In an effort to overcome quality and cost constraints inherent in population-based research, diverse data sources are increasingly being combined. In this paper, we describe the performance of a Medicare claims-based incident cancer identification algorithm in comparison with observational cohort data from the Nurses’ Health Study (NHS).Methods:
NHS-Medicare linked participants’ claims data were analyzed using 4 versions of a cancer identification algorithm across 3 cancer sites (breast, colorectal, and lung). The algorithms evaluated included an update of the original Setoguchi algorithm, and 3 other versions that differed in the data used for prevalent cancer exclusions.Results:
The algorithm that yielded the highest positive predictive value (PPV) (0.52–0.82) and κ statistic (0.62–0.87) in identifying incident cancer cases utilized both Medicare claims and observational cohort data (NHS) to remove prevalent cases. The algorithm that only used NHS data to inform the removal of prevalent cancer cases performed nearly equivalently in statistical performance (PPV, 0.50–0.79; κ, 0.61–0.85), whereas the version that used only claims to inform the removal of prevalent cancer cases performed substantially worse (PPV, 0.42–0.60; κ, 0.54–0.70), in comparison with the dual data source-informed algorithm.Conclusions:
Our findings suggest claims-based algorithms identify incident cancer with variable reliability when measured against an observational cohort study reference standard. Self-reported baseline information available in cohort studies is more effective in removing prevalent cancer cases than are claims data algorithms. Use of claims-based algorithms should be tailored to the research question at hand and the nature of available observational cohort data.