Recurrent cancer is common, costly, and lethal, yet we know little about it in community-based populations. Electronic health records and tumor registries contain vast amounts of data regarding community-based patients, but usually lack recurrence status. Existing algorithms that use structured data to detect recurrence have limitations.Methods:
We developed algorithms to detect the presence and timing of recurrence after definitive therapy for stages I–III lung and colorectal cancer using 2 data sources that contain a widely available type of structured data (claims or electronic health record encounters) linked to gold-standard recurrence status: Medicare claims linked to the Cancer Care Outcomes Research and Surveillance study, and the Cancer Research Network Virtual Data Warehouse linked to registry data. Twelve potential indicators of recurrence were used to develop separate models for each cancer in each data source. Detection models maximized area under the ROC curve (AUC); timing models minimized average absolute error. Algorithms were compared by cancer type/data source, and contrasted with an existing binary detection rule.Results:
Detection model AUCs (>0.92) exceeded existing prediction rules. Timing models yielded absolute prediction errors that were small relative to follow-up time (<15%). Similar covariates were included in all detection and timing algorithms, though differences by cancer type and dataset challenged efforts to create 1 common algorithm for all scenarios.Conclusions:
Valid and reliable detection of recurrence using big data is feasible. These tools will enable extensive, novel research on quality, effectiveness, and outcomes for lung and colorectal cancer patients and those who develop recurrence.