Introduction: Stroke research using widely available institutional, state-wide and national retrospective data is dependent on accurate identification of stroke subtypes using claims data. Despite the abundance of such data and the advances in clinical informatics, there is limited published data on the application of machine learning models to improve previously reported administrative stroke identification algorithms.
Hypothesis: We hypothesized that machine learning models can be applied to claims data coded using the International Classification of Disease, version 9 (ICD-9), to accuracy identify patients with ischemic stroke (IS), intracerebral hemorrhage (ICH), and subarachnoid hemorrhage (SAH), and these models would outperform previously published algorithms in our patient cohort.
Methods: We developed a gold standard list of 427 stroke patients continuously admitted to our institution from 1/1/2015 to 9/30/2015 using an internal stroke database and applied 75% of it to train and 25% to test two machine learning models: one using classification and regression tree (CART) and another using regularized logistic regression. There were 2,241 negative controls. We further applied a previously reported stroke detection algorithm, by Tirschwell and Longstreth, to our cohort for comparison.
Results: The CART model had a κ of 0.72, 0.82, 0.59; sensitivity of 95%, 99%, 99%; and a specificity of 88%, 78%, 75%; for IS, ICH and SAH respectively. The regularized logistic regression model had a κ of 0.73, 0.80, 0.59; sensitivity of 95%, 99%, 99%, and a specificity of 89%, 78%, 75%; for IS, ICH and SAH respectively. The previously reported algorithm by Tirschwell et al, had a κ of 0.71,0.56, 0.64; sensitivity of 98%, 99%, 99%; and a specificity of 64%, 52%, 50%; for IS, ICH and SAH.
Conclusion: Compared with the previously reported ICD 9 based detection algorithm, the machine learning models had a higher κ for diagnosis of IS and ICH, similar sensitivity for all subtypes, and higher specificity for all stroke subtypes in our cohort. Applying machine learning models to identify stroke subtypes from administrative data sets, can lead to highly accurate models of stroke subtype identification for health services researchers.