Objective: The TOAST classification for ischemic stroke (IS) is critical to determining management and predicting outcome. The adjudication process is done manually by highly trained stroke clinicians. This is time-consuming, error-prone, and limits scaling to large datasets. However, electronic medical records (EMR) could be leveraged to automate this process. We hypothesized that machine learning enabled natural language processing (NLP) for multiclass classification could determine the TOAST subtype from free text stored in the EMR.
Methods: We selected 1099 IS patients from an observational registry with TOAST subtyping confirmed by board-certified vascular neurologists. We analyzed text-based EMR data including progress notes and radiology reports. For each patient, we concatenated notes into one large single document. We tokenized the results into a “bag of words” based representation using n-grams (unigrams, bigrams and trigrams). We did five-fold cross validation in order to avoid overfitting. To reduce the high dimensionality of features, we used principal component analysis (150 components) and L1 regularized logistic regression and then combined the features thus obtained within each fold. Next, several classification methods - K nearest neighbors, Support Vector Machines, Random Forests, Extra Trees classifiers, Gradient Boosting Machines, Xtra-Gradient Boosting and Stack ensembles - to assess the accuracy and discrimination of machine learning techniques for TOAST subtyping compared to manual subtyping (gold standard). We performed receiver operating characteristics analysis to assess discrimination of each model.
Results: Our best classification method achieved an accuracy of 41 +/- 5% using radiology reports alone and 64 +/- 4% using progress notes alone. Combining radiology reports and progress notes, we achieved an accuracy of 66 +/- 5% with high discrimination (90 +/- 4%).
Conclusions: Compared to manual approaches, automated machine learning and NLP can discriminate TOAST subtypes using EMR data with moderate accuracy and high discrimination. The automated pipeline, if validated, could enable large-scale stroke epidemiology research using EHRs nationwide.