Introduction: Cardiac catheterization reports are rich sources of clinical data, but their unstructured format makes large-scale analysis time- and effort-intensive. Natural Language Processing (NLP) and Machine Learning (ML) are computational techniques that have previously been leveraged to analyze unstructured clinical data. Here, we apply those techniques to cardiac catheterization reports from a large academic medical center.
Hypothesis: NLP and ML techniques can be leveraged to analyze a large number of unstructured cardiac catheterization reports to identify the presence of obstructive coronary artery disease (CAD) at the level of individual coronary arteries.
Methods: A randomly-selected set of 200 full-text catheterization reports were manually labeled for the presence or absence of obstructive CAD in each of the Left Main, Left Anterior Descending, Left Circumflex, and Right Coronary Arteries. They were then processed using the “text2vec” NLP and “caret” ML packages in R, and a subset of the reports were used to train a suite of ML algorithms including: LDA, glm, GLMNET, SVMlinear, SVMradial, CART, bagged CART, random forest, and neural network. The ML algorithms were compared with respect to sensitivity, specificity, and ROC characteristics, with random forest demonstrating the best overall performance. The final random forest model was validated by comparing to manual adjudication in a test set consisting of the remaining labeled reports. The model was then used to classify a set of 4,226 unlabeled reports.
Results: The final classifier identified obstructive coronary artery disease at the individual coronary artery level with high precision (0.951) and recall (0.940); the F1 statistic was 0.946. The overall distribution of obstructive CAD is shown in the figure below, and recapitulates what is known in the literature.
Conclusions: NLP and ML techniques can be effectively applied to extract data from full-text cardiac catheterization reports.