Abstract
BACKGROUND Multi-center electronic health records (EHR) can support quality improvement initiatives and comparative effectiveness research in stroke care. However, limitations of EHR-based research include challenges in abstracting key clinical variables from non-structured data at scale. This is further compounded by missing data. Here we develop a natural language processing (NLP) model that automatically reads EHR notes to determine the NIH stroke scale (NIHSS) score of patients with acute stroke.
METHODS The study included notes from acute stroke patients (>= 18 years) admitted to the Massachusetts General Hospital (MGH) (2015-2022). The MGH data were divided into training (70%) and hold-out test (30%) sets. A two-stage model was developed to predict the admission NIHSS. A linear model with the least absolute shrinkage and selection operator (LASSO) was trained within the training set. For notes in the test set where the NIHSS was documented, the scores were extracted using regular expressions (stage 1), for notes where NIHSS was not documented, LASSO was used for prediction (stage 2). The reference standard for NIHSS was obtained from Get With The Guidelines Stroke Registry. The two-stage model was tested on the hold-out test set and validated in the MIMIC-III dataset (Medical Information Mart for Intensive Care-MIMIC III 2001-2012) v1.4, using root mean squared error (RMSE) and Spearman correlation (SC).
RESULTS We included 4,163 patients (MGH = 3,876; MIMIC = 287); average age of 69 [SD 15] years; 53% male, and 72% white. 90% patients had ischemic stroke and 10% hemorrhagic stroke. The two-stage model achieved a RMSE [95% CI] of 3.13 [2.86-3.41] (SC = 0.90 [0.88-0. 91]) in the MGH hold-out test set and 2.01 [1.58-2.38] (SC = 0.96 [0.94-0.97]) in the MIMIC validation set.
CONCLUSIONS The automatic NLP-based model can enable large-scale stroke severity phenotyping from EHR and therefore support real-world quality improvement and comparative effectiveness studies in stroke.
Competing Interest Statement
The authors have declared no competing interest.
Clinical Trial
The study consists of a retrospective data analysis and it was approved by the Mass General Brigham Institutional Review Board; a waiver of informed consent was obtained for this observational study.
Funding Statement
Acknowledgements This work was funded by NIH R01NS131347 SFZ. Sources of Funding Dr. Westover was supported by grants from the NIH (RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119), and NSF (2014431). Dr. Sahar F. Zafar was supported by the NIH (K23NS114201, R01NS126282, R01AG082693).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study consists of a retrospective data analysis and it was approved by the Mass General Brigham Institutional Review Board; a waiver of informed consent was obtained for this observational study.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
Authors e-mails addresses: M. Brandon Westover: mwestover{at}mgh.harvard.edu, Aneesh B. Singhal: asinghal{at}mgh.harvard.edu, Sahar F. Zafar: sfzafar{at}mgh.harvard.edu
Data Availability
The clinical de-identified data used in this manuscript will be publicly available in a designated repository, upon acceptance of the manuscript.
Non-standard Abbreviations and Acronyms
- AUC
- Area under the receiver-operating characteristic curve
- BIDMC
- Beth Israel Deaconess Medical Center
- CI
- Confidence intervals
- CONSORT
- Consolidated Standards of Reporting Trials
- CPT
- Current Procedural Terminology
- EHRs
- Electronic health records
- ICD
- International Classification of Diseases
- ICU
- Intensive care unit
- LASSO
- Least absolute shrinkage and selection operator
- MCA
- Middle cerebral artery
- MGH
- Massachusetts General Hospital
- MIMIC-III
- Medical Information Mart for Intensive Care III
- NIHSS
- National Institutes of Health Stroke Scale
- NLP
- Natural language processing
- RMSE
- Root mean squared error
- SC
- Spearman correlation
- STROBE
- STrengthening the Reporting of OBservational studies in Epidemiology