PT - JOURNAL ARTICLE AU - Fernandes, Marta AU - Westover, M. Brandon AU - Singhal, Aneesh B. AU - Zafar, Sahar F. TI - Automated Extraction of Stroke Severity from Unstructured Electronic Health Records using Natural Language Processing AID - 10.1101/2024.03.08.24304011 DP - 2024 Jan 01 TA - medRxiv PG - 2024.03.08.24304011 4099 - http://medrxiv.org/content/early/2024/03/11/2024.03.08.24304011.short 4100 - http://medrxiv.org/content/early/2024/03/11/2024.03.08.24304011.full AB - BACKGROUND Multi-center electronic health records (EHR) can support quality improvement initiatives and comparative effectiveness research in stroke care. However, limitations of EHR-based research include challenges in abstracting key clinical variables from non-structured data at scale. This is further compounded by missing data. Here we develop a natural language processing (NLP) model that automatically reads EHR notes to determine the NIH stroke scale (NIHSS) score of patients with acute stroke.METHODS The study included notes from acute stroke patients (>= 18 years) admitted to the Massachusetts General Hospital (MGH) (2015-2022). The MGH data were divided into training (70%) and hold-out test (30%) sets. A two-stage model was developed to predict the admission NIHSS. A linear model with the least absolute shrinkage and selection operator (LASSO) was trained within the training set. For notes in the test set where the NIHSS was documented, the scores were extracted using regular expressions (stage 1), for notes where NIHSS was not documented, LASSO was used for prediction (stage 2). The reference standard for NIHSS was obtained from Get With The Guidelines Stroke Registry. The two-stage model was tested on the hold-out test set and validated in the MIMIC-III dataset (Medical Information Mart for Intensive Care-MIMIC III 2001-2012) v1.4, using root mean squared error (RMSE) and Spearman correlation (SC).RESULTS We included 4,163 patients (MGH = 3,876; MIMIC = 287); average age of 69 [SD 15] years; 53% male, and 72% white. 90% patients had ischemic stroke and 10% hemorrhagic stroke. The two-stage model achieved a RMSE [95% CI] of 3.13 [2.86-3.41] (SC = 0.90 [0.88-0. 91]) in the MGH hold-out test set and 2.01 [1.58-2.38] (SC = 0.96 [0.94-0.97]) in the MIMIC validation set.CONCLUSIONS The automatic NLP-based model can enable large-scale stroke severity phenotyping from EHR and therefore support real-world quality improvement and comparative effectiveness studies in stroke.Competing Interest StatementThe authors have declared no competing interest.Clinical TrialThe study consists of a retrospective data analysis and it was approved by the Mass General Brigham Institutional Review Board; a waiver of informed consent was obtained for this observational study.Funding StatementAcknowledgements This work was funded by NIH R01NS131347 SFZ. Sources of Funding Dr. Westover was supported by grants from the NIH (RF1AG064312, RF1NS120947, R01AG073410, R01HL161253, R01NS126282, R01AG073598, R01NS131347, R01NS130119), and NSF (2014431). Dr. Sahar F. Zafar was supported by the NIH (K23NS114201, R01NS126282, R01AG082693).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study consists of a retrospective data analysis and it was approved by the Mass General Brigham Institutional Review Board; a waiver of informed consent was obtained for this observational study.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe clinical de-identified data used in this manuscript will be publicly available in a designated repository, upon acceptance of the manuscript.AUCArea under the receiver-operating characteristic curveBIDMCBeth Israel Deaconess Medical CenterCIConfidence intervalsCONSORTConsolidated Standards of Reporting TrialsCPTCurrent Procedural TerminologyEHRsElectronic health recordsICDInternational Classification of DiseasesICUIntensive care unitLASSOLeast absolute shrinkage and selection operatorMCAMiddle cerebral arteryMGHMassachusetts General HospitalMIMIC-IIIMedical Information Mart for Intensive Care IIINIHSSNational Institutes of Health Stroke ScaleNLPNatural language processingRMSERoot mean squared errorSCSpearman correlationSTROBESTrengthening the Reporting of OBservational studies in Epidemiology