Abstract
Background Assessment of stroke risk in patients with atrial fibrillation (AF) is crucial for guiding anticoagulation therapy. CHA₂DS₂-VASc is a widely used score for defining this risk, but current assessments rely on manual calculation by clinicians or approximations from structured EHR data elements. Unstructured clinical notes contain rich information that could enhance risk assessment. We developed and validated a Retrieval-Augmented Generation (RAG) approach to extract CHA₂DS₂-VASc risk factors from unstructured notes in patients with AF.
Methods We employed a RAG architecture paired with the large language model, Llama3.1, to extract features relevant to CHA₂DS₂-VASc scores from unstructured notes. The model was deployed on a random set of 1,000 clinical notes (934 AF patients) from Yale New Haven Health System (YNHHS). To establish a gold standard, 2 clinicians manually reviewed and labeled CHA₂DS₂-VASc risk factors in a random subset of 200 notes. The CHA₂DS₂-VASc scores were calculated for each patient using structured data alone and by incorporating risk factors identified with RAG. We assessed performance across risk factors using macro-averaged area under the receiver operating characteristic (AUROC). For external validation, we utilized 100 manually labeled clinical notes from the MIMIC-IV database.
Results The RAG model demonstrated robust performance in extracting risk factors from clinical notes. In the 1000 clinical notes, RAG identified several risk factors more frequently than structured elements, including hypertension (82.4% vs 26.2%), stroke/TIA (62.9% vs 45.5%), vascular disease (83.4% vs 56.6%), and diabetes (84.1% vs 47.2%). In the 200 expert-annotated notes, the RAG approach achieved high performance for various risk factors, with AUROCs ranging from 0.96 to 0.98 for hypertension, diabetes, and age ≥75 years. Incorporating risk factors identified by RAG increased CHA₂DS₂-VASc scores compared with using structured data alone.
Conclusion An LLM-optimized RAG can accurately extract CHA₂DS₂-VASc risk factors from unstructured clinical notes in AF patients. This approach can enable computable risk assessment and guide appropriate anticoagulation therapy.
Competing Interest Statement
The authors PMT and RK are coinventors of a provisional patent not related to the current work (63/606,203). RK is an Associate Editor of JAMA. He receives support from the Doris Duke Charitable Foundation (under award, 2022060). He also receives research support, through Yale, from Bristol-Myers Squibb, Novo Nordisk, and BridgeBio. He is a coinventor of U.S. Provisional Patent Applications 63/177,117, 63/428,569, 63/346,610, 63/484,426, 63/508,315, and 63/606,203 and is a cofounder of Ensight-AI and Evidence2Health, health platforms to improve cardiovascular diagnosis andevidence-based cardiovascular care.
Funding Statement
PA is supported by F30HL176149. RK is supported by the National Heart Lung and Blood Institute of the National Institutes of Health (under awards R01HL167858 and K23HL153775) and the Doris Duke Charitable Foundation (under award 2022060). Dr. Thangaraj was supported by the National Institutes of Health under award T32HL155000. The funding organizations had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Yale Institutional Review Board reviewed the study, approved the protocol, and waived the need for informed consent as this is a secondary analysis of existing data.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
The data used for this study cannot be publicly shared as it represents protected health information and sharing data will be a violation of patient privacy. The MIMIC-IV cohort has an application for access at https://physionet.org/content/mimiciv/3.0/.