RT Journal Article SR Electronic T1 Evaluating Large Language Models in Extracting Cognitive Exam Dates and Scores JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2023.07.10.23292373 DO 10.1101/2023.07.10.23292373 A1 Zhang, Hao A1 Jethani, Neil A1 Jones, Simon A1 Genes, Nicholas A1 Major, Vincent J. A1 Jaffe, Ian S. A1 Cardillo, Anthony B. A1 Heilenbach, Noah A1 Ali, Nadia Fazal A1 Bonanni, Luke J. A1 Clayburn, Andrew J. A1 Khera, Zain A1 Sadler, Erica C. A1 Prasad, Jaideep A1 Schlacter, Jamie A1 Liu, Kevin A1 Silva, Benjamin A1 Montgomery, Sophie A1 Kim, Eric J. A1 Lester, Jacob A1 Hill, Theodore M. A1 Avoricani, Alba A1 Chervonski, Ethan A1 Davydov, James A1 Small, William A1 Chakravartty, Eesha A1 Grover, Himanshu A1 Dodson, John A. A1 Brody, Abraham A. A1 Aphinyanaphongs, Yindalon A1 Masurkar, Arjun A1 Razavian, Narges YR 2024 UL http://medrxiv.org/content/early/2024/02/13/2023.07.10.23292373.abstract AB Importance Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR.Objective Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates.Methods Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss’ Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation.Results For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT’s errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date.Conclusions In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study was funded by NYU Langone Medical Center Information Technology (MCIT) center. Author N.R. and A.M. are also funded by the following awards National Institute On Aging, of the National Institutes of Health, under Award Numbers R01AG085617 and P30AG066512. Authors H.Z. S.J., V.J.M, J.A.D., A.A.B., Y.A., A.M. and N.R. are also supported by the award number R01AG079175 from the National Institute On Aging, of the National Institutes of Health.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Ethics committee/IRB of New York University Langone Health gave ethical approval of this work.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe data contains private patient information and will not be made available.