RT Journal Article
SR Electronic
T1 Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making
JF medRxiv
FD Cold Spring Harbor Laboratory Press
SP 2024.01.26.24301810
DO 10.1101/2024.01.26.24301810
A1 Hager, Paul
A1 Jungmann, Friederike
A1 Bhagat, Kunal
A1 Hubrecht, Inga
A1 Knauer, Manuel
A1 Vielhauer, Jakob
A1 Holland, Robbie
A1 Braren, Rickmer
A1 Makowski, Marcus
A1 Kaisis, Georgios
A1 Rueckert, Daniel
YR 2024
UL http://medrxiv.org/content/early/2024/01/26/2024.01.26.24301810.abstract
AB Clinical decision making is one of the most impactful parts of a physician’s responsibilities and stands to benefit greatly from AI solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills that are necessary for deployment in a realistic clinical decision making environment, including gathering information, adhering to established guidelines, and integrating into clinical workflows. To understand how useful LLMs are in real-world settings, we must evaluate them in the wild, i.e. on real-world data under realistic conditions. Here we have created a curated dataset based on the MIMIC-IV database spanning 2400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians on average), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for clinical deployment while providing a dataset and framework to guide future studies.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Institutional Review Board of Beth Israel Deaconess Medical Center granted a waiver of informed consent and approved the data sharing initiative.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe dataset is available to all researchers who create an account on https://physionet.org/ and follow the steps to gain access to the MIMIC-IV database (https://physionet.org/content/mimiciv/2.2/). Access is given after completing the "CITI Data or Specimens Only Research" training course. The data use agreement of physionet for "credentialed health data" must also be signed. The dataset can then be recreated using the code found at: https://github.com/paulhager/MIMIC-Clinical-Decision-Making-Dataset. The code to create the dataset uses python v3.10 and pandas v2.1.3.https://physionet.org/content/mimiciv/2.2/