RT Journal Article SR Electronic T1 Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.01.26.24301810 DO 10.1101/2024.01.26.24301810 A1 Hager, Paul A1 Jungmann, Friederike A1 Bhagat, Kunal A1 Hubrecht, Inga A1 Knauer, Manuel A1 Vielhauer, Jakob A1 Holland, Robbie A1 Braren, Rickmer A1 Makowski, Marcus A1 Kaisis, Georgios A1 Rueckert, Daniel YR 2024 UL http://medrxiv.org/content/early/2024/01/26/2024.01.26.24301810.abstract AB Clinical decision making is one of the most impactful parts of a physician’s responsibilities and stands to benefit greatly from AI solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills that are necessary for deployment in a realistic clinical decision making environment, including gathering information, adhering to established guidelines, and integrating into clinical workflows. To understand how useful LLMs are in real-world settings, we must evaluate them in the wild, i.e. on real-world data under realistic conditions. Here we have created a curated dataset based on the MIMIC-IV database spanning 2400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians on average), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for clinical deployment while providing a dataset and framework to guide future studies.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Institutional Review Board of Beth Israel Deaconess Medical Center granted a waiver of informed consent and approved the data sharing initiative.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesThe dataset is available to all researchers who create an account on https://physionet.org/ and follow the steps to gain access to the MIMIC-IV database (https://physionet.org/content/mimiciv/2.2/). Access is given after completing the "CITI Data or Specimens Only Research" training course. The data use agreement of physionet for "credentialed health data" must also be signed. The dataset can then be recreated using the code found at: https://github.com/paulhager/MIMIC-Clinical-Decision-Making-Dataset. The code to create the dataset uses python v3.10 and pandas v2.1.3.https://physionet.org/content/mimiciv/2.2/