Abstract
Aims Large language models (LLMs) have shown promise in therapeutic decision-making comparable to medical experts, but these studies have used specially prepared patient data. The aim of this study was to determine whether LLMs can make guideline-adherent treatment decisions based on real-world patient data.
Methods and Results We conducted a retrospective study of 80 patients with severe aortic stenosis who were scheduled for either surgical (SAVR, n=24) or transcatheter aortic valve replacement (TAVR, n=56) by our institutional heart team in 2022. Various LLMs (BioGPT, GPT-3.5, GPT-4 and GPT-4 Turbo, Llama-2, Mistral, and PaLM-2) were queried using either deidentified original medical reports or manually generated case summaries to determine the most guideline-adherent treatment. Agreement with the Heart Team was measured using Cohen’s kappa coefficients, reliability using intraclass correlation coefficients (ICCs), and fairness using frequency bias indices (FBIs) with FBIs >1 indicating bias towards TAVR. When presented with original medical reports, LLMs showed poor performance (kappa: -0.47-0.09, ICC: 0.0-0.91, FBI: 0.95-1.53). The LLMs’ performance improved substantially when case summaries were used as input and additional guideline knowledge was added to the prompt (kappa: -0.02-0.62, ICC: 0.01-0.97, FBI: 0.46-1.24). Qualitative analysis revealed instances of hallucinations in all LLMs tested.
Conclusion Our findings suggest that even advanced LLMs currently make informed treatment decisions only with extensively pre-processed data, not with original patient data. Unreliable responses, bias and hallucinations pose significant health risks and highlight the need for caution in applying LLMs to real-world clinical decision-making.
Introduction
Large Language Models (LLMs) have recently demonstrated their impressive capabilities in medicine, exemplified by passing medical board exams1, making correct diagnoses in complex clinical cases2 and excelling in physician-patient communication.3
Most recently, the use of LLMs in therapeutic decision making has been trialed. Several studies have shown that LLMs can make treatment decisions for patients with oncological and cardiovascular diseases that are in substantial agreement with the respective treatment decisions made by clinical experts in tumor boards4-7 and Heart Teams (HTs)8. However, a common feature of these studies was that the LLMs did not make treatment decisions based on real-world patient data in its original format (e.g., discharge letters, imaging reports, etc.) but based on pre-processed data.
In clinical practice, relevant patient data such as patient characteristics, comorbidities, tumor stages and imaging results are typically available in free-text format, either as medical text reports or as text entries in the electronic health record – a format that is likely to persist in the near future. In the studies, however, decision-relevant patient data were extracted from the original medical reports by the investigators in a pre-processing step before being provided to the LLMs as input in a concise and high-quality form. However, it is still unknown to what extent LLMs can make treatment decisions based on the original medical data, a scenario that could lead to a significant reduction in physician workload and potentially increase guideline adherence and thus improve patient care.
In this study, we investigate the impact of data representation, i.e. using original medical reports versus case summaries either manually written by physicians or generated by LLMs, on the performance of LLMs in therapeutic decision-making.
As our study population, we selected patients with severe aortic stenosis (AS). This cohort was chosen because the parameters relevant to decision-making are readily quantifiable, the potential for resource optimization is substantial, and the prevalence of the condition is increasing. If left untreated, AS is associated with high morbidity and mortality.9 Treatment modalities for severe AS include surgical aortic valve replacement (SAVR), transcatheter aortic valve replacement (TAVR) and, to a lesser extent, medical therapy. The choice of the optimal treatment modality depends on several clinical variables, including patient age, estimated surgical risk, comorbidities and anatomical factors, as specified in the 2021 guidelines of the European Society of Cardiology (ESC) and the European Association for Cardio-Thoracic Surgery (EACTS) guidelines for the management of valvular heart disease.10 The 2021 ESC/EACTS Guidelines strongly endorse an active, collaborative consultation with the multidisciplinary Heart Team (HT). HTs are comprised of cardiologists, cardiac surgeons, cardiac imaging specialists and cardiac anesthesiologists. In HT meetings, these experts review a patient’s condition based on patient data laboriously extracted from medical reports before arriving at a treatment decision using a guideline-based approach.
Methods
We presented patient data to an LLM to obtain a treatment decision of either SAVR or TAVR. We assessed the degree of concordance between the treatment decisions provided by the LLM and the treatment decisions provided by the HT. Furthermore, we assessed decidability, reliability and fairness of the LLMs. Finally, we compared the performance of seven state-of-the-art LLMs to the performance of a simple non-LLM reference model. In an ablative manner, we studied the effect of using case summaries instead of the original medical reports and adding guideline knowledge to the prompt, respectively, resulting in four distinct experiments (Figure 1). In addition, we conducted a fifth experiment to determine whether the best-performing LLM could make sound treatment decisions based on case summaries created by the LLM itself in an intermediate step.
Study population
This study included patients treated at a heart center. We screened all patients with severe degenerative AS who were scheduled for a HT meeting in our hospital information system at one campus of our center in 2022, we identified 80 patients with sufficiently digitized documentation. As part of a quaternary care center, our institutional HT receives preselected patients scheduled for invasive AS treatment. Therefore, the number of patients recommended for conservative treatment at our institution is negligible. As a result, we decided to limit the possible therapeutic options for this study to SAVR and TAVR, excluding conservative therapy. This study was approved by the research ethics committee of Charité – Universitätsmedizin Berlin (EA1/146/23).
Data collection
Medical reports were available as Portable Document Format (PDF) files in our hospital information system. For each patient, we included the following pre-procedural reports: the two most recent discharge letters (including letters from external clinics), the invasive coronary angiography report, the echocardiography report, the CT scan report, and the HT protocol. We manually anonymized these reports. HT meeting protocols are standardized documents that contain decision-relevant patient characteristics, such as comorbidities, surgical risk scores, as well as the final treatment decision of the HT (Supplementary material online, Figure S1). A detailed description of our institutional HT is provided in the Supplementary.
Large Language Models
The study employed several state-of-the-art LLMs, namely GPT-3.511, GPT-412 and GPT-4 Turbo by OpenAI, and PaLM 2 by Google13. In addition, we used the open-source models Mistral-7B14, Llama 2 by Meta15 and BioGPT16. These LLMs had either demonstrated proficiency on similar tasks or had undergone pre-training on medical literature. Model details are provided in Supplementary Table S1. We set model hyperparameters to the default values, except for the temperature τ (creativity), which we set to zero in line with previous studies in the medical domain. The model hyperparameters were set to the default values, with the exception of the temperature, which was set to zero in accordance with previous studies in the medical domain.17 The temperature is a hyperparameter in LLMs that controls the randomness of the model’s output. A low value results in a more deterministic and focused model output, thereby reducing variability and creativity. A detailed description of how we accessed the LLMs and handled input size constraints is given in the Supplementary material online (Section Supplementary methods).
Reference model
The reference model represented an algorithmic emulation of the 2021 ESC/EACTS Guidelines for the management of valvular heart disease.10 More specifically, the reference model assigned patients to either SAVR or TAVR according to a flowchart (Supplementary material online, Figure S2) and relevant clinical variables (Table S4, Table S5).10 Model details are provided in the Supplementary.
Experiments
Five experiments were conducted to investigate the effects of data pre-processing on LLM performance:
RAW: In the RAW experiment, anonymized medical reports (i.e., the two most recent discharge letters, the invasive coronary angiography report, the echocardiography report, and the CT scan report) were concatenated, and stored in a unified text file. This file was programmatically inserted into a prompt template. Each prompt included an introductory or continuation phrase and concluded with a request for a treatment decision (Supplementary material online, Table S6).
RAW+: As it is unknown whether the LLMs we used had sufficient guideline knowledge, we compiled a resume of relevant CPG content from the ESC/EACTS guidelines.10 We added this resume to the prompt along with the unified text reports.
SUM: To study the effect of content compression, we replaced the original medical reports used in RAW with concise case summaries. These case summaries were created based on patient data extracted from the HT protocols.
SUM+: Case summaries were used as input and enriched with the CPG resume (Figure 1).
SUMLLM+: In a first query, case summaries were generated from the original medical reports by GPT-4 Turbo before these LLM-generated case summaries were used as input enriched with the CPG resume in a second query.
Prompt templates, the CPG resume and an exemplary case summary are shown in the Supplementary material online (Table S6).
The LLMs’ responses were manually reviewed and categorized as either “TAVR”, “SAVR” or “indeterminate”. Indeterminate responses occur when the model output does not match the available answer choices or when the model determines that there is insufficient information to support a decision (Supplementary material online, Table S7). To assess reliability and obtain robust estimates of performance metrics, the LLMs were presented with the same prompt input 10 times in succession for each experiment and patient (hereafter referred to as ‘runs’) to obtain a treatment decision. To prevent memory bias, a new chat session was initiated for each run.
Performance metrics
We quantified concordance by measuring accuracy and interrater agreement. Accuracy was calculated as the proportion of treatment decisions that agreed with those made by the HT. Interrater agreement was estimated using Cohen’s kappa coefficients. Decidability was quantified as the proportion of determinate treatment decisions. Bias was quantified using frequency bias indices (FBI), defined as the ratio of predicted to observed treatment decisions for TAVR.
Due to the limitations of individual metrics, we used three different metrics to quantify reliability: intraclass correlation coefficients (ICCs), normalized Shannon entropy, and the proportion of unanimously accurate treatment decisions. A detailed description of the performance metrics, including strategies for handling indeterminate responses, is provided in the Supplementary material online (Table S8).
Statistical analysis
Patient characteristics for patients who received SAVR vs. TAVR were compared using Student’s t-test for normally distributed continuous variables and Mann-Whitney U test for non-normally distributed continuous variables. The Shapiro-Wilk test was used to assess normality. The chi-squared test was used for binary variables and Fisher’s exact test for sparse binary data.
Accuracy and Cohen’s kappa were computed with Python’s sklearn.metrics package (version 1.2.2). ICCs were calculated based on a one-way random effects, absolute agreement, single-rater model18 using Python’s pingouin package (version 0.5.3).
Results
Patient characteristics
A total of 80 patients with severe AS who were discussed at our institutional HT in 2022 were included. Of these patients, 24 (30 %) underwent SAVR while 56 (70 %) underwent TAVR. Patient characteristics are presented in Table 1.
Qualitative analysis
The LLMs’ output ranged from nonsensical treatment recommendations (e.g., heart transplant) and purely fabricated content, to correctly assessing the patient’s status, choosing the correct treatment option, and supporting the treatment decision with additional anatomical insights (Table 2).
Qualitative analysis revealed that smaller models (e.g., BioGPT) tended to provide conflicting treatment recommendations for the same patient. In contrast, the frontier models (e.g., GPT-4, PaLM 2) consistently provided the same treatment recommendation when presented with the same patient data repeatedly over 10 runs.
In each experiment, all LLMs produced hallucinations of varying severity and frequency. These included instructional, contextual and factual inconsistencies (Table 2).
Quantitative analysis
Figure 2 and Table S9 (Supplementary material online) present the performance metrics. In the RAW experiment, LLMs achieved accuracies of up to 71 % and Cohen’s kappa coefficients of up to 0.09. Some LLMs gave indeterminate treatment recommendations in up to 54 % of cases (e.g., GPT-3.5) and showed low reliability as evidenced by low ICCs and high entropy values (e.g., Mistral). FBIs were substantially higher than 1.0 for all LLMs except BioGPT. The reference model generally outperformed the LLMs in the RAW experiment regarding the metrics we assessed.
In the RAW+ experiments, performance metrics did not change substantially. However, the performance metrics improved in the SUM experiment and peaked in the SUM+ experiment, where some LLMs (e.g., GPT-4 Turbo) drew level with the reference model.
A general trend towards more concordant treatment decisions, fewer indeterminate responses, increased reliability, and less bias towards TAVR was observed with increasing data pre-processing and information enrichment efforts from RAW to SUM+ (Figure 2, Figure 3).
When instructing GPT-4 Turbo to generate case summaries as an intermediate step, we observed an accuracy of 74 %, a Cohen’s kappa coefficient of 0.17, and an FBI of 1.2.
Discussion
To our knowledge, this is the first study to evaluate the impact of input data representation, including real-world medical data, on the ability of LLMs to make guideline-adherent treatment decisions.
Current LLMs make incorrect decisions based on original clinical data
Our analysis reveals that LLMs struggled to process original medical reports effectively, often outputting ‘TAVR’ or providing indeterminate responses. The LLMs showed low agreement with the HT, exhibited undecidability and unreliability, and displayed a strong bias towards TAVR. The high accuracies observed with some LLMs in the RAW experiment can be largely attributed to the class imbalance within our patient cohort, where 70 % of patients received TAVR.
LLMs require extensive data pre-processing to make sound therapeutic decisions
Performance improved substantially when physician-made case summaries were used as input, and when guideline knowledge was added to the prompts. GPT-4 and GPT-4 Turbo stood out as the most capable LLMs in our experiments. When given case summaries and a CPG resume, these two models showed substantial agreement with HT treatment decisions and drew level with the reference model in terms of accuracy, interrater agreement, decidability, and bias.
When GPT-4 Turbo was instructed to first generate a case summary before making a treatment decision, its performance was found to be intermediate between the results observed with original patient data and physician-generated case summaries.
Data representation affects LLM performance
An explanation for the underperformance of LLMs in the RAW experiment is not immediately apparent due to their opaque nature and the lack of established tools that allow the direct examination of input-output correlations. However, the underperformance cannot be attributed to a lack of or incorrectly applied guideline knowledge since the performance in RAW+ was similar to RAW and since LLMs can supposably apply clinical knowledge to clinical cases as shown in their ability to pass medical board exams.1,19 This, together with the significant improvement in almost all performance measures observed when providing case summaries instead of original medical reports, suggests that the representation of input data is the most critical factor in LLM performance.
This finding is consistent with the fact that virtually all studies in which LLMs have been shown to make sound treatment decisions, used pre-processed clinical data as input.4-8 Of particular note is the study by Salihu et al.8 In this study, data from patients with severe AS were provided to GPT-4 to obtain a treatment decision for either TAVR, SAVR or conservative management. Patient data were provided in the form of a standardized multiple-choice questionnaire with 14 key clinical variables as input, similar to our SUM experiments. GPT-4 treatment decisions were in substantial agreement with HT treatment decisions, a finding that we were able to reproduce in our experiments.
Similarly, in studies on tasks beyond therapeutic decision-making, such as answering board exam questions1 and diagnosing complex clinical cases2,20, LLMs performed particularly well when the input data were concise and information-dense.
Basic research has indicated that LLMs struggle with lengthy texts21 spanning over multiple prompts, potentially leading to memory loss22 and texts with a low signal-to-noise ratio.23 A study by Levy et al24 demonstrated that LLM reasoning performance declined notably with increasing input length. Specifically, the authors observed a 26 % drop in LLM performance when the input length was artificially increased from 250 to 3,000 tokens – i.e. a range of input lengths comparable to our study (see Supplementary material online Table S3).
Recently, Hager et al.25 investigated the ability of LLMs to correctly diagnose patients presenting to the emergency department with abdominal pain. In this study, it was shown that deliberately withholding relevant clinical information from the LLMs paradoxically improved their diagnostic accuracy. Overall, this implies that LLMs are sensitive to both the signal-to-noise ratio and the sheer quantity of information provided.
LLMs may not be ideally suited for clinical decision-making
The results obtained with pre-processed patient data in our study and in previous studies demonstrate the potential of LLMs in medicine. However, the use of curated and pre-processed data does not reflect the clinical situation: To this day, the communication of clinical data within hospitals is largely based on unstructured free text.
Healthcare professionals have high expectations of AI to reduce their workload. This is not the case when physicians must manually extract and prepare key patient data for LLMs, as data extraction, not the actual decision-making task, is usually the most labor-intensive step. Interestingly, the observation that the performance with physician-generated summaries (i.e., SUM+) was substantially better than with summaries created by GPT-4 Turbo (i.e., SUMLLM+), suggests that current LLMs are not yet adequately capable of pre-processing clinical data at the level of physicians. Once key patient data has been extracted and made available as input to a decision support model, the question arises as to why, of all machine learning (ML) models, LLMs should be used for therapeutic decision-making.
Selecting the best treatment modality for a patient is essentially a classification task. Some traditional ML models, such as tree-based models, are specifically developed for this purpose contrasting with LLMs, which are designed to generate text. Very simple reference models performed similarly to GPT-4 in our study and Salihu et al.’s8 study, suggesting that more sophisticated non-LLM models might generally be better decision-makers than LLMs if trained accordingly. In addition, non-generative models do not exhibit undesirable behaviors such as hallucinations and unreliability26,27, and provide explainability and established measures of uncertainty quantification, two hallmarks of reasonable AI28 that are currently not adequately implemented for LLMs29,30.
Another hallmark of reasonable AI is to address algorithmic bias. It is conceivable that the bias we observed in virtually all LLMs in our study could be due to LLMs being exposed to an abundance of TAVR-related internet literature during training compared to SAVR, subsequently influencing their treatment decisions.
Overall, we propose using LLMs to extract clinical data31 and generate input for downstream non-LLMs, which then perform the decision-making. While this strategy should ideally exploit the strengths of LLMs and well-established ML classifiers, its effectiveness remains to be proven in future studies.
Limitations
Our study is subject to certain limitations. These include a small patient cohort from a single center and the retrospective nature of our investigation. Nevertheless, the size of our study cohort (n=80), was comparable to previous key publications2,32 studying the performance of LLMs in medicine, and we assume that our patient cohort was sufficiently large given the clear trends we observed.
The intentionally vague instruction to make decisions according to “the guidelines” left it unclear which specific guidelines had to be followed. For instance, the 2021 ESC guidelines differ from the 2020 ACC/AHA guidelines33. This ambiguity could contribute to the poor performance in the RAW experiment, as the 2021 ESC guidelines were used as benchmark. However, in the RAW+ experiments, the content of the 2021 ESC guidelines was included in the prompt, yet the performance did not improve substantially. Therefore, it must be assumed that this ambiguous instruction was not the driving factor for the poor performance in the RAW experiment.
The HT decisions against which we compared the LLMs’ treatment decisions could themselves be non-objective and deviate from the CPGs. We manually reviewed the HT treatment decisions and found no substantial deviations from the CPGs. Since human decision-makers (i.e. physicians) ultimately make the treatment decisions, the ground truth in experiments such as ours is inherently susceptible to some degree of subjectivity.
Given the limited cohort size and the considerable length of the medical reports, few-shot prompting or fine-tuning was not a viable option. We did not employ more sophisticated prompting techniques, such as Chain-of-Thought34, and confined hyperparameter tuning to the temperature parameter.
Conclusions
Our experiments have been among the most challenging tasks LLMs have been asked to perform in the medical sciences. Overall, we conclude that LLMs are currently not suitable as decision makers for the treatment of patients with severe AS, as our results suggest that a) LLMs require elaborate pre-processing of patient data to make informed treatment decisions, and b) LLMs are currently not able to pre-process original patient data on par with medical experts. Thus, we do not share the medical community’s concern that staff will be replaced by artificial intelligence35 in clinical decision-making in the near future.
Our findings suggest that LLMs should be used cautiously, particularly by medical laypersons seeking medical advice, such as second opinions. Users without extensive domain knowledge may receive treatment recommendations at a level similar to our RAW experiments. This is because medical laypersons may not be able to support prompts with guideline knowledge or create case summaries of sufficient quality but will only be able to use original medical reports. The study by Hager et al.25 suggests that LLMs perform poorly when collecting additional patient data sequentially, as physicians would during a patient-physician dialogue. This suggests that the alternative to our approach - not providing all clinical data to the LLM at once, but having medical laypersons provide essential information incrementally during a chat session - is also likely to lead to suboptimal therapeutic recommendations.
Finally, medical laypersons may not be able to recognize hallucinations as effectively as medical professionals. This, combined with the eloquent and persuasive linguistic style of most LLMs, has the potential to mislead users by creating an illusion of greater certainty than warranted, aggravating the hazardous effects of incorrect treatment recommendations.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Funding
This work was supported by the German Centre for Cardiovascular Research (DZHK), funded by the German Federal Ministry of Education and Research, and the Charité – Universitätsmedizin Berlin. D.H. received two grants from the DZHK (Grant Number: 81X3100214 and Grant Number: 81X3100220).
T.R. and D.H. are participants in the BIH Charité Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, and the Berlin Institute of Health at Charité (BIH).
Conflicts of Interests
Djawid Hashemi reports financial engagements beyond the scope of the presented work. These activities include consultation services and speaking engagements for companies including AstraZeneca, Bayer Vital, Boehringer Ingelheim, Coliquio, and Novartis.
Tobias D. Trippel reports on the potential conflict of interest by holding shares of Microsoft, Amazon and Palantir Technologies.
Axel Unbehaun serves as physician proctor to Boston Scientific, Edwards Lifesciences, and Medtronic.
Jörg Kempfert reports personal fees from Edwards, personal fees from LSI, outside the submitted work.
Benjamin O’Brien declares Research funding from the British Heart Foundation and the National Institute for Health Science Research and relevant financial activities outside the submitted work with following commercial entities: Teleflex, Abiomed in relation to consultancy fees.
Felix Balzer reports funding from Medtronic and grants from the German Federal Ministry of Education and Research, grants from the German Federal Ministry of Health, grants from the Berlin Institute of Health, personal fees from Elsevier Publishing, grants from Hans Böckler Foundation, other from Robert Koch Institute, grants from Einstein Foundation, and grants from Berlin University Alliance outside the submitted work.
Volkmar Falk declares relevant financial activities outside the submitted work with following commercial entities: Medtronic GmbH; Biotronik SE & Co.; Abbott GmbH & Co. KG; Boston Scientific; Edwards Lifesciences; Berlin Heart; Novartis Pharma GmbH; JOTEC GmbH; Zurich Heart. In relation to: Educational Grants (including travel support); Fees for lectures and speeches; Fees for professional consultation; Research and study funds.
Alexander Meyer declares the receipt of consulting and lecturing fees from Medtronic, lecturing fees from Bayer, and consulting fees from Pfizer. Alexander Meyer is founder and managing director of x-cardiac GmbH.
The remaining authors have no conflicts of interest to disclose.
Data sharing statement
The (anonymized) data underlying this article will be shared on reasonable request to the corresponding author.
Author Contributions
Conception and design of the study and literature review: TR, MH, DH, AM. Data collection: DH, FR. Analysis and interpretation of the data: MH, TR, AM, NH. Drafting of the manuscript: TR, MH, DH, AM, NH. All authors: revising and editing the manuscript.
Acknowledgements
Dr. Roeschl and Dr. Hashemi are participants in the BIH Charité Digital Clinician Scientist Program funded by the Charité – Universitätsmedizin Berlin, and the Berlin Institute of Health at Charité (BIH).
We thank Michael Gudo (MORPHISTO GmbH) for providing access to GPT-4 and Hadi El Ali (B.Sc.), University of Bayreuth, for contributing to the illustration of Figure 1.