Abstract
Importance Large Language Models (LLMs) can assist in a wide range of healthcare-related activities. Current approaches to evaluating LLMs make it difficult to identify the most impactful LLM application areas.
Objective To summarize the current evaluation of LLMs in healthcare in terms of 5 components: evaluation data type, healthcare task, Natural Language Processing (NLP)/Natural Language Understanding (NLU) task, dimension of evaluation, and medical specialty.
Data Sources A systematic search of PubMed and Web of Science was performed for studies published between 01-01-2022 and 02-19-2024.
Study Selection Studies evaluating one or more LLMs in healthcare.
Data Extraction and Synthesis Three independent reviewers categorized 519 studies in terms of data used in the evaluation, the healthcare tasks (the what) and the NLP/NLU tasks (the how) examined, the dimension(s) of evaluation, and the medical specialty studied.
Results Only 5% of reviewed studies utilized real patient care data for LLM evaluation. The most popular healthcare tasks were assessing medical knowledge (e.g. answering medical licensing exam questions, 44.5%), followed by making diagnoses (19.5%), and educating patients (17.7%). Administrative tasks such as assigning provider billing codes (0.2%), writing prescriptions (0.2%), generating clinical referrals (0.6%) and clinical notetaking (0.8%) were less studied. For NLP/NLU tasks, the vast majority of studies examined question answering (84.2%). Other tasks such as summarization (8.9%), conversational dialogue (3.3%), and translation (3.1%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias and toxicity (15.8%), robustness (14.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in internal medicine (42%), surgery (11.4%) and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%) and medical genetics (0.2%) being the least represented.
Conclusions and Relevance Existing evaluations of LLMs mostly focused on accuracy of question answering for medical exams, without consideration of real patient care data. Dimensions like fairness, bias and toxicity, robustness, and deployment considerations received limited attention. To draw meaningful conclusions and improve LLM adoption, future studies need to establish a standardized set of LLM applications and evaluation dimensions, perform evaluations using data from routine care, and broaden testing to include administrative tasks as well as multiple medical specialties.
0. Key Points
Question: How are healthcare applications of large language models (LLMs) currently evaluated?
Findings: Studies rarely used real patient care data for LLM evaluation. Administrative tasks such as generating provider billing codes and writing prescriptions were understudied. Natural Language Processing (NLP)/Natural Language Understanding (NLU) tasks like summarization, conversational dialogue, and translation were infrequently explored. Accuracy was the predominant dimension of evaluation, while fairness, bias and toxicity assessments were neglected. Evaluations in specialized fields, such as nuclear medicine and medical genetics were rare.
Meaning: Current LLM assessments in healthcare remain shallow and fragmented. To draw concrete insights on their performance, evaluations need to use real patient care data across a broad range of healthcare and NLP/NLU tasks and medical specialties with standardized dimensions of evaluation.
2. Introduction
The adoption of Artificial Intelligence (AI) in healthcare is rising, catalyzed by the emergence of Large Language Models (LLMs) like OpenAI’s ChatGPT 1 2 3 4. Unlike predictive AI, generative AI produces original content such as sound, image, and text5. Within the realm of generative AI, LLMs produce structured, coherent prose in response to text inputs, with broad application in health system operations 6. Prominent applications such as facilitating clinical note-taking have already been implemented by several health systems in the U.S., and there is excitement in the medical community for improving healthcare efficiency, quality, and patient outcomes 7 8. A recent report estimates that LLMs could unlock a substantial portion of the $1 trillion in untapped healthcare efficiency improvements, including an estimated savings ranging from 5 to 10 percent of US healthcare spending or approximately $200 billion to $360 billion annually based on 2019 figures 9 10.
New and revolutionary technologies are often met with excitement about their many potential uses, leading to widespread and often unfocussed experimentation across different healthcare applications. Thus, as expected, the performance of LLMs in real-world healthcare settings too, remains inconsistently conducted and evaluated 11 12. For instance, Cadamuro et al. assessed ChatGPT-4’s diagnostic ability by evaluating relevance, correctness, helpfulness, and safety, finding responses to be generally superficial and sometimes inaccurate, lacking in helpfulness and safety 13. In contrast, Pagano et al. also assessed diagnostic ability, but focused solely on correctness, concluding that ChatGPT-4 exhibited a high level of accuracy comparable to clinician responses 14.Thus, we hypothesize that the current evaluation landscape lacks the uniformity, thoroughness, and robustness necessary to effectively guide the deployment of LLMs in a real-world setting.
Our research aimed to assess the evaluation landscape of LLMs’ healthcare applications with the goal of accelerating their time to impact and successful integration. Through a systematic review of 519 studies, we comprehensively categorized how LLMs have been evaluated in healthcare along 5 axes: evaluation data type used, healthcare task, NLP/NLU task, dimension of evaluation, and medical specialty. To enable the categorization of the diverse range of applications and their evaluation setups, we used two categorization frameworks: the first describes healthcare applications of LLMs in terms of their constituent healthcare and NLP/NLU tasks, and the second describes dimensions of evaluation and associated metrics. These frameworks were then applied systematically to characterize the current state of evaluations to quantify the variability in LLM application evaluations and identify areas for further exploration.
Among the studies reviewed, there was no existing categorization framework that was consistently used. This lack of standardization required the creation of categorization framework, where we define each category as well as show illustrative examples of each. While such categorization can be a limitation, given that it provides a way to consistently discuss the testing and evaluations of LLM applications, it may have use beyond this review.
Our results show that evaluations of LLM applications in healthcare have been unevenly distributed both in terms of dimensions of evaluation used and in terms of medical specialty and application.
3. Methods
3.1 Design
A systematic review was conducted following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines as shown in Figure 1 15.
3.2 Information sources
Peer-reviewed studies and preprints from January 1 2022, to February 19, 2024, were retrieved from PubMed and Web of Science databases, using specific keywords as detailed in Supplement 1. Our search focused on titles and abstracts to identify studies on evaluation of LLMs’ healthcare applications. This two-year period aimed to capture publications evaluating LLM healthcare applications since the public launch of ChatGPT in November 2022. Given our hypothesis that the current landscape lacks the necessary elements needed to truly assess LLM performance in healthcare, we included a broad spectrum of studies. Citations were imported into EndNote 21 (Clarivate) for analysis.
3.3 Categorization framework
Each study was categorized by evaluation data type, healthcare task, NLP/NLU task, dimension of evaluation, and medical specialty. Healthcare task categories were developed using publicly available healthcare task and competency lists and were refined by consulting board-certified MDs 16 17 as outlined in Table 1. NLP/NLU categories and dimension of evaluations were developed using the Holistic Evaluation of Language Models (HELM) and Hugging Face frameworks 18 19 as shown in Tables 2 and 3. Medical specialties were adapted from Accreditation Council for Graduate Medical Education (ACGME) residency programs. 20
3.4 Eligibility criteria and screening
Screening was conducted by SB, YL, and LOE using the Covidence software (Covidence, 2024) as outlined in Figure 1. Included studies used LLMs for healthcare tasks and evaluated their performance. Excluded articles were those focused on multimodal tasks or basic biological science research with LLMs.
3.5 Data extraction and labeling
We adopted a paired review approach, wherein each study was categorized into evaluation data type, healthcare tasks, NLP/NLU tasks, dimension(s) of evaluation, and medical specialty by at least one human reviewer (SB, YL, or LOE) and GPT-4, based on the title and abstract. Note that GPT-4 was used as a force multiplier while the final categories were assigned by the human reviewers. In instances of disagreements regarding category assignments, the methods sections of the studies were retrieved, and final categories were determined through reviewer consensus. The prompts given to GPT-4 can be found in Supplement 2.
Each study received one or more healthcare tasks, NLP/NLU task, and dimension of evaluation labels as appropriate, hence the percentages sum above 100% in Table 4. In addition, each study could be assigned more than one medical specialty based on the evaluation conducted.
4. Results
749 relevant studies were screened for eligibility. After applying the inclusion and exclusion criteria described in 3.4, 519 studies were included in the analysis using the frameworks developed by the authors.
4.1 Categorization framework for healthcare tasks, NLP/NLU tasks and dimensions of evaluation
We deconstructed each healthcare application of an LLM into its constituent healthcare task (Table 1), i.e. the clinical and non-clinical task it is used for (the “what”), and the NLP/NLU task (Table 2), i.e. the language processing task being performed (the “how”). Examples of a healthcare task are diagnosing a patient’s disease, recommending a treatment for osteoarthritis. Examples of the language-processing job to be accomplished – which is not necessarily specific to the medical domain are summarizing the impression section of a radiology report, answering questions about the symptoms of type 2 diabetes etc.
An example of how healthcare tasks and NLP/NLU combine for a healthcare application of LLM is how Gan et al. evaluated LLM performance for mass-casualty triaging 40. The healthcare task (the “what”) is triaging patients while the NLP/NLU tasks (the “how”) are information extraction (extracting detailed patient information from the triage questionnaire scenarios, including age, symptoms, and vital signs), text classification (classifying the triage questionnaire scenarios into different triage levels), and question answering (generating final decision responses to the triage questionnaire).
We initially compiled a list of healthcare tasks using publicly available resources 16 17. Subsequently, through consultation with three board-certified MDs, we refined the list through iterative discussions to establish the final categories for classification, as outlined in Table 1. To compile a list of common NLP/NLU tasks, we referred to sources such as the Holistic Evaluation of Language Models (HELM) study and the Hugging Face task framework to derive 6 categories: 1) Summarization, 2) Question answering, 3) Information extraction, 4) Text classification (such as clinical notes, research articles, and documents), 5) Translation, and 6) Conversational dialogue (Table 2) 18 19.
We categorized the most common dimensions of evaluation used in the reviewed studies based on the list outlined in Table 3. These dimensions include: 1) Accuracy, 2) Calibration and uncertainty, 3) Robustness, 4) Factuality, 5) Comprehensiveness, 6) Fairness, bias, and toxicity, and 7) Deployment considerations. Fairness, bias, and toxicity were grouped together for ease of analysis, due to their infrequent occurrence in the reviewed studies, and relevance to ethical evaluation of LLMs. Additionally, we compiled common metrics for each dimension (eTable 1) to serve as a starting framework for researchers designing studies to assess LLM performance in healthcare applications.
4.2 Distribution of studies based on evaluation data type
Among the reviewed studies, 5% evaluated and tested LLMs using real patient care data, while the remaining relied on data such as medical examination questions, clinician-designed vignettes or Subject Matter Expert (SME) generated questions.
4.3 Categorizing articles based on healthcare tasks and NLP/NLU tasks
The studies we examined had a predominant focus on evaluating LLMs for their medical knowledge (Table 4), primarily through assessments such as the USMLE. This trend assumes that because we assess medical professionals’ readiness for entering clinical practice through board-style examinations, mirroring this type of evaluation for LLMs is adequate to certify their fitness-for-use. Making diagnoses, educating patients and making treatment recommendations were the other common healthcare tasks studied. While these tasks represent critical aspects of healthcare delivery, validating the utility of LLMs in supporting them requires assessment with real patient care data. The limited examination of administrative tasks like assigning provider billing codes, writing prescriptions, generating clinical referrals, and clinical notetaking suggests a gap in studying LLMs’ use for high-value, immediately impactful administrative tasks. These tasks are often labor intensive, presenting a ripe opportunity for testing LLMs to enhance efficiency in these areas 41.
Among the NLP/NLU tasks, most studies evaluated LLM performance through question answering tasks. These tasks ranged from addressing generic inquiries about symptoms and treatments to tackling board-style questions featuring clinical vignettes. Approximately a quarter of the studies focused on text classification and information extraction tasks. Tasks such as summarization, conversational dialogue, and translation remained underexplored. This gap is significant because condensing patient records into concise summaries, translating medical content into simpler languages or the patient’s native language, and facilitating conversations through chatbots are often touted benefits of using LLMs and could substantially alleviate physician burden.
4.4 Categorizing articles based on the dimensions of evaluation
As seen in Table 4, accuracy and comprehensiveness were overwhelmingly the top two most examined dimensions, whereas factuality, fairness, bias, and toxicity, robustness, deployment considerations, and calibration and uncertainty were infrequently assessed. This suggests a potential gap in assessing the broader capabilities and suitability of LLMs for real-world deployment. While accuracy and comprehensiveness are crucial for ensuring the reliability and effectiveness of LLMs in healthcare tasks, dimensions like fairness, bias, and toxicity are equally vital for addressing ethical concerns and ensuring equitable outcomes. Similarly, robustness and deployment considerations are essential for assessing the sustainability of integrating LLMs into healthcare systems. The limited assessment of calibration and uncertainty raises questions about the extent to which researchers are addressing the need for LLMs to provide uncertainty quantifications, particularly in healthcare scenarios.
4.5 Distribution of studies by medical specialty
We categorized studies according to the Accreditation Council for Graduate Medical Education (ACGME) residency programs, augmented to include additional categories to capture studies investigating applications in dental specialties, treatment of genetic disorders and generic healthcare applications 20. Notably, over a fifth of the studies were categorized as generic, indicating a significant focus on healthcare applications that are relevant to many specialties, rather than a specific specialty. Among the specialties, internal medicine, surgery, and ophthalmology were the top specialties. Nuclear medicine, physical medicine, and medical genetics were the least prevalent specialties in studies, accounting for 12 studies in total. The exact percentage of studies in different specialties are outlined in eTable 2. The distribution of studies across specialties underscores the potential for LLMs to contribute to a wide range of medical specialties, but also signals opportunities for further exploration within less represented areas such as nuclear medicine, physical medicine, and medical genetics.
5. Discussion
Our systematic review of 519 studies summarizes existing evaluations of LLMs across medical specialties, the underlying healthcare task, NLP/NLU task, and dimension of evaluation. Doing such categorization captures the heterogeneity of current LLM applications in healthcare while providing a concrete way to discuss their testing and evaluation in a consistent manner. While our categorization we created may be viewed as a limitation, given the precise definitions and illustrative examples of each category, it may have utility beyond this review.
Overall, we identified six limitations in the current efforts and suggest how to address them in future. Our findings call for an urgent need to develop consensus-driven guidance for evaluating LLMs in medicine, in a manner similar to the creation of the blueprint for trustworthy AI by The Coalition for Health AI for traditional AI models42.
The need for evaluations based on real patient care data
One striking finding is that only 5% of the studies used real patient care data for evaluation, with most studies using a mix of medical exam questions, patient vignettes and subject matter expert generated questions 43 14 44. Shah et al noted that testing LLMs with hypothetical medical questions is like assessing a car’s performance with multiple-choice questions before certifying it for road use 11. Real patient care data encompasses the complexities of clinical practice, providing a more thorough evaluation of LLM performance that will closely mirror real-world performance 14 45 46 6.
Real-world LLM evaluations provide valuable insights that may be overlooked in simulations or synthetic environments. For instance, while LLMs have been touted for potentially saving time and enhancing clinician experience, Garcia et al. found that the mean utilization rate for drafting patient messaging responses in an EHR system was only 20%, resulting in a reduction in burnout score but no time savings 47.
Given the importance of using real patient care data, systems need to be created to ensure their use in evaluating LLMs’ healthcare applications. The Office of the National Coordinator for Health Information Technology (ONC) recently passed HT-1, the first federal regulation to set specific reporting requirements for developers of AI tools through their ‘model report cards’48. ONC and other regulators should look to embed a mandate for the use of patient care data in the evaluation process of LLM tools into its requirements.
The need to standardize the task formulations and dimensions of evaluation
There is a lack of consensus on which dimensions of evaluation to examine for a given healthcare task or NLP/NLU task. For instance, for a medical education task, Ali et al. tested the performance of GPT-4 on a written board examination focusing on output accuracy as the sole dimension 21. Another study tested the performance of ChatGPT on the USMLE, focusing on output accuracy, factuality and comprehensiveness as primary dimensions of evaluation 49.
To address this challenge, we need to establish shared definitions of tasks and corresponding dimensions of evaluation. Similar to how efforts such as Holistic Evaluation of Language Models (HELM) define the dimensions of evaluation of an LLM that matter in general, a framework specific for healthcare is necessary to define the core dimensions of evaluation to be assessed across studies. Doing so enables better comparisons and cumulative learning from which reliable conclusions can be drawn for future technical work and policy guidance.
Prioritize immediately impactful, administrative applications
Current research predominantly focuses on medical knowledge tasks, such as answering medical exam questions (44.5%), or complex healthcare tasks, as well as making diagnoses (19.5%) and making treatment recommendations (9.2%). However, there are many administrative tasks in healthcare that are often labor-intensive, requiring manual input and contributing to physician burnout 41. Particularly, areas such as assigning provider billing codes (1 study), writing prescriptions (1 study), generating clinical referrals (3 studies), and clinical note-taking (4 studies); all of which remain under-researched and could greatly benefit from a systematic evaluation of using LLMs for those tasks 38 39 50 51.
The need to bridge gaps in LLM utilization across clinical specialties
The substantial representation of generic healthcare applications, accounting for over a fifth of the studies, underscores the potential of LLMs in addressing needs applicable to many specialties, such as summarizing medical reports. In contrast, the scarcity of research in particular specialties like nuclear medicine (3 studies), physical medicine (2 studies), and medical genetics (1 study) suggests an untapped potential for using LLMs in these complex medical domains that often present intricate diagnostic challenges and demand personalized treatment approaches52 53 54 55. The lack of LLM-focused studies in these areas may indicate the need for increased awareness, collaboration, or specialized adaptation of such models to suit the unique demands of these specialties.
The need for a realistic accounting of financial impact
Generative AI is projected to create $200 billion to $360 billion in healthcare cost savings through productivity improvements 10. However, the implementation of these tools could pose a significant financial burden to health systems. In a recent review by Sahni and Carrus, defining the cost and benefit of deploying AI was highlighted as one of the greatest challenges 56. It is key for health systems to capture this, to accurately estimate and budget for increased implementation and computing costs 57.
Within this review, only one study conducted a financial impact or cost-effectiveness analysis. Rau et al. investigated the use of ChatGPT to develop personalized imaging, demonstrating “an average decision time of 5 minutes and a cost of €0.19 for all cases, compared to 50 minutes and €29.99 for radiologists” 58. However, this analysis was a parallel implementation of the LLM solution compared with the traditional radiologist approach, thus not providing a realistic assessment of the added value of LLM integration into existing clinical workflows and its corresponding financial impact.
While the dearth of real-world testing is understandable given the infancy of LLM applications in healthcare, it is imperative to establish realistic assessments of these tools before reallocating resources from other healthcare initiatives. Notably, such assessments should estimate the total cost of implementation, which includes not only the cost to run the model but also expenses associated with monitoring, maintenance, and any necessary infrastructure adjustments.
The need to better define and quantify bias
Recent studies have highlighted a concerning trend of LLMs perpetuating race-based medicine in their responses 59. This phenomenon can be attributed to the tendency of LLMs to reproduce information from their training data, which may contain human biases 60.To improve our methods for evaluating and quantifying bias, we need to first collectively establish what it means to be unbiased.
While efforts to assess racial and ethical biases exist, only 15.8% of studies have conducted any evaluation that delves into how factors such as race, gender, or age impact bias in the model’s output 61 62 63. Future research should place greater emphasis on such evaluations, particularly as policymakers develop best practices and guidance for model assurance.
Mandating these evaluations as part of a “model report card” could be a proactive step towards mitigating harmful biases perpetuated by LLMs 64.
The need to publicly report failure modes
The analysis of failure modes has long been regarded as fundamental in engineering and quality management, facilitating the identification, examination, and subsequent mitigation of failures65. The FDA has databases for adverse event reporting in pharmaceuticals and medical devices, but there is currently no analogous place for reporting failure modes for AI systems, let alone LLMs, in healthcare 66 67.
In the ‘Conclusion’ sections of many studies, only a select few researched why the deployment of the LLM did not produce satisfactory results (e.g. ineffective prompt engineering) 68. A deeper examination of failure modes and why the exercise was deemed unsuccessful or inaccurate (e.g. the reference data was factually incorrect or outdated), is necessary to further improve the use of LLMs in healthcare settings.
6. Conclusion
The evaluation of LLMs lacks standardized task definitions and dimensions of evaluation. This systematic review underscores the need for evaluating LLMs using real patient care data, particularly on administrative healthcare tasks like generating provider billing codes, writing prescriptions, and clinical note-taking. It highlights the need to expand testing criteria beyond accuracy to include fairness, bias, toxicity, robustness, and deployment considerations across different medical specialties. Establishing shared task definitions and rigorous testing and evaluation standards are crucial for the safe integration of LLMs in healthcare. Realistic financial accounting and robust reporting of failures are essential to accurately assess their value and safety in clinical settings. Broadly, there is an urgent need to develop a nationwide consensus and guidance for evaluating LLMs in healthcare, so that we may realize the tremendous promise these groundbreaking technologies have to offer.
Author contributions
SB, YL, LOE, NHS conceived of the study, defined the main outcomes and measures. SB, LOE, YL searched the literature to identify the publications to review and categorized the publications. SB, LOE, YL and NHS drafted the manuscript. SB designed the GPT-4 based screening strategy with input from DD. SB developed the NLP/NLU task and dimensions of evaluation framework. LOE developed the healthcare task framework. DD guided the creation and categorization of healthcare tasks. AC and AS guided the creation of NLP/NLU task categorization. JAF guided the creation of the dimensions of evaluation categorization, SK helped select HELM dimensions to reuse. MK refined the medical specialty categorization, and MW critiqued the review methodology and figure organization. LL and HH assessed the usefulness of the frameworks for other analyses. NRS guided LOE and YL on all aspects of performing systematic reviews. AM reviewed and edited the manuscript for framing the discussion. KS and TT assessed the relevance of the results for developing consensus LLM testing and evaluation guidance for CHAI. MAP critiqued the deployment concerns in health systems and reviewed the categories. All authors reviewed, edited and approved of the final manuscript.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Supplement 1. Search terms for PubMed as of 02/19/2024
((“Large Language Model” [Title/Abstract] OR “ChatGPT” [Title/Abstract] OR “Generative AI” [Title/Abstract]) AND (“Health” [Title/Abstract] OR “Medical” [Title/Abstract] OR “Clinical” [Title/Abstract] OR “Medicine” [Title/Abstract]) AND (“Test” [Title/Abstract] OR “Evaluate” [Title/Abstract] OR “Performance” [Title/Abstract] OR “Assess” [Title/Abstract]))
Search terms for Web of science as of 02/19/2024
(TS=(“Large Language Model” OR “ChatGPT” OR “Generative AI”)
AND
TS=(“Health” OR “Medical” OR “Clinical” OR “Medicine”)
AND
TS=(“Test” OR “Evaluate” OR “Performance” OR “Assess”))
Supplement 2. Prompts used to extract and assign categories for human review
Prompt 1
“You are assisting in a systematic review of large language models in healthcare. Summarize the {entity_type} mentioned in this research abstract in 25 words”
Prompt 2
“Using the generated summaries, identify and categorize the following text based on {entity_type}:\n\n{text}\n\nCategories:”
Where entity_type can be NLP task, medical specialty or metric and categories is the list of possible values for each entity_type to make categorization into, for the NLP task, metric and medical specialty.
Acknowledgements
We thank Nicholas Chedid for extensive guidance in the development of the healthcare task categorization.
Footnotes
We made minor edits to the content of the document, based off peer review.