Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Felix Busch; Lena Hoffmann; Christopher Rueger; Elon HC van Dijk; Rawen Kader; Esteban Ortiz-Prado; Marcus R Makowski; Luca Saba; Martin Hadamitzky; Jakob Nikolas Kather; Daniel Truhn; Renato Cuocolo; Lisa C Adams; Keno K Bressem

doi:10.1101/2024.03.04.24303733

Abstract

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.

1. Introduction

Public and academic interest in large language models (LLMs) and their potential applications has increased substantially, especially since the release of OpenAI’s ChatGPT (Chat Generative Pre-trained Transformers) in November 2022.^1–3 One of the main reasons for their popularity is the remarkable ability to mimic human writing, a result of extensive training on massive amounts of text and reinforcement learning from human feedback.⁴

Since most LLMs are designed as general-purpose chatbots, recent research has focused on developing specialized models for the medical domain, such as Meditron or BioMistral, by enriching the training data of LLMs with medical knowledge.^5,6 However, this approach to fine-tuning LLMs requires significant computational resources that are not available to everyone and is also not applicable to closed-source LLMs, which are often the most powerful. Therefore, another approach to improve LLMs for biomedicine is to use techniques such as Retrieval-Augmented Generation (RAG).⁷ RAG allows information to be dynamically retrieved from medical databases during the model generation process, enriching the output with medical knowledge without the need to train the model.

LLMs hold great promise for improving the efficiency and accuracy of healthcare delivery, e.g., by extracting clinical information from electronic health records, summarizing, structuring, or explaining medical texts, streamlining administrative tasks in clinical practice, and enhancing medical research, quality control, and education.^8–10 In addition, LLMs have been shown to be versatile tools for supporting diagnosis or serving as prognostic models.^11,12

In contrast to applications primarily aimed at healthcare professionals, LLMs could also be used to promote patient education and empowerment by providing answers to medical questions and translating complex medical information into more accessible language.^4,13 Thereby, LLMs may promote personalized medicine and broaden access to medical knowledge, empowering patients to actively participate in their healthcare decisions.

However, despite the growing body of research and the clear potential of LLMs, there is a gap in terms of systematized information towards their use in patient care. To date, there has been no evaluation of existing research to understand the scope of applications and identify limitations that may currently limit the successful integration of LLMs into clinical practice.

Therefore, this systematic review aims to analyze and synthesize the literature on LLMs in patient care, providing a systematic overview of 1) current applications and 2) challenges and limitations, with the purpose of establishing a foundational framework and taxonomy for the implementation and evaluation of LLMs in healthcare settings.

2. Methods

This systematic review was pre-registered in the International Prospective Register of Systematic Reviews (PROSPERO) under the identifier CRD42024504542 before the start of the initial screening and was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.^14,15

2.1 Eligibility criteria

We searched 5 databases, including the Web of Science, PubMed, Embase/Embase Classic, American for Computing Machinery (ACM) Digital Library, and Institute of Electrical and Electronics Engineers (IEEE) Xplore as of January 25, 2024, to identify qualitative, quantitative, and mixed methods studies published between January 1, 2022, and December 31, 2023, that examined the use of LLMs for patient care. LLMs for patient care were defined as any artificial neural network that follows a transformer architecture and can be used to generate and translate text and other content or perform other natural language processing tasks for the purpose of disease management and support (i.e., prevention, preclinical management, diagnosis, treatment, or prognosis) that could be directly directed to or used by patients. Articles had to be available in English and contain sufficient data for thematic synthesis (e.g., conference abstracts that did not provide sufficient information on study results were excluded). Given the recent surge in publications on LLMs such as ChatGPT, we allowed for the inclusion of preprints if no corresponding peer-reviewed article was available. Duplicate reports of the same study, non-human studies, and articles limited to technology development/performance evaluation, pharmacy, human genetics, epidemiology, psychology, psychosocial support, or behavioral assessment were excluded.

2.2 Screening and data extraction

Initially, we conducted a preliminary search on PubMed and Google Scholar to define relevant search terms. The final search strategy included terms for LLMs, generative AI, and their applications in medicine, health services, clinical practices, medical treatments, and patient care (as detailed by database in Supplementary Section 1). After importing the bibliographic data into Rayyan and removing duplicates, LH and CR conducted an independent blind review of each article’s title and abstract.¹⁶ Any article flagged as potentially eligible by either reviewer proceeded to the full-text evaluation stage. For this stage, LH and CR used a custom data extraction form created in Google Forms (available online)¹⁷ to collect all relevant data independently from the studies that met the inclusion criteria. Quality assessment was also performed independently for each article within this data extraction form, using the Mixed Methods Appraisal Tool (MMAT) 2018.¹⁸ Disagreements at any stage of the review were resolved through discussion with the author FB. In cases of studies with incomplete data, we have tried to contact the corresponding authors for clarification or additional information.

2.3 Data analysis

Due to the diversity of investigated outcomes and study designs we sought to include, including qualitative, quantitative, and mixed methods, a meta-analysis was not practical. Instead, a data-driven convergent synthesis approach was selected for thematic syntheses of LLM applications and limitations in patient care.¹⁹ Following Thomas and Harden, FB coded each study’s numerical and textual data in Dedoose using free line-by-line coding.^20,21 Initial codes were then systematically categorized into descriptive and subsequently into analytic themes, incorporating new codes for emerging concepts within a hierarchical tree structure. Upon completion of the codebook, FB and LH reviewed each study to ensure consistent application of codes. Discrepancies were resolved through discussion with the author KKB, and the final codebook and analytical themes were discussed and refined in consultation with all contributing authors.

3. Results

3.1 Screening results

Of the 4,349 reports identified, 2,991 underwent initial screening, and 126 were deemed suitable for potential inclusion and underwent full-text screening. Two articles could not be retrieved because the authors or the corresponding title and abstract could not be identified online. Following full-text screening, 35 articles were excluded, and 89 articles were included in the final review. Most studies were excluded because they targeted the wrong discipline (n=10/35, 28.6%) or population (n=7/35, 20%) or were not original research (n=8/35, 22.9%) (see Supplementary Section 2). For example, we evaluated a study that focused on classifying physician notes to identify patients without active bleeding who were appropriate candidates for thromboembolism prophylaxis.²² Although the classification tasks may lead to patient treatment, the primary outcome was informing clinicians rather than directly forwarding this information to patients. We also reviewed a study assessing the accuracy and completeness of several LLMs when answering Methotrexate-related questions.²³ This study was excluded because it focused solely on the pharmacological treatment of rheumatic disease. For a detailed breakdown of the inclusion and exclusion process at each stage, please refer to the PRISMA flowchart in Figure 1.

Figure 1.

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram.

3.2 Characteristics of included studies

Table 1 summarizes the characteristics of the analyzed studies, including their setting, results, and conclusions. One study (n=1/89, 1.1%) was published in 2022²⁴, 84 (n=84/89, 94.4%) in 2023^13,25–107, and 4 (n=4/89, 4.5%) in 2024^108–111 (all of which were peer-reviewed publications of preprints published in 2023). Most studies were quantitative non-randomized (n=84/89, 94.4%)^{13,25–27,29–101,103,104,106,107,109–111}, 4 (n=4/89, 4.5%)^{28,102,105,108} had a qualitative study design, and one (n=1/89, 1.1%)²⁴ was quantitative randomized according to the MMAT 2018 criteria. However, the LLM outputs were often first analyzed quantitatively but followed by a qualitative analysis of certain responses. Therefore, if the primary outcome was quantitative, we considered the study design to be quantitative rather than mixed methods, resulting in the inclusion of zero mixed methods studies. The quality of the included studies was mixed (see Table 2). The authors were primarily affiliated with institutions in the United States (n=47 of 122 different countries identified per publication, 38.5%), followed by Germany (n=11/122, 9%), Turkey (n=7/122, 5.7%), the United Kingdom (n=6/122, 4.9%), China/Australia/Italy (n=5/122, 4.1%, respectively), and 24 (n=36/122, 29.5%) other countries. Most studies examined one or more applications based on the GPT-3.5 architecture (n=66 of 124 different LLMs examined per study, 53.2%)^{13,26–29,31–34,36–40,42–49,52–54,56–61,63,65–67,71,72,74,75,77,78,81–89,91,92,94,95,97–100,102–104,106–109,111}, followed by GPT-4 (n=33/124, 26.6%)^{13,25,27,29,30,34–36,41,43,50,51,54,55,58,61,64,68–70,74,76,79–81,83,87,89,90,93,96,98,99,101,105}, Bard (n=10/124, 8.1%; now known as Gemini)^{33,48,49,55,73,74,80,87,94,99}, Bing Chat (n=7/124, 5.7%; now Microsoft Copilot)^{49,51,55,73,94,99,110}, and other applications based on Bidirectional Encoder Representations from Transformers (BERT; n=4/124, 3.2%)^13,83,84, Large Language Model Meta-AI (LLaMA; n=3/124, 2.4%)⁵⁵, or Claude by Anthropic (n=1/124, 0.8%)⁵⁵. The majority of applications were primarily targeted at patients (n=64 of 89 included studies, 73%)^{24,25,29,32,34–39,41–43,45–48,52–54,56–60,62,63,65,66,68–71,73–75,77–80,85–95,97,99,100,102–111} or both patients and caregivers (n=25/89, 27%)^{13,26–28,30,31,33,40,44,49–51,55,61,64,67,72,76,81–84,96,98,101}. Information about conflicts of interest and funding was not explicitly stated in 23 (n=23/89, 25.8%) studies, while 48 (n=48/89, 53.9%) reported that there were no conflicts of interest or funding. A total of 18 (n=18/89, 20.2%) studies reported the presence of conflicts of interest and funding.^{13,24,38,40,54,58,59,67,69–71,74,80,84,96,103,105,111} Most studies did not report information about the institutional review board (IRB) approval (n=55/89, 61.8%) or deemed IRB approval unnecessary (n=28/89, 31.5%). Six studies obtained IRB approval (n=6/89, 6.7%).^{52,82,84–86,92}

View this table:

Table 1. Overview of included studies and corresponding authors, year of publication, affiliation countries of authors, study design, medical specialty, purpose of study, large language model (LLM)/tool examined, target user, evaluation/setting, main outcome, and conclusion.

View this table:

Table 2. Evaluation of included studies according to the Mixed Methods Appraisal Tool (MMAT) 2018.

¹⁸

3.3 Applications of Large Language Models

An overview of the presence of codes for each study is provided in Supplementary Section 3. The majority of articles investigated the use and feasibility of LLMs as medical chatbots (n=84/89, 94.4%)^{13,24–62,64–66,68,69,71–96,98–111}, while fewer reports additionally or exclusively focused on the generation of patient information (n=19/89, 21.4%)^{24,31,43,48,49,57,59,62,67,70,79,88–91,97,102,106,107}, including clinical documentation such as informed consent forms (n=5/89, 5.6%)^{43,67,91,97,102} and discharge instructions (n=1/89, 1.1%)³¹, or translation/summarization tasks of medical texts (n=5/89, 5.6%)^{24,49,57,79,89}, creation of patient education materials (n=5/89, 5.6%)^{48,62,90,106,107}, and simplification of radiology reports (n=2/89, 2.3%)^59,88. Most reports evaluated LLMs in English (n=88/89, 98.9%)^{13,24–103,105–111}, followed by Arabic (n=2/84, 2.3%)^32,104, Mandarin (n=2/84, 2.3%)^36,75, and Korean or Spanish (n=1/89, 1.1%, respectively)⁷⁵. The top-five specialties studied were ophthalmology (n=10/89, 11.2%)^{37,40,48,51,65,74,97,98,100,101}, gastro-enterology (n=9/89, 10.1%)^{25,32,34,36,39,61,62,72,96}, head and neck surgery/otolaryngology (n=8/89, 9%)^{35,42,56,64,66,76,78,79}, and radiology^{59,70,88–90,110} or plastic surgery^{45,47,49,102,107,108} (n=6/89, 6.7%, respectively). A schematic illustration of the identified concepts of LLM applications in patient care is shown in Figure 2.

Figure 2.

Schematic illustration of the identified concepts for the application of large language models (LLMs) in patient care.

3.4 Limitations of Large Language Models

The thematic synthesis of limitations resulted in two main concepts: one related to design limitations and one related to output.

3.4.1 Design limitations

In terms of design limitations, many authors noted the limitation that LLMs are not optimized for medical use (n=46/89, 51.7%)^{13,26,28,34,35,37–39,46,49,50,54–59,61,62,65,66,68,70,71,79–81,83–85,88,91,93–98,100–107,109}, including implicit knowledge/lack of clinical context (n=13/89, 14.6%)^{28,39,46,66,71,79,81,83–85,98,103}, limitations in clinical reasoning (n=7/89, 7.9%) ^{55,84,95,102–105}, limitations in medical image processing/production (n=5/89, 5.6%)^{37,55,91,106,107}, and misunderstanding of medical information and terms by the model (n=7/89, 7.9%)^{28,38,39,59,62,65,97}. In addition, data-related limitations were identified, including limited access to data on the internet (n=22/89, 24.7%)^{38,39,41,43,54–57,59,60,64,76,79,82–84,88,91,94,96,104,109}, the undisclosed origin of training data (n=36/89, 40.5%)^{25,26,29,30,32,34,36,37,40,46,47,50,51,53–60,64,65,70,71,76,82,83,91,94–96,101,105,109}, limitations in providing, evaluating, and validating references (n=20/89, 22.5%)^{45,49,54–57,65,71,73,76,80,83,85,91,94,96,98,101,103,105}, and storage/processing of sensitive health information (n=8/89, 9%)^{13,34,46,55,62,76,83,109}. Further second-order concepts included black-box algorithms, i.e., non-explainable AI (n=12/89, 13.5%)^{27,36,55,57,65,73,76,83,91,94,103,105}, limited engagement and dialogue capabilities (n=10/89)^{13,27,28,37,38,51,56,66,95,103}, and the inability of self-validation and correction (n=4/89, 4.5%)^61,73,74,107.

3.4.2 Output limitations

The evaluation of limitations in output data yielded 7 second-order codes concerning the non-reproducibility (n=38/89, 42.7%)^{28,29,34,38,39,41,43,45,46,49,54–61,64,65,71–73,76,80,82,83,85,90,91,94,96,98,99,101,103–105}, non-comprehensiveness (n=78/89, 87.6%)^{13,25,26,28–30,32–44,46,48–62,64,65,67–79,81–98,100,102–107,109–111}, incorrectness (n=78/89, 87.6%)^{13,25–44,46,49–52,54–62,64–66,69–79,81–85,87–107,109–111}, (un-)safeness (n=39/89, 43.8%)^{28,30,35,37,39,40,42–44,46,50,51,57–60,62,64,65,69,70,73,74,76,78–80,82,84,85,91,94,95,98–100,105,106,109}, bias (n=6/89, 6.7%)^{26,32,34,36,66,103}, and the dependence of the quality of output on the prompt-/input provided (n=27/89, 30.3%)^{26–28,34,38,41,44,46,51,52,56,68–72,74,76,78,79,81–83,90,94,95,100,101} or the environment (n=16/89, 18%)^{13,34,46,49–51,54,58,60,72,73,88,90,93,97,109}.

For non-reproducibility, key concepts included the non-deterministic nature of the output, e.g., due to inconsistent results across multiple iterations (n=34/89, 38.2%)^{28,29,34,38,39,41,43,46,58–61,72,76,82,90,94,98,99,101,103,104} and the inability to provide reliable references (n=20/89, 22.5%)^{45,49,54–57,65,71,73,76,80,83,85,91,94,96,98,101,103,105}. Non-comprehensiveness included nine concepts related to generic/non-personalized output (n=34/89, 38.2%)^{13,28,30,34,37,38,41,43,49,51,56,57,59,61,65,70,77,79,81,84–86,90,94,95,100,102–107,110}, incompleteness of output (n=68/89, 76.4%)^{13,25,26,28–30,32,34–39,41–44,46,49–52,55–62,64,65,67–69,72–77,79,81–86,89–98,100,102–107,109–111}, provision of information that is not standard of care (n=24/89, 27%)^{28,40,43,46,49,50,54,57,58,65,69,72,73,77,78,81,85,91,94,98,100,103,107,111} and/or outdated (n=12/89, 13.5%)^{13,25,32,34,38,41,43,44,49,54,83,84}, and production of oversimplified (n=10/89, 11.2%)^{38,46,49,54,59,79,84,85,103}, superfluous (n=16/89, 18%)^{13,28,34,38,46,62,72,79,86,90,94,97,100,106,107}, overcautious (n=7/89, 7.9%)^{13,28,37,51,70,103,110}, overempathic (n=1/89, 1.1%)¹³, or output with inappropriate complexity/reading level for patients (n=22/89, 24.7%)^{13,34,42,48,50,51,53,55,56,67,71,78,79,85,87,88,90,93,106,107,109,110}. For incorrectness, we identified 6 key concepts. Some of the incorrect information could be attributed to what is commonly known as hallucination (n=38/89, 42.7%)^{25,28,32,33,35–38,40–44,49–51,57–60,65,73,74,76,77,81,83,85,91,94,96–98,100,103,106,107,109}, i.e., the creation of entirely fictitious or false information that has no basis in the input provided or in reality (e.g., “You may be asked to avoid eating or drinking for a few hours before the scan” for a bone scan). However, numerous instances of misinformation were more appropriately classified under alternative concepts of the original psychiatric analogy, as described in detail by Currie et al.^43,112,113 These include illusion (n=12/89, 13.5%)^{28,36,38,43,57,59,77,78,85,88,94,105}, which is characterized by the generation of deceptive perceptions or the distortion of information by conflating similar but separate concepts (e.g., suggesting that MRI-type sounds might be experienced during standard nuclear medicine imaging), delirium (n=34/89, 38.2%)^{13,26,28,30,37,43,50,58,59,61,65,70,72–75,77,79,81–85,90–92,94,95,98,102,103,107,109,110}, which indicates significant gaps in vital information, resulting in a fragmented or confused understanding of a subject (e.g., omission of crucial information about caffeine cessation for stress myocardial perfusion scans), extrapolation (n=11/89, 12.4%)^{43,59,65,78,81,91,94,106,107,110}, which involves applying general knowledge or patterns to specific situations where they are inapplicable (e.g., advice about injection-site discomfort that is more typical of CT contrast administration), delusion (n=14/89, 15.7%)^{28,30,43,50,59,65,69,73,74,78,81,94,103,111}, a fixed, false beliefs despite contradictory evidence (e.g., inaccurate waiting times for the thyroid scan), and confabulation (n=18/89, 20.2%)^{25,28,36–38,40,46,59,62,65,71,77–79,94,103,107}, i.e., filling in memory or knowledge gaps with plausible but invented information (e.g., “You should drink plenty of fluids to help flush the radioactive material from your body” for a biliary system–excreted radiopharmaceutical).

Many studies rated the generated output as unsafe, including misleading (n=34/89, 38.2%)^{28,30,35,43,44,46,50,51,57–60,62,64,65,69,73,74,76,78–80,82,84,85,94,95,98–100,105,106,109} or even harmful content (n=26/89, 29.2%)^{28,30,37,39,40,42,43,50,51,58–60,70,73,74,76,79,84,85,91,94,95,98–100,109}. A minority of reports identified biases in the output, which were related to language (n=2/89, 2.3%)^32,36, insurance status¹⁰³, underserved racial groups²⁶, or underrepresented procedures³⁴ (n=1/89, 1.1%, each). Finally, many authors suggested that performance was related to the prompting/input provided or the environment, i.e., depending on the evidence (n=7/89, 7.9%)^{52,68,69,71,81,82,95}, complexity (n=11/89, 12.4%)^{28,34,44,46,70,74,76,79,94,102}, specificity (n=13/89, 14.6%)^{27,38,41,56,70,72,74,76,78,81,95,100,101}, quantity (n=3/89, 3.4%)^26,52,74 of the input, type of conversation (n=3/89, 3.4%)^27,51,90, or the appropriateness of the output related to the target group (n=9/89, 10.1%)^{46,49,51,54,72,90,93,97,109}, provider/organization (n=4/89, 4.5%)^13,50,60,88, and local/national medical resources (n=5/89, 5.6%)^{34,50,58,60,73}. Figure 3 illustrates the hierarchical tree structure and quantity of the codes derived from the thematic synthesis of limitations.

Figure 3.

Illustration of the hierarchical tree structure for the thematic synthesis of large language model (LLM) limitations in patient care, including the presence of codes for each concept.

4. Discussion

In this systematic review, we synthesized the current applications and limitations of LLMs in patient care, incorporating a broad analysis across 29 medical specialties and highlighting key limitations in LLM design and output, providing a comprehensive framework and taxonomy for their future implementation and evaluation in healthcare settings.

Most articles examined the use of LLMs based on the GPT-3.5 or GPT-4 architecture for answering medical questions, followed by the generation of patient information, including medical text summarization or translation and clinical documentation. The conceptual synthesis of LLM limitations revealed two key concepts: the first related to design, including 6 second-order and 12 third-order codes, and the second related to output, including 9 second-order and 32 third-order codes.

Although many LLMs have been developed specifically for the biomedical domain in recent years, we found that ChatGPT has been a disruptor in the medical literature on LLMs, with GPT-3.5 and GPT-4 accounting for almost 80% of the LLMs examined in this systematic review. While it was not possible to conduct a meta-analysis of the performance on medical tasks, many authors provided a positive outlook towards the integration of LLMs into clinical practice. However, the use of proprietary models such as ChatGPT in the biomedical field raises concerns because the limited access to the underlying algorithms, training data, and data processing and storage mechanisms makes them untransparent and, thus, significantly limits their applicability in healthcare.¹¹⁴ Furthermore, the integration of proprietary models into patient care applications makes one susceptible to performance changes associated with model updates, which may break existing functionalities and lead to harmful outcomes for patients. Therefore, especially in the biomedical field, open-source models such as BioMistral may offer a viable solution.⁶ Given the limited number of articles on open-source LLMs in our review, we strongly encourage future studies investigating the applicability of open-source LLMs in patient care. We identified several key limitations regarding the design and output. Not surprisingly, many reports noted the limitation that the LLMs studied were not optimized for the medical domain. One possible solution to this limitation may be to provide medical knowledge during inference using RAG.¹¹⁵ However, even when trained for general purposes, ChatGPT has previously been shown to pass the United States Medical Licensing Examination (USMLE), the German State Examination in Medicine, or even a radiology board-style examination without images.^116–119 Although outperformed on specific tasks by specialized medical LLMs, such as Google’s MedPaLM-2, this suggests that general-purpose LLMs can comprehend complex medical literature and case scenarios to a degree that meets professional standards.¹²⁰ Furthermore, given the large amounts of data on which proprietary models such as ChatGPT are trained, it is not unlikely that they have been exposed to more medical data overall than smaller specialized models despite being generalist models.

It should also be noted that passing these exams does not equate to the practical competence required of a healthcare provider.¹²¹ In addition, reliance on exam-based assessments carries a significant risk of bias. For example, if the exam questions or similar variants are publicly available and, thus, may be present in the training data, the LLM does not demonstrate any knowledge outside of training data memorization.¹²² In fact, these types of tests can be misleading in estimating the model’s true abilities in terms of comprehension or analytical skills. Many studies have reported limitations in the output related to comprehensiveness, safety, correctness, reproducibility, and dependence of the output on the input/prompt and environment. Specifically, for correctness, we followed the taxonomy of Currie et al. to classify incorrect outputs more precisely into illusions, delusions, delirium, confabulation, and extrapolation, thus proposing a framework for a more precise and structured error classification to improve the characterization of incorrect outputs and enabling more detailed performance comparisons with other research.^43,112,113 On the other hand, a minority of studies have identified biases, for example, reflecting the unequal representation of certain content or the biases inherent in human-generated text in the training data.¹²³ This may indicate that the implemented safeguards are effective. However, not much is known about the technology and developer policies of proprietary LLMs, and previous work has shown that automated jailbreak generation is possible across various commercial LLM chatbots.¹²⁴ This also mirrors our concept of data-related limitations, particularly regarding the handling of sensitive health information. Together with the limited transparency about the origin of the training data and the unexplainable and non-deterministic nature of the output, this raises a key question when applying LLMs to the medical domain: how can we entrust our patients to LLMs if they are neither reliable nor transparent? Given that models like ChatGPT are already publicly accessible and widely used, patients may already refer to them for medical questions in much the same way they use Google Search, making concerns about their early adoption somewhat academic.¹²⁵

In addition, low health literacy due to the identified limitations in comprehensiveness, including the generation of content with high complexity and an inappropriate reading level, which was above the 6th-grade level recommended by the American Medical Association (AMA) in almost all studies analyzed, may further limit their utility for patient information.¹²⁶ Overall, this can lead to results that are misleading and harmful, as described in many of the reports in our review. In addition to advances in the development of LLMs and the focus on open source, it will therefore be necessary to develop and implement a well-validated scale to determine the quality and safety of LLM outputs in medical practice, such as the recent effort made to adopt the widely recognized Physician Documentation Quality Instrument (PDQI-9) for the assessment of AI transcripts and clinical summaries.¹²⁷

Finally, the implementation of regulatory mandates like the forthcoming European Union AI Act and the associated challenges faced by generative AI and LLMs, for example, in terms of training data transparency and validation of non-deterministic output, will show which approaches the companies will take to bring these models into compliance with the law. How the notified bodies interpret and enforce the law in practice will likely be decisive for the further development of LLMs in the biomedical sector.¹²⁸

4.1 Limitations

Our study has limitations. First, our review focused on LLM applications and limitations in patient care, thus excluding research directed at clinicians only. Future studies may extend our synthesis approach to LLM applications that explicitly focus on healthcare professionals. Second, there is a risk that potentially eligible studies were not included in our analysis if they were not present in the 5 databases reviewed or were not available in English. However, we screened nearly 3,000 articles in total and systematically analyzed 89 articles, providing a comprehensive overview of the current state of LLMs in patient care, even if some articles could have been missed. Third, the rapid development and advancement of LLMs make it difficult to keep this systematic review up to date. For example, Gemini 1.5 Pro was published in February 2024, and corresponding articles are not included in this review, which synthesized articles from 2022 to 2023. Continued updates will be essential to monitor emerging areas and limitations in this rapidly evolving field.

5. Conclusion

In conclusion, this review provides a systematic overview of current LLM applications and limitations in patient care. Our conceptual synthesis provides a structured taxonomy that may lay the groundwork for both the implementation and critical evaluation of LLMs in healthcare settings.

6. Declarations

6.2 Competing interests

JNK declares consulting services for Owkin, France; DoMore Diagnostics, Norway; Panakeia, UK, and Scailyte, Basel, Switzerland; furthermore JNK holds shares in Kather Consulting, Dresden, Germany; and StratifAI GmbH, Dresden, Germany, and has received honoraria for lectures and advisory board participation by AstraZeneca, Bayer, Eisai, MSD, BMS, Roche, Pfizer and Fresenius. DT holds shares in StratifAI GmbH, Dresden, Germany and has received honoraria for lectures by Bayer. KKB reports grants from the European Union (101079894) and Wilhelm-Sander Foundation; participation on a Data Safety Monitoring Board or Advisory Board for the EU Horizon 2020 LifeChamps project (875329) and the EU IHI Project IMAGIO (101112053); speaker Fees for Canon Medical Systems Corporation and GE HealthCare. RK receives medical consultancy fees from Odin Vision.

6.3 Author contributions

Conceptualization: FB, LCA, KKB; Project administration: FB; Resources: FB, LCA, KKB; Software: FB, LCA, KKB; Data curation: FB, LH, CR; Formal analysis: FB, LH, CR, LCA, KKB; Investigation: FB, LH, CR, LCA, KKB; Methodology: FB; Supervision: FB, LCA, KKB; Validation: FB, LH, CR, EHCvD, RK, EOP, MRM, LS, MH, JNK, DT, RC, LCA, KKB; Visualization: FB, LCA; Writing – original draft preparation: FB, LH, LCA, KKB; Writing – review & editing: FB, LH, CR, EHCvD, RK, EOP, MRM, LS, MH, JNK, DT, RC, LCA, KKB.

6.4 Data availability

All data generated or analyzed during this study are included in this published article and its supplementary information files.

6.1 Acknowledgements

This research is funded by the European Union (101079894). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the granting authority can be held responsible for them. The funding had no role in the study design, data collection and analysis, manuscript preparation, or decision to publish.

9. References

↵
Milmo, D. ChatGPT reaches 100 million users two months after launch, <https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app> (2023).
OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774; doi:10.48550/arXiv.2303.08774 (2023).
OpenUrl CrossRef
↵
Zhao, W. X. et al. A survey of large language models. arXiv preprint arXiv:2303.18223; doi:10.48550/arXiv.2303.18223 (2023).
OpenUrl CrossRef
↵
Clusmann, J. et al. The future landscape of large language models in medicine. Communications Medicine 3, 141; doi:10.1038/s43856-023-00370-1 (2023).
OpenUrl CrossRef
↵
Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079; doi:10.48550/arXiv.2311.16079 (2023).
OpenUrl CrossRef
↵
Labrak, Y. et al. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv preprint arXiv:2402.10373; doi:10.48550/arXiv.2402.10373 (2024).
OpenUrl CrossRef
↵
Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking Retrieval-Augmented Generation for Medicine. arXiv preprint arXiv:2402.13178; doi:10.48550/arXiv.2402.13178 (2024).
OpenUrl CrossRef
↵
Yang, X., et al. A large language model for electronic health records. npj Dig Med 5, 194; doi:10.1038/s41746-022-00742-2 (2022).
OpenUrl CrossRef
Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 25; doi:10.1093/bib/bbad493 (2024).
OpenUrl CrossRef
↵
Adams, L. C. et al. Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 307, e230725; doi:10.1148/radiol.230725 (2023).
OpenUrl CrossRef
↵
McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164; doi:10.48550/arXiv.2312.00164 (2023).
OpenUrl CrossRef
↵
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362; doi:10.1038/s41586-023-06160-y (2023).
OpenUrl CrossRef
↵
Liu, S. et al. Leveraging Large Language Models for Generating Responses to Patient Messages. medRxiv 2023.2007.2014.23292669; doi:10.1101/2023.07.14.23292669 (2023).
OpenUrl Abstract/FREE Full Text
↵
Busch, F., Hoffmann, L., Adams, L. C. & Bressem, K. K. A systematic review of current large language model applications and biases in patient care, <https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42024504542> (2024).
↵
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj 372, n71; doi:10.1136/bmj.n71 (2021).
OpenUrl FREE Full Text
↵
Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—a web and mobile app for systematic reviews. Syst Rev 5, 210; doi:10.1186/s13643-016-0384-4 (2016).
OpenUrl CrossRef PubMed
↵
Data extraction form, <https://docs.google.com/forms/d/e/1FAIpQLScFwE5KaOugxX_xXtt9Y6fbBhV4s77S9cWRdVuiHh34vmArkQ/viewform> (2024).
↵
Hong, Q. N. et al. The Mixed Methods Appraisal Tool (MMAT) version 2018 for information professionals and researchers. Educ Inf 34, 285–291; doi:10.3233/EFI-180221 (2018).
OpenUrl Abstract/FREE Full Text
↵
Hong, Q. N., Pluye, P., Bujold, M. & Wassef, M. Convergent and sequential synthesis designs: implications for conducting and reporting systematic reviews of qualitative and quantitative evidence. Syst Rev 6, 61; doi:10.1186/s13643-017-0454-2 (2017).
OpenUrl CrossRef PubMed
↵
Thomas, J. & Harden, A. Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med Res Methodol 8, 45; doi:10.1186/1471-2288-8-45 (2008).
OpenUrl CrossRef PubMed
↵
Sociocultural Research Consultants, LLC. Dedoose Version 9.2.4, cloud application for managing, analyzing, and presenting qualitative and mixed method research data (Los Angeles, CA, 2024).
↵
Savage, T., Wang, J. & Shieh, L. A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation. JMIR Med Inform 11, e49886; doi:10.2196/49886 (2023).
OpenUrl CrossRef
↵
Coskun, B. N., Yagiz, B., Ocakoglu, G., Dalkilic, E. & Pehlivan, Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int; doi:10.1007/s00296-023-05473-5 (2023).
OpenUrl CrossRef
↵
Bitar, H., Babour, A., Nafa, F., Alzamzami, O. & Alismail, S. Increasing Women’s Knowledge about HPV Using BERT Text Summarization: An Online Randomized Study. Int J Environ Res Public Health 19; doi:10.3390/ijerph19138100 (2022).
OpenUrl CrossRef
↵
Samaan, J. S. et al. Artificial Intelligence and Patient Education: Examining the Accuracy and Reproducibility of Responses to Nutrition Questions Related to Inflammatory Bowel Disease by GPT-4. medRxiv 2023.2010.2028.23297723; doi:10.1101/2023.10.28.23297723 (2023).
OpenUrl Abstract/FREE Full Text
↵
Eromosele, O. B., Sobodu, T., Olayinka, O. & Ouyang, D. Racial Disparities in Knowledge of Cardiovascular Disease by a Chat-Based Artificial Intelligence Model. medRxiv 2023.2009.2020.23295874; doi:10.1101/2023.09.20.23295874 (2023).
OpenUrl Abstract/FREE Full Text
↵
Johri, S. et al. Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning. medRxiv 2023.2009.2012.23295399; doi:10.1101/2023.09.12.23295399 (2024).
OpenUrl Abstract/FREE Full Text
↵
Braga, A. V. N. M. et al. Use of ChatGPT in Pediatric Urology and its Relevance in Clinical Practice: Is it useful? medRxiv 2023.2009.2011.23295266; doi:10.1101/2023.09.11.23295266 (2023).
OpenUrl Abstract/FREE Full Text
↵
King, R. C. et al. Appropriateness of ChatGPT in answering heart failure related questions. medRxiv 2023.2007.2007.23292385; doi:10.1101/2023.07.07.23292385 (2023).
OpenUrl Abstract/FREE Full Text
↵
Huang, S. S. et al. Fact Check: Assessing the Response of ChatGPT to Alzheimer’s Disease Statements with Varying Degrees of Misinformation. medRxiv 2023.2009.2004.23294917; doi:10.1101/2023.09.04.23294917 (2023).
OpenUrl Abstract/FREE Full Text
↵
Hanna, J. J., Wakene, A. D., Lehmann, C. U. & Medford, R. J. Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by ChatGPT¹. medRxiv 2023.2008.2028.23294730; doi:10.1101/2023.08.28.23294730 (2023).
OpenUrl Abstract/FREE Full Text
↵
Samaan, J. S. et al. ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic. Arab J Gastroenterol 24, 145–148; doi:10.1016/j.ajg.2023.08.001 (2023).
OpenUrl CrossRef
↵
Patnaik, S. S. & Hoffmann, U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth 132, 169–171; doi:10.1016/j.bja.2023.09.030 (2024).
OpenUrl CrossRef
↵
Ali, H. et al. Evaluating the performance of ChatGPT in responding to questions about endoscopic procedures for patients. iGIE 2, 553–559; doi:10.1016/j.igie.2023.10.001 (2023).
OpenUrl CrossRef
↵
Suresh, K. et al. Utility of GPT-4 as an Informational Patient Resource in Otolaryngology. medRxiv 2023.2005.2014.23289944; doi:10.1101/2023.05.14.23289944 (2023).
OpenUrl Abstract/FREE Full Text
↵
Yeo, Y. H. et al. GPT-4 outperforms ChatGPT in answering non-English questions related to cirrhosis. medRxiv 2023.2005.2004.23289482; doi:10.1101/2023.05.04.23289482 (2023).
OpenUrl Abstract/FREE Full Text
↵
Knebel, D. et al. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies-An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 1, 5–35; doi:10.1055/a-2149-0447 (2023).
OpenUrl CrossRef
↵
Zhu, L., Mou, W. & Chen, R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 21, 269; doi:10.1186/s12967-023-04123-5 (2023).
OpenUrl CrossRef
↵
Lahat, A., Shachar, E., Avidan, B., Glicksberg, B. & Klang, E. Evaluating the Utility of a Large Language Model in Answering Common Patients’ Gastrointestinal Health-Related Questions: Are We There Yet? Diagnostics (Basel) 13; doi:10.3390/diagnostics13111950 (2023).
OpenUrl CrossRef
↵
Bernstein, I. A. et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open 6, e2330320; doi:10.1001/jamanetworkopen.2023.30320 (2023).
OpenUrl CrossRef
↵
Rogasch, J. M. M. et al. ChatGPT: Can You Prepare My Patients for [¹⁸F]FDG PET/CT and Explain My Reports? J Nucl Med, jnumed.123.266114; doi:10.2967/jnumed.123.266114 (2023).
OpenUrl Abstract/FREE Full Text
↵
Campbell, D. J. et al. Evaluating ChatGPT Responses on Thyroid Nodules for Patient Education. Thyroid®; doi:10.1089/thy.2023.0491 (2023).
OpenUrl CrossRef
↵
Currie, G., Robbie, S. & Tually, P. ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4. J Nucl Med Technol 51, 307–313; doi:10.2967/jnmt.123.266151 (2023).
OpenUrl Abstract/FREE Full Text
↵
Draschl, A. et al. Are ChatGPT’s Free-Text Responses on Periprosthetic Joint Infections of the Hip and Knee Reliable and Useful? J Clin Med 12; doi:10.3390/jcm12206655 (2023).
OpenUrl CrossRef
↵
Alessandri-Bonetti, M., Liu, H. Y., Palmesano, M., Nguyen, V. T. & Egro, F. M. Online patient education in body contouring: A comparison between Google and ChatGPT. J Plast Reconstr Aesthet Surg 87, 390–402; doi:10.1016/j.bjps.2023.10.091 (2023).
OpenUrl CrossRef
↵
Coskun, B., Ocakoglu, G., Yetemen, M. & Kaygisiz, O. Can ChatGPT, an Artificial Intelligence Language Model, Provide Accurate and High-quality Patient Information on Prostate Cancer? Urology 180, 35–58; doi:10.1016/j.urology.2023.05.040 (2023).
OpenUrl CrossRef
↵
Durairaj, K. K. et al. Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT “Wins” Rhinoplasty Consultations: Should We Be Worried? Facial Plast Surg Aesthet Med; doi:10.1089/fpsam.2023.0224 (2023).
OpenUrl CrossRef
↵
Kianian, R., Sun, D., Crowell, E. L. & Tsui, E. The Use of Large Language Models to Generate Education Materials about Uveitis. Ophthalmol Retina; doi:10.1016/j.oret.2023.09.008 (2023).
OpenUrl CrossRef
↵
Seth, I. et al. Exploring the Role of a Large Language Model on Carpal Tunnel Syndrome Management: An Observation Study of ChatGPT. J Hand Surg Am 48, 1025–1033; doi:10.1016/j.jhsa.2023.07.003 (2023).
OpenUrl CrossRef
↵
Inojosa, H. et al. Can ChatGPT explain it? Use of artificial intelligence in multiple sclerosis communication. Neurol Res Pract 5, 48; doi:10.1186/s42466-023-00270-8 (2023).
OpenUrl CrossRef
↵
Lyons, R. J., Arepalli, S. R., Fromal, O., Choi, J. D. & Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can J Ophthalmol; doi:10.1016/j.jcjo.2023.07.016 (2023).
OpenUrl CrossRef
↵
Babayiğit, O., Tastan Eroglu, Z., Ozkan Sen, D. & Ucan Yarkac, F. Potential Use of ChatGPT for Patient Information in Periodontology: A Descriptive Pilot Study. Cureus 15, e48518; doi:10.7759/cureus.48518 (2023).
OpenUrl CrossRef
↵
Mondal, H., Dash, I., Mondal, S. & Behera, J. K. ChatGPT in Answering Queries Related to Lifestyle-Related Diseases and Disorders. Cureus 15, e48296; doi:10.7759/cureus.48296 (2023).
OpenUrl CrossRef
↵
Kim, H. W., Shin, D. H., Kim, J., Lee, G. H. & Cho, J. W. Assessing the performance of ChatGPT’s responses to questions related to epilepsy: A cross-sectional study on natural language processing and medical information retrieval. Seizure 114, 1–8; doi:10.1016/j.seizure.2023.11.013 (2023).
OpenUrl CrossRef
↵
Song, H. et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 47, 125; doi:10.1007/s10916-023-02021-3 (2023).
OpenUrl CrossRef
↵
Zalzal, H. G., Abraham, A., Cheng, J. H. & Shah, R. K. Can ChatGPT help patients answer their otolaryngology questions? Laryngoscope Investig Otolaryngol; doi:10.1002/lio2.1193 (2023).
OpenUrl CrossRef
↵
Chervenak, J., Lieman, H., Blanco-Breindel, M. & Jindal, S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 120, 575–583; doi:10.1016/j.fertnstert.2023.05.151 (2023).
OpenUrl CrossRef
↵
Bushuven, S. et al. “ChatGPT, Can You Help Me Save My Child’s Life?” - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis. J Med Syst 47, 123; doi:10.1007/s10916-023-02019-x (2023).
OpenUrl CrossRef
↵
Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol; doi:10.1007/s00330-023-10213-1 (2023).
OpenUrl CrossRef
↵
Samaan, J. S. et al. Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery. Obes Surg 33, 1790–1796; doi:10.1007/s11695-023-06603-5 (2023).
OpenUrl CrossRef
↵
Zhou, J. M., Li, T. Y., Fong, S. J., Dey, N. & Crespo, R. G. Exploring ChatGPT’s Potential for Consultation, Recommendations and Report Diagnosis: Gastric Cancer and Gastroscopy Reports’ Case. Int J Interact Multimed Artif Intell 8, 7–13; doi:10.9781/ijimai.2023.04.007 (2023).
OpenUrl CrossRef
↵
Oniani, D. et al. Toward Improving Health Literacy in Patient Education Materials with Neural Machine Translation Models. AMIA Jt Summits Transl Sci Proc 2023, 418–426 (2023).
↵
Hernandez, C. A. et al. The Future of Patient Education: AI-Driven Guide for Type 2 Diabetes. Cureus 15, e48919; doi:10.7759/cureus.48919 (2023).
OpenUrl CrossRef
↵
Kuşcu, O., Pamuk, A. E., Sütay Süslü, N. & Hosal, S. Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Front Oncol 13, 1256459; doi:10.3389/fonc.2023.1256459
OpenUrl CrossRef
↵
Biswas, S., Logan, N. S., Davies, L. N., Sheppard, A. L. & Wolffsohn, J. S. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol Opt 43, 1562–1570; doi:10.1111/opo.13207 (2023).
OpenUrl CrossRef
↵
Chiesa-Estomba, C. M. et al. Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. Eur Arch Otorhinolaryngol; doi:10.1007/s00405-023-08104-8 (2023).
OpenUrl CrossRef
↵
Decker, H. et al. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Netw Open 6, e2336997; doi:10.1001/jamanetworkopen.2023.36997 (2023).
OpenUrl CrossRef
↵
Kaarre, J. et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc 31, 5190–5198; doi:10.1007/s00167-023-07529-2 (2023).
OpenUrl CrossRef
↵
Ferreira, A. L., Chu, B., Grant-Kels, J. M., Ogunleye, T. & Lipoff, J. B. Evaluation of ChatGPT Dermatology Responses to Common Patient Queries. JMIR Dermatol 6, e49280; doi:10.2196/49280 (2023).
OpenUrl CrossRef
↵
Truhn, D. et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep 13, 20159; doi:10.1038/s41598-023-47500-2 (2023).
OpenUrl CrossRef
↵
Hurley, E. T. et al. Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence-Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery. Arthroscopy; doi:10.1016/j.arthro.2023.07.048 (2023).
OpenUrl CrossRef
↵
Cankurtaran, R. E., Polat, Y. H., Aydemir, N. G., Umay, E. & Yurekli, O. T. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus 15, e46736; doi:10.7759/cureus.46736 (2023).
OpenUrl CrossRef
↵
Birkun, A. A. & Gautam, A. Large Language Model (LLM)-Powered Chatbots Fail to Generate Guideline-Consistent Content on Resuscitation and May Provide Potentially Harmful Advice. Prehosp Disaster Med 38, 757–763; doi:10.1017/s1049023x23006568 (2023).
OpenUrl CrossRef
↵
Pushpanathan, K. et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience 26, 108163; doi:10.1016/j.isci.2023.108163 (2023).
OpenUrl CrossRef
↵
Shao, C. Y. et al. Appropriateness and Comprehensiveness of Using ChatGPT for Perioperative Patient Education in Thoracic Surgery in Different Language Contexts: Survey Study. Interact J Med Res 12, e46900; doi:10.2196/46900 (2023).
OpenUrl CrossRef
↵
Vaira, L. A. et al. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngol Head Neck Surg; doi:10.1002/ohn.489 (2023).
OpenUrl CrossRef
↵
Chen, S. et al. Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncol 9, 1459–1462; doi:10.1001/jamaoncol.2023.2954 (2023).
OpenUrl CrossRef
↵
Bellinger, J. R. et al. BPPV Information on Google Versus AI (ChatGPT). Otolaryngol Head Neck Surg; doi:10.1002/ohn.506 (2023).
OpenUrl CrossRef
↵
Nielsen, J. P. S., von Buchwald, C. & Grønhøj, C. Validity of the large language model ChatGPT (GPT4) as a patient information source in otolaryngology by a variety of doctors in a tertiary otorhinolaryngology department. Acta Otolaryngol 143, 779–782; doi:10.1080/00016489.2023.2254809 (2023).
OpenUrl CrossRef
↵
Sezgin, E., Chekeni, F., Lee, J. & Keim, S. Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study. J Med Internet Res 25, e49240; doi:10.2196/49240 (2023).
OpenUrl CrossRef
↵
Floyd, W. et al. Current Strengths and Weaknesses of ChatGPT as a Resource for Radiation Oncology Patients and Providers. Int J Radiat Oncol Biol Phys; doi:10.1016/j.ijrobp.2023.10.020 (2023).
OpenUrl CrossRef
↵
Uz, C. & Umay, E. “Dr ChatGPT“: Is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 26, 1343–1349; doi:10.1111/1756-185x.14749 (2023).
OpenUrl CrossRef
↵
Athavale, A., Baier, J., Ross, E. & Fukaya, E. The potential of chatbots in chronic venous disease patient management. JVS Vasc Insights 1; doi:10.1016/j.jvsvi.2023.100019 (2023).
OpenUrl CrossRef
↵
Li, Y. et al. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 15, e40895; doi:10.7759/cureus.40895 (2023).
OpenUrl CrossRef
↵
Seth, I. et al. Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study. Aesthet Surg J Open Forum 5, ojad084; doi:10.1093/asjof/ojad084 (2023).
OpenUrl CrossRef
↵
Lockie, E. & Choi, J. Evaluation of a chat GPT generated patient information leaflet about laparoscopic cholecystectomy. ANZ J Surg; doi:10.1111/ans.18834 (2023).
OpenUrl CrossRef
↵
Haver, H. L., Lin, C. T., Sirajuddin, A., Yi, P. H. & Jeudy, J. Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT’s Answers to Common Questions About Lung Cancer and Lung Cancer Screening. AJR Am J Roentgenol 221, 701–704; doi:10.2214/ajr.23.29622 (2023).
OpenUrl CrossRef
↵
Li, H. et al. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging 101, 137–141; doi:10.1016/j.clinimag.2023.06.008 (2023).
OpenUrl CrossRef
↵
Scheschenja, M. et al. Feasibility of GPT-3 and GPT-4 for in-Depth Patient Education Prior to Interventional Radiological Procedures: A Comparative Analysis. Cardiovasc Intervent Radiol 47, 245–250; doi:10.1007/s00270-023-03563-2 (2024).
OpenUrl CrossRef
↵
Gordon, E. B. et al. Enhancing Patient Communication With Chat-GPT in Radiology: Evaluating the Efficacy and Readability of Answers to Common Imaging-Related Questions. J Am Coll Radiol 21, 353–359; doi:10.1016/j.jacr.2023.09.011 (2024).
OpenUrl CrossRef
↵
Stroop, A. et al. Large language models: Are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery? Eur Spine J; doi:10.1007/s00586-023-07975-z (2023).
OpenUrl CrossRef
↵
Coraci, D. et al. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol 33; doi:10.4081/ejtm.2023.12114 (2023).
OpenUrl CrossRef
↵
Ye, C., Zweck, E., Ma, Z., Smith, J. & Katz, S. Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study. Arthritis Rheumatol; doi:10.1002/art.42737 (2023).
OpenUrl CrossRef
↵
Mohammad-Rahimi, H. et al. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 57, 305–314; doi:10.1111/iej.14014 (2024).
OpenUrl CrossRef
↵
Hermann, C. E. et al. Let’s chat about cervical cancer: Assessing the accuracy of ChatGPT responses to cervical cancer questions. Gynecol Oncol 179, 164–168; doi:10.1016/j.ygyno.2023.11.008 (2023).
OpenUrl CrossRef
↵
Kerbage, A. et al. Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers. Clin Gastroenterol Hepatol; doi:10.1016/j.cgh.2023.11.008 (2023).
OpenUrl CrossRef
↵
Shiraishi, M. et al. Generating Informed Consent Documents Related to Blepharoplasty Using ChatGPT. Ophthalmic Plast Reconstr Surg; doi:10.1097/iop.0000000000002574 (2023).
OpenUrl CrossRef
↵
Barclay, K. S. et al. Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy. Cornea; doi:10.1097/ico.0000000000003439 (2023).
OpenUrl CrossRef
↵
Qarajeh, A. et al. AI-Powered Renal Diet Support: Performance of ChatGPT, Bard AI, and Bing Chat. Clin Pract 13, 1160–1172; doi:10.3390/clinpract13050104 (2023).
OpenUrl CrossRef
↵
Chowdhury, M. et al. Can Large Language Models Safely Address Patient Questions Following Cataract Surgery?. Proceedings of the 5th Clinical Natural Language Processing Workshop; doi:10.18653/v1/2023.clinicalnlp-1.17 (2023).
↵
Singer, M. B., Fu, J. J., Chow, J. & Teng, C. C. Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4. J Surg Educ 81, 438–443; doi:10.1016/j.jsurg.2023.11.019 (2024).
OpenUrl CrossRef
↵
Xie, Y. et al. Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT. Aesthetic Plast Surg 47, 1985–1993; doi:10.1007/s00266-023-03338-7 (2023).
OpenUrl CrossRef PubMed
↵
Nastasi, A. J., Courtright, K. R., Halpern, S. D. & Weissman, G. E. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep 13, 17885; doi:10.1038/s41598-023-45223-y (2023).
OpenUrl CrossRef
↵
Biswas, M., Islam, A., Shah, Z., Zaghouani, W. & Brahim Belhaouari, S. Can ChatGPT be Your Personal Medical Assistant?. Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS), 1–5; doi:10.1109/SNAMS60348.2023.10375477 (2023).
OpenUrl CrossRef
↵
Panagoulias, D., Palamidas, F., Virvou, M. & Tsihrintzis, G. Evaluating the Potential of LLMs and ChatGPT on Medical Diagnosis and Treatment. 14th International Conference on Information, Intelligence, Systems & Applications (IISA), 1–9; doi:10.1109/IISA59645.2023.10345968 (2023).
↵
Chandra, A., Davis, M. J., Hamann, D. & Hamann, C. R. Utility of Allergen-Specific Patient-Directed Handouts Generated by Chat Generative Pretrained Transformer. Dermatitis 34, 448; doi:10.1089/derm.2023.0059 (2023).
OpenUrl CrossRef
↵
Hung, Y.-C., Chaker, S., Sigel, M., Saad, M. & Slater, E. Comparison of Patient Education Materials Generated by Chat Generative Pre-Trained Transformer Versus Experts: An Innovative Way to Increase Readability of Patient Education Materials. Ann Plast Surg 91, 409–412; doi:10.1097/SAP.0000000000003634 (2023).
OpenUrl CrossRef
↵
Capelleras, M., Soto-Galindo, G. A., Cruellas, M. & Apaydin, F. ChatGPT and Rhinoplasty Recovery: An Exploration of AI’s Role in Postoperative Guidance. Facial Plast Surg; doi:10.1055/a-2219-4901 (2024).
OpenUrl CrossRef
↵
Scquizzato, T. et al. Testing ChatGPT ability to answer laypeople questions about cardiac arrest and cardiopulmonary resuscitation. Resuscitation 194, 110077; doi:10.1016/j.resuscitation.2023.110077 (2024).
OpenUrl CrossRef
↵
Kuckelman, I. J. et al. Assessing AI-Powered Patient Education: A Case Study in Radiology. Acad Radiol 31, 338–342; doi:10.1016/j.acra.2023.08.020 (2024).
OpenUrl CrossRef
↵
Sulejmani, P. et al. A large language model artificial intelligence for patient queries in atopic dermatitis. J Eur Acad Dermatol Venereol; doi:10.1111/jdv.19737 (2024).
OpenUrl CrossRef
↵
Currie, G. & Barry, K. ChatGPT in Nuclear Medicine Education. J Nucl Med Technol 51, 247–254; doi:10.2967/jnmt.123.265844 (2023).
OpenUrl Abstract/FREE Full Text
↵
Currie, G. M. Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy? Semin Nucl Med 53, 719–730; doi:10.1053/j.semnuclmed.2023.04.008 (2023).
OpenUrl CrossRef
↵
Li, J., Dada, A., Puladi, B., Kleesiek, J. & Egger, J. ChatGPT in healthcare: A taxonomy and systematic review. Comput Methods Programs Biomed 245, 108013; doi:10.1016/j.cmpb.2024.108013 (2024).
OpenUrl CrossRef
↵
Jin, M. et al. Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model. arXiv preprint arXiv:2402.00746; doi:10.48550/arXiv.2402.00746 (2024).
OpenUrl CrossRef
↵
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375; doi:10.48550/arXiv.2303.13375 (2023).
OpenUrl CrossRef
Brin, D. et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 13, 16492; doi:10.1038/s41598-023-43436-9 (2023).
OpenUrl CrossRef
Jung, L. B. et al. ChatGPT Passes German State Examination in Medicine With Picture Questions Omitted. Dtsch Arztebl Int 120, 373–374; doi:10.3238/arztebl.m2023.0113 (2023).
OpenUrl CrossRef
↵
Bhayana, R., Krishna, S. & Bleakney, R. R. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology 307, e230582; doi:10.1148/radiol.230582 (2023).
OpenUrl CrossRef
↵
Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617; doi:10.48550/arXiv.2305.09617 (2023).
OpenUrl CrossRef
↵
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2, e0000198; doi:10.1371/journal.pdig.0000198 (2023).
OpenUrl CrossRef PubMed
↵
Kapoor, S., Henderson, P. & Narayanan, A. Promises and pitfalls of artificial intelligence for legal applications. arXiv preprint arXiv:2402.01656; doi:10.48550/arXiv.2402.01656 (2024).
OpenUrl CrossRef
↵
Navigli, R., Conia, S. & Ross, B. Biases in Large Language Models: Origins, Inventory, and Discussion. ACM J Data Inf Qual 15, 1–21; doi:10.1145/3597307 (2023).
OpenUrl CrossRef
↵
Deng, G. et al. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715; doi:10.48550/arXiv.2307.08715 (2023).
OpenUrl CrossRef
↵
Ayoub, N. F., Lee, Y. J., Grimm, D. & Divi, V. Head-to-Head Comparison of ChatGPT Versus Google Search for Medical Knowledge Acquisition. Otolaryngol Head Neck Surg; doi:10.1002/ohn.465 (2023).
OpenUrl CrossRef
↵
Weis, B. Health Literacy: A Manual for Clinicians. Chicago, IL: American Medical Association, American Medical Foundation (2003).
↵
Tierney, A. A. et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catalyst 5, CAT.23.0404; doi:10.1056/CAT.23.0404 (2024).
OpenUrl CrossRef
↵
Council of the European Union. Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts - Analysis of the final compromise text with a view to agreement. 2021/0106(COD), 1–272 (2024).

View the discussion thread.

Posted March 05, 2024.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Health Informatics

Subject Areas

All Articles

Addiction Medicine (395)
Allergy and Immunology (707)
Anesthesia (198)
Cardiovascular Medicine (2905)
Dentistry and Oral Medicine (329)
Dermatology (249)
Emergency Medicine (435)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1024)
Epidemiology (12674)
Forensic Medicine (10)
Gastroenterology (822)
Genetic and Genomic Medicine (4529)
Geriatric Medicine (411)
Health Economics (722)
Health Informatics (2895)
Health Policy (1065)
Health Systems and Quality Improvement (1064)
Hematology (382)
HIV/AIDS (917)
Infectious Diseases (except HIV/AIDS) (14056)
Intensive Care and Critical Care Medicine (840)
Medical Education (422)
Medical Ethics (115)
Nephrology (466)
Neurology (4292)
Nursing (233)
Nutrition (630)
Obstetrics and Gynecology (797)
Occupational and Environmental Health (730)
Oncology (2249)
Ophthalmology (639)
Orthopedics (257)
Otolaryngology (324)
Pain Medicine (274)
Palliative Medicine (83)
Pathology (493)
Pediatrics (1190)
Pharmacology and Therapeutics (497)
Primary Care Research (491)
Psychiatry and Clinical Psychology (3716)
Public and Global Health (6877)
Radiology and Imaging (1514)
Rehabilitation Medicine and Physical Therapy (887)
Respiratory Medicine (912)
Rheumatology (433)
Sexual and Reproductive Health (436)
Sports Medicine (381)
Surgery (481)
Toxicology (60)
Transplantation (208)
Urology (178)

[1] ↵
Milmo, D. ChatGPT reaches 100 million users two months after launch, <https://www.theguardian.com/technology/2023/feb/02/chatgpt-100-million-users-open-ai-fastest-growing-app> (2023).

[2] OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774; doi:10.48550/arXiv.2303.08774 (2023).
OpenUrl CrossRef

[3] ↵
Zhao, W. X. et al. A survey of large language models. arXiv preprint arXiv:2303.18223; doi:10.48550/arXiv.2303.18223 (2023).
OpenUrl CrossRef

[4] ↵
Clusmann, J. et al. The future landscape of large language models in medicine. Communications Medicine 3, 141; doi:10.1038/s43856-023-00370-1 (2023).
OpenUrl CrossRef

[5] ↵
Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079; doi:10.48550/arXiv.2311.16079 (2023).
OpenUrl CrossRef

[6] ↵
Labrak, Y. et al. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv preprint arXiv:2402.10373; doi:10.48550/arXiv.2402.10373 (2024).
OpenUrl CrossRef

[7] ↵
Xiong, G., Jin, Q., Lu, Z. & Zhang, A. Benchmarking Retrieval-Augmented Generation for Medicine. arXiv preprint arXiv:2402.13178; doi:10.48550/arXiv.2402.13178 (2024).
OpenUrl CrossRef

[8] ↵
Yang, X., et al. A large language model for electronic health records. npj Dig Med 5, 194; doi:10.1038/s41746-022-00742-2 (2022).
OpenUrl CrossRef

[9] Tian, S. et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 25; doi:10.1093/bib/bbad493 (2024).
OpenUrl CrossRef

[10] ↵
Adams, L. C. et al. Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 307, e230725; doi:10.1148/radiol.230725 (2023).
OpenUrl CrossRef

[11] ↵
McDuff, D. et al. Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164; doi:10.48550/arXiv.2312.00164 (2023).
OpenUrl CrossRef

[12] ↵
Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619, 357–362; doi:10.1038/s41586-023-06160-y (2023).
OpenUrl CrossRef

[13] ↵
Liu, S. et al. Leveraging Large Language Models for Generating Responses to Patient Messages. medRxiv 2023.2007.2014.23292669; doi:10.1101/2023.07.14.23292669 (2023).
OpenUrl Abstract/FREE Full Text

[14] ↵
Busch, F., Hoffmann, L., Adams, L. C. & Bressem, K. K. A systematic review of current large language model applications and biases in patient care, <https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42024504542> (2024).

[15] ↵
Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj 372, n71; doi:10.1136/bmj.n71 (2021).
OpenUrl FREE Full Text

[16] ↵
Ouzzani, M., Hammady, H., Fedorowicz, Z. & Elmagarmid, A. Rayyan—a web and mobile app for systematic reviews. Syst Rev 5, 210; doi:10.1186/s13643-016-0384-4 (2016).
OpenUrl CrossRef PubMed

[17] ↵
Data extraction form, <https://docs.google.com/forms/d/e/1FAIpQLScFwE5KaOugxX_xXtt9Y6fbBhV4s77S9cWRdVuiHh34vmArkQ/viewform> (2024).

[18] ↵
Hong, Q. N. et al. The Mixed Methods Appraisal Tool (MMAT) version 2018 for information professionals and researchers. Educ Inf 34, 285–291; doi:10.3233/EFI-180221 (2018).
OpenUrl Abstract/FREE Full Text

[19] ↵
Hong, Q. N., Pluye, P., Bujold, M. & Wassef, M. Convergent and sequential synthesis designs: implications for conducting and reporting systematic reviews of qualitative and quantitative evidence. Syst Rev 6, 61; doi:10.1186/s13643-017-0454-2 (2017).
OpenUrl CrossRef PubMed

[20] ↵
Thomas, J. & Harden, A. Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med Res Methodol 8, 45; doi:10.1186/1471-2288-8-45 (2008).
OpenUrl CrossRef PubMed

[21] ↵
Sociocultural Research Consultants, LLC. Dedoose Version 9.2.4, cloud application for managing, analyzing, and presenting qualitative and mixed method research data (Los Angeles, CA, 2024).

[22] ↵
Savage, T., Wang, J. & Shieh, L. A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation. JMIR Med Inform 11, e49886; doi:10.2196/49886 (2023).
OpenUrl CrossRef

[23] ↵
Coskun, B. N., Yagiz, B., Ocakoglu, G., Dalkilic, E. & Pehlivan, Y. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int; doi:10.1007/s00296-023-05473-5 (2023).
OpenUrl CrossRef

[24] ↵
Bitar, H., Babour, A., Nafa, F., Alzamzami, O. & Alismail, S. Increasing Women’s Knowledge about HPV Using BERT Text Summarization: An Online Randomized Study. Int J Environ Res Public Health 19; doi:10.3390/ijerph19138100 (2022).
OpenUrl CrossRef

[25] ↵
Samaan, J. S. et al. Artificial Intelligence and Patient Education: Examining the Accuracy and Reproducibility of Responses to Nutrition Questions Related to Inflammatory Bowel Disease by GPT-4. medRxiv 2023.2010.2028.23297723; doi:10.1101/2023.10.28.23297723 (2023).
OpenUrl Abstract/FREE Full Text

[26] ↵
Eromosele, O. B., Sobodu, T., Olayinka, O. & Ouyang, D. Racial Disparities in Knowledge of Cardiovascular Disease by a Chat-Based Artificial Intelligence Model. medRxiv 2023.2009.2020.23295874; doi:10.1101/2023.09.20.23295874 (2023).
OpenUrl Abstract/FREE Full Text

[27] ↵
Johri, S. et al. Guidelines For Rigorous Evaluation of Clinical LLMs For Conversational Reasoning. medRxiv 2023.2009.2012.23295399; doi:10.1101/2023.09.12.23295399 (2024).
OpenUrl Abstract/FREE Full Text

[28] ↵
Braga, A. V. N. M. et al. Use of ChatGPT in Pediatric Urology and its Relevance in Clinical Practice: Is it useful? medRxiv 2023.2009.2011.23295266; doi:10.1101/2023.09.11.23295266 (2023).
OpenUrl Abstract/FREE Full Text

[29] ↵
King, R. C. et al. Appropriateness of ChatGPT in answering heart failure related questions. medRxiv 2023.2007.2007.23292385; doi:10.1101/2023.07.07.23292385 (2023).
OpenUrl Abstract/FREE Full Text

[30] ↵
Huang, S. S. et al. Fact Check: Assessing the Response of ChatGPT to Alzheimer’s Disease Statements with Varying Degrees of Misinformation. medRxiv 2023.2009.2004.23294917; doi:10.1101/2023.09.04.23294917 (2023).
OpenUrl Abstract/FREE Full Text

[31] ↵
Hanna, J. J., Wakene, A. D., Lehmann, C. U. & Medford, R. J. Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by ChatGPT¹. medRxiv 2023.2008.2028.23294730; doi:10.1101/2023.08.28.23294730 (2023).
OpenUrl Abstract/FREE Full Text

[32] ↵
Samaan, J. S. et al. ChatGPT’s ability to comprehend and answer cirrhosis related questions in Arabic. Arab J Gastroenterol 24, 145–148; doi:10.1016/j.ajg.2023.08.001 (2023).
OpenUrl CrossRef

[33] ↵
Patnaik, S. S. & Hoffmann, U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth 132, 169–171; doi:10.1016/j.bja.2023.09.030 (2024).
OpenUrl CrossRef

[34] ↵
Ali, H. et al. Evaluating the performance of ChatGPT in responding to questions about endoscopic procedures for patients. iGIE 2, 553–559; doi:10.1016/j.igie.2023.10.001 (2023).
OpenUrl CrossRef

[35] ↵
Suresh, K. et al. Utility of GPT-4 as an Informational Patient Resource in Otolaryngology. medRxiv 2023.2005.2014.23289944; doi:10.1101/2023.05.14.23289944 (2023).
OpenUrl Abstract/FREE Full Text

[36] ↵
Yeo, Y. H. et al. GPT-4 outperforms ChatGPT in answering non-English questions related to cirrhosis. medRxiv 2023.2005.2004.23289482; doi:10.1101/2023.05.04.23289482 (2023).
OpenUrl Abstract/FREE Full Text

[37] ↵
Knebel, D. et al. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies-An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 1, 5–35; doi:10.1055/a-2149-0447 (2023).
OpenUrl CrossRef

[38] ↵
Zhu, L., Mou, W. & Chen, R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med 21, 269; doi:10.1186/s12967-023-04123-5 (2023).
OpenUrl CrossRef

[39] ↵
Lahat, A., Shachar, E., Avidan, B., Glicksberg, B. & Klang, E. Evaluating the Utility of a Large Language Model in Answering Common Patients’ Gastrointestinal Health-Related Questions: Are We There Yet? Diagnostics (Basel) 13; doi:10.3390/diagnostics13111950 (2023).
OpenUrl CrossRef

[40] ↵
Bernstein, I. A. et al. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open 6, e2330320; doi:10.1001/jamanetworkopen.2023.30320 (2023).
OpenUrl CrossRef

[41] ↵
Rogasch, J. M. M. et al. ChatGPT: Can You Prepare My Patients for [¹⁸F]FDG PET/CT and Explain My Reports? J Nucl Med, jnumed.123.266114; doi:10.2967/jnumed.123.266114 (2023).
OpenUrl Abstract/FREE Full Text

[42] ↵
Campbell, D. J. et al. Evaluating ChatGPT Responses on Thyroid Nodules for Patient Education. Thyroid®; doi:10.1089/thy.2023.0491 (2023).
OpenUrl CrossRef

[43] ↵
Currie, G., Robbie, S. & Tually, P. ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4. J Nucl Med Technol 51, 307–313; doi:10.2967/jnmt.123.266151 (2023).
OpenUrl Abstract/FREE Full Text

[44] ↵
Draschl, A. et al. Are ChatGPT’s Free-Text Responses on Periprosthetic Joint Infections of the Hip and Knee Reliable and Useful? J Clin Med 12; doi:10.3390/jcm12206655 (2023).
OpenUrl CrossRef

[45] ↵
Alessandri-Bonetti, M., Liu, H. Y., Palmesano, M., Nguyen, V. T. & Egro, F. M. Online patient education in body contouring: A comparison between Google and ChatGPT. J Plast Reconstr Aesthet Surg 87, 390–402; doi:10.1016/j.bjps.2023.10.091 (2023).
OpenUrl CrossRef

[46] ↵
Coskun, B., Ocakoglu, G., Yetemen, M. & Kaygisiz, O. Can ChatGPT, an Artificial Intelligence Language Model, Provide Accurate and High-quality Patient Information on Prostate Cancer? Urology 180, 35–58; doi:10.1016/j.urology.2023.05.040 (2023).
OpenUrl CrossRef

[47] ↵
Durairaj, K. K. et al. Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT “Wins” Rhinoplasty Consultations: Should We Be Worried? Facial Plast Surg Aesthet Med; doi:10.1089/fpsam.2023.0224 (2023).
OpenUrl CrossRef

[48] ↵
Kianian, R., Sun, D., Crowell, E. L. & Tsui, E. The Use of Large Language Models to Generate Education Materials about Uveitis. Ophthalmol Retina; doi:10.1016/j.oret.2023.09.008 (2023).
OpenUrl CrossRef

[49] ↵
Seth, I. et al. Exploring the Role of a Large Language Model on Carpal Tunnel Syndrome Management: An Observation Study of ChatGPT. J Hand Surg Am 48, 1025–1033; doi:10.1016/j.jhsa.2023.07.003 (2023).
OpenUrl CrossRef

[50] ↵
Inojosa, H. et al. Can ChatGPT explain it? Use of artificial intelligence in multiple sclerosis communication. Neurol Res Pract 5, 48; doi:10.1186/s42466-023-00270-8 (2023).
OpenUrl CrossRef

[51] ↵
Lyons, R. J., Arepalli, S. R., Fromal, O., Choi, J. D. & Jain, N. Artificial intelligence chatbot performance in triage of ophthalmic conditions. Can J Ophthalmol; doi:10.1016/j.jcjo.2023.07.016 (2023).
OpenUrl CrossRef

[52] ↵
Babayiğit, O., Tastan Eroglu, Z., Ozkan Sen, D. & Ucan Yarkac, F. Potential Use of ChatGPT for Patient Information in Periodontology: A Descriptive Pilot Study. Cureus 15, e48518; doi:10.7759/cureus.48518 (2023).
OpenUrl CrossRef

[53] ↵
Mondal, H., Dash, I., Mondal, S. & Behera, J. K. ChatGPT in Answering Queries Related to Lifestyle-Related Diseases and Disorders. Cureus 15, e48296; doi:10.7759/cureus.48296 (2023).
OpenUrl CrossRef

[54] ↵
Kim, H. W., Shin, D. H., Kim, J., Lee, G. H. & Cho, J. W. Assessing the performance of ChatGPT’s responses to questions related to epilepsy: A cross-sectional study on natural language processing and medical information retrieval. Seizure 114, 1–8; doi:10.1016/j.seizure.2023.11.013 (2023).
OpenUrl CrossRef

[55] ↵
Song, H. et al. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 47, 125; doi:10.1007/s10916-023-02021-3 (2023).
OpenUrl CrossRef

[56] ↵
Zalzal, H. G., Abraham, A., Cheng, J. H. & Shah, R. K. Can ChatGPT help patients answer their otolaryngology questions? Laryngoscope Investig Otolaryngol; doi:10.1002/lio2.1193 (2023).
OpenUrl CrossRef

[57] ↵
Chervenak, J., Lieman, H., Blanco-Breindel, M. & Jindal, S. The promise and peril of using a large language model to obtain clinical information: ChatGPT performs strongly as a fertility counseling tool with limitations. Fertil Steril 120, 575–583; doi:10.1016/j.fertnstert.2023.05.151 (2023).
OpenUrl CrossRef

[58] ↵
Bushuven, S. et al. “ChatGPT, Can You Help Me Save My Child’s Life?” - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis. J Med Syst 47, 123; doi:10.1007/s10916-023-02019-x (2023).
OpenUrl CrossRef

[59] ↵
Jeblick, K. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol; doi:10.1007/s00330-023-10213-1 (2023).
OpenUrl CrossRef

[60] ↵
Samaan, J. S. et al. Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery. Obes Surg 33, 1790–1796; doi:10.1007/s11695-023-06603-5 (2023).
OpenUrl CrossRef

[61] ↵
Zhou, J. M., Li, T. Y., Fong, S. J., Dey, N. & Crespo, R. G. Exploring ChatGPT’s Potential for Consultation, Recommendations and Report Diagnosis: Gastric Cancer and Gastroscopy Reports’ Case. Int J Interact Multimed Artif Intell 8, 7–13; doi:10.9781/ijimai.2023.04.007 (2023).
OpenUrl CrossRef

[62] ↵
Oniani, D. et al. Toward Improving Health Literacy in Patient Education Materials with Neural Machine Translation Models. AMIA Jt Summits Transl Sci Proc 2023, 418–426 (2023).

[63] ↵
Hernandez, C. A. et al. The Future of Patient Education: AI-Driven Guide for Type 2 Diabetes. Cureus 15, e48919; doi:10.7759/cureus.48919 (2023).
OpenUrl CrossRef

[64] ↵
Kuşcu, O., Pamuk, A. E., Sütay Süslü, N. & Hosal, S. Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Front Oncol 13, 1256459; doi:10.3389/fonc.2023.1256459
OpenUrl CrossRef

[65] ↵
Biswas, S., Logan, N. S., Davies, L. N., Sheppard, A. L. & Wolffsohn, J. S. Assessing the utility of ChatGPT as an artificial intelligence-based large language model for information to answer questions on myopia. Ophthalmic Physiol Opt 43, 1562–1570; doi:10.1111/opo.13207 (2023).
OpenUrl CrossRef

[66] ↵
Chiesa-Estomba, C. M. et al. Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. Eur Arch Otorhinolaryngol; doi:10.1007/s00405-023-08104-8 (2023).
OpenUrl CrossRef

[67] ↵
Decker, H. et al. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Netw Open 6, e2336997; doi:10.1001/jamanetworkopen.2023.36997 (2023).
OpenUrl CrossRef

[68] ↵
Kaarre, J. et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc 31, 5190–5198; doi:10.1007/s00167-023-07529-2 (2023).
OpenUrl CrossRef

[69] ↵
Ferreira, A. L., Chu, B., Grant-Kels, J. M., Ogunleye, T. & Lipoff, J. B. Evaluation of ChatGPT Dermatology Responses to Common Patient Queries. JMIR Dermatol 6, e49280; doi:10.2196/49280 (2023).
OpenUrl CrossRef

[70] ↵
Truhn, D. et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep 13, 20159; doi:10.1038/s41598-023-47500-2 (2023).
OpenUrl CrossRef

[71] ↵
Hurley, E. T. et al. Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence-Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery. Arthroscopy; doi:10.1016/j.arthro.2023.07.048 (2023).
OpenUrl CrossRef

[72] ↵
Cankurtaran, R. E., Polat, Y. H., Aydemir, N. G., Umay, E. & Yurekli, O. T. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus 15, e46736; doi:10.7759/cureus.46736 (2023).
OpenUrl CrossRef

[73] ↵
Birkun, A. A. & Gautam, A. Large Language Model (LLM)-Powered Chatbots Fail to Generate Guideline-Consistent Content on Resuscitation and May Provide Potentially Harmful Advice. Prehosp Disaster Med 38, 757–763; doi:10.1017/s1049023x23006568 (2023).
OpenUrl CrossRef

[74] ↵
Pushpanathan, K. et al. Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience 26, 108163; doi:10.1016/j.isci.2023.108163 (2023).
OpenUrl CrossRef

[75] ↵
Shao, C. Y. et al. Appropriateness and Comprehensiveness of Using ChatGPT for Perioperative Patient Education in Thoracic Surgery in Different Language Contexts: Survey Study. Interact J Med Res 12, e46900; doi:10.2196/46900 (2023).
OpenUrl CrossRef

[76] ↵
Vaira, L. A. et al. Accuracy of ChatGPT-Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis. Otolaryngol Head Neck Surg; doi:10.1002/ohn.489 (2023).
OpenUrl CrossRef

[77] ↵
Chen, S. et al. Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncol 9, 1459–1462; doi:10.1001/jamaoncol.2023.2954 (2023).
OpenUrl CrossRef

[78] ↵
Bellinger, J. R. et al. BPPV Information on Google Versus AI (ChatGPT). Otolaryngol Head Neck Surg; doi:10.1002/ohn.506 (2023).
OpenUrl CrossRef

[79] ↵
Nielsen, J. P. S., von Buchwald, C. & Grønhøj, C. Validity of the large language model ChatGPT (GPT4) as a patient information source in otolaryngology by a variety of doctors in a tertiary otorhinolaryngology department. Acta Otolaryngol 143, 779–782; doi:10.1080/00016489.2023.2254809 (2023).
OpenUrl CrossRef

[80] ↵
Sezgin, E., Chekeni, F., Lee, J. & Keim, S. Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study. J Med Internet Res 25, e49240; doi:10.2196/49240 (2023).
OpenUrl CrossRef

[81] ↵
Floyd, W. et al. Current Strengths and Weaknesses of ChatGPT as a Resource for Radiation Oncology Patients and Providers. Int J Radiat Oncol Biol Phys; doi:10.1016/j.ijrobp.2023.10.020 (2023).
OpenUrl CrossRef

[82] ↵
Uz, C. & Umay, E. “Dr ChatGPT“: Is it a reliable and useful source for common rheumatic diseases? Int J Rheum Dis 26, 1343–1349; doi:10.1111/1756-185x.14749 (2023).
OpenUrl CrossRef

[83] ↵
Athavale, A., Baier, J., Ross, E. & Fukaya, E. The potential of chatbots in chronic venous disease patient management. JVS Vasc Insights 1; doi:10.1016/j.jvsvi.2023.100019 (2023).
OpenUrl CrossRef

[84] ↵
Li, Y. et al. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus 15, e40895; doi:10.7759/cureus.40895 (2023).
OpenUrl CrossRef

[85] ↵
Seth, I. et al. Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study. Aesthet Surg J Open Forum 5, ojad084; doi:10.1093/asjof/ojad084 (2023).
OpenUrl CrossRef

[86] ↵
Lockie, E. & Choi, J. Evaluation of a chat GPT generated patient information leaflet about laparoscopic cholecystectomy. ANZ J Surg; doi:10.1111/ans.18834 (2023).
OpenUrl CrossRef

[87] ↵
Haver, H. L., Lin, C. T., Sirajuddin, A., Yi, P. H. & Jeudy, J. Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT’s Answers to Common Questions About Lung Cancer and Lung Cancer Screening. AJR Am J Roentgenol 221, 701–704; doi:10.2214/ajr.23.29622 (2023).
OpenUrl CrossRef

[88] ↵
Li, H. et al. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging 101, 137–141; doi:10.1016/j.clinimag.2023.06.008 (2023).
OpenUrl CrossRef

[89] ↵
Scheschenja, M. et al. Feasibility of GPT-3 and GPT-4 for in-Depth Patient Education Prior to Interventional Radiological Procedures: A Comparative Analysis. Cardiovasc Intervent Radiol 47, 245–250; doi:10.1007/s00270-023-03563-2 (2024).
OpenUrl CrossRef

[90] ↵
Gordon, E. B. et al. Enhancing Patient Communication With Chat-GPT in Radiology: Evaluating the Efficacy and Readability of Answers to Common Imaging-Related Questions. J Am Coll Radiol 21, 353–359; doi:10.1016/j.jacr.2023.09.011 (2024).
OpenUrl CrossRef

[91] ↵
Stroop, A. et al. Large language models: Are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery? Eur Spine J; doi:10.1007/s00586-023-07975-z (2023).
OpenUrl CrossRef

[92] ↵
Coraci, D. et al. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol 33; doi:10.4081/ejtm.2023.12114 (2023).
OpenUrl CrossRef

[93] ↵
Ye, C., Zweck, E., Ma, Z., Smith, J. & Katz, S. Doctor Versus Artificial Intelligence: Patient and Physician Evaluation of Large Language Model Responses to Rheumatology Patient Questions in a Cross-Sectional Study. Arthritis Rheumatol; doi:10.1002/art.42737 (2023).
OpenUrl CrossRef

[94] ↵
Mohammad-Rahimi, H. et al. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 57, 305–314; doi:10.1111/iej.14014 (2024).
OpenUrl CrossRef

[95] ↵
Hermann, C. E. et al. Let’s chat about cervical cancer: Assessing the accuracy of ChatGPT responses to cervical cancer questions. Gynecol Oncol 179, 164–168; doi:10.1016/j.ygyno.2023.11.008 (2023).
OpenUrl CrossRef

[96] ↵
Kerbage, A. et al. Accuracy of ChatGPT in Common Gastrointestinal Diseases: Impact for Patients and Providers. Clin Gastroenterol Hepatol; doi:10.1016/j.cgh.2023.11.008 (2023).
OpenUrl CrossRef

[97] ↵
Shiraishi, M. et al. Generating Informed Consent Documents Related to Blepharoplasty Using ChatGPT. Ophthalmic Plast Reconstr Surg; doi:10.1097/iop.0000000000002574 (2023).
OpenUrl CrossRef

[98] ↵
Barclay, K. S. et al. Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy. Cornea; doi:10.1097/ico.0000000000003439 (2023).
OpenUrl CrossRef

[99] ↵
Qarajeh, A. et al. AI-Powered Renal Diet Support: Performance of ChatGPT, Bard AI, and Bing Chat. Clin Pract 13, 1160–1172; doi:10.3390/clinpract13050104 (2023).
OpenUrl CrossRef

[100] ↵
Chowdhury, M. et al. Can Large Language Models Safely Address Patient Questions Following Cataract Surgery?. Proceedings of the 5th Clinical Natural Language Processing Workshop; doi:10.18653/v1/2023.clinicalnlp-1.17 (2023).

[101] ↵
Singer, M. B., Fu, J. J., Chow, J. & Teng, C. C. Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4. J Surg Educ 81, 438–443; doi:10.1016/j.jsurg.2023.11.019 (2024).
OpenUrl CrossRef

[102] ↵
Xie, Y. et al. Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT. Aesthetic Plast Surg 47, 1985–1993; doi:10.1007/s00266-023-03338-7 (2023).
OpenUrl CrossRef PubMed

[103] ↵
Nastasi, A. J., Courtright, K. R., Halpern, S. D. & Weissman, G. E. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep 13, 17885; doi:10.1038/s41598-023-45223-y (2023).
OpenUrl CrossRef

[104] ↵
Biswas, M., Islam, A., Shah, Z., Zaghouani, W. & Brahim Belhaouari, S. Can ChatGPT be Your Personal Medical Assistant?. Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS), 1–5; doi:10.1109/SNAMS60348.2023.10375477 (2023).
OpenUrl CrossRef

[105] ↵
Panagoulias, D., Palamidas, F., Virvou, M. & Tsihrintzis, G. Evaluating the Potential of LLMs and ChatGPT on Medical Diagnosis and Treatment. 14th International Conference on Information, Intelligence, Systems & Applications (IISA), 1–9; doi:10.1109/IISA59645.2023.10345968 (2023).

[106] ↵
Chandra, A., Davis, M. J., Hamann, D. & Hamann, C. R. Utility of Allergen-Specific Patient-Directed Handouts Generated by Chat Generative Pretrained Transformer. Dermatitis 34, 448; doi:10.1089/derm.2023.0059 (2023).
OpenUrl CrossRef

[107] ↵
Hung, Y.-C., Chaker, S., Sigel, M., Saad, M. & Slater, E. Comparison of Patient Education Materials Generated by Chat Generative Pre-Trained Transformer Versus Experts: An Innovative Way to Increase Readability of Patient Education Materials. Ann Plast Surg 91, 409–412; doi:10.1097/SAP.0000000000003634 (2023).
OpenUrl CrossRef

[108] ↵
Capelleras, M., Soto-Galindo, G. A., Cruellas, M. & Apaydin, F. ChatGPT and Rhinoplasty Recovery: An Exploration of AI’s Role in Postoperative Guidance. Facial Plast Surg; doi:10.1055/a-2219-4901 (2024).
OpenUrl CrossRef

[109] ↵
Scquizzato, T. et al. Testing ChatGPT ability to answer laypeople questions about cardiac arrest and cardiopulmonary resuscitation. Resuscitation 194, 110077; doi:10.1016/j.resuscitation.2023.110077 (2024).
OpenUrl CrossRef

[110] ↵
Kuckelman, I. J. et al. Assessing AI-Powered Patient Education: A Case Study in Radiology. Acad Radiol 31, 338–342; doi:10.1016/j.acra.2023.08.020 (2024).
OpenUrl CrossRef

[111] ↵
Sulejmani, P. et al. A large language model artificial intelligence for patient queries in atopic dermatitis. J Eur Acad Dermatol Venereol; doi:10.1111/jdv.19737 (2024).
OpenUrl CrossRef

[112] ↵
Currie, G. & Barry, K. ChatGPT in Nuclear Medicine Education. J Nucl Med Technol 51, 247–254; doi:10.2967/jnmt.123.265844 (2023).
OpenUrl Abstract/FREE Full Text

[113] ↵
Currie, G. M. Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy? Semin Nucl Med 53, 719–730; doi:10.1053/j.semnuclmed.2023.04.008 (2023).
OpenUrl CrossRef

[114] ↵
Li, J., Dada, A., Puladi, B., Kleesiek, J. & Egger, J. ChatGPT in healthcare: A taxonomy and systematic review. Comput Methods Programs Biomed 245, 108013; doi:10.1016/j.cmpb.2024.108013 (2024).
OpenUrl CrossRef

[115] ↵
Jin, M. et al. Health-LLM: Personalized Retrieval-Augmented Disease Prediction Model. arXiv preprint arXiv:2402.00746; doi:10.48550/arXiv.2402.00746 (2024).
OpenUrl CrossRef

[116] ↵
Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375; doi:10.48550/arXiv.2303.13375 (2023).
OpenUrl CrossRef

[117] Brin, D. et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 13, 16492; doi:10.1038/s41598-023-43436-9 (2023).
OpenUrl CrossRef

[118] Jung, L. B. et al. ChatGPT Passes German State Examination in Medicine With Picture Questions Omitted. Dtsch Arztebl Int 120, 373–374; doi:10.3238/arztebl.m2023.0113 (2023).
OpenUrl CrossRef

[119] ↵
Bhayana, R., Krishna, S. & Bleakney, R. R. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology 307, e230582; doi:10.1148/radiol.230582 (2023).
OpenUrl CrossRef

[120] ↵
Singhal, K. et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617; doi:10.48550/arXiv.2305.09617 (2023).
OpenUrl CrossRef

[121] ↵
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2, e0000198; doi:10.1371/journal.pdig.0000198 (2023).
OpenUrl CrossRef PubMed

[122] ↵
Kapoor, S., Henderson, P. & Narayanan, A. Promises and pitfalls of artificial intelligence for legal applications. arXiv preprint arXiv:2402.01656; doi:10.48550/arXiv.2402.01656 (2024).
OpenUrl CrossRef

[123] ↵
Navigli, R., Conia, S. & Ross, B. Biases in Large Language Models: Origins, Inventory, and Discussion. ACM J Data Inf Qual 15, 1–21; doi:10.1145/3597307 (2023).
OpenUrl CrossRef

[124] ↵
Deng, G. et al. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715; doi:10.48550/arXiv.2307.08715 (2023).
OpenUrl CrossRef

[125] ↵
Ayoub, N. F., Lee, Y. J., Grimm, D. & Divi, V. Head-to-Head Comparison of ChatGPT Versus Google Search for Medical Knowledge Acquisition. Otolaryngol Head Neck Surg; doi:10.1002/ohn.465 (2023).
OpenUrl CrossRef

[126] ↵
Weis, B. Health Literacy: A Manual for Clinicians. Chicago, IL: American Medical Association, American Medical Foundation (2003).

[127] ↵
Tierney, A. A. et al. Ambient Artificial Intelligence Scribes to Alleviate the Burden of Clinical Documentation. NEJM Catalyst 5, CAT.23.0404; doi:10.1056/CAT.23.0404 (2024).
OpenUrl CrossRef

[128] ↵
Council of the European Union. Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts - Analysis of the final compromise text with a view to agreement. 2021/0106(COD), 1–272 (2024).

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Abstract

1. Introduction

2. Methods

2.1 Eligibility criteria

2.2 Screening and data extraction

2.3 Data analysis

3. Results

3.1 Screening results

3.2 Characteristics of included studies

3.3 Applications of Large Language Models

3.4 Limitations of Large Language Models

3.4.1 Design limitations

3.4.2 Output limitations

4. Discussion

4.1 Limitations

5. Conclusion

6. Declarations

6.2 Competing interests

6.3 Author contributions

6.4 Data availability

6.1 Acknowledgements

9. References

Citation Manager Formats

Subject Area