ABSTRACT
Rationale and Objectives Large Language Models (LLMs) have the potential to enhance medical training, education, and diagnosis. However, since these models were not originally designed for medical purposes, there are concerns regarding their reliability and safety in clinical settings. This review systematically assesses the utility, advantages, and potential risks of employing LLMs in the field of hematology.
Materials and Methods We searched PubMed, Web of Science, and Scopus databases for original publications on LLMs application in hematology. We limited the search to articles published in English from December 01 2022 to March 25, 2024, coinciding with the introduction of ChatGPT. To evaluate the risk of bias, we used the adapted version of the Quality Assessment of Diagnostic Accuracy Studies criteria (QUADAS-2).
Results Eleven studies fulfilled the eligibility criteria. The studies varied in their goals and methods, covering medical education, diagnosis, and clinical practice. GPT-3.5 and GPT-4’s demonstrated superior performance in diagnostic tasks and medical information propagation compared to other models like Google’s Bard (currently called Gemini). GPT-4 demonstrated particularly high accuracy in tasks such as interpreting hematology cases and diagnosing hemoglobinopathy, with performance metrics of 76% diagnostic accuracy and 88% accuracy in identifying normal blood cells. However, the study also revealed discrepancies in model consistency and the accuracy of provided references, indicating variability in their reliability.
Conclusion While LLMs present significant opportunities for advancing clinical hematology, their incorporation into medical practice requires careful evaluation of their benefits and limitations.
INTRODUCTION
LLMs such as OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude, are reshaping the field of text generation. Notably, ChatGPT-3.5 alone boasts more than 100 million monthly active users.1
LLMs are increasingly used in clinical research and practice. They serve various purposes such as simplifying complex medical terminology for patient education.2 They also support physician training and academic learning through interactive tools3 and enhance diagnostic accuracy by creating predictive models from Electronic Health Records (EHR) data for patient outcomes.4 In hematology, a field with strict protocols and standards, the potential of LLMs goes beyond simple text interactions. They could significantly reshape the field.5
However, adopting LLMs, which were not initially designed for medical purposes, raises concerns regarding their reliability, including challenges in contextual understanding and interpretability, biases in the training data, and ethical and regulatory issues. Doubts also persist about the dependability of their outputs for making clinical decisions.6 7 As LLMs become more common in healthcare, the necessity to test their applications increases. This review evaluates the application of LLMs in the field of hematology, systematically assessing their benefits, limitations, and potential risks in medical training, education, and diagnosis. It includes a thorough examination of studies from major databases, focusing on their use since the introduction of ChatGPT. The review guides us toward cautious yet optimistic integration of LLMs into clinical hematology practice.
Tokens
Tokens are the basic units of data processed by LLMs. In the context of text, a token can be a word, part of a word or a character.29
Prompt
In the context of AI, a "prompt" refers to the input given to a language model to initiate and guide its output generation.29
Autoregression
Autoregression in AI refers to the process where a model predicts the next word or sequence based on the previous inputs. It operates by evaluating the probabilities of various possible continuations and selecting the most likely next element in the sequence.29
METHODS
Search Strategy
A systematic review was conducted according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement (PRISMA),8 relevant guidelines from the diagnostic test accuracy extension,9 and the recommendations for systematic reviews of prediction models (CHARMS checklist).10
A systematic search of the published literature was conducted on March 25, 2024. PubMed, Web of Science, and Scopus were used as databases.
Our search strategy targeted original studies at the intersection of LLMs and hematology, employing a set of search terms relevant to both fields. The detailed search strategy is outlined in the Supplementary Materials ("Detailed Search Terms").
We limited the search to articles published in English after December 1, 2022, to align with the introduction of ChatGPT, marking the first widespread release of LLMs.
Peer-reviewed original publications on the subject of LLMs applications in Hematology were included. We excluded articles that were not related to applications of LLMs in hematology, articles that were not original, and conference abstracts. To ensure that we did not inadvertently exclude relevant articles, we searched the bibliographies of the articles included in our study.
The study is registered with PROSPERO (CRD42024525241).
Study Selection
Two reviewers (AM and SS) independently screened the titles and abstracts to determine whether the studies met the inclusion criteria. In unclear cases, the full-text article was reviewed. Disagreements were adjudicated by a third reviewer (EK). The two authors (AM and SS) independently assessed the full texts of the included articles.
Data Extraction
Data from all included studies was collected into a standardized data extraction sheet. Data included publication year, LLM model types, objective, sample size, main findings, and limitations.
Quality Assessment and Risk of Bias
To evaluate the risk of bias, we used the adapted version of the Quality Assessment of Diagnostic Accuracy Studies criteria (QUADAS-2).11
Data Synthesis
We conducted a narrative synthesis of the findings from the included studies. Due to the heterogeneity in the study designs and outcomes, a meta-analysis was not implemented. Instead, we focused on summarizing the applications, benefits, and limitations of LLMs in Hematology as reported in the studies.
RESULTS
A total of 325 articles were retrieved in the initial search. After exclusion (Supplementary Figure 1), 11 studies evaluating the application of LLMs in hematology were included.
The included studies were diverse in their objectives and methodologies and covered various aspects of hematology including lymphoma, transfusion medicine, hemophilia, hemoglobinopathies, and stem cell transplant. Categories included medical education, diagnosis, and clinical practice (Figure3).
All studies were evaluated for risk of bias and applicability using the QUADAS-2 tool (Supplementary Table 1). Generally, the studies demonstrated a low risk of bias across the criteria of index test, reference standard, and flow and timing. In some studies, the composition of the sample (patient selection criteria) presented an intermediate risk of bias. For example, in their article, Kumari A. et al.12 wrote 50 hematologic questions that were posed to the LLMs. The authors chose which questions to ask, which might have influenced the outcomes of the study.
The characteristics of the studies are presented in Table 1. Objectives, reference standards, sample sizes, and main findings are presented in Table 2. The strengths and limitations of LLMs as presented in the included studies are summarized in Table 3. In the following section, we detail the included studies separated by theme.
Medical Education
Two studies compared the effectiveness of ChatGPT (GPT-3.5) with Google search in answering medical questions on stem cell therapy and breast implant-associated anaplastic large cell lymphoma (BIA-ALCL).13 14 Both studies highlighted GPT-3.5’s superiority in propagating professional information.
In another study by Van de Wyngaert C. et al., GPT-3.5’s answers to frequently asked questions about hemophilia were assessed and compared to the answers found on specialized websites. GPT’s responses were more relevant, exhaustive, scientifically valid, and understandable, compared to the websites.15
When assessing the performance of LLMs on the Internal Medicine postgraduate knowledge of transfusion practice (BEST-TEST), a multiple-choice test regarding transfusion medicine, GPT-4 performed the best with an average of 87% correct answers, followed by Bard (54%) and GPT-3.5 (44%).16
Klang et al. demonstrated GPT-3.5’s utility for academic writing in thrombosis and hemostasis. The model’s limitations highlighted the importance of human judgment and medical expertise in the writing of academic papers.17
Medical Diagnosis
Kumari A. et al. compared the capability of different LLMs in solving hematology cases. GPT-3.5 performed the best, achieving a score of 3.2 out of 5, followed by Google’s Bard with a score of 2.2, and Microsoft’s Bing with 2.0.12
In a study that investigated the ability of GPT models to diagnose hemoglobinopathy (Beta heterozygote, 3.7 homozygote, SEA, MED, FIL, HbC or HbE heterozygote) by interpreting patients’ laboratory results, GPT-4 achieved a diagnostic accuracy of 76%, with 97% true positives and 56% true negatives. In the same task, GPT-3.5 achieved an accuracy of 53%, with 75% true positives and 31% true negatives.18
Yang et al., evaluated the potential of GPT-4 to assist with blood morphology identification. GPT-4 identified normal blood cells with an accuracy of 88%, exceeding the accuracy of identifying abnormal blood cells at a rate of 54%. Regarding identifying abnormal cells, the accuracy of GPT-4 was slightly higher than that of the manual method, which was 49.5%.19
Clinical practice and therapeutic recommendation
Duey et al. compared the recommendations of GPT-3.5 and GPT4 on thromboembolic prophylaxis in spine surgery. They showed that GPT-3.5 had an accuracy rate of 33%, while ChatGPT-4.0 achieved an accuracy of 92%.20
The performance of GPT models and Google Bard in answering common questions related to clinical transfusion practice was better with explicitly phrased questions compared to those with implicit, more "realistic" phrasing. GPT-4 performed best with a score of 83% on implicit questions and 100% on explicit questions. GPT-3.5 performance was the most affected by question phrasing, with a score of 17% on the implicit questions and 83% on explicit questions.16
Stephens et al. evaluated the quality of indications for irradiated blood components for TA-GVHD prevention, as provided by GPT models, Google Bard, and Microsoft Bing. Bing’s responses were rated the most accurate, with a grade of 3.75 out of 5, followed by GPT-3.5 with 3.5, GPT-4 with 3, and Bard with 3. GPT-4’s responses were considered the most complete, receiving a completeness grade of 2.25 out of 3, followed by GPT-3.5 with 2, Bing with 1.75, and Bard with 1.5.21
A study by Hurley NC. et al. evaluated the recommendations of GPT models and Google Bard, regarding the necessity for red blood cell transfusion, based on short case presentations. GPT-4 demonstrated the highest performance, with a correct result rate of 91%. Bard performed least well, with a correct result rate of 45.5%.16 When assessing the capabilities of GPT-4, PaLm2, Llama2-13b, and Llama2-70b in making complex decisions in hematopoietic stem cell transplantation, only 58.8% of the LLMs’ answers matched the experts’ positions.22
Benefits and limitations of LLMs in hematology
From the included studies, several strengths and limitations of LLMs in hematology were identified (Table 3, Figure 5).
Opinions regarding the consistency of GPT models in the field of hematology were divided. The study by Kurstjens S. et al.18 described ChatGPT as an inconsistent model (choosing the same answer 80% of the time), while referring to GPT-4 as a very consistent one (choosing the same answer 98.3% of the time). In contrast, Hurley et al.16 characterize three LLMs as consistent models, with Bard, GPT-3.5, and GPT-4 choosing the same answer 87%, 99%, and 97% of the time, respectively. The discrepancy highlights the debate within the field regarding the reliability of these models’ outputs.
Some studies highlighted issues with the references used by the LLMs. Three studies reported that some of GPT-3.5’s references were fake, outdated, or contained errors.14 17 20 Another study, by Civettini et al., indicated that in some cases, Llama-2 and GPT-4 could not provide references.22 Duey et al. observed that GPT-4 did not explicitly cite any sources in its responses, thus avoiding the display of problematic references.20 Nonetheless, it is essential to observe that Bing Chat provided appropriate references in the study by Stephens et al. .21
Three studies noted limitations in the performance of GPT models due to the large datasets used to train them, which are not updated enough. One focused on GPT-3.5,17 another on GPT-4,19 a third on both.20 Van de Wyngaert et al. also mentioned the potential impact of this issue on ChatGPT’s performance.15
Concerns regarding the reliability of the data were raised. GPT-3.520 and PaLm222 might derive their answers from sources that are not suitable, leading to potentially incorrect responses. For example, PaLm2 consistently based its responses on UpToDate (uptodate.com) information on stem cell transplantation in acute myeloid leukemia, not only when asked about acute myeloid leukemia but also when asked about acute lymphoblastic leukemia.
Other studies pointed out that some of LLMs’ answers are not accurate enough.17 20 21 For instance, LLMs recommended medications that were not indicated,16 and GPT models offered recommendations even when there was insufficient evidence.20 However, additional research suggests that GPT’s responses are reliable and based on academic sources.13 15
LLMs also encountered some problems when dealing with laboratory data.12 18 For example, ChatGPT interpreted hemoglobin results, expressed in nmol/L, as if the concentration was in g/dL.18
An important strength that was observed in the reviewed articles is that LLMs offered detailed and specific responses,13 14 15 19 21 written in a manner similar to common everyday language.13 Additionally, GPT models sometimes included a disclaimer advising users to seek professional medical advice for the most accurate information.20 21
DISCUSSION
This systematic review examines the use of Large Language Models (LLMs) such as OpenAI’s ChatGPT and Google’s Bard within the field of Hematology. It highlights the potential and the limitations of these models in enhancing medical education, advancing diagnostic accuracy, and shaping clinical practice within the field.
Evaluation of LLMs in Medical Education, Diagnosis and Clinical Decision-making
The studies reviewed demonstrate that LLMs can surpass traditional electronic methods in delivering medical education to both patients and physicians. For example, ChatGPT provided responses that were more relevant and easier to understand than those of standard search engines and specialized medical websites.13 14 15 Patient interaction with chatbot can help patients understand their illness and treatment options. It can also serve as a translator for medical terminology used by physicians. This capability suggests that LLMs could play a pivotal role in democratizing access to reliable medical information, thereby potentially transforming medical education by making high-quality knowledge more accessible.
In diagnostic applications, the performance of LLMs varied, with some models achieving diagnostic accuracies that rival traditional methods.19 This variance underscores the importance of ongoing model training and validation to ensure that these tools provide accurate and reliable diagnostics.
The application of LLMs in clinical settings has shown that while they can offer accurate recommendations for treatment and diagnosis, their reliability can be inconsistent.16 20 21 22 This inconsistency is particularly concerning given the high stakes of medical decision-making. For instance, the studies highlighted issues with the models’ references14 17 20 22 and the reliability of their training data,20 22 which could lead to incorrect or outdated medical advice.
Moreover, the variation in the performance of different LLMs in handling implicit versus explicit queries raises concerns about their practicality in real-world clinical environments, where queries may not always be clearly formulated.16 This limitation could affect the models’ utility in emergency settings or complex cases where nuanced understanding and rapid decision-making are crucial.
The Potential of LLMs in Hematology
While LLMs are often praised for their potential to streamline operations and enhance decision-making, it is crucial to acknowledge the implications of their integration into healthcare.23 24 The reliance on AI- generated knowledge could shift the focus of medical authority from trained professionals to AI systems, altering the power dynamics in healthcare settings. Such a shift could exacerbate existing challenges related to trust and accountability in medical practice.
Additionally, the adoption of LLMs might lead to a homogenization of medical knowledge, as these models are trained primarily on existing datasets that may not fully capture the diversity of patient experiences or the complexities of rare conditions.25 This could stifle innovation in medical thinking and reduce the personalized nature of patient care, potentially leading to a one-size-fits-all approach that neglects individual patient needs and contexts.
Another important point is the critical understanding of how chatbots function, as the design of user prompts greatly influences the responses they generate. In research settings, prompts are meticulously crafted to ensure precision and relevance; however, this level of detail often diminishes in real-world applications. To secure accurate, non-generic responses, users must clearly specify their audience and the context of their questions. Considering the challenges associated with creating effective prompts, customizing LLM to meet the diverse needs of hematology patients is extremely important.
Future Directions
The applications of LLMs in hematology discussed in this review are currently in their early stages, and the full spectrum of their potential remains largely untapped. There is much to discover about what LLMs can offer to the field. There are many more subareas of hematology, such as leukemias, coagulation disorders, and myelodysplastic syndromes, that have not yet been studied using LLMs. Additionally, there are many more tasks where LLMs could make significant contributions including, identifying adverse events, enriching risk prediction models, enhancing patient management, as well as improving administrative processes and compliance to guidelines.26
To harness the benefits of LLMs while mitigating the risks, future research should focus on enhancing the transparency and accountability of these models. Developing standards for the ethical use of AI in medicine, improving the diversity and reliability of training datasets, and implementing robust validation processes are critical steps toward responsible integration.
Furthermore, fostering a collaborative environment where AI complements rather than replaces human expertise could help maintain the essential role of medical professionals’ judgment. Integrating LLMs into clinical practice should not diminish the value of human expertise but rather augment it, ensuring that medical care remains compassionate, individualized, and informed by both human empathy and AI’s analytical capabilities.
This systematic review encounters several limitations. The small number of studies included restricts the ability to draw broad conclusions. Variability in study tasks and methods hinders the execution of a meta- analysis. Additionally, as the investigation of LLMs in this field is relatively recent, the long-term impacts remain uncertain. Concerns regarding the training data—specifically its lack of diversity and inherent biases—could compromise the reliability of AI-generated recommendations in clinical practice.
Furthermore, the continuous improvement in LLM technology means that the performance of an LLM noted in a reported study may not accurately reflect its current capabilities. For example, in December 2023, Google made some significant improvements to Bard, including renaming it to Gemini.27 Moreover,
In November 2023, Elon Musk’s xAI introduced a new AI model named Grok. The development team highlights that Grok distinguishes itself from existing LLMs by its ability to generate responses based on ’real-time knowledge,’ a feature which provides users with the most updated information.28
In conclusion, While LLMs offer substantial opportunities for advancing hematology, it is important to recognize that the field is still in its early stages, not yet addressing many diseases and potential tasks. Moreover, its development is limited by variability, reliability issues, and inaccuracies. As such, integrating LLMs into clinical practice must be navigated with careful consideration of both their potential and their pitfalls.
Balancing innovation with caution will be key to realizing the benefits of LLMs without compromising the integrity and human-centered nature of medical care.
Data Availability
All data produced in the present work are contained in the manuscript