Abstract
Background The rapid advancement of generative artificial intelligence (AI) has revolutionized understanding and generation of human language. Their integration into healthcare has shown potential for improving medical diagnostics, yet a comprehensive diagnostic performance evaluation of generative AI models and the comparison of their diagnostic performance with that of physicians has not been extensively explored.
Methods In this systematic review and meta-analysis, a comprehensive search of Medline, Scopus, Web of Science, Cochrane Central, and medRxiv was conducted for studies published from June 2018 through December 2023, focusing on those that validate generative AI models for diagnostic tasks. Meta-analysis was performed to summarize the performance of the models and to compare the accuracy of the models with that of physicians. The quality of studies was assessed using the Prediction Model Study Risk of Bias Assessment Tool.
Results The search resulted in 54 studies being included in the meta-analysis, with 13 of these also used in the comparative analysis. Eight models were evaluated across 17 medical specialties. The overall accuracy for generative AI models across 54 studies was 57% (95% confidence interval [CI]: 51–63%). The I-squared statistic of 96% signifies a high degree of heterogeneity among the study results. Meta-regression analysis of generative AI models revealed significantly improved accuracy for GPT-4, and reduced accuracy for some specialties such as Neurology, Endocrinology, Rheumatology, and Radiology. The comparison meta-analysis demonstrated that, on average, physicians exceeded the accuracy of the models (difference in accuracy: 14% [95% CI: 8–19%], p-value <0.001). However, in the performance comparison between GPT-4 and physicians, GPT-4 performed slightly higher than non-experts (–4% [95% CI: –10–2%], p-value = 0.173), and slightly underperformed compared to experts (6% [95% CI: –1–13%], p-value = 0.091). The quality assessment indicated a high risk of bias in the majority of studies, primarily due to small sample sizes.
Conclusions Generative AI exhibits promising diagnostic capabilities, with accuracy varying significantly by model and medical specialty. Although they have not reached the reliability of expert physicians, the findings suggest that generative AI models have the potential to enhance healthcare delivery and medical education, provided they are integrated with caution and their limitations are well-understood. This study also highlights the need for more rigorous research standards and a larger number of cases in the future.
Introduction
In recent years, the advent of generative artificial intelligence (AI) has marked a transformative era in our society.1–8 These advanced computational systems have demonstrated exceptional proficiency in interpreting and generating human language, thereby setting new benchmarks in AI’s capabilities. Generative AI, with their deep learning architectures, have rapidly evolved, showcasing a remarkable understanding of complex language structures, contexts, and even images. This evolution has not only expanded the horizons of AI but also opened new possibilities in various fields, including healthcare.9,10
The integration of generative AI models in the medical domain has spurred a growing body of research focusing on their diagnostic capabilities.11 Studies have extensively examined the performance of these models in interpreting clinical data, understanding patient histories, and even suggesting possible diagnoses.12,13 In medical diagnoses, the accuracy, speed, and efficiency of generative AI models in processing vast amounts of medical literature and patient information have been highlighted, positioning them as valuable tools. This research has begun to outline the strengths and limitations of generative AI models in diagnostic tasks in healthcare.
Despite the growing research on generative AIs in medical diagnostics, there remains a significant gap in the literature: a comprehensive meta-analysis of the diagnostic capabilities of the models, followed by a comparison of their performance with that of physicians.14 Such a comparison is crucial for understanding the practical implications and effectiveness of generative AI models in real-world medical settings. While individual studies have provided insights into the capabilities of generative AI models,12,13 a systematic review and meta-analysis is necessary to aggregate these findings and draw more robust conclusions about their comparative effectiveness against traditional diagnostic practices by physicians.
This paper aims to bridge the existing gap in the literature by conducting a meticulous meta-analysis of the diagnostic capabilities of generative AI models in healthcare. Our focus is to provide a comprehensive diagnostic performance evaluation of generative AI models and the comparison of their diagnostic performance with that of physicians. By synthesizing the findings from various studies, we endeavor to offer a nuanced understanding of the effectiveness, potential, and limitations of generative AI models in medical diagnostics. This analysis is intended to serve as a foundational reference for future research and practical applications in the field, ultimately contributing to the advancement of AI-assisted diagnostics in healthcare.
Methods
Protocol and Registration
This systematic review was prospectively registered with PROSPERO (CRD42023494733). Our study adhered to the relevant sections of guidelines from the Preferred Reporting Items for a Systematic Review and Meta-analysis (PRISMA) of Diagnostic Test Accuracy Studies.15,16 All stages of the review (title and abstract screening, full-text screening, data extraction, and assessment of bias) were performed in duplicate by two independent reviewers (H.Takita and D.U.), and disagreements were resolved by discussion with a third independent reviewer (H.Tatekawa).
Search Strategy and Study Selection
A search was performed to identify studies that validate a generative AI model for diagnostic tasks. A search strategy was developed, including variations of the terms generative AI and diagnosis. The search strategy was as follows: articles in English that included the words “large language model”, “LLM”, “generative artificial intelligence”, “generative AI”, “generative pre-trained transformers”1, “GPT”1, “Bing”, “Bard”, “PaLM”7,8, “Pathways Language Model”, “LaMDA”17, “Language Model for Dialogue Applications”, “Llama”5,6, or “Large Language Model Meta AI” and also “diagnosis”, “diagnostic”, “quiz”, “examination”, or “vignette” were included. We searched the following electronic databases for literature from June 2018 through December 2023: Medline, Scopus, Web of Science, Cochrane Central, and medRxiv. June 2018 represents when the first generative AI model was published.1 We included all articles that fulfilled the following inclusion criteria: primary research studies that validate a generative AI for diagnosis. We applied the following exclusion criteria to our search: review articles, case reports, comments, editorials, retracted articles, and those not related to diagnostic performance.
Data Extraction
Titles and abstracts were screened before full-text screening. Data was extracted using a predefined data extraction sheet. A count of excluded studies, including the reason for exclusion, was recorded in a PRISMA flow diagram.16 We extracted information from each study including the first author, model with its version, model task, test dataset type (internal, external, or unknown),18 medical specialty, accuracy, sample size, and publication status (pre-print or peer-reviewed) for the meta-analysis of generative AI performance. Most generative AI models only presented their training period without any information on which data was used for training. Therefore, when generative AI models are tested with data outside of the training period, the test dataset type is classified as external testing, and when tested with data that was publicly available during the training period, it is classified as unknown. In addition to this, when both the model and the physician’s diagnostic performance are presented in the same paper, we extracted both for comparative analysis. We also considered the type of physician involved in relevant studies. We classified physicians as non-experts if they were trainees or residents. In contrast, those beyond this stage in their career were categorized as experts. When a single model used multiple prompts and individual performances were available in one article, we took the average of them.
Quality Assessment
We used the Prediction Model Study Risk of Bias Assessment Tool (PROBAST) to assess papers for bias and applicability.19 This tool uses signaling questions in four domains (participants, predictors, outcomes, and analysis) to provide both an overall and a granular assessment. We did not include some PROBAST signaling questions because they are not relevant to generative AI models. Details of modifications made to PROBAST are in Appendix Table S1 (online).
Statistical Analysis
Initially, we conducted a meta-analysis of generative AI studies reporting accuracy data to estimate the pooled accuracy of the diagnostic performance. Subsequently, a meta-regression analysis was performed on the accuracy of these models to identify sources of heterogeneity across studies, incorporating covariates such as model type, medical specialty, task of the model, type of test dataset, level of bias, and publication status. Secondly, we compared the diagnostic performance of generative AI models with that of physicians. For this analysis, we used the difference in accuracy, calculated by subtracting the physicians’ accuracy from that of the models. An inverse-variance-weighted random-effects model with the DerSimonian–Laird estimator was utilized to estimate the between-study variance and normal approximation intervals based on summary measures to calculate confidence intervals (CI) for individual study results. The random-effects model (DerSimonian–Laird method) rather than a fixed-effects model was selected at the time of the study protocol because of the expected heterogeneity of the included studies. To assess publication bias, we used a funnel plot, evaluating effect size and standard error as described by Egger et al.20 Statistical significance was set at a P value of 0.05. All calculations were performed using R (version 4.0.0), utilizing the ‘metafor’ package.
Results
Study Selection and Characteristics
We identified 13966 studies, of which 7940 were duplicates. After screening, 54 studies were included in the meta-analysis of generative AI diagnostic performance12,13,21–72 and 13 studies in the comparative analysis between generative AI models and physicians (Figure 1 and Table 1).32,33,35–41,49,52,58,62 The most evaluated models were GPT-44 (31 articles) and GPT-3.5 (28), while models such as GPT-4V (6), PaLM28 (3), Llama 26 (2), Prometheus (2), GPT-33 (1), Glass (1), and Med-42 (1) had less representation. OpenAI developed GPT-3, GPT-3.5, GPT-4, and GPT-4V, several of which are accessible through ChatGPT. Google’s PaLM2 is implemented in its Bard system. Meta created Llama 2, and Med-42 is a fine-tuned version of Llama 2. Microsoft’s Bing incorporates Prometheus, which is based on OpenAI’s GPT technology. Lastly, Glass Health developed a model named Glass. The review spanned a wide range of medical specialties, with General medicine being the most common (14 articles). Other specialties like Radiology (10), Ophthalmology (8), Emergency medicine (5), Neurology (3), and Dermatology (3) were represented, as well as Gastroenterology, Cardiology, Pediatrics, Otolaryngology, Urology, Endocrinology, Gynecology, Orthopedic surgery, Rheumatology, Psychiatry, and Plastic surgery with one article each. Regarding model tasks, free text tasks were the most common, with 47 articles, followed by choice tasks at 13. For test dataset types, 40 articles involved external testing, while 14 were unknown due to the training data for the generative AI models being unknown. Of the included studies, 37 were peer-reviewed, while 17 were preprints. Study characteristics are shown in Table 1 and Appendix Table S2 (online).
Thirteen studies compared the performance of generative AI models with physicians.32,33,35–41,49,52,58,62 GPT-4 (8 articles) was the most frequently evaluated, followed by GPT-3.5 (7), GPT-4V (2), Llama 2 (1), and GPT-3 (1). While comparisons between both expert and non-expert physicians were found for GPT-4, GPT-3.5, GPT-4V, and GPT-3, only comparisons with experts were found for Llama 2, with no comparisons involving non-experts. The studies covered a variety of medical specialties. Ophthalmology was the most frequently studied specialty with 4 articles, followed by Radiology with 3 articles. General medicine and Emergency medicine were evaluated in 2 articles each. Endocrinology and Urology were each represented once. For model tasks, free text tasks were more prevalent with 12 articles, whereas choice tasks were represented in 3 articles. Regarding test types, external testing was more common with 7 articles, compared to 6 articles of unspecified or unknown test types.
Quality Assessment
PROBAST assessment led to an overall rating of 45/54 (83%) studies at high risk of bias, 8/54 (15%) studies at low risk of bias, 10/54 (19%) studies at high concern for generalizability, and 44/54 (81%) studies at low concern for generalizability (Figure 2). The main factors of this evaluation were studies that evaluated models with a small test set and studies that cannot prove external evaluation due to the unknown training data of generative AI models. Detailed results are shown in Appendix Table S2 (online).
Meta-analysis for generative AI models
The pooled accuracy of generative AI models showed varied performance across different models and medical specialties (Figure 3 and Appendix Figures S1–3 [online]). The overall accuracy for generative AI models was found to be 57% with a 95% CI of 51–63%. The I-squared statistic of 96% signifies a high degree of heterogeneity among the study results. In the meta-regression analysis examining the performance of various generative AI models across different specialties, the results revealed differences in effectiveness (Table 2). For the models, GPT-4 showed statistically significant performance with a coefficient of 26.1 (95% CI: 6.6–45.6, p = 0.009) while other models, such as GPT-3.5, GPT-4V, Llama 2, PaLM2 (Bard), and Prometheus (Bing), did not demonstrate significant results. Regarding the performance across specialties, wide variations were observed. The fields of Neurology, Endocrinology, Rheumatology, and Radiology displayed significant negative coefficients, with Neurology at –21.7 (95% CI: –41.2–-2.1, p = 0.03), Endocrinology at –42.0 (95% CI: –61.3–-22.6, p = 0.002), Rheumatology at –41.4 (95% CI: –78.5–-4.3, p = 0.029) and Radiology at –24.9 (95% CI: –40.7–-9.1, p <0.001). Other specialties such as Pediatrics, Gynecology, Urology, Otolaryngology, Orthopedic surgery, Ophthalmology, and Plastic surgery showed positive coefficients but not significant differences. No significant heterogeneity was observed based on the risk of bias, or based on publication status. Overall, the meta-regression analysis indicates that among various generative AI models, GPT-4 significantly outperforms others in effectiveness, though performance varies considerably across medical specialties, with some showing negative impacts.
We assessed publication bias by using a regression analysis to quantify funnel plot asymmetry (Appendix Figures S1 [online]) and it suggested a low risk of publication bias (p = 0.572).
Meta-analysis comparing between generative AI models and physicians
In our comparison meta-analysis, we observed that physicians generally outperformed generative AI models in various scenarios (Figure 4). This superiority was particularly evident when comparing AI models to overall physician performance, where physicians demonstrated a significant 14% higher performance on average (95% CI: 8–19%, p < 0.001). Though physicians overall and experts specifically both outperformed GPT-4, the differences were not statistically significant (4% difference, 95% CI: –2–10%, p = 0.192 against physicians overall, and 6% difference, 95% CI: –1–13%, p = 0.091 against experts). Interestingly, in the scenario of GPT-4 versus non-experts, GPT-4 showed a slight, yet not statistically significant, superiority (difference of –4%, 95% CI: –10–2%, p = 0.173). GPT-3.5 was also consistently outperformed by physicians, with performance 16% lower than that of physicians overall (95% CI: 7–24%, p < 0.001), 4% lower than that of non-experts (95% CI: 2–6%, p < 0.001), and a more pronounced 26% lower performance than that of experts (95% CI: 16–36%, p < 0.001). GPT-4V followed a similar pattern as GPT-3.5. GPT-4V had 22% lower performance (95% CI: 1–43%, p = 0.039) against physicians overall. Specifically, 14% lower performance against non-experts (95% CI: –7–35%, p = 0.188) and 44% lower performance than expert physicians (95% CI: 33–56%, p < 0.001). Similarly, Llama 2 also showed 47% lower performance than experts (95% CI: 33–61%, p < 0.001).
Discussion
In this systematic review and meta-analysis, we analyzed the diagnostic performance of generative AI and physicians. We initially identified 13,966 studies, ultimately including 54 in the meta-analysis and 13 in the comparative analysis with physicians. The study spanned various AI models and medical specialties, with GPT-4 being the most evaluated. Quality assessment revealed a majority of studies at high risk of bias. The meta-analysis showed a pooled accuracy of 57% (95% CI: 51–63%) for generative AI models. Meta-regression analysis highlighted significant differences in effectiveness of different AI models across medical fields. The comparative analysis revealed that physicians generally outperformed AI models, although in non-expert settings, some AI models showed comparable performance. To the best of our knowledge, this is the first meta-analysis of generative AI models in diagnostic tasks. This comprehensive study highlights the varied capabilities and limitations of generative AI in medical diagnostics.
The meta-analysis of generative AI models in healthcare reveals crucial insights for clinical practice. Despite the overall modest accuracy of 57% for generative AI models in medical applications, the significant performance of GPT-4, suggests its potential utility in certain clinical scenarios. The variation in effectiveness across specialties, particularly the lower effectiveness in fields like Neurology, Endocrinology, Rheumatology, and Radiology underscores the need for cautious implementation and further refinement of AI models in these areas. The data indicates that generative AI models possess a propensity towards knowledge in some medical specialties, and by understanding and utilizing its characteristics, it has the potential to function as a valuable support tool in medical settings. Importantly, the close performance of GPT-4 to physicians in non-expert scenarios highlights the possibility of AI augmenting healthcare delivery in resource-limited settings or as a preliminary diagnostic tool, thereby potentially increasing accessibility and efficiency in patient care.73,74
The comparison between generative AI and physician performances, particularly in the context of medical education, offers intriguing perspectives.75 The overall higher accuracy of physicians compared to AI models emphasizes the irreplaceable value of human judgement and experience in medical decision-making. However, the comparable performance of GPT-4 and physicians in non-expert settings reveals an opportunity for integrating AI into medical training. This could include using AI as a teaching aid for medical students and residents, especially in simulating non-expert scenarios where AI’s performance is nearly equivalent to that of healthcare professionals.76 Such integration could enhance learning experiences, offering diverse clinical case studies and facilitating self-assessment and feedback. Additionally, the narrower performance gap between GPT-4 and physicians even in expert settings suggests that AI could be used to supplement advanced medical education, helping to identify areas for improvement and providing supporting information. This approach could foster a more dynamic and adaptive learning environment, preparing future medical professionals for an increasingly digital healthcare landscape.
Although there are no statistically significant differences among the risks of bias, the PROBAST quality assessment reveals a high risk of bias in 80% of studies.19 This raises significant concerns about the reliability of current generative AI research in healthcare. This highlights the crucial need for rigorous and transparent methodologies, including the necessity of large amounts of external evaluation to assess real-world performance accurately.77 Moreover, the transparency of training data and its collection period is paramount. Without this transparency, it is impossible to determine whether the test dataset is an external dataset or not. It ensures an understanding of the model’s knowledge, context, and limitations, aids in identifying potential biases, and facilitates independent replication and validation, which are fundamental to scientific integrity. As generative AI continues to evolve, fostering a culture of rigorous transparency is essential to ensure their safe, effective, and equitable application in clinical settings,78 ultimately enhancing the quality of healthcare delivery and medical education.
The methodology of this study, while comprehensive, has limitations. This meta-analysis involved primary studies with considerable heterogeneity. The performance of generative AI models might vary significantly in real-world scenarios, which are often more complex than research settings. There were not many studies that compared generative AI and physicians using the same sample. Future research should focus on addressing the identified limitations. This includes conducting studies with more diverse datasets, exploring the performance of generative AI models in varied clinical environments, and examining their impact on different patient demographics. Additionally, longitudinal studies assessing the long-term efficacy and impact of generative AI models in clinical practice would be valuable.
In conclusion, this meta-analysis provides a nuanced understanding of the capabilities and limitations of generative AI in medical diagnostics. While generative AI models, particularly advanced iterations like GPT-4, have shown progressive improvements and hold promise for assisting in diagnosis, their effectiveness remains highly variable across different models and medical specialties. With an overall moderate accuracy of 57%, generative AI models are not yet reliable substitutes for expert physicians but may serve as valuable aids in non-expert scenarios and as educational tools for medical trainees. The findings also underscore the need for continued advancements and specialization in model development, as well as rigorous, externally validated research to overcome the prevalent high risk of bias and ensure generative AIs’ effective integration into clinical practice. As the field evolves, continuous learning and adaptation for both generative AI models and medical professionals are imperative, alongside a commitment to transparency and stringent research standards. This approach will be crucial in harnessing the potential of generative AI models to enhance healthcare delivery and medical education while safeguarding against their limitations and biases.
Funding
There was no funding provided for this study.
Role of the Sponsor
There was no funding provided for this study. The corresponding author had full access to all data in the study and final responsibility for the decision to submit the report for publication.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.
IRB Approval
Not applicable.
Disclosures
The authors have nothing to disclose.
Reproducible Research Statement
Study protocol and metadata are available from Dr. Ueda (e-mail, ai.labo.ocu{at}gmail.com).
Acknowledgement
We utilized ChatGPT for assistance with parts of the English proofing.