Abstract
Background Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Past research, however, has not had the scale and accuracy for use in clinical decision making. The power of large language models (LLMs) may be changing this. In this study, we evaluated the performance and interpretability of Generative Pre-trained Transformer 4 Vision (GPT-4V), a multimodal LLM, on medical licensing examination questions with images.
Methods We used three sets of multiple-choice questions with images from United States Medical Licensing Examination (USMLE), USMLE question bank for medical students (AMBOSS), and Diagnostic Radiology Qualifying Core Exam (DRQCE) to test GPT-4V’s accuracy and explanation quality. We compared GPT-4V with two other large language models, GPT-4 and ChatGPT, which cannot process images. We also assessed the preference and feedback of healthcare professionals on GPT-4V’s explanations.
Results GPT-4V achieved high accuracies on USMLE (86.2%), AMBOSS (62.0%), and DRQCE (73.1%), outperforming ChatGPT and GPT-4 by relative increase of 131.8% and 64.5% on average. GPT-4V was in the 70th - 80th percentile with AMBOSS users preparing for the exam. GPT-4V also passed the full USMLE exam with an accuracy of 90.7%. GPT-4V’s explanations were preferred by healthcare professionals when it answered correctly, but they revealed several issues such as image misunderstanding, text hallucination, and reasoning error when it answered incorrectly.
Conclusion GPT-4V showed promising results for medical licensing examination questions with images, suggesting its potential for clinical decision support. However, GPT-4V needs to improve its explanation quality and reliability for clinical use.
1-2 sentence description AI models offer potential for imaging diagnostic support tool, but their performance and interpretability are often unclear. Here, the authors show that GPT-4V, a large multimodal language model, can achieve high accuracy on medical licensing exams with images, but also reveal several issues in its explanation quality.
Introduction
Using computers to help make clinical diagnoses and guide treatments has been a goal of artificial intelligence since its inception.1 The adoption of electronic health record (EHR) systems by hospitals in the US has resulted in an unprecedented amount of digital data associated with patient encounters. Computer-assisted clinical diagnostic support system (CDSS) endeavors to enhance clinicians’ decisions with patient information and clinical knowledge.2 There is burgeoning interest in CDSS for enhanced imaging3, often termed radiomics, in various disciplines such as breast cancer detection4, covid detection5, diagnosing congenital cataracts6, and hidden fracture location7. For a decision to be trustworthy for clinicians, CDSS should not only make the prediction but also provide accurate explanations.8–10 However, most previous imaging CDSS offers only highlight areas deemed significant by AI,11–15 providing limited insight into the explanation of the diagnosis.16
Large language models (LLMs) could generate explanations as they are trained with reinforcement learning from human feedback to follow user requests to explain a question. Typical LLM examples include Chat Generative Pre-trained Transformer (ChatGPT), a renowned chatbot released by OpenAI in October 2022, and its successor Generative Pre-trained Transformer 4 (GPT-4) in March 2023. The influence of ChatGPT is attributed to its conversational prowess and its performance, which approaches or matches human-level competence in cognitive tasks, spanning various domains including medicine.17 ChatGPT has achieved commendable results in the United States Medical Licensing Examinations, leading to discussions about the readiness of LLM applications for integration into clinical18–20, and educational21–23 environments.
One limitation of ChatGPT is that it may only read and generate text but is unable to process other data modalities, such as images. This limitation, known as the “single-modality,” is a common issue among many LLMs.24,25 Advancements in multimodal LLM promise enhanced capabilities and integration with diverse data sources.26,27 OpenAI’s recent introduction of GPT- 4V has undeniably made strides toward bridging this divide. GPT-4V, a state-of-the-art multimodal LLM, is equipped with visual processing ability, granting it to understand and describe visual content.28 By incorporating GPT-4V into current imaging CDSS, physicians can ask open-ended questions pertaining to a patient’s medical evaluation - taking into account all available information including images, symptoms, lab results, allowing for an interactive experience where AI suggest both decision and explanation to support physicians.
However, the ability of GPT-4V to analyze medical images still remains unknown. GPT-4V must perform comparably to humans on assessments of medical knowledge and reasoning such that users have sufficient confidence in its responses. In this work, we aim to assess GPT-4V performance on medical licensing examination questions with images, as well as to analyze its explanation for healthcare professional interpretability.
Method
This cross-sectional study compared the performance between GPT-4V, GPT-4, and ChatGPT on medical licensing examination questions answering. This study also investigates the quality of GPT-4V explanation in answering these questions. The study protocol was deemed exempt by Institutional Review Board at the VA Bedford Healthcare System and informed consent was waived due to minimal risk to patients. This study was conducted in October 2023.
Medical Exam Data Collection
We obtained study questions from three sources. The United States Medical Licensing Examination (USMLE) consists of three steps required to obtain a medical license in the United States. The USMLE assesses a physician’s ability to apply knowledge, concepts, and principles, which is critical to both health and disease management and is the foundation for safe, efficient patient care. The Step1, Step2 clinical knowledge(CK), Step3 of USMLE sample exam released from the National Board of Medical Examiners (NBME) consist of 119, 120, and 137 questions respectively. Each question contained multiple options to choose from. We then selected all questions with images, resulting in 19, 13, and 18 questions from Step1, Step2 CK, and Step3. Discipline includes but is not limited to radiology, dermatology, orthopedics, ophthalmology, and cardiology.
The sample exam only included limited questions with images. Thus, we further collected similar questions from a non-public available and registered required source: AMBOSS, a widely used question bank for medical students, which provides exam performance data given students’ performance. The performance of past AMBOSS students enabled us to assess the comparative effectiveness of the model. For each question, AMBOSS associated an expert-written hint to tip the student to answer the question and a difficulty level that ranges from 1-5. Levels 1, 2, 3, 4, and 5 represent the easiest 20%, 20-50%, 50%-80%, 80%-95%, and 95%-100% of questions respectively.29 We randomly selected 10 questions from each of the 5 difficulty levels. And we repeated this process for Step1, Step2 CK, and Step3. This resulted in a total number 150 of questions.
The Diagnostic Radiology Qualifying Core Exam (DRQCE), offered after 36 months of residency training, is an image-rich exam to evaluate a candidate’s core fund of knowledge and clinical judgment across practice domains of diagnostic radiology. DRQCE is not publicly available and requires registration. We collected 26 questions with images from the preparation exam offered by the American Board of Radiology (ABR). Thus, we had a total of 226 questions with images from the three sources. To illustrate GPT-4V’s potential as an imaging diagnostic support tool, we modified a patient case report30 to resemble a typical “curbside consult” question between medical professionals.31
How to Answer Image Questions using GPT-4V Prompt
GPT-4V took image and text data as inputs to generate textual outputs. Given that input format (prompt) played a key role in optimizing model performance, we followed the standard prompting guidelines of the visual question-answering task. Specifically, we prompted GPT-4V by first adding the image, then appending context (i.e., patient information) and questions, and finally providing multiple-choice options, each separated by a new line. An example user prompt and GPT-4V response are shown in Figure 1. When multiple sub-images existed in the image, we uploaded multiple sub-images to GPT-4V. We did not use hint in the prompt unless otherwise specified. The response consists of the selected option as an answer, supported by a textual explanation to substantiate the selected decision. When using ChatGPT and GPT-4 models that cannot handle image data, images were omitted from the prompt. Responses were collected from the September 25, 2023 version of models. Each question was manually entered into the ChatGPT website independently (new chat window).
Evaluation Metrics
For answer accuracy, we evaluated the model’s performance by comparing the model’s choice with the correct choice provided by the exam board or question bank website. We defined accuracy as the ratio of the number of correct choices to the total number of questions.
We also evaluated the quality of the explanation by preference from 3 healthcare professionals (one medical doctor, one registered nurse, and one medical student). For each question from AMBOSS dataset (n=150), we asked the healthcare professionals to choose their preference between an explanation by GPT-4V, an explanation by an expert, or a tie.
Additionally, we also asked healthcare professionals to evaluate GPT-4V explanation from a sufficient and comprehensive perspective.32,33 They determined if the following information exists in the explanation:
Image interpretation: GPT-4V tried to interpret the image in the explanation, and such interpretation is sufficient to support its choice.
Question information: Explanations contained information related to the textual context (i.e., patient information) of the question, and such information was essential for GPT-4V’s choice.
Comprehensive explanation: The explanation included comprehensive reasoning for all possible evidence (e.g., symptoms, lab results) that leads to the final answer.
Finally, for each question answered incorrectly, we asked healthcare professionals to check if the explanation contained any of the following errors:
Image misunderstanding: if the sentence in the explanation showed an incorrect interpretation of the image. Example: GPT-4V said that a bone in the image was for the hand, but it was the foot.
Text hallucination: if the sentence in the explanation contained something that is incorrect. (Other than the image) Example: Claiming Saxenda was just insulin.
Reasoning error: if the sentence did not properly convert the information (either image or text) to an answer. Example: GPT-4V identified a patient trip occurring in the last 3 months and despite the fact that chagas disease usually develops 10~20 years after infection, it still diagnosed the patient as having chagas disease.
Non-medical error: For any non-medical error, use this class. GPT is known to struggle with tasks requiring precise spatial localization, such as identifying chess positions on the board. It is known to struggle with calculations, such as 1 + 1 =?.
Statistical Analysis
GPT-4V’s accuracies on the AMBOSS dataset were compared between different difficulties using unpaired chi-square tests with a significance level of 0.05. All analysis was conducted in Python software (version 3.10.11).
Results
Overall Answer Accuracy
For questions with image, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9% in Step1, Step2CK, and Step3 of USMLE questions accordingly, outperforming ChatGPT and GPT-4 by 42.1% and 21.1% in Step1, 35.7% and 21.4% in Step2CK, 38.9% and 22.2% in Step3 (Table 1). Similarly, GPT-4V achieved an accuracy of 73.1%, outperforming ChatGPT (19.2%) and GPT-4 (26.9%) in DRQCE.
For all questions in the USMLE sample exam (including ones without image), GPT-4V achieved an accuracy of 88.2%, 90.8%, 92.7% in Step1, Step2CK, and Step3 of USMLE questions accordingly (Table 1), passing the standard for the USMLE (about 60%). However, it achieved limited accuracy in several medical disciplines such as anatomy (25.0%), emergency medicine (25.0%), and pathology (50.0%). A grasp of images is essential to correctly answer the majority of questions in these disciplines.
Accuracy Decreases When Difficulty Increases
When asking GPT-4V questions without the hint, it achieved an accuracy of 60%, 64%, and 66% for AMBOSS Step1, Step2CK, and Step3. GPT-4V was in the 72nd, 76th, and 80th percentile with AMBOSS users who were preparing for Step1, Step2CK, and Step3 respectively. Table 2 shows a decreasing trend in performance as question difficulty increased in the AMBOSS dataset (P<0.05). However, the decreasing trend was not observed when the GPT-4V was questioned with the hint. Out of 55 wrong answers without the hint, 17 were corrected by hints. An example and detailed analysis are provided in the supplementary material figure 1.
Quality of Explanation
We first evaluated the user’s preference among GPT-4V generated explanations and expert generated explanations. When GPT-4V answered incorrectly, it was no surprise that healthcare professionals overwhelmingly preferred expert explanations as shown in Table 3. When GPT-4V answered correctly, healthcare professionals favoring experts only exceeded favoring for GPT-4V by 4 votes, out of a total of 95 votes.
We further evaluated the quality of the GPT-4V generated explanation by verifying if explanation includes image and question text interpretation in Table 4. When examining the 95 correct answers, 84.2% (n=80) of the responses contained an interpretation of the image, while 96.8% (n=92) aptly captured the information presented in the question. On the other hand, for the 55 incorrect answers, 92.8% (n=51) interpreted the image, and 89.1% (n=49) depicted the question’s details. In terms of explanation comprehensiveness, GPT-4V offered a comprehensive one in 79.0% (n=75) of correct responses. In contrast, only 7.2% (n=4) of the wrong responses had a comprehensive explanation that led to the final choice.
We also evaluated the explanations that lead to GPT-4V in answering incorrectly across 4 metrics as outlined above: image misunderstanding, text hallucination, reasoning error, and non-medical error. Among questions with wrong answers (n=55), we found that 76.3% (n=42) of questions included misunderstanding of the image, 45.5% (n=25) of questions included logic error, 18.2% (n=10) of questions included text hallucination, and no questions included non-medical errors.
Case Study on Consult Conversation
The consultation conversation, regarding a 45-Year-Old woman with hypertension, fatigue, and altered mental status, is provided in supplementary material figure 2. We found that the interactive design of GPT-4V allowed physicians to seek additional information by posing follow-up questions. Specifically, GPT-4V initially provided an irrelevant response when asked to interpret the CT scan. However, it was able to adjust its response and accurately identify the potential medical condition depicted in the image after receiving a physician’s visual hint - an arrow pointed to a part of the CT scan where physicians desired GPT-4V to analysis.
Through comparing GPT-4V response with the case report, we also found that GPT-4V generally offered responses that were clear and coherent. When asked about differential diagnosis, GPT-4V explained why 3 diseases should be listed (Primary Aldosteronism, Hypertension, and Cushing’s Syndrome) along with its explanation which were deemed relevant by a medical doctor. Following a query about the subsequent steps to ascertain the origin of the anomaly, GPT-4V recommended a PET-CT scan. Utilizing the patient’s PET-CT scan, it was able to locate a tumor in the mediastinum, lending credence to the suspicion of Cushing’s Syndrome. Finally, GPT-4V asked for further studies, such as a biopsy of the mass, to confirm the diagnosis.
Discussion
A prevailing research direction involves thoroughly investigating ChatGPT’s ability to handle a diverse set of medical examination questions.21,34–36 GPT-4 was recently introduced as the upgraded model for ChatGPT. Studies showed that GPT-4 outperforms ChatGPT in various medical tasks.31,37 The collective insights illustrated the power of ChatGPT and GPT-4 in medical exam answering and the potential for medical decision support. However, previous evaluations were only limited to questions without images.
In this study on the evaluation of medical exam questions with images, we found that GPT-4V selects more correct choices compared to ChatGPT and GPT-4 as shown in Table 1. Hence, when evaluating all questions in the USMLE sample exam, GPT-4V achieved an accuracy of 90.7% outperforming ChatGPT (58.5%) and GPT-4 (83.8%). The passing standard for the USMLE was typically set at 60%, indicating that the GPT-4V performed at a level similar to a medical graduate in the final year of study. The accuracy of GPT-4V highlights its grasp over biomedical and clinical sciences, essential for medical practice, but also showcases its ability in patient management and problem-solving skills,38 both of which indicate the potential for clinical routines, such as summarizing radiology reports39 and differential diagnosis40,41. We further explored GPT-4V’s potential in CDSS through a consultation conversation. By posing follow-up queries and providing visual hints from physicians, GPT-4V showed promise in CDSS applications, including interpreting CT scan interpretation, differential diagnoses, and follow-up exams recommendation. However, the quality of GPT-4V’s explanation to support its clinical decision making remained an open question, bridging the focus of our study to the next segment of our analysis.
In terms of explanation quality, we found that more than 80% of responses from GPT-4V provided an interpretation of the image and question of its answer selection, regardless of correctness. This suggested that GPT-4V consistently took into account both the image and question elements while generating responses. Figure 1 illustrates an example of high-quality explanation that utilizes both text and image in answering a hard question. More than 70% of students answered incorrectly on the first try, because both bacterial pneumonia and pulmonary embolism may involve symptoms such as cough. To differentiate them, GPT-4V correctly interpreted the X-ray with a radiologic sign of Hampton hump, which further increased the suspicion of pulmonary infarction rather than pneumonia.42 To show the need for X-ray as mentioned in the explanation, we removed the image from the input, and GPT-4V switched the answer to bacterial pneumonia while also acknowledging the possibility of pulmonary infarction. This change in response demonstrated the high quality of the GPT-4V explanation, as its explanation about X-ray was not fictional and it truly needed the X-ray to answer this question. The high quality of GPT-4V explanations was also supported by experts’ preference voting. When comparing explanations generated by experts with ones generated by GPT-4V, experts’ preference for the expert over the GPT-4V was minimal (n=4) when GPT-4V correctly answered the question (n=95). Previous studies have shown limited utilization of CDSS as most of them offered limited decision explanation and thus gained limited trust among physicians (unlike their colleagues).43–46 Thus, GPT-4V could enhance the effectiveness and trustworthiness of CDSS by providing high-quality, expert-preferred explanations, encouraging broader adoption and more confident utilization among physicians.
We also found that the accuracy of GPT-4V was related to the comprehensiveness of the explanation. GPT-4V offered a comprehensive explanation in 79.0% of correct responses. In contrast, only 7.2% of the wrong answers had a comprehensive explanation. Thus, the absence of key information may be the cause of inaccurate answers. These observations suggested that enhancing the performance of models can be achieved by training GPT-4V with more intricate, clinical-specific data with insights from experienced physicians such as UpToDate,47 and medical research literature such as PubMed.26
Image misunderstanding was the primary reason why GPT-4V answered incorrectly. Out of 55 wrong responses, 42 (76.3%) were due to misunderstanding of the image. In comparison, only 10 (18.2%) of the mistakes were attributed to text misinterpretation. GPT-4V’s proficiency in processing images was considerably lagging behind its text-handling capability. To circumvent its image interpretation issue, we tried to additionally prompt GPT-4V with a short hint that described the image. We found that 40.5% (17 out of 42) responses switched to the correct answer. Corrections from the hint indicated that GPT-4V could be easily persuaded. Within a conversational interface, medical professionals can readily guide and refine GPT-4V’s initial outputs. This adaptability could be advantageous for physicians, as it allows for real-time adjustments and ensures that the generated information aligns more closely with the clinical context or the specific details of a patient’s case.47 With customized hint from physicians, GPT-4V enhances the utility and reliability as an auxiliary tool.
Another drawback of GPT-4V involved its tendency to produce factually inaccurate responses, a problem often referred to as the hallucination effect, which is prevalent among many large language models such as GPT-4V.48 We found that more than 18% of GPT-4V explanations contain hallucinations. Thus when designing clinical support tools for high-risk situations such as patient diagnosing, it is crucial to integrate GPT-4V and a probabilistic model with confidence interval, indicating the reliability of the response.49 This would enhance the reliability of the CDSS response when additional physician review is warranted.16
Limitations
This study has several limitations. First, our findings are constrained in their applicability due to the modest sample size. We gathered 226 questions that included images, which might not comprehensively represent all medical disciplines. Second, while GPT-4V has demonstrated proficiency in medical license examination, its real-world applicability, especially in dynamic, user-interactive scenarios, remains untested. Therefore, while the results are promising, extrapolating the efficacy of GPT-4V to broader clinical applications requires appropriate benchmarks and further research.20
Conclusion
While GPT-4V showcased remarkable accuracy across a spectrum of medical disciplines and varying difficulty levels in this study, it is paramount that further refinement be undertaken, particularly in enhancing its explanatory capabilities prior to any clinical assimilation. Medical students and professionals must be acutely aware of its limitations and consistently cross-verify with authoritative sources. Notably, state-of-the-art LLMs like GPT-4V, even with their sophisticated capabilities, are merely on the brink of supplanting physicians. Their performance in specialized examinations, though noteworthy, still exhibits imperfections, which can lead to consequential inaccuracies and uncertainties. Coupled with the well-known ethical concerns, the credibility and readiness of LLMs for clinical settings remain under scrutiny. However, the preliminary results are promising, suggesting that the present technology is poised to influence clinical practices. As research and development persist, we anticipate a more extensive and profound integration of AI in the medical domain.
Conflicts of Interest
The authors declare no conflict of interests.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Footnotes
Added an example to illustrate the potential usage for imaging diagnostic support