Abstract
Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images.
Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability.
Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V’s performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023.
Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error.
Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint.
Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.
Introduction
Using computers to help make clinical diagnoses and guide treatments has been a goal of artificial intelligence since its inception.1 The adoption of electronic health record (EHR) systems by hospitals in the US has resulted in an unprecedented amount of digital data associated with patient encounters. Computer-assisted clinical diagnostic support system (CDSS) endeavors to enhance clinicians’ decisions with patient information and clinical knowledge.2 There is burgeoning interest in CDSS for enhanced imaging3, often termed radiomics, in various disciplines such as breast cancer detection4, covid detection5, diagnosing congenital cataracts6, and hidden fracture location7. For a decision to be trustworthy for clinicians, CDSS should not only make the prediction but also provide accurate explanations.8–10 However, most previous imaging CDSS offers only highlight areas deemed significant by AI,11–14 providing limited insight into the explanation of the diagnosis.15
Large language models (LLMs) could generate explanations as they are trained with reinforcement learning from human feedback to follow user requests to explain a question. Typical LLM examples include ChatGPT, a renowned chatbot released by OpenAI in October 2022, and its successor Generative Pre-trained Transformer (GPT-4) in March 2023. The influence of ChatGPT is attributed to its conversational prowess and its performance, which approaches or matches human-level competence in cognitive tasks, spanning various domains including medicine.16 ChatGPT has achieved commendable results in the United States Medical Licensing Examinations, leading to discussions about the readiness of LLM applications for integration into clinical17–19, educational20–22, and research23 environments.
One limitation of ChatGPT is that it may only read and generate text but is unable to process other data modalities, such as images. This limitation, known as the “single-modality,” is a common issue among many LLMs.24,25 Advancements in multimodal models promise enhanced capabilities and integration with diverse data sources.26,27 OpenAI’s recent introduction of GPT-4V has undeniably made strides toward bridging this divide. GPT-4V, a state-of-the-art multimodal large language model, is equipped with visual processing ability, granting it to understand and describe visual content.28
However, the ability of GPT-4V to analyze medical images still remains unknown. GPT-4V must perform comparably to humans on assessments of medical knowledge and reasoning such that users have sufficient confidence in its responses. In this work, we aim to assess GPT-4V performance on medical licensing examination questions with images, as well as to analyze its explanation for healthcare professional interpretability.
Method
This cross-sectional study compared the performance between GPT-4V, GPT-4, and ChatGPT on medical licensing examination questions answering. This study also investigates the quality of GPT-4V explanation in answering these questions. The study protocol was deemed exempt by Institutional Review Board at the VA Bedford Healthcare System and informed consent was waived due to minimal risk to patients. This study was conducted in October 2023.
Medical Exam Data Collection
We obtained study questions from three sources. The United States Medical Licensing Examination (USMLE) consists of three steps required to obtain a medical license in the United States. The USMLE assesses a physician’s ability to apply knowledge, concepts, and principles, which is critical to both health and disease management and is the foundation for safe, efficient patient care. The Step1, Step2 clinical knowledge(CK), Step3 of USMLE sample exam released from the National Board of Medical Examiners (NBME) consist of 119, 120, and 137 questions respectively. We then selected questions with images, resulting in 19, 13, and 18 questions from Step1, Step2 CK, and Step3. Discipline includes but is not limited to radiology, dermatology, orthopedics, ophthalmology, and cardiology.
The sample exam only included limited questions with images. Thus, we further collected similar questions from a non-public available and registered required source: AMBOSS, a widely used question bank for medical students, which provides exam performance data given students’ performance. The performance of past AMBOSS students enabled us to assess the comparative effectiveness of the model. For each question, AMBOSS associated an expert-written hint to tip the student to answer the question and a difficulty level that ranges from 1-5. Levels 1, 2, 3, 4, and 5 represent the easiest 20%, 20-50%, 50%-80%, 80%-95%, and 95%-100% of questions respectively.29 We randomly selected 10 questions from each of the 5 difficulty levels. And we repeated this process for Step1, Step2 CK, and Step3. This resulted in a total number 150 of questions.
The Diagnostic Radiology Qualifying Core Exam (DRQCE), offered after 36 months of residency training, is an image-rich exam to evaluate a candidate’s core fund of knowledge and clinical judgment across practice domains of diagnostic radiology. DRQCE is not publicly available and requires registration. We collected 26 questions with images from the preparation exam offered by the American Board of Radiology (ABR). Thus, we had a total of 226 questions with images from the three sources.
How to Answer Image Questions using GPT-4V Prompt
GPT-4V took image and text data as inputs to generate textual outputs. Given that input format (prompt) played a key role in optimizing model performance, we followed the standard prompting guidelines of the visual question-answering task. Specifically, we prompted GPT-4V by first adding the image, then appending context (i.e., patient information) and questions, and finally providing multiple-choice options, each separated by a new line. When multiple sub-images existed in the image, we uploaded multiple sub-images to GPT-4V. An example prompt and response are shown in Figure 1. We did not use hint in the prompt unless otherwise specified. The response consists of the selected option as an answer, supported by a textual explanation to substantiate the selected decision. When using ChatGPT and GPT-4 models that cannot handle image data, images were omitted from the prompt. Responses were collected from the September 25, 2023 version of models. Each question was manually entered into the ChatGPT website independently (new chat window).
Evaluation Metrics
For answer accuracy, we evaluated the model’s performance by comparing the model’s choice with the correct choice provided by the exam board or question bank. We defined accuracy as the ratio of the number of correct choices to the total number of questions.
We also evaluated the quality of the explanation by preference from 3 healthcare professionals (one medical doctor, one registered nurse, and one medical student). For each question from AMBOSS dataset (n=150), we asked the healthcare professionals to choose their preference between an explanation by GPT-4V, an explanation by an expert, or a tie.
Additionally, we also asked healthcare professionals to evaluate GPT-4V explanation from a sufficient and comprehensive perspective.30,31 we asked them to label if the following information exists in the explanation:
Image interpretation: The explanation tries to interpret what the image means.
Question information: The explanation uses information related to the question.
Comprehensive explanation: The explanation includes comprehensive reasoning that leads to the final answer.
Finally, for each question answered incorrectly, we asked healthcare professionals to check if the explanation contained any of the following errors:
Image misunderstanding: if the sentence in the explanation shows an incorrect interpretation of the image. Example: It says a bone in the image is from the hand, but it is the foot.
Text hallucination: if the sentence in the explanation contains something that is incorrect. (Other than the image) Example: Claiming Saxenda is just insulin.
Reasoning error: if the sentence did not properly convert the information (either image or text) to an answer. Example: It identifies patient trip occurred in the last 3 months and chagas disease usually develops in 10∼20 years after infection, but still diagnosis as chagas disease.
Non-medical error: For any non-medical error, use this class. GPT is known to struggle with tasks requiring precise spatial localization, such as identifying chess positions on the board. It is known to struggle with calculations, such as 1 + 1 =?.
Statistical Analysis
GPT-4V’s accuracies on the AMBOSS dataset were compared between different difficulties using unpaired chi-square tests with a significance level of 0.05. All analysis was conducted in Python software (version 3.10.11).
Results
Overall Answer Accuracy
For questions with image, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9% in Step1, Step2CK, and Step3 of USMLE questions accordingly, outperforming ChatGPT and GPT-4 by 42.1% and 21.1% in Step1, 35.7% and 21.4% in Step2CK, 38.9% and 22.2% in Step3 (Table 1). Similarly, GPT-4V achieved an accuracy of 73.1%, outperforming ChatGPT (19.2%) and GPT-4 (26.9%) in DRQCE.
For all questions in the USMLE sample exam (including ones without image), GPT-4V achieved an accuracy of 88.2%, 90.8%, 92.7% in Step1, Step2CK, and Step3 of USMLE questions accordingly (Table 1), passing the standard for the USMLE (about 60%). However, it achieved limited accuracy in several medical disciplines such as anatomy (25.0%), Emergency Medicine (25.0%), and pathology (50.0%). A grasp of image is essential to correctly answer the majority of questions in these disciplines.
Accuracy Decreases When Difficulty Increases
When asking GPT-4V questions without the hint, it achieved an accuracy of 60%, 64%, and 66% for AMBOSS Step1, Step2CK, and Step3. GPT-4V was in the 72nd, 76th, and 80th percentile with AMBOSS users who were preparing for Step1, Step2CK, and Step3 respectively. Table 2 shows a decreasing trend in performance as question difficulty increased in the AMBOSS dataset (P<0.05). However, the decreasing trend was not observed when the GPT-4V was questioned with the hint. Out of 55 wrong answers, 17 were corrected by hint. An example and detailed analysis are provided in the supplementary material.
Quality of Explanation
We first evaluated the user’s preference among GPT-4V generated explanations and expert generated explanations. When GPT-4V answered incorrectly, it was no surprise that healthcare professionals overwhelmingly preferred expert explanations as shown in Table 3. When GPT-4V answered correctly, healthcare professionals favoring experts only exceeded favoring for GPT-4V by 4 votes, out of a total of 95 votes.
We further evaluated the quality of the GPT-4V generated explanation by verifying if GPT-4V used the provided image and question to answer the question in Table 4. When examining the 95 correct answers, 84.2% (n=80) of the responses contained an interpretation of the image, while 96.8% (n=92) aptly captured the information presented in the question. On the other hand, for the 55 incorrect answers, 92.8% (n=51) interpreted the image, and 89.1% (n=49) depicted the question’s details. In terms of comprehensive explanation, GPT-4V offered a comprehensive one in 79.0% (n=75) of correct responses. In contrast, only 7.2% (n=4) of the wrong answers had a comprehensive explanation.
We also evaluated the explanations that lead to GPT-4V in answering incorrectly across 4 metrics as outlined above: image misunderstanding, text hallucination, reasoning error, and non-medical error. Among questions with wrong answers (n=55), we found that 76.3% (n=42) of questions included misunderstanding of the image, 45.5% (n=25) of questions included logic error, 18.2% (n=10) of questions included text hallucination, and no questions included non-medical error.
Discussion
A prevailing research direction involves thoroughly investigating ChatGPT’s ability to handle a diverse set of medical examination questions.20,32,33 GPT-4 was recently introduced as the upgraded model for ChatGPT. Studies showed that GPT-4 outperforms ChatGPT in various medical tasks.34,35 The collective insights illustrated the power of ChatGPT and GPT-4 in medical exam answering and the potential for medical decision support. However, previous evaluations were only limited to questions without images.
In this study on the evaluation of medical exam questions with images, we found that GPT-4V selects more correct choices compared to ChatGPT and GPT-4 as shown in Table 1. Hence, when evaluating all questions in the USMLE sample exam, GPT-4V achieved an accuracy of 90.7% outperforming ChatGPT (58.5%) and GPT-4 (83.8%). The passing standard for the USMLE was typically set at 60%, indicating that the GPT-4V performed at a level similar to a medical graduate in the final year of study. In terms of explanation quality, we found that more than 80% of responses from GPT-4V provided an interpretation of the image and question of its answer selection, regardless of correctness. This suggested that GPT-4V consistently took into account both the image and question elements while generating responses. Figure 1 illustrates an example of high-quality explanation that utilizes both text and image in answering a hard question. More than 70% of students answered incorrectly on the first try, because both bacterial pneumonia and pulmonary embolism may involve symptoms such as cough. To differentiate them, GPT-4V correctly interpreted the X-ray with a radiologic sign of Hampton hump, which further increased the suspicion of pulmonary infarction rather than pneumonia.36 To show the need for X-ray as mentioned in the explanation, we removed the image from the input, and GPT-4V switched the answer to bacterial pneumonia while also acknowledging the possibility of bacterial pneumonia. This change in response demonstrated the high quality of the GPT-4V explanation, as its explanation about X-ray was not fictional and it truly needed the X-ray to answer this question. The high quality of GPT-4V explanations was also supported by experts’ preference voting. When comparing explanations generated by experts with ones generated by GPT-4V, experts’ preference for the expert over the GPT-4V was minimal (n=4) when GPT-4V correctly answered the question (n=95).
We also found that the accuracy of GPT-4V was related to the comprehensiveness of the explanation. GPT-4V offered a comprehensive explanation in 79.0% of correct responses. In contrast, only 7.2% of the wrong answers had a comprehensive explanation. Thus, the absence of key information may be the cause of inaccurate answers. These observations suggested that enhancing the performance of models can be achieved by training GPT-4V with more intricate, clinical-specific data with insights from experienced physicians such as UpToDate,37 and medical research literature such as PudMed.38
Image misunderstanding was the primary reason why GPT-4V answered incorrectly. Out of 55 wrong responses, 42 (76.3%) were due to misunderstanding of the image. In comparison, only 10 (18.2%) of the mistakes were attributed to text misinterpretation. GPT-4V’s proficiency in processing images was considerably lagging behind its text-handling capability. To circumvent its image interpretation issue, we tried to additionally prompt GPT-4V with a short hint that described the image. We found that 40.5% (17 out of 42) responses switched to the correct answer. Correction from the hint indicated that GPT-4V could be easily persuaded. Within a conversational interface, medical professionals can readily guide and refine GPT-4V’s initial outputs. This adaptability could be advantageous for physicians, as it allows for real-time adjustments and ensures that the generated information aligns more closely with the clinical context or the specific details of a patient’s case. By doing so, it enhances the utility and reliability of GPT-4V as an auxiliary tool in medical settings.37
Another drawback of GPT-4V involved its tendency to produce factually inaccurate responses, a problem often referred to as the hallucination effect, which is prevalent among many large language models such as GPT-4V.39 We found that more than 18% of GPT-4V explanations contain hallucinations. Thus when designing clinical support tools for high-risk situations such as patient diagnosing, it is crucial to integrate GPT-4V and a probabilistic model with confidence interval, indicating the reliability of the response.40 This would enhance the reliability of the CDSS response when additional physician review is warranted.15
Limitations
This study has several limitations. First, our findings are constrained in their applicability due to the modest sample size. We gathered 226 questions that included images, which might not comprehensively represent all medical disciplines. Second, GPT-4V is trained with online websites and licensed data up until September 2022. Thus GPT-4V is unable to answer questions that require up-to-date medical knowledge. However, we did not exclude those questions during evaluation. Furthermore, while GPT-4V has demonstrated proficiency in medical license examination, its real-world applicability, especially in dynamic, user-interactive scenarios, remains untested. Therefore, while the results are promising, extrapolating the efficacy of GPT-4V to broader clinical applications requires appropriate benchmarks and further research.19
Conclusion
While GPT-4V showcased remarkable accuracy across a spectrum of medical disciplines and varying difficulty levels in this study, it is paramount that further refinement be undertaken, particularly in enhancing its explanatory capabilities prior to any clinical assimilation. Medical students and professionals must be acutely aware of its limitations and consistently cross-verify with authoritative sources. Notably, even with their sophisticated capabilities, state-of-the-art LLMs like GPT-4V are not on the brink of supplanting physicians. Their performance in specialized examinations, though noteworthy, still exhibits imperfections, which can lead to consequential inaccuracies and uncertainties. Coupled with the well-known ethical concerns, the credibility and readiness of LLMs for clinical settings remain under scrutiny. However, the preliminary results are promising, suggesting that the present technology is poised to influence clinical practices. As research and development persist, we anticipate a more extensive and profound integration of AI in the medical domain.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
GPT Response Below
Let’s go through the information provided and evaluate each option based on the clinical picture and the chest x-ray:
**Clinical Information**:
- 55-year-old woman
- Smoked for 40 years (20 pack-years)
- Recent upper respiratory tract symptoms
- 3-day history of chest pain and cough with blood-tinged sputum
- Chest pain is sharp, stabbing, and worsened with coughing
- History of multiple sclerosis and uses a wheelchair
- On ocrelizumab and dantrolene
- Vital signs show mild fever and elevated pulse
- Auscultation shows crackles in the right lower lung
**Chest X-ray**:
Hampton hump is a radiologic sign which consists of a shallow wedge-shaped opacity in the periphery of the lung.
**Options**:
**Sarcoidosis** - Sarcoidosis often presents with bilateral hilar lymphadenopathy on chest x-ray. The given image doesn’t seem to show this. Additionally, the acute symptoms don’t align well with a diagnosis of sarcoidosis.
**Pericarditis** - This would present with chest pain that’s positional and relieved by sitting forward. The x-ray provided doesn’t suggest pericardial effusion. Unlikely.
**Bacterial pneumonia** - Given her symptoms and the crackles in the right lower lung, this is a possibility. However, the chest x-ray would typically show a lobar consolidation or infiltrate which isn’t clearly seen here.
**Pulmonary embolism** - Given her history of immobility (uses a wheelchair) and the acute onset of sharp, stabbing chest pain exacerbated by coughing, this is a strong contender. The presence of a Hampton Hump should further increase suspicion of a PE.
**Pneumothorax** - This would present with sudden onset chest pain and shortness of breath. On x-ray, you’d expect to see a clear line indicating the collapsed lung and no lung markings beyond that line. This is not evident in the provided image.
**Conclusion**:
Based on the provided information and the chest x-ray, the most likely diagnosis is:
**D: Pulmonary embolism**.
The reason for this choice is the combination of her risk factors (immobility, smoking), her clinical presentation (sharp, stabbing chest pain exacerbated by coughing, blood-tinged sputum), and the fact that the x-ray doesn’t clearly support the other diagnoses.
Conflicts of Interest
The authors declare no conflict of interests.
Supplementary Material
Here is the link to supplement cases.docx