Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Zhichao Yang; Zonghai Yao; Mahbuba Tasmin; Parth Vashisht; Won Seok Jang; Beining Wang; Dan Berlowitz; Hong Yu

doi:10.1101/2023.10.26.23297629

Abstract

Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images.

Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability.

Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V’s performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023.

Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error.

Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint.

Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.

Introduction

Using computers to help make clinical diagnoses and guide treatments has been a goal of artificial intelligence since its inception.¹ The adoption of electronic health record (EHR) systems by hospitals in the US has resulted in an unprecedented amount of digital data associated with patient encounters. Computer-assisted clinical diagnostic support system (CDSS) endeavors to enhance clinicians’ decisions with patient information and clinical knowledge.² There is burgeoning interest in CDSS for enhanced imaging³, often termed radiomics, in various disciplines such as breast cancer detection⁴, covid detection⁵, diagnosing congenital cataracts⁶, and hidden fracture location⁷. For a decision to be trustworthy for clinicians, CDSS should not only make the prediction but also provide accurate explanations.^8–10 However, most previous imaging CDSS offers only highlight areas deemed significant by AI,^11–14 providing limited insight into the explanation of the diagnosis.¹⁵

Large language models (LLMs) could generate explanations as they are trained with reinforcement learning from human feedback to follow user requests to explain a question. Typical LLM examples include ChatGPT, a renowned chatbot released by OpenAI in October 2022, and its successor Generative Pre-trained Transformer (GPT-4) in March 2023. The influence of ChatGPT is attributed to its conversational prowess and its performance, which approaches or matches human-level competence in cognitive tasks, spanning various domains including medicine.¹⁶ ChatGPT has achieved commendable results in the United States Medical Licensing Examinations, leading to discussions about the readiness of LLM applications for integration into clinical^17–19, educational^20–22, and research²³ environments.

One limitation of ChatGPT is that it may only read and generate text but is unable to process other data modalities, such as images. This limitation, known as the “single-modality,” is a common issue among many LLMs.^24,25 Advancements in multimodal models promise enhanced capabilities and integration with diverse data sources.^26,27 OpenAI’s recent introduction of GPT-4V has undeniably made strides toward bridging this divide. GPT-4V, a state-of-the-art multimodal large language model, is equipped with visual processing ability, granting it to understand and describe visual content.²⁸

However, the ability of GPT-4V to analyze medical images still remains unknown. GPT-4V must perform comparably to humans on assessments of medical knowledge and reasoning such that users have sufficient confidence in its responses. In this work, we aim to assess GPT-4V performance on medical licensing examination questions with images, as well as to analyze its explanation for healthcare professional interpretability.

Method

This cross-sectional study compared the performance between GPT-4V, GPT-4, and ChatGPT on medical licensing examination questions answering. This study also investigates the quality of GPT-4V explanation in answering these questions. The study protocol was deemed exempt by Institutional Review Board at the VA Bedford Healthcare System and informed consent was waived due to minimal risk to patients. This study was conducted in October 2023.

Medical Exam Data Collection

We obtained study questions from three sources. The United States Medical Licensing Examination (USMLE) consists of three steps required to obtain a medical license in the United States. The USMLE assesses a physician’s ability to apply knowledge, concepts, and principles, which is critical to both health and disease management and is the foundation for safe, efficient patient care. The Step1, Step2 clinical knowledge(CK), Step3 of USMLE sample exam released from the National Board of Medical Examiners (NBME) consist of 119, 120, and 137 questions respectively. We then selected questions with images, resulting in 19, 13, and 18 questions from Step1, Step2 CK, and Step3. Discipline includes but is not limited to radiology, dermatology, orthopedics, ophthalmology, and cardiology.

The sample exam only included limited questions with images. Thus, we further collected similar questions from a non-public available and registered required source: AMBOSS, a widely used question bank for medical students, which provides exam performance data given students’ performance. The performance of past AMBOSS students enabled us to assess the comparative effectiveness of the model. For each question, AMBOSS associated an expert-written hint to tip the student to answer the question and a difficulty level that ranges from 1-5. Levels 1, 2, 3, 4, and 5 represent the easiest 20%, 20-50%, 50%-80%, 80%-95%, and 95%-100% of questions respectively.²⁹ We randomly selected 10 questions from each of the 5 difficulty levels. And we repeated this process for Step1, Step2 CK, and Step3. This resulted in a total number 150 of questions.

The Diagnostic Radiology Qualifying Core Exam (DRQCE), offered after 36 months of residency training, is an image-rich exam to evaluate a candidate’s core fund of knowledge and clinical judgment across practice domains of diagnostic radiology. DRQCE is not publicly available and requires registration. We collected 26 questions with images from the preparation exam offered by the American Board of Radiology (ABR). Thus, we had a total of 226 questions with images from the three sources.

How to Answer Image Questions using GPT-4V Prompt

GPT-4V took image and text data as inputs to generate textual outputs. Given that input format (prompt) played a key role in optimizing model performance, we followed the standard prompting guidelines of the visual question-answering task. Specifically, we prompted GPT-4V by first adding the image, then appending context (i.e., patient information) and questions, and finally providing multiple-choice options, each separated by a new line. When multiple sub-images existed in the image, we uploaded multiple sub-images to GPT-4V. An example prompt and response are shown in Figure 1. We did not use hint in the prompt unless otherwise specified. The response consists of the selected option as an answer, supported by a textual explanation to substantiate the selected decision. When using ChatGPT and GPT-4 models that cannot handle image data, images were omitted from the prompt. Responses were collected from the September 25, 2023 version of models. Each question was manually entered into the ChatGPT website independently (new chat window).

Figure 1. A case study of GPT-4V in answering USMLE questions with chest x-ray. Input prompt and output response are separated by “GPT Response Below”. Option D is the correct answer and is selected by GPT-4V without hint.

Question: A 55-year-old woman comes to the physician because of a 4-day history of chest pain and coughing up blood. The chest discomfort is characterized as piercing and aggravated by coughing. She encountered symptoms of a sore throat and nasal congestion 11 days prior. She was diagnosed with multiple sclerosis 15 years ago and utilizes a wheelchair for movement. A 40-year history of smoking a pack of cigarettes daily is noted, but she abstains from alcohol. Her current prescriptions include ocrelizumab and dantrolene. Temperature is 37.9°C (100.2°F), pulse is 105/min, and blood pressure is 110/60 mm Hg. Lung auscultation reveals occasional inspiratory crackles in the right lower lobe, while the cardiac exam shows no abnormalities. Neurologic examination shows stiffness and decreased sensation of the lower extremities; there is diffuse hyperreflexia. An x-ray of the chest is provided. Which of the following is the most likely diagnosis?

Options:

A: Sarcoidosis

B: Pericarditis

C: Bacterial pneumonia

D: Pulmonary embolism

E: Pneumothorax

Let’s think step by step. First, analyze each option one by one, and then give your final prediction and reasons to be sure we have the right answer.

Evaluation Metrics

For answer accuracy, we evaluated the model’s performance by comparing the model’s choice with the correct choice provided by the exam board or question bank. We defined accuracy as the ratio of the number of correct choices to the total number of questions.

We also evaluated the quality of the explanation by preference from 3 healthcare professionals (one medical doctor, one registered nurse, and one medical student). For each question from AMBOSS dataset (n=150), we asked the healthcare professionals to choose their preference between an explanation by GPT-4V, an explanation by an expert, or a tie.

Additionally, we also asked healthcare professionals to evaluate GPT-4V explanation from a sufficient and comprehensive perspective.^30,31 we asked them to label if the following information exists in the explanation:

Image interpretation: The explanation tries to interpret what the image means.
Question information: The explanation uses information related to the question.
Comprehensive explanation: The explanation includes comprehensive reasoning that leads to the final answer.

Finally, for each question answered incorrectly, we asked healthcare professionals to check if the explanation contained any of the following errors:

Image misunderstanding: if the sentence in the explanation shows an incorrect interpretation of the image. Example: It says a bone in the image is from the hand, but it is the foot.
Text hallucination: if the sentence in the explanation contains something that is incorrect. (Other than the image) Example: Claiming Saxenda is just insulin.
Reasoning error: if the sentence did not properly convert the information (either image or text) to an answer. Example: It identifies patient trip occurred in the last 3 months and chagas disease usually develops in 10∼20 years after infection, but still diagnosis as chagas disease.
Non-medical error: For any non-medical error, use this class. GPT is known to struggle with tasks requiring precise spatial localization, such as identifying chess positions on the board. It is known to struggle with calculations, such as 1 + 1 =?.

Statistical Analysis

GPT-4V’s accuracies on the AMBOSS dataset were compared between different difficulties using unpaired chi-square tests with a significance level of 0.05. All analysis was conducted in Python software (version 3.10.11).

Results

Overall Answer Accuracy

For questions with image, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9% in Step1, Step2CK, and Step3 of USMLE questions accordingly, outperforming ChatGPT and GPT-4 by 42.1% and 21.1% in Step1, 35.7% and 21.4% in Step2CK, 38.9% and 22.2% in Step3 (Table 1). Similarly, GPT-4V achieved an accuracy of 73.1%, outperforming ChatGPT (19.2%) and GPT-4 (26.9%) in DRQCE.

View this table:

Table 1. Performance of ChatGPT, GPT-4, and GPT-4V on USMLE sample exam from NBME.

For all questions in the USMLE sample exam (including ones without image), GPT-4V achieved an accuracy of 88.2%, 90.8%, 92.7% in Step1, Step2CK, and Step3 of USMLE questions accordingly (Table 1), passing the standard for the USMLE (about 60%). However, it achieved limited accuracy in several medical disciplines such as anatomy (25.0%), Emergency Medicine (25.0%), and pathology (50.0%). A grasp of image is essential to correctly answer the majority of questions in these disciplines.

Accuracy Decreases When Difficulty Increases

When asking GPT-4V questions without the hint, it achieved an accuracy of 60%, 64%, and 66% for AMBOSS Step1, Step2CK, and Step3. GPT-4V was in the 72nd, 76th, and 80th percentile with AMBOSS users who were preparing for Step1, Step2CK, and Step3 respectively. Table 2 shows a decreasing trend in performance as question difficulty increased in the AMBOSS dataset (P<0.05). However, the decreasing trend was not observed when the GPT-4V was questioned with the hint. Out of 55 wrong answers, 17 were corrected by hint. An example and detailed analysis are provided in the supplementary material.

View this table:

Table 2. Performance of GPT-4V on USMLE AMBOSS. For each step, overall: n=50; difficulty 1: n=10; difficulty 2: n=10; difficulty 3: n=10; difficulty 4: n=10; difficulty 5: n=10.

Quality of Explanation

We first evaluated the user’s preference among GPT-4V generated explanations and expert generated explanations. When GPT-4V answered incorrectly, it was no surprise that healthcare professionals overwhelmingly preferred expert explanations as shown in Table 3. When GPT-4V answered correctly, healthcare professionals favoring experts only exceeded favoring for GPT-4V by 4 votes, out of a total of 95 votes.

View this table:

Table 3. Healthcare professionals preferred explanation for 150 AMBOSS questions.

We further evaluated the quality of the GPT-4V generated explanation by verifying if GPT-4V used the provided image and question to answer the question in Table 4. When examining the 95 correct answers, 84.2% (n=80) of the responses contained an interpretation of the image, while 96.8% (n=92) aptly captured the information presented in the question. On the other hand, for the 55 incorrect answers, 92.8% (n=51) interpreted the image, and 89.1% (n=49) depicted the question’s details. In terms of comprehensive explanation, GPT-4V offered a comprehensive one in 79.0% (n=75) of correct responses. In contrast, only 7.2% (n=4) of the wrong answers had a comprehensive explanation.

View this table:

Table 4. Qualitative analysis of GPT-4V response quality for AMBOSS questions.

We also evaluated the explanations that lead to GPT-4V in answering incorrectly across 4 metrics as outlined above: image misunderstanding, text hallucination, reasoning error, and non-medical error. Among questions with wrong answers (n=55), we found that 76.3% (n=42) of questions included misunderstanding of the image, 45.5% (n=25) of questions included logic error, 18.2% (n=10) of questions included text hallucination, and no questions included non-medical error.

Discussion

A prevailing research direction involves thoroughly investigating ChatGPT’s ability to handle a diverse set of medical examination questions.^20,32,33 GPT-4 was recently introduced as the upgraded model for ChatGPT. Studies showed that GPT-4 outperforms ChatGPT in various medical tasks.^34,35 The collective insights illustrated the power of ChatGPT and GPT-4 in medical exam answering and the potential for medical decision support. However, previous evaluations were only limited to questions without images.

In this study on the evaluation of medical exam questions with images, we found that GPT-4V selects more correct choices compared to ChatGPT and GPT-4 as shown in Table 1. Hence, when evaluating all questions in the USMLE sample exam, GPT-4V achieved an accuracy of 90.7% outperforming ChatGPT (58.5%) and GPT-4 (83.8%). The passing standard for the USMLE was typically set at 60%, indicating that the GPT-4V performed at a level similar to a medical graduate in the final year of study. In terms of explanation quality, we found that more than 80% of responses from GPT-4V provided an interpretation of the image and question of its answer selection, regardless of correctness. This suggested that GPT-4V consistently took into account both the image and question elements while generating responses. Figure 1 illustrates an example of high-quality explanation that utilizes both text and image in answering a hard question. More than 70% of students answered incorrectly on the first try, because both bacterial pneumonia and pulmonary embolism may involve symptoms such as cough. To differentiate them, GPT-4V correctly interpreted the X-ray with a radiologic sign of Hampton hump, which further increased the suspicion of pulmonary infarction rather than pneumonia.³⁶ To show the need for X-ray as mentioned in the explanation, we removed the image from the input, and GPT-4V switched the answer to bacterial pneumonia while also acknowledging the possibility of bacterial pneumonia. This change in response demonstrated the high quality of the GPT-4V explanation, as its explanation about X-ray was not fictional and it truly needed the X-ray to answer this question. The high quality of GPT-4V explanations was also supported by experts’ preference voting. When comparing explanations generated by experts with ones generated by GPT-4V, experts’ preference for the expert over the GPT-4V was minimal (n=4) when GPT-4V correctly answered the question (n=95).

We also found that the accuracy of GPT-4V was related to the comprehensiveness of the explanation. GPT-4V offered a comprehensive explanation in 79.0% of correct responses. In contrast, only 7.2% of the wrong answers had a comprehensive explanation. Thus, the absence of key information may be the cause of inaccurate answers. These observations suggested that enhancing the performance of models can be achieved by training GPT-4V with more intricate, clinical-specific data with insights from experienced physicians such as UpToDate,³⁷ and medical research literature such as PudMed.³⁸

Image misunderstanding was the primary reason why GPT-4V answered incorrectly. Out of 55 wrong responses, 42 (76.3%) were due to misunderstanding of the image. In comparison, only 10 (18.2%) of the mistakes were attributed to text misinterpretation. GPT-4V’s proficiency in processing images was considerably lagging behind its text-handling capability. To circumvent its image interpretation issue, we tried to additionally prompt GPT-4V with a short hint that described the image. We found that 40.5% (17 out of 42) responses switched to the correct answer. Correction from the hint indicated that GPT-4V could be easily persuaded. Within a conversational interface, medical professionals can readily guide and refine GPT-4V’s initial outputs. This adaptability could be advantageous for physicians, as it allows for real-time adjustments and ensures that the generated information aligns more closely with the clinical context or the specific details of a patient’s case. By doing so, it enhances the utility and reliability of GPT-4V as an auxiliary tool in medical settings.³⁷

Another drawback of GPT-4V involved its tendency to produce factually inaccurate responses, a problem often referred to as the hallucination effect, which is prevalent among many large language models such as GPT-4V.³⁹ We found that more than 18% of GPT-4V explanations contain hallucinations. Thus when designing clinical support tools for high-risk situations such as patient diagnosing, it is crucial to integrate GPT-4V and a probabilistic model with confidence interval, indicating the reliability of the response.⁴⁰ This would enhance the reliability of the CDSS response when additional physician review is warranted.¹⁵

Limitations

This study has several limitations. First, our findings are constrained in their applicability due to the modest sample size. We gathered 226 questions that included images, which might not comprehensively represent all medical disciplines. Second, GPT-4V is trained with online websites and licensed data up until September 2022. Thus GPT-4V is unable to answer questions that require up-to-date medical knowledge. However, we did not exclude those questions during evaluation. Furthermore, while GPT-4V has demonstrated proficiency in medical license examination, its real-world applicability, especially in dynamic, user-interactive scenarios, remains untested. Therefore, while the results are promising, extrapolating the efficacy of GPT-4V to broader clinical applications requires appropriate benchmarks and further research.¹⁹

Conclusion

While GPT-4V showcased remarkable accuracy across a spectrum of medical disciplines and varying difficulty levels in this study, it is paramount that further refinement be undertaken, particularly in enhancing its explanatory capabilities prior to any clinical assimilation. Medical students and professionals must be acutely aware of its limitations and consistently cross-verify with authoritative sources. Notably, even with their sophisticated capabilities, state-of-the-art LLMs like GPT-4V are not on the brink of supplanting physicians. Their performance in specialized examinations, though noteworthy, still exhibits imperfections, which can lead to consequential inaccuracies and uncertainties. Coupled with the well-known ethical concerns, the credibility and readiness of LLMs for clinical settings remain under scrutiny. However, the preliminary results are promising, suggesting that the present technology is poised to influence clinical practices. As research and development persist, we anticipate a more extensive and profound integration of AI in the medical domain.

Data Availability

All data produced in the present study are available upon reasonable request to the authors

GPT Response Below

Let’s go through the information provided and evaluate each option based on the clinical picture and the chest x-ray:

**Clinical Information**:

- 55-year-old woman
- Smoked for 40 years (20 pack-years)
- Recent upper respiratory tract symptoms
- 3-day history of chest pain and cough with blood-tinged sputum
- Chest pain is sharp, stabbing, and worsened with coughing
- History of multiple sclerosis and uses a wheelchair
- On ocrelizumab and dantrolene
- Vital signs show mild fever and elevated pulse
- Auscultation shows crackles in the right lower lung

**Chest X-ray**:

Hampton hump is a radiologic sign which consists of a shallow wedge-shaped opacity in the periphery of the lung.

**Options**:

**Sarcoidosis** - Sarcoidosis often presents with bilateral hilar lymphadenopathy on chest x-ray. The given image doesn’t seem to show this. Additionally, the acute symptoms don’t align well with a diagnosis of sarcoidosis.
**Pericarditis** - This would present with chest pain that’s positional and relieved by sitting forward. The x-ray provided doesn’t suggest pericardial effusion. Unlikely.
**Bacterial pneumonia** - Given her symptoms and the crackles in the right lower lung, this is a possibility. However, the chest x-ray would typically show a lobar consolidation or infiltrate which isn’t clearly seen here.
**Pulmonary embolism** - Given her history of immobility (uses a wheelchair) and the acute onset of sharp, stabbing chest pain exacerbated by coughing, this is a strong contender. The presence of a Hampton Hump should further increase suspicion of a PE.
**Pneumothorax** - This would present with sudden onset chest pain and shortness of breath. On x-ray, you’d expect to see a clear line indicating the collapsed lung and no lung markings beyond that line. This is not evident in the provided image.

**Conclusion**:

Based on the provided information and the chest x-ray, the most likely diagnosis is:

**D: Pulmonary embolism**.

The reason for this choice is the combination of her risk factors (immobility, smoking), her clinical presentation (sharp, stabbing chest pain exacerbated by coughing, blood-tinged sputum), and the fact that the x-ray doesn’t clearly support the other diagnoses.

Conflicts of Interest

The authors declare no conflict of interests.

Supplementary Material

Here is the link to supplement cases.docx

Reference

1.↵
Shortliffe EH, Cimino JJ. Biomedical Informatics: Computer Applications in Health Care and Biomedicine. Springer; 2014.
2.↵
Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digital Medicine. 2020;3.
3.↵
Rajpurkar P, Lungren MP. The Current and Future State of AI Interpretation of Medical Images. The New England journal of medicine. 2023;388 21:1981–1990.
OpenUrl
4.↵
Aggarwal R, Sounderajah V, Martin G, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digital Medicine. 2021;4. https://api.semanticscholar.org/CorpusID:233139020
5.↵
Wang L, Lin ZQ, Wong A. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Scientific Reports. 2020;10. https://api.semanticscholar.org/CorpusID:215768886
6.↵
Long E, Lin H, Liu Z, et al. An artificial intelligence platform for the multihospital collaborative management of congenital cataracts. Nature Biomedical Engineering. 2017;1. https://api.semanticscholar.org/CorpusID:113460889
7.↵
Rayan JC, Reddy N, Kan JH, Zhang W, Annapragada AV. Binomial Classification of Pediatric Elbow Fractures Using a Deep Learning Multiview Approach Emulating Radiologist Decision Making. Radiology Artificial intelligence. 2019;1 1:e180015.
OpenUrl
8.↵
Bussone A, Stumpf S, O’Sullivan D. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems. 2015 International Conference on Healthcare Informatics. Published online 2015:160-169.
9.
Panigutti C, Beretta A, Giannotti F, Pedreschi D. Understanding the impact of explanations on advice-taking: a user study for AI-based clinical Decision Support Systems. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. Published online 2022. https://api.semanticscholar.org/CorpusID:248419322
10.↵
Gaube S, Suresh H, Raue M, et al. Non-task expert physicians benefit from correct explainable AI advice when reviewing X-rays. Scientific reports. 2023;13(1):1383.
OpenUrl
11.↵
Singh A, Mohammed AR, Zelek JS, Lakshminarayanan V. Interpretation of deep learning using attributions: application to ophthalmic diagnosis. In: Optical Engineering + Applications. ; 2020. https://api.semanticscholar.org/CorpusID:221616930
12.
1. Suzuki K,
2. Reyes M,
3. Syeda-Mahmood T, et al.
Eitel F, Ritter K. Testing the Robustness of Attribution Methods for Convolutional Neural Networks in MRI-Based Alzheimer’s Disease Classification. In: Suzuki K, Reyes M, Syeda-Mahmood T, et al., eds. Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support. Springer International Publishing; 2019:3–11.
13.
Papanastasopoulos Z, Samala RK, Chan HP, et al. Explainable AI for medical imaging: deep-learning CNN ensemble for classification of estrogen receptor status from breast MRI. In: Medical Imaging. ; 2020. https://api.semanticscholar.org/CorpusID:216291456
14.↵
Shamout FE, Shen Y, Wu N, et al. An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department. NPJ Digital Medicine. 2021;4. https://api.semanticscholar.org/CorpusID:220968946
15.↵
Shen Y, Heacock L, Elias J, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. Published online 2023:230163.
16.↵
OpenAI. GPT-4 Technical Report. ArXiv. 2023;abs/2303.08774. https://api.semanticscholar.org/CorpusID:257532815
17.↵
Goodman RS, Patrinely JR, Stone J Cosby A, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Network Open. 2023;6(10):e2336483–e2336483. doi:10.1001/jamanetworkopen.2023.36483
OpenUrl CrossRef
18.
Decker H, Trang K, Ramirez J, et al. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Network Open. 2023;6. https://api.semanticscholar.org/CorpusID:263774434
19.↵
Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA internal medicine. Published online 2023. https://api.semanticscholar.org/CorpusID:258375371
20.↵
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2022;2.
21.
Thirunavukarasu AJ, Hassan R, Mahmood S, et al. Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care. JMIR Medical Education. 2023;9. https://api.semanticscholar.org/CorpusID:258259005
22.↵
Cooper AZ, Rodman A. AI and Medical Education - A 21st-Century Pandora’s Box. The New England journal of medicine. Published online 2023. https://api.semanticscholar.org/CorpusID:260322445
23.↵
Hua HU, Kaakour AH, Rachitskaya A, Srivastava S, Sharma S, Mammo DA. Evaluation and Comparison of Ophthalmic Scientific Abstracts and References by Current Artificial Intelligence Chatbots. JAMA Ophthalmology. 2023;141(9):819–824. doi:10.1001/jamaophthalmol.2023.3119
OpenUrl CrossRef
24.↵
Khader F, Müller-Franzes G, Wang T, et al. Multimodal Deep Learning for Integrating Chest Radiographs and Clinical Parameters: A Case for Transformers. Radiology. 2023;309 1:e230806.
OpenUrl
25.↵
Topol EJ. As artificial intelligence goes multimodal, medical applications multiply. Science. 2023;381 6663:adk6139.
26.↵
Jared Spataro. Introducing Microsoft 365 Copilot—your copilot for work. Accessed October 15, 2023. https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/
27.↵
Zoubin Ghahramani. Introducing PaLM 2. Published October 15, 2023. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/
28.↵
Yang Z, Li L, Lin K, et al. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). ArXiv. 2023;abs/2309.17421. https://api.semanticscholar.org/CorpusID:263310951
29.↵
AMBOSS. AMBOSS Question difficulty. Published 10/15/12023. https://support.amboss.com/hc/en-us/articles/360035679652-Question-difficulty
30.↵
Yu M, Chang S, Zhang Y, Jaakkola T. Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019:4094–4103. doi:10.18653/v1/D19-1420
OpenUrl CrossRef
31.↵
Zaidan O, Eisner J, Piatko C. Using “Annotator Rationales” to Improve Machine Learning for Text Categorization. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics; 2007:260–267. https://aclanthology.org/N07-1033
32.↵
Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. Published online 2023:230582.
33.↵
Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education. 2023;9. https://api.semanticscholar.org/CorpusID:256663603
34.↵
Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. ArXiv. 2023;abs/2303.13375. https://api.semanticscholar.org/CorpusID:257687695
35.↵
Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. The New England journal of medicine. 2023;388 25:2399.
OpenUrl
36.↵
Patel UB, Ward TJ, Kadoch MA, Cham MD. Radiographic features of pulmonary embolism: Hampton’s hump. Postgraduate Medical Journal. 2014;90:420–421.
OpenUrl FREE Full Text
37.↵
Lourenco AP, Slanetz PJ, Baird GL. Rise of ChatGPT: It May Be Time to Reassess How We Teach and Test Radiology Residents. Radiology. Published online 2023:231053.
38.↵
Zhang S, Xu Y, Usuyama N, et al. Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing. ArXiv. 2023;abs/2303.00915. https://api.semanticscholar.org/CorpusID:257280046
39.↵
Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys. 2022;55:1–38.
OpenUrl
40.↵
Jiang S, Xu YY, Lu X. ChatGPT in Radiology: Evaluating Proficiencies, Addressing Shortcomings, and Proposing Integrative Approaches for the Future. Radiology. 2023;308 1:e231335.
OpenUrl

View the discussion thread.

Posted October 26, 2023.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Radiology and Imaging

Subject Areas

All Articles

Addiction Medicine (349)
Allergy and Immunology (670)
Anesthesia (181)
Cardiovascular Medicine (2654)
Dentistry and Oral Medicine (316)
Dermatology (224)
Emergency Medicine (400)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (945)
Epidemiology (12245)
Forensic Medicine (10)
Gastroenterology (762)
Genetic and Genomic Medicine (4113)
Geriatric Medicine (387)
Health Economics (680)
Health Informatics (2665)
Health Policy (1005)
Health Systems and Quality Improvement (987)
Hematology (364)
HIV/AIDS (853)
Infectious Diseases (except HIV/AIDS) (13704)
Intensive Care and Critical Care Medicine (799)
Medical Education (399)
Medical Ethics (109)
Nephrology (439)
Neurology (3897)
Nursing (210)
Nutrition (578)
Obstetrics and Gynecology (741)
Occupational and Environmental Health (696)
Oncology (2037)
Ophthalmology (586)
Orthopedics (241)
Otolaryngology (306)
Pain Medicine (250)
Palliative Medicine (75)
Pathology (473)
Pediatrics (1116)
Pharmacology and Therapeutics (466)
Primary Care Research (452)
Psychiatry and Clinical Psychology (3441)
Public and Global Health (6538)
Radiology and Imaging (1407)
Rehabilitation Medicine and Physical Therapy (818)
Respiratory Medicine (871)
Rheumatology (411)
Sexual and Reproductive Health (410)
Sports Medicine (343)
Surgery (451)
Toxicology (53)
Transplantation (185)
Urology (166)

[1] 1.↵
Shortliffe EH, Cimino JJ. Biomedical Informatics: Computer Applications in Health Care and Biomedicine. Springer; 2014.

[2] 2.↵
Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digital Medicine. 2020;3.

[3] 3.↵
Rajpurkar P, Lungren MP. The Current and Future State of AI Interpretation of Medical Images. The New England journal of medicine. 2023;388 21:1981–1990.
OpenUrl

[4] 4.↵
Aggarwal R, Sounderajah V, Martin G, et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digital Medicine. 2021;4. https://api.semanticscholar.org/CorpusID:233139020

[5] 5.↵
Wang L, Lin ZQ, Wong A. COVID-Net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Scientific Reports. 2020;10. https://api.semanticscholar.org/CorpusID:215768886

[6] 6.↵
Long E, Lin H, Liu Z, et al. An artificial intelligence platform for the multihospital collaborative management of congenital cataracts. Nature Biomedical Engineering. 2017;1. https://api.semanticscholar.org/CorpusID:113460889

[7] 7.↵
Rayan JC, Reddy N, Kan JH, Zhang W, Annapragada AV. Binomial Classification of Pediatric Elbow Fractures Using a Deep Learning Multiview Approach Emulating Radiologist Decision Making. Radiology Artificial intelligence. 2019;1 1:e180015.
OpenUrl

[8] 8.↵
Bussone A, Stumpf S, O’Sullivan D. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems. 2015 International Conference on Healthcare Informatics. Published online 2015:160-169.

[9] 9.
Panigutti C, Beretta A, Giannotti F, Pedreschi D. Understanding the impact of explanations on advice-taking: a user study for AI-based clinical Decision Support Systems. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. Published online 2022. https://api.semanticscholar.org/CorpusID:248419322

[10] 10.↵
Gaube S, Suresh H, Raue M, et al. Non-task expert physicians benefit from correct explainable AI advice when reviewing X-rays. Scientific reports. 2023;13(1):1383.
OpenUrl

[11] 11.↵
Singh A, Mohammed AR, Zelek JS, Lakshminarayanan V. Interpretation of deep learning using attributions: application to ophthalmic diagnosis. In: Optical Engineering + Applications. ; 2020. https://api.semanticscholar.org/CorpusID:221616930

[12] 12.
Suzuki K,
Reyes M,
Syeda-Mahmood T, et al.
Eitel F, Ritter K. Testing the Robustness of Attribution Methods for Convolutional Neural Networks in MRI-Based Alzheimer’s Disease Classification. In: Suzuki K, Reyes M, Syeda-Mahmood T, et al., eds. Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support. Springer International Publishing; 2019:3–11.

[13] Suzuki K,

[14] Reyes M,

[15] Syeda-Mahmood T, et al.

[16] 13.
Papanastasopoulos Z, Samala RK, Chan HP, et al. Explainable AI for medical imaging: deep-learning CNN ensemble for classification of estrogen receptor status from breast MRI. In: Medical Imaging. ; 2020. https://api.semanticscholar.org/CorpusID:216291456

[17] 14.↵
Shamout FE, Shen Y, Wu N, et al. An artificial intelligence system for predicting the deterioration of COVID-19 patients in the emergency department. NPJ Digital Medicine. 2021;4. https://api.semanticscholar.org/CorpusID:220968946

[18] 15.↵
Shen Y, Heacock L, Elias J, et al. ChatGPT and Other Large Language Models Are Double-edged Swords. Radiology. Published online 2023:230163.

[19] 16.↵
OpenAI. GPT-4 Technical Report. ArXiv. 2023;abs/2303.08774. https://api.semanticscholar.org/CorpusID:257532815

[20] 17.↵
Goodman RS, Patrinely JR, Stone J Cosby A, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Network Open. 2023;6(10):e2336483–e2336483. doi:10.1001/jamanetworkopen.2023.36483
OpenUrl CrossRef

[21] 18.
Decker H, Trang K, Ramirez J, et al. Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures. JAMA Network Open. 2023;6. https://api.semanticscholar.org/CorpusID:263774434

[22] 19.↵
Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA internal medicine. Published online 2023. https://api.semanticscholar.org/CorpusID:258375371

[23] 20.↵
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2022;2.

[24] 21.
Thirunavukarasu AJ, Hassan R, Mahmood S, et al. Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care. JMIR Medical Education. 2023;9. https://api.semanticscholar.org/CorpusID:258259005

[25] 22.↵
Cooper AZ, Rodman A. AI and Medical Education - A 21st-Century Pandora’s Box. The New England journal of medicine. Published online 2023. https://api.semanticscholar.org/CorpusID:260322445

[26] 23.↵
Hua HU, Kaakour AH, Rachitskaya A, Srivastava S, Sharma S, Mammo DA. Evaluation and Comparison of Ophthalmic Scientific Abstracts and References by Current Artificial Intelligence Chatbots. JAMA Ophthalmology. 2023;141(9):819–824. doi:10.1001/jamaophthalmol.2023.3119
OpenUrl CrossRef

[27] 24.↵
Khader F, Müller-Franzes G, Wang T, et al. Multimodal Deep Learning for Integrating Chest Radiographs and Clinical Parameters: A Case for Transformers. Radiology. 2023;309 1:e230806.
OpenUrl

[28] 25.↵
Topol EJ. As artificial intelligence goes multimodal, medical applications multiply. Science. 2023;381 6663:adk6139.

[29] 26.↵
Jared Spataro. Introducing Microsoft 365 Copilot—your copilot for work. Accessed October 15, 2023. https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/

[30] 27.↵
Zoubin Ghahramani. Introducing PaLM 2. Published October 15, 2023. https://blog.google/technology/ai/google-palm-2-ai-large-language-model/

[31] 28.↵
Yang Z, Li L, Lin K, et al. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). ArXiv. 2023;abs/2309.17421. https://api.semanticscholar.org/CorpusID:263310951

[32] 29.↵
AMBOSS. AMBOSS Question difficulty. Published 10/15/12023. https://support.amboss.com/hc/en-us/articles/360035679652-Question-difficulty

[33] 30.↵
Yu M, Chang S, Zhang Y, Jaakkola T. Rethinking Cooperative Rationalization: Introspective Extraction and Complement Control. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics; 2019:4094–4103. doi:10.18653/v1/D19-1420
OpenUrl CrossRef

[34] 31.↵
Zaidan O, Eisner J, Piatko C. Using “Annotator Rationales” to Improve Machine Learning for Text Categorization. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference. Association for Computational Linguistics; 2007:260–267. https://aclanthology.org/N07-1033

[35] 32.↵
Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. Published online 2023:230582.

[36] 33.↵
Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education. 2023;9. https://api.semanticscholar.org/CorpusID:256663603

[37] 34.↵
Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. ArXiv. 2023;abs/2303.13375. https://api.semanticscholar.org/CorpusID:257687695

[38] 35.↵
Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. The New England journal of medicine. 2023;388 25:2399.
OpenUrl

[39] 36.↵
Patel UB, Ward TJ, Kadoch MA, Cham MD. Radiographic features of pulmonary embolism: Hampton’s hump. Postgraduate Medical Journal. 2014;90:420–421.
OpenUrl FREE Full Text

[40] 37.↵
Lourenco AP, Slanetz PJ, Baird GL. Rise of ChatGPT: It May Be Time to Reassess How We Teach and Test Radiology Residents. Radiology. Published online 2023:231053.

[41] 38.↵
Zhang S, Xu Y, Usuyama N, et al. Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing. ArXiv. 2023;abs/2303.00915. https://api.semanticscholar.org/CorpusID:257280046

[42] 39.↵
Ji Z, Lee N, Frieske R, et al. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys. 2022;55:1–38.
OpenUrl

[43] 40.↵
Jiang S, Xu YY, Lu X. ChatGPT in Radiology: Evaluating Proficiencies, Addressing Shortcomings, and Proposing Integrative Approaches for the Future. Radiology. 2023;308 1:e231335.
OpenUrl

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Abstract

Introduction

Method

Medical Exam Data Collection

How to Answer Image Questions using GPT-4V Prompt

Evaluation Metrics

Statistical Analysis

Results

Overall Answer Accuracy

Accuracy Decreases When Difficulty Increases

Quality of Explanation

Discussion

Limitations

Conclusion

Data Availability

GPT Response Below

Conflicts of Interest

Supplementary Material

Reference

Citation Manager Formats

Subject Area