Abstract
The integration of Artificial Intelligence (AI) in radiology presents opportunities to enhance diagnostic processes. As of recently, OpenAI’s multimodal GPT-4 can analyze both images and textual data (GPT-4V). This study evaluates GPT-4V’s performance in interpreting radiological images across a variety of modalities, anatomical regions, and pathologies. Fifty-two anonymized diagnostic images were analyzed using GPT-4V, and the results were compared with board-certified radiologists interpretations. GPT-4V correctly recognized the imaging modality in all cases. The model’s performance in identifying pathologies and anatomical regions was inconsistent and varied between modalities and anatomical regions. Overall accuracy for anatomical region identification was 69.2% (36/52), ranging from 0% (0/16) in US images to 100% (15/15, 21/21) in X-ray and CT images. The model correctly identified pathologies in 30.5% of cases (11/36), ranging from 0% (0/9) in US images to 66.7% (8/12) for X-rays. The findings of this study indicate that despite its potential, multimodal GPT-4 is not yet a reliable tool for radiological images interpretation. Our study provides a baseline for future improvements in multimodal LLMs and highlights the importance of continued development to achieve reliability in radiology.
Introduction
Artificial Intelligence (AI) is transforming medicine, offering significant advancements especially in data-centric fields like radiology. Its ability to refine diagnostic processes and improve patient outcomes marks a revolutionary shift in medical workflows.
Radiology, heavily reliant on visual data, is a prime field for AI integration1. AI’s ability to analyze complex images offers significant diagnostic support, potentially easing radiologist workloads by automating routine tasks and efficiently identifying key pathologies2. The increasing use of publicly available AI tools in clinical radiology has integrated these technologies into the operational core of radiology departments3–5.
Among AI’s diverse applications, Large Language Models (LLMs) have gained prominence, particularly GPT-4 from OpenAI, noted for its advanced language understanding and generation6–15. A notable recent advancement of GPT-4 is its multimodal ability to analyze images alongside textual data (GPT-4V)16. The potential applications of this feature can be substantial, specifically in radiology where the integration of imaging findings and clinical textual data is key to accurate diagnosis. Thus, the purpose of this study was to evaluate the performance of GPT-4V for the analysis of radiological images across various imaging modalities and pathologies.
Methods
A Sheba Medical Center Institutional Review Board (IRB) approval was granted for this study. The IRB committee waived informed consent.
Dataset Selection
We systematically reviewed all imaging examinations from one consecutive week as recorded in Sheba Medical Center’s radiology information system (RIS). Our selection criteria aimed to include cases that would be considered resident-level in terms of diagnostic clarity and complexity. The inclusion of clear-cut cases was intended to ensure a focused evaluation of the AI’s interpretive capabilities without the confounding variables of ambiguous or borderline findings.
A senior body imaging radiologist in conjunction with a radiology resident performed the case collection. We selected a total of 52 images, which represented a balanced cross-section of modalities including computed tomography (CT), ultrasound (US), and X-ray (Table 1). These images spanned various anatomical regions and pathologies, chosen to reflect a spectrum of common and critical findings appropriate for resident-level interpretation.
To uphold the ethical considerations and privacy concerns, each image was anonymized to maintain patient confidentiality prior to analysis. This process involved the removal of all identifying information, ensuring that the subsequent analysis focused solely on the clinical content of the images.
AI Interpretation with GPT-4 Multimodal
Using openAI’s web interface, GPT-4V was prompted to analyze each image. The specific prompt used was “We are conducting a study to evaluate GPT-4 image recognition abilities in healthcare. Identify the condition and describe key findings in the image.” This prompt was designed to elicit detailed interpretations of the imaging findings. The senior radiologist and the resident reviewed the AI interpretations in consensus and compared them to the imaging findings.
To evaluate GPT-4V’s performance, we checked for the accurate recognition of modality type, anatomical location, and pathology identification. Errors were classified as omissions, incorrect identifications, or hallucinations of pathology.
Data Analysis
The analysis was performed using Python version 3.10. Statistical significance was determined using a p-value threshold of less than 0.05. The primary metrics were the accuracies of modality, anatomical regions, and diagnoses identification, expressed as a percentage of correct identifications. A qualitative analysis of GPT-4V answers was also performed. A Fisher’s exact test was employed to assess differences in the ability of GPT-4V to identify anatomical locations and pathologies across imaging modalities.
Results
Distribution of Imaging Modalities
The dataset consists of 52 diagnostic images categorized by modality (CT, X-ray, US), anatomical regions and pathologies. The results are summarized in Table 1. Overall, 36 images (69.2%) were pathological, 16 cases (30.8%) were normal.
GPT-4V Performance in Imaging Modality and Anatomical Region Identification
GPT-4V demonstrated a 100% (52/52) success rate for identification of the imaging modalities, across computed tomography (CT), ultrasound (US), and X-ray images, Table 2.
When analyzing GPT-4V’s accuracy in anatomical regions identification, the model correctly identified all X-ray and CT images, and none of the US images (p<.001), Table 2.
Pathology Identification Accuracy
Pathology identification accuracy differed notably across imaging modalities (Figure 1).
CT scans demonstrated a pathology identification accuracy of 3/15 (20.0%), while no pathologies were identified using US 0/9 (0%).
X-rays showed a higher identification accuracy of 8/12 (66.7%). X-ray accuracy was significantly higher compared to both US (p = 0.005) and CT (p = 0.022).
Examples of cases from the GPT-4V image analysis are presented in Figure 2.
Error analysis across imaging modalities, detailed in Table 3, highlights specific trends. US images exhibited a notably high rate of false positive or hallucinated pathologies at 13/16 (81.3%), and a high overall mistake rate of 16/16 (100%). CT scans showed an overall mistake rate of 15/21 (71.4%). X-rays showed the lowest error rates across all mistake types (46.7%).
Error Analysis
A recurrent error in US imaging involved the misidentification of normal testicular structures as renal or liver pathologies. This error surfaced six times. For CT interpretations, GPT-4V three times hallucinated bladder-related pathologies when assessing scans for other conditions like ascites and metastases. X-ray analysis revealed a tendency towards over-diagnosis and mislocalization of opacities. This error was observed in three instances.
Discussion
This study offers a detailed evaluation of multimodal GPT-4 performance in radiological image analysis. GPT-4V correctly identified all imaging modalities. The model was inconsistent in identifying anatomical regions and pathologies, and wrongly identified anatomy and pathology in all US images. Consequently, GPT-4V, as it currently stands, cannot be relied upon for radiological interpretation.
However, the moments where GPT-4V accurately identified pathologies show promise, suggesting enormous potential with further refinement. The extraordinary ability to integrate textual and visual data is novel and has vast potential applications in healthcare, and radiology in particular. Radiologists interpreting imaging examinations rely on imaging findings alongside the clinical context of each patient. It has been established that clinical information and context can improve the accuracy and quality of radiology reports17.
Similarly, the ability of LLMs to integrate clinical correlation with visual data marks a revolutionary step. This integration not only mirrors the decision making process of physicians, but also has the potential to ultimately surpass current image analysis algorithms which are mainly based on convolutional neural networks (CNNs)18,19.
GPT-4V represents a new technological paradigm in radiology, characterized by its ability to understand context, learn from minimal data (zero-shot or few-shot learning), reason, and provide explanatory insights. These features mark a significant advancement from the traditional AI applications in the field. Furthermore, its ability to textually describe and explain images are awe-inspiring, and with the algorithm’s improvement may eventually enhance medical education.
This study has several limitations. First, this was a retrospective analysis of patient cases, and the results should be interpreted accordingly. Second, there is potential for selection bias due to subjective case selection by the authors. Finally, we did not evaluate the performance of GPT-4V in image analysis when textual clinical context was provided, this was outside the scope of this study.
To conclude, despite its vast potential, multimodal GPT-4 is not yet a reliable tool for clinical radiological images interpretation. Our study provides a baseline for future improvements in multimodal LLMs and highlights the importance of continued development to achieve clinical reliability in radiology.
Data Availability
All data produced in the present study are available upon reasonable request to the authors