PT  - JOURNAL ARTICLE
AU  - Cesur, Turay
AU  - Gunes, Yasin Celal
AU  - Camur, Eren
AU  - Dağlı, Mustafa
TI  - Empowering Radiologists with ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases
AID  - 10.1101/2024.06.25.24309247
DP  - 2024 Jan 01
TA  - medRxiv
PG  - 2024.06.25.24309247
4099  - http://medrxiv.org/content/early/2024/06/25/2024.06.25.24309247.short
4100  - http://medrxiv.org/content/early/2024/06/25/2024.06.25.24309247.full
AB  - Purpose This study evaluated the diagnostic accuracy and differential diagnosis capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and three general radiologists in cardiac radiology. The impact of ChatGPT-4o assistance on radiologist performance was also investigated.Materials and Methods We collected publicly available 80 “Cardiac Case of the Month’’ from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx Score) were analyzed using the chi-square, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.Results The unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, General Radiologist-I was 53.8%, and General Radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for General Radiologists-I and II were statistically significant (P≤0.006). All radiologists’ DDx scores improved significantly with ChatGPT-4o assistance (P≤0.05). Remarkably, Radiologist-I’s GPT-4o-assisted diagnostic accuracy and DDx Score were not significantly different from the Cardiac Radiologist’s unassisted performance (P&amp;gt;0.05).Among the LLMs, Claude 3.5 Sonnet and Claude 3 Opus had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx Score, Claude 3 Opus outperformed all models and Radiologist-III (P&amp;lt;0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o-assistance (P&amp;lt;0.001).Conclusion ChatGPT-4o may enhance the diagnostic performance of general radiologists for cardiac imaging, suggesting its potential as a valuable diagnostic support tool. Further research is required to assess its clinical integration.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any funding.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesI confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data used in the present study are available upon reasonable request to the authorsAbbreviationsAIArtificial IntelligenceDDxDifferential DiagnosisLLMsLarge Language ModelsMRIMagnetic Resonance ImagingSTARDStandards for Reporting Diagnostic Accuracy Studies