RT Journal Article
SR Electronic
T1 Comparison of the diagnostic accuracy among GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists in musculoskeletal radiology
JF medRxiv
FD Cold Spring Harbor Laboratory Press
SP 2023.12.07.23299707
DO 10.1101/2023.12.07.23299707
A1 Horiuchi, Daisuke
A1 Tatekawa, Hiroyuki
A1 Oura, Tatsushi
A1 Shimono, Taro
A1 Walston, Shannon L
A1 Takita, Hirotaka
A1 Matsushita, Shu
A1 Mitsuyama, Yasuhito
A1 Miki, Yukio
A1 Ueda, Daiju
YR 2023
UL http://medrxiv.org/content/early/2023/12/09/2023.12.07.23299707.abstract
AB Objective To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4 based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology.Materials and Methods We included 106 “Test Yourself” cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4 based ChatGPT and the medical history and images into GPT-4V based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists.Results GPT-4 based ChatGPT significantly outperformed GPT-4V based ChatGPT (p &lt; 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4 based ChatGPT was comparable to that of the radiology resident but was lower than that of the board-certified radiologist, although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V based ChatGPT was significantly lower than those of both radiologists (p &lt; 0.001 and &lt; 0.001, respectively).Conclusion GPT-4 based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V based ChatGPT. While GPT-4 based ChatGPT’s diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study was supported by GuerbetAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This study used the cases published in the journal Skeletal Radiology (URL: https://link.springer.com/journal/256).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present work are contained in the manuscript