RT Journal Article
SR Electronic
T1 Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance
JF medRxiv
FD Cold Spring Harbor Laboratory Press
SP 2024.07.27.24310809
DO 10.1101/2024.07.27.24310809
A1 Samaan, Jamil S.
A1 Margolis, Samuel
A1 Srinivasan, Nitin
A1 Srinivasan, Apoorva
A1 Yeo, Yee Hui
A1 Anand, Rajsavi
A1 Samaan, Fadi S.
A1 Mirocha, James
A1 Safavi-Naini, Seyed Amir Ahmad
A1 El Kurdi, Bara
A1 Soroush, Ali
A1 Watson, Rabindra
A1 Gaddam, Srinivas
A1 Elmore, Joann G.
A1 Spiegel, Brennan M.R.
A1 Tatonetti, Nicholas P.
YR 2024
UL http://medrxiv.org/content/early/2024/07/29/2024.07.27.24310809.abstract
AB Background Large language models (LLMs) have shown promise in answering medical licensing examination-style questions. However, there is limited research on the performance of multimodal LLMs on subspecialty medical examinations. Our study benchmarks the performance of multimodal LLM’s enhanced by model prompting strategies on gastroenterology subspeciality examination-style questions and examines how these prompting strategies incrementally improve overall performance.Methods We used the 2022 American College of Gastroenterology (ACG) self-assessment examination (N=300). This test is typically completed by gastroenterology fellows and established gastroenterologists preparing for the gastroenterology subspeciality board examination. We employed a sequential implementation of model prompting strategies: prompt engineering, retrieval augmented generation (RAG), five-shot learning, and an LLM-powered answer validation revision model (AVRM). GPT-4 and Gemini Pro were tested.Results Implementing all prompting strategies improved the overall score of GPT-4 from 60.3% to 80.7% and Gemini Pro’s from 48.0% to 54.3%. GPT-4’s score surpassed the 70% passing threshold and 75% average human test-taker scores unlike Gemini Pro. Stratification of questions by difficulty showed the accuracy of both LLMs mirrored that of human examinees, demonstrating higher accuracy as human test-taker accuracy increased. The addition of the AVRM to prompt, RAG and 5-shot increased GPT-4’s accuracy by 4.4%. The incremental addition of model prompting strategies improved accuracy for both non-image (57.2% to 80.4%) and image-based (63.0% to 80.9%) questions for GPT-4, but not Gemini Pro.Conclusions Our results underscore the value of model prompting strategies in improving LLM performance on subspecialty-level licensing exam questions. We also present a novel implementation of an LLM-powered reviewer model in the context of subspecialty medicine which further improved model performance when combined with other prompting strategies. Our findings highlight the potential future role of multimodal LLMs, particularly with the implementation of multiple model prompting strategies, as clinical decision support systems in subspecialty care for healthcare providers.Competing Interest StatementConflict of Interest: Jamil S. Samaan declares that they have no conflict of interest. Samuel Margolis declares that they have no conflict of interest. Nitin Srinivasan declares that they have no conflict of interest. Yee Hui Yeo declares that they have no conflict of interest. Rajsavi Anand declares that they have no conflict of interest. Fadi S. Samaan declares that they have no conflict of interest. James Mirocha declares that they have no conflict of interest. Seyed Amir Ahmad Safavi-Naini received non-significant financial compensation as an R&amp;D associate from AryaspCo. Bara El Kurdi declares that they have no conflict of interest. Ali Soroush declares that they have no conflict of interest. Rabindra Watson declares that they have no conflict of interest. Srinivas Gaddam declares that they have no conflict of interest. Joann G. Elmore declares that they have no conflict of interest. Brennan M.R. Spiegel declares that they have no conflict of interest. Nicholas P. Tatonetti declares that they have no conflict of interest.Funding StatementNoneAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:2022 American College of Gastroenterology (ACG) self-assessment examination. Available at https://education.gi.org/satest/satest_18I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.Yes2022 American College of Gastroenterology (ACG) self-assessment examination. Available at https://education.gi.org/satest/satest_18 https://education.gi.org/satest/satest_18 ChatGPTChat Generative Pre-trained TransformerLLMLarge language modelAIArtificial IntelligenceUSMLEUnited States Medical Licensing ExaminationRAGRetrieval Augmented GenerationAGAAmerican Gastroenterological AssociationASGEAmerican Society for Gastrointestinal EndoscopyAASLDAmerican Association for the Study of Liver DiseasesAVRMAnswer Validation Revision Model