Summary
Background and Objectives Recent advancements in large language models (LLMs) such as GPT-3.5 and GPT-4 have shown impressive potential in a wide array of applications, including healthcare. While GPT-3.5 and GPT-4 showed heterogeneous results across specialized medical board examinations, the performance of these models in neurology board exams remains unexplored.
Methods An exploratory, prospective study was conducted between May 17 and May 31, 2023. The evaluation utilized a question bank approved by the American Board of Psychiatry and Neurology, designed as part of a self-assessment program. Questions were presented in a single best answer, multiple-choice format. The results from the question bank were validated with a small question cohort by the European Board for Neurology. All questions were categorized into lower-order (recall, understanding) and higher-order (apply, analyze, synthesize) questions. The performance of GPT-3.5 and GPT-4 was assessed in relation to overall performance, question type, and topic. In addition, the confidence level in responses and the reproducibility of correctly and incorrectly answered questions was evaluated. Univariable analysis was carried out. Chi-squared test and Bonferroni correction were used to determine performance differences based on question characteristics. To differentiate characteristics of correctly and incorrectly answered questions, a high-dimensional tSNE analysis of the question representations was performed.
Results In May 2023, GPT-3.5 correctly answered 66.8 % of 1956 questions, whereas GPT-4 demonstrated a higher performance level, correctly answering 85 % of questions in congruence with near-passing and passing of the neurology board exam. GPT-4’s performance surpassed both GPT-3.5 and question bank users (mean human user score: 73.8%). An analysis of twenty-six question categories showed that GPT-4 outperformed human users in Behavioral, Cognitive and Psych-related questions and demonstrated superior performance to GPT-3.5 in six categories. Both models performed better on lower-order than higher-order questions according to Bloom Taxonomy for learning and assessment (GPT4: 790 of 893 (88.5%) vs. 872 of 1063 (82%), GPT-3.5: 639 of 893 (71.6%) vs. 667 of 1063 (62.7%)) with GPT-4 also excelling in both lower-order and higher-order questions. The use of confident language was observed consistently across both models, even when incorrect (GPT-4: 99.3%, 292 of 294 incorrect answers, GPT-3.5: 100%, 650 of 650 incorrect answers). Reproducible answers of GPT-3.5 and GPT-4 (defined as more than 75 % same output across 50 independent queries) were associated with a higher percentage of correct answers (GPT-3.5: 66 of 88 (75%), GPT-4: 78 of 96 (81.3%)) than inconsistent answers, (GPT-3.5: 5 of 13 (38.5%), GPT-4: 1 of 4 (25%)). Lastly, the high-dimensional embedding analysis of correctly and incorrectly answered questions revealed no clear differentiation into distinct clusters.
Discussion Despite the absence of neurology-specific training, GPT-4 demonstrated commendable performance, whereas GPT-3.5 performed slightly below the human average question bank user. Higher-order cognitive tasks proved more challenging for both GPT-4 and GPT-3.5. Notwithstanding, GPT-4’s performance was equivalent to a passing grade for specialized neurology board exams. These findings suggest that with further refinements, LLMs like GPT-4 could play a pivotal role in applications for clinical neurology and healthcare in general.
Competing Interest Statement
W.W. is inventor of the patent WO2017020982A1 Agents for use in the treatment of glioma. This patent covers new treatment strategies that all target the formation and function of TMs in glioma.
Funding Statement
None.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes