Zero-Shot Multimodal Question Answering for Assessment of Medical Student OSCE Physical Exam Videos

Michael J. Holcomb; Shinyoung Kang; Ameer Shakur; Sol Vedovato; David Hein; Thomas O. Dalton; Krystle K. Campbell; Daniel J. Scott; Gaudenz Danuser; Andrew R. Jamieson

doi:10.1101/2024.06.05.24308467

Abstract

The Objective Structured Clinical Examination (OSCE) is a critical component of medical education whereby the data gathering, clinical reasoning, physical examination, diagnostic and planning capabilities of medical students are assessed in a simulated outpatient clinical setting with standardized patient actors (SPs) playing the role of patients with a predetermined diagnosis, or case. This study is the first to explore the zero-shot automation of physical exam grading in OSCEs by applying multimodal question answering techniques to the analysis of audiovisual recordings of simulated medical student encounters. Employing a combination of large multimodal models (LLaVA-1.6 7B,13B,34B, GPT-4V, and GPT-4o), automatic speech recognition (Whisper v3), and large language models (LLMs), we assess the feasibility of applying these component systems to the domain of student evaluation without any retraining. Our approach converts video content into textual representations, encompassing the transcripts of the audio component and structured descriptions of selected video frames generated by the multimodal model. These representations, referred to as “exam stories,” are then used as context for an abstractive question-answering problem via an LLM. A collection of 191 audiovisual recordings of medical student encounters with an SP for a single OSCE case was used as a test bed for exploring relevant features of successful exams. During this case, the students should have performed three physical exams: 1) mouth exam, 2) ear exam, and 3) nose exam. These examinations were each scored by two trained, non-faculty standardized patient evaluators (SPE) using the audiovisual recordings—an experienced, non-faculty SPE adjudicated disagreements. The percentage agreement between the described methods and the SPEs’ determination of exam occurrence as measured by percentage agreement varied from 26% to 83%. The audio-only methods, which relied exclusively on the transcript for exam recognition, performed uniformly higher by this measure compared to both the image-only methods and the combined methods across differing model sizes. The outperformance of the transcript-only model was strongly linked to the presence of key phrases where the student-physician would “signpost” the progression of the physical exam for the standardized patient, either alerting when they were about to begin an examination or giving the patient instructions. Multimodal models offer tremendous opportunity for improving the workflow of the physical examinations’ evaluation, for example by saving time and guiding focus for better assessment. While these models offer the promise of unlocking audiovisual data for downstream analysis with natural language processing methods, our findings reveal a gap between the off-the-shelf AI capabilities of many available models and the nuanced requirements of clinical practice, highlighting a need for further development and enhanced evaluation protocols in this area. We are actively pursuing a variety of approaches to realize this vision.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

The authors greatly acknowledge Azure compute credits provided by Microsoft as part of a Accelerating Foundation Model Research grant award. (https://www.microsoft.com/en-us/research/project/afmr-cognition-and-societal-benefits/)

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Ethics committee/IRB of UT Southwestern Medical Center gave ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Footnotes

Added experimental results with GPT-4o.

Data Availability

Raw OSCE audio-video recording data of the students (FERPA protected) are not publicly available. Privacy sensitive analytical outputs may be made available upon request to authors.