Abstract
Purpose To evaluate the application of Retrieval-Augmented Generation (RAG), a technique that combines information retrieval with text generation, to benchmark the performance of open-source and proprietary generative large language models (LLMs) in medical question-answering tasks within the ophthalmology domain.
Methods Our dataset comprised 260 multiple-choice questions sourced from two question-answer banks designed to assess ophthalmic knowledge: the American Academy of Ophthalmology’s Basic and Clinical Science Course (BCSC) Self-Assessment program and OphthoQuestions. Our RAG pipeline involved initial retrieval of documents in the BCSC companion textbook using ChromaDB, followed by reranking with Cohere to refine the context provided to the LLMs. We benchmarked four models, including GPT-4 and three open-source models (Llama-3-70B, Gemma-2-27B, and Mixtral-8×7B, all under 4-bit quantization), under three settings: zero-shot, zero-shot with Chain-of-Thought and RAG. Model performance was evaluated using accuracy on the two datasets. Quantization was applied to improve the efficiency of the open-source models. Effects of quantization level was also measured.
Results Using RAG, GPT-4-turbo’ s accuracy increased from 80.38% to 91.92% on BCSC and from 77.69% to 88.65 % on OphthoQuestions. Importantly, the RAG pipeline greatly enhanced overall performance of Llama-3 from 57.50% to 81.35% (23.85% increase), Gemma-2 62.12% to 79.23% (17.11% increase), and Mixtral-8×7B 52.89% to 75% (22.11% increase). Zero-shot-CoT had overall no significant improvement on the models’ performance. Quantization using 4 bit was shown to be as effective as using 8 bits while requiring half the resources.
Conclusion Our work demonstrates that integrating RAG significantly enhances LLM accuracy, especially for privacy-preserving smaller open-source LLMs that can be run in sensitive and resource-constrained environments such within hospitals, offering a viable alternative to cloud-based LLMs like GPT-4-turbo.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
- UCL UKRI Centre for Doctoral Training in AI-enabled Healthcare studentship (EP/S021612/1). - NIHR AI Award AI_AWARD02488. SL acknowledges support from Medical Research Council Clinical Research Training Fellowship (MR/X006271/1).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Some data produced in the present study are available upon reasonable request to the authors