Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models

Juanjo do Olmo; Javier Logroño; Carlos Mascías; Marcelo Martínez; Julián Isla

doi:10.1101/2024.05.08.24307062

Abstract

Diagnosing rare diseases is a significant challenge in healthcare, with patients often experiencing long delays and misdiagnoses. The large number of rare diseases and the difficulty for doctors to be familiar with all of them contribute to this problem. Artificial intelligence, particularly large language models (LLMs), has shown promise in improving the diagnostic process by leveraging their extensive knowledge to help doctors navigate the complexities of diagnosing rare diseases.

Foundation 29 presents a comprehensive evaluation of DxGPT, a web-based platform designed to assist healthcare professionals in the diagnostic process for rare diseases. The platform currently utilizes GPT-4, but this study also compares its performance with other large language models, including Claude 3, Gemini 1.5 Pro, Llama, Mistral, Mixtral, and Cohere Command R+. It is crucial to emphasize that DxGPT is not a medical device but rather a decision support tool that aims to aid in clinical reasoning.

This study extends beyond initial synthetic patient cases, incorporating real-world data from the RAMEDIS and Peking Union Medical College Hospital (PUMCH) datasets. The analysis followed two main metrics: Strict Accuracy (P1), how often the first diagnostic suggestion agreed with the real diagnosis, and Top-5 Accuracy (P1 + P5), how often the right diagnosis was in the top five suggestions. The results show a complex picture of diagnostic accuracy, with performance varying significantly across models and datasets:

On the synthetic dataset, closed models like GPT-4, Claude, and Gemini exhibited relatively high accuracy. Open models like Llama 3 and Mixtral performed reasonably well, though lagging behind.
On the RAMEDIS rare disease cases, Claude 3 Opus model demonstrated 55% Strict Accuracy and 70% Top-5 Accuracy, outperforming other closed models. Open models like Llama 3 and Mixtral showed moderate accuracy.
The PUMCH dataset proved challenging for all models, with the highest Strict Accuracy at 59.46% (GPT-4 Turbo 1106) and Top-5 Accuracy at 64.86%.

These findings demonstrate the potential of DxGPT and LLMs in improving diagnostic methods for rare diseases. However, they also emphasize the need for further validation, particularly in real-world clinical settings, and comparison with human expert diagnoses. Successful integration of AI into medical diagnostics will require collaboration between researchers, clinicians, and regulatory bodies to ensure safety, efficacy, and ethical deployment.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

Foundation 29 received a grant from Takeda to develop pilot #1 for Global Rare Disease Commission https://www.globalrarediseasecommission.com/Report/. Pilot #1 is about exploring how to use artificial intelligence for RD diagnosis. Produced work is publically available at www.dx29.ai. Foundation 29 received a grant from GW Pharma to develop www.dx29.ai. This an open source tool and free of charge tool for physicians to accelerate time to diagnosis for patients with rare diseses. Produced work is publically available at www.dx29.ai. Foundation 29 received a grant from UCB Pharma to develop https://dxgpt.app/. This an open source tool based on GPT-4 Azure OpenAI model and is free of charge tool for physicians and patients to accelerate time to diagnosis for patients with rare diseses. Produced work is publically available at https://dxgpt.app/ and https://github.com/foundation29org/dxgpt_testing. Foundation 29 received a grant from Italfarmaco to develop https://dxgpt.app/. This an open source tool based on GPT-4 Azure OpenAI model and is free of charge tool for physicians and patients to accelerate time to diagnosis for patients with rare diseses. Produced work is publically available at https://dxgpt.app/ and https://github.com/foundation29org/dxgpt_testing. These grants are not related to any of these pharma's products. This study was funded by all these 4 grants as part of DxGPT development.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

https://huggingface.co/datasets/chenxz/RareBench

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes