Abstract
Background: This study evaluates the diagnostic performance of several AI models, including Deepseek, in diagnosing corneal diseases, glaucoma, and neuroophthalmologic disorders. Methods: We retrospectively selected 53 case reports from the Department of Ophthalmology and Visual Sciences at the University of Iowa, comprising 20 corneal disease cases, 11 glaucoma cases, and 22 neuroophthalmology cases. The case descriptions were input into DeepSeek, ChatGPT4.0, ChatGPT01, and Qwens 2.5 Max. These responses were compared with diagnoses rendered by human experts (corneal specialists, glaucoma attendings, and neuroophthalmologists). Diagnostic accuracy and interobserver agreement, defined as the percentage difference between each AI model performance and the average human expert performance, were determined. Results: DeepSeek achieved an overall diagnostic accuracy of 79.2%, with specialty specific accuracies of 90.0% in corneal diseases, 54.5% in glaucoma, and 81.8% in neuroophthalmology. ChatGPT01 outperformed the other models with an overall accuracy of 84.9% (85.0% in corneal diseases, 63.6% in glaucoma, and 95.5% in neuroophthalmology), while Qwens exhibited a lower overall accuracy of 64.2% (55.0% in corneal diseases, 54.5% in glaucoma, and 77.3% in neuroophthalmology). Interobserver agreement analysis revealed that in corneal diseases, DeepSeek differed by minus 3.3% (90.0% vs 93.3%), ChatGPT01 by minus 8.3%, and Qwens by minus 38.3%. In glaucoma, DeepSeek outperformed the human expert average by +3.0% (54.5% vs 51.5%), while ChatGPT4.0 and ChatGPT&01 exceeded it by +12.1%, and Qwens was +3.0% above the human average. In neuroophthalmology, DeepSeek and ChatGPT4.0 were 9.1% lower than the human average, ChatGPT01 exceeded it by +4.6%, and Qwens was 13.6% lower. Conclusions: ChatGPT01 demonstrated the highest overall diagnostic accuracy, especially in neuroophthalmology, while DeepSeek and ChatGPT4.0 showed comparable performance. Qwens underperformed relative to the other models, especially in corneal diseases. Although these AI models exhibit promising diagnostic capabilities, they currently lag behind human experts in certain areas, underscoring the need for a collaborative integration of clinical judgment.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
The Chan Zuckerberg Initiative, Cold Spring Harbor Laboratory, the Sergey Brin Family Foundation, California Institute of Technology, Centre National de la Recherche Scientifique, Fred Hutchinson Cancer Center, Imperial College London, Massachusetts Institute of Technology, Stanford University, University of Washington, and Vrije Universiteit Amsterdam.