Abstract
Introduction The inability of Large Language Models (LLMs) to communicate uncertainty is a significant barrier to their use in medicine. Before LLMs can be integrated into patient care, the field must assess methods to measure uncertainty in ways that are useful to physician-users.
Objective Evaluate the ability for uncertainty metrics to quantify LLM confidence when performing diagnosis and treatment selection tasks by assessing the properties of discrimination and calibration.
Methods We examined the discrimination and calibration of Confidence Elicitation, Token-Level Probabilities, and Sample Consistency metrics across GPT3.5, GPT4, Llama2-70B and Llama3-70B. Uncertainty metrics were evaluated against three datasets of open-ended patient scenarios.
Results Sample Consistency methods outperformed Token Level Probability and Confidence Elicitation methods. Sample Consistency by sentence embedding cosine similarity achieved the highest discrimination performance with poor calibration, while Sample Consistency by GPT annotation achieved the second-best discrimination with more accurate calibration. Nearly all uncertainty metrics had better discriminative performance with diagnosis questions rather than treatment selection questions and verbalized confidence (Confidence Elicitation) was found to consistently over-estimate model confidence.
Conclusions Sample Consistency methods are the optimal metrics for assessing LLM uncertainty for the tasks of medical diagnosis and treatment selection. We suggest Sample Consistency by sentence embedding cosine similarity if the user has a set of reference cases with which to re-calibrate their results, and Sample Consistency by GPT annotation if the user does not have reference cases and requires accurate raw calibration. Our results also confirm LLMs are consistently over-confident when verbalizing their confidence through Confidence Elicitation.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
We have no funding sources to report applicable to this project.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study used the public MedQA dataset which is publicly published (https://doi.org/10.48550/arXiv.2009.13081). We also use the New England Journal of Medicine Case Report Series, which is also publicly published. Finally we created a third dataset of our own fictional simulated patient data de novo for this study.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data is available for review at: https://doi.org/10.6084/m9.figshare.25962529.v1