Conformal Triage for Medical Imaging AI Deployment
==================================================

* Anastasios N. Angelopoulos
* Stuart Pomerantz
* Synho Do
* Stephen Bates
* Christopher P. Bridge
* Daniel C. Elton
* Michael H. Lev
* R. Gilberto González
* Michael I. Jordan
* Jitendra Malik

## Abstract

**Background** The deployment of black-box AI models in medical imaging presents significant challenges, especially in maintaining reliability across different clinical settings. These challenges are compounded by distribution shifts that can lead to failures in reproducing the accuracy attained during the AI model’s original validations.

**Method** We introduce the conformal triage algorithm, designed to categorize patients into low-risk, high-risk, and uncertain groups within a clinical deployment setting. This method leverages a combination of a black-box AI model and conformal prediction techniques to offer statistical guarantees of predictive power for each group. The high-risk group is guaranteed to have a high positive predictive value, while the low-risk group is assured a high negative predictive value. Prediction sets are never constructed; instead, conformal techniques directly assure high accuracy in both groups, even in clinical environments different from those in which the AI model was originally trained, thereby ameliorating the challenges posed by distribution shifts. Importantly, a representative data set of exams from the testing environment is required to ensure statistical validity.

**Results** The algorithm was tested using a head CT model previously developed by Do and col-leagues [9] and a data set from Massachusetts General Hospital. The results demonstrate that the conformal triage algorithm provides reliable predictive value guarantees to a clinically significant extent, reducing the number of false positives from 233 (45%) to 8 (5%) while only abstaining from prediction on 14% of data points, even in a setting different from the training environment of the original AI model.

**Conclusions** The conformal triage algorithm offers a promising solution to the challenge of deploying black-box AI models in medical imaging across varying clinical settings. By providing statistical guarantees of predictive value for categorized patient groups, this approach significantly enhances the reliability and utility of AI in optimizing medical imaging workflows, particularly in neuroradiology.

## 1 Introduction

AI models for medical imaging often fail to match their initial training and testing accuracy when implemented clinically [5, 6]. They struggle to *generalize*: they cannot perform as well on new data as on the data used for development. This may be due to differences in demographics—such as age, sex, acuity of presentation, and disease prevalence—as well as technical differences in hardware and protocols from the carefully curated datasets normally used to train the AI [10, 13]. The performance gap may be immediately evident, or drifting conditions may cause it to appear over time [12]. Such degradation can result in real patient harm and undermine trust in continued use of the algorithm.

The present work proposes applying an uncertainty quantification (UQ) technique called *Learn-then-Test* [2] assure *high predictive value* of a black-box AI in different clinical environments. This is achieved using recent representative imaging data from the local site. Specifically, we split patients into a low-risk group with a high negative predictive value, a high-risk group with a high positive predictive value, and an uncertain group, building on a previous approach developed at the Massachusetts General Hospital wherein neuroradiologists decide these groupings manually [3, 18]. In our new approach, these groupings are decided by a black-box machine learning algorithm, using imaging as input, in a way that endows the groupings with statistical guarantees. We call this pipeline *conformal triage*, as the triage is determined by an AI algorithm wrapped by a conformal prediction-type procedure [17, 1, 2]. Compared to existing techniques [7, 3, 18], conformal triage is post hoc (it does not require any retraining/fine-tuning), and does not make any assumptions about the form of the model or data distribution. Thus, we can provide formal guarantees of high predictive value even if there has been drift from the original training conditions.1

We demonstrate conformal predictive triage on a modification of the head CT model previously developed at Massachusetts General Hospital by Do and colleagues [9, 4]. The model was developed to detect intracranial hemorrhage, but when applied to real-world imaging data, it was found to also identify other major intracranial abnormalities (ICs) such as brain tumors—an unexpected, clinically useful capability. The head CT model generates a per-exam probability of a significant intracranial abnormality revealed by CT. Conformal predictive triage is then performed—critically, using a calibration set of representative site exams—to provide a statistical guarantee on both the positive predictive value (PPV) and negative predictive value (NPV) of the algorithm. Patients classified as high risk will be positive at the rate of the chosen PPV target, and similarly, those classified as low risk are guaranteed to be negative with high probability. In exchange for these guarantees, the algorithm is allowed to abstain on the most uncertain cases; the number of abstentions is determined by how stringent the PPV/NPV thresholds are.

This procedure allows the uncertainty of the model to transfer to a new local patient population, while retaining formal statistical rigor. This system is useful in neuroradiologist workflow optimization, e.g., ensuring the cases with the highest likelihood of positive findings are prioritized for diagnostic review. If properly implemented, this could benefit broader hospital operations and improve wait and discharge times.

### Contributions

Our work is the first to propose and perform a deployment and analysis of a *distribution-free* selective classification technique in the field of medicine. Methodologically, we show how to use selective classification to perform statistically valid triage. The selective classification approach, critical for downstream guarantees on the triage algorithm, distinguishes our approach from other forms of conformal prediction that have been applied in clinical medicine, e.g. [8, 14, 11], which generally output prediction sets and thus cannot be used to give triage guarantees (see [16] and [7] for helpful reviews). Our experiments provide a large-scale validation of the approach on real medical data from Mass General Brigham; to our knowledge, this is the first empirical investigation of this family of approaches in real medical imaging data not carefully curated a priori.

## 2 Methods

The standard method for tuning the performance of a binary classifier is to trade off PPV and NPV using the receiver operating characteristic (ROC) curve. Our methodology, by contrast, does not require trading off PPV and NPV; both are anchored to a pre-specified level using the calibration data, and the algorithm abstains from prediction if it is not sufficiently confident. If the calibration set is representative of future data, then our algorithm comes with formal guarantees that the PPV and NPV chosen will not fall below the desired operating thresholds. We present the mathematical details of this conformal predictive triage procedure in Section 2.1.

We applied conformal predictive triage to our IC Detection algorithm on two retrospective CT exam datasets from Massachusetts General Hospital, which we labeled ourselves; the details of these datasets are in Section 2.2. We have open-sourced the model outputs on these datasets, along with the ground truth labels, in order to facilitate the reproducibility of our results.2

### 2.1 Conformal predictive triage

#### Statistical setup

We use a form of selective classification to ensure the PPV and NPV are calibrated. Our statistical model is as follows. We consider a calibration data set of *n* pairs of CT scans *X*i and binary labels *Y*i in the set {−, +}, signifying ‘IC-negative’ and ‘IC-positive’ cases. We receive a new test scan *X*, for which there is an *unknown* binary label *Y* which we seek to predict. In this paper, we assume that the data points (*X*1, *Y*1), …, (*X*n, *Y*n), (*X, Y*) are exchangeable—invariant to ordering—which is a mathematical way to capture the intuitive notion that the calibration data is representative of the test scan.

#### Models

In order to perform the prediction, we require access to two models. The first is a deep-learning-based IC detector *g* that takes the images from a head CT exam as input and outputs a probability of IC for each slice of the scan. The second is an aggregator function *h* that maps the output of *g* to a number between 0 and 1 signifying the estimated probability of IC in the entire brain. The IC classifier *g* is the ML model developed by Do and colleagues in [9, 4], and we experiment with different functions *h*, including isotonic and logistic regressions; see Section 3. For convenience, we give the name *f* = *h* ∘ *g* to the chain of models (i.e., *f* (*x*) = *h*(*g*(*x*))).

#### Formal Goal

The goal is to achieve both a high PPV and a high NPV by allowing the model to abstain on the cases that exceed the uncertainty threshold, i.e., for which the statistical guarantee for PPV or NPV can not be met. Towards that end, we introduce a parameter λ that controls how many abstentions (non-predictions) are made. In particular, we set *Ŷ* (*x*), the prediction of *Y*, to be ![Formula][1]</img>  and we let λ = (*λ*FPR, *λ*FNR) denote the pair of parameters. The goal is to select a data-driven value ![Graphic][2]</img> satisfying, with high probability, that the false positive rate (FPR) and the false negative rate (FNR) are controlled: ![Formula][3]</img>  and ![Formula][4]</img>  for some user-chosen parameters *α*PPV and *α*NPV. The parameter ![Graphic][5]</img> will be selected based on a calibration dataset that is assumed to be representative of future data points. We want to guarantee, without any further assumptions on the form of the data distribution or the model *f*, that the PPV and NPV are greater than or equal to 1 − *α*PPV and 1 − *α*NPV, respectively. This assumption-light statistical approach allows our method to apply to any model and data distribution, as long as an appropriate calibration set is available.

#### Method

We build upon the Learn then Test variant of conformal prediction developed in [2]. The method constructs discrete grid, Λ = {0, 1*/m*, …, (*m* − 1)*/m*, 1} from ![Graphic][6]</img> is selected. For false positive rate control, ![Formula][7]</img>  where ![Formula][8]</img>  and the function BinoUCBδ is the upper end of the binomial confidence interval constructed at level 1 − *δ*. The second coordinate, ![Graphic][9]</img>, is constructed analogously. The user can elect to apply a Bonferroni correction (i.e., to choose both coordinates with level *δ/*2) to achieve a simultaneous guarantee for PPV and NPV. This strategy is generally statistically tight.

### 2.2 Description of Data Sets

We perform experiments with two data sets of head CT scans collected at Massachusetts General Hospital along with a pre-trained, pre-existing intracranial hemorrhage detector *f* built on historical data from the hospital system; details are referred to the prior work [9, 4]. In both datasets, the feature *X*i is a CT scan with a variable number of slices. The label *Y*i is binary, corresponding to an indicator {−, +} of an IC. The first data set 𝒟1 is a painstakingly curated data set consisting of 827 consecutive CT exams which were scrutinized by a senior, board-certified neuroradiologist with over 30 years of experience to produce the labels *Y*i—in this sense, the dataset truly contains gold-standard labels. The second data set 𝒟2 contains 9122 consecutive CT exams whose ground truth labels were extracted by a regular-expression-based pattern-matching algorithm on the radiology report text, which was then manually verified by a senior neuroradioloigst (SRP) with greater than 20 years experience. Each data point also contains a binary indicator of diagnostic indeterminacy—cases where the neuroradiologist that originally read the scan was not certain of the disease state—and also an indicator of parsing indeterminacy, i.e., cases where the original neuroradiologist may have understood the disease state, but did not communicate it in a clear way in their notes. The data set 𝒟2 is of reasonably high quality, representing the standard of care that patients receive at a hospital; not all scans are read carefully by experienced neuroradiologists, and thus there may be more errors in these labels than in 𝒟1.

## 3. Experiments

### Setup

We use several methods of regression. To estimate the function *h* which is chained with the original slice-wise risk predictor *g*, we evaluate the use of isotonic regression, a logistic regression model, and a simple thresholding classifier. All three algorithms work reasonably well, and we focus on the thresholding classifier for the sake of simplicity. In the figures, models are trained using 3000 data points from 𝒟2 and evaluated on the remaining data points from 𝒟2 as well as those from 𝒟1. Below are given representative quantitative results on both 𝒟1 and the validation split of 𝒟2, and also plots of the accuracy as a function of *λ* in order to understand the qualitative behavior of the procedure and its tightness (in general, it is slightly conservative). It should be noted that the choice of ![Graphic][10]</img> is optimized on 𝒟2, so the data in 𝒟1 is subject to a shifting distribution. On 𝒟2, we also report results on the subset of data points without any diagnostic or parsing indeterminacies (this subset is labeled ‘no ID’).

Our main point of comparison is a heuristically derived set of thresholds manually designed by the expert neuroradiologist. The neuroradiologist developed a hand-designed thresholding rule on a consecutive series of ED head CTs. It was tested on a subsequent 1000 consecutive cases from the ED. The rule is that if *h*(*x*) assigns any three slices to have an estimated probability of at least 0.9, then we set *f* (*x*) = +. If not, and if additionally there are no more than a single slice above 0.6, then *f* (*x*) = −. Otherwise, *f* (*x*) is set to equal ?. This is a strong baseline in-distribution on 𝒟1 (the in-distribution data), but it is limited in reproducibility and generalizability to different and changing patient populations. Our machine learning approach can achieve a similar PPV/NPV tradeoff, while remaining tunable to different PPV/NPV set points. It is a scalable and repeatable procedure for calibrating AI algorithms in real hospital scenarios that avoids the need for a neuroradiologist to hand-design such algorithms.

The plots include enough information to reconstruct the output of our procedure for any choices of *α*PPV and *α*NPV. As a visual aid, we plot as gray lines the value of ![Graphic][11]</img> chosen with *α*PPV = 0.3 and *α*NPV = 0.05. This mirrors the use case of triage discussed in the introduction: the NPV threshold is substantially more stringent so as to avoid missing positive cases. Intuitively, the ramification of this choice is that the procedure will be more aggressive in identifying positive cases than negative ones.

### Top-line result

Conformal triage effectively calibrates the previously developed model to a *current* set of patient exams, enabling high-confidence PPV and NPV compared to the original model, without the need for retraining and an acceptable rate of abstention (non-prediction). The exact numbers depend on the desired tradeoff between abstention rate and accuracy. On the most extreme end, the calibrated isotonic regression procedure can provide a formal PPV guarantee of at least 95% and NPV of at least 95%, with only 4.9% of data points abstained; see the third row from the bottom of Table 2. Results on this extreme end tend to be unstable due to the small number of samples, so in our plots, we calibrate to a PPV of 90% and an NPV of 95% using the logistic regression procedure, leading to an abstention rate of 14% on the validation data. By standard ROC analysis, an NPV of 95% would yield PPV of 55%, a decrease of 40 percentage points. On our data, this is a decrease from to 233 to 8 false positives—a clinically significant result.

#### Comparison of different regression methods

Tables 1 and 2 include results on all the regression algorithms. The baseline performance of the model using the heuristically derived thresholds is included in the caption of the tables. Our algorithm matches or exceeds the performance of the hand-designed algorithm. It is important to notice that on 𝒟1 the hand-designed algorithm has a PPV of 0.92, while on 𝒟2 the PPV is 0.7. This is a large drop in performance that is caused by the distribution shift in the population of the two datasets, and it is exactly the sort of situation we want to solve with our methodology. The calibrated version achieves the desired statistical guarantee of PPV and NPV for all settings of *α*PPV and *α*NPV.

View this table:
[Table 1:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/T1)

Table 1: Numerical results of PPV, NPV, and abstention (non-prediction) rate as a function of model type and desired PPV and NPV guarantee level.
Results are computed on 𝒟1. The baseline hand-designed rule achieves a PPV of 0.92, an NPV of 0.94, and an abstention rate of 0.17.

View this table:
[Table 2:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/T2)

Table 2: Numerical results of PPV, NPV, and abstention (non-prediction) rate as a function of model type and desired PPV and NPV guarantee level.
Results are computed on 𝒟2. The baseline hand-designed rule achieves a PPV of 0.7, an NPV of 0.96, and an abstention rate of 0.14.

#### Tracing the PPV and NPV as a function of *λ*

We plot the empirical PPV and NPV on test data as a function of *λ* in Figures 1 and 2. The figures indicate that the NPV and PPV are controlled at the desired level on 𝒟2. The control is more conservative on the PPV side than for the NPV side, as there are fewer positive data points; the conservativeness would go away with the use of more data. On the other hand, on 𝒟1, the PPV is controlled, but the NPV is not. The root cause is that the calibration data has more negatives than 𝒟1, since 𝒟1 is higher quality and is less likely to have positives that are missed. This is further evidence of the imperative to collect good ground-truth data and a calibration set that represents the future patient population to ensure the proper error rate. Finally, we note that although the PPV is controlled, it is quite conservative; this is because positive cases are relatively rare in our data, so there is not enough information to calibrate the PPV to a very stringent level.

![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/11/2024.02.09.24302543/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/F1)

Figure 1: PPV as a function of the threshold.
The horizontal axis is the parameter *λ*PPV. The vertical axis is the PPV, i.e., the fraction of the scans labeled + that are indeed +. The gray dotted line indicates the target PPV, which is guaranteed to hold with probability at least 1 − *α*PPV. The left plot is computed on 𝒟1 and the right is on 𝒟2.

![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/11/2024.02.09.24302543/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/F2)

Figure 2: NPV as a function of the threshold.
The horizontal axis is the parameter *λ*NPV. The vertical axis is the NPV, i.e., the fraction of the scans labeled − that are indeed −. The left plot is computed on 𝒟1 and the right is on 𝒟2.

#### Evaluating the True Positive Rate (TPR) and True Negative Rate (TNR)

Next we look at the TPR and TNR to understand how many positive and negative cases are assigned predictions (as opposed to the abstention/non-prediction category, ‘?’). The PPV often trades off with the TPR, as increasing the number of abstentions improves the former and hurts the latter (since fewer positive cases are recovered). The analogous principle holds for negative cases. Figures 3 and 4 illustrate this tradeoff.

![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/11/2024.02.09.24302543/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/F3)

Figure 3: TPR as a function of the threshold.
The horizontal axis is the parameter *λ*PPV. The vertical axis is the TPR, i.e., the fraction of actually positive scans labeled positive. The left plot is computed on 𝒟1 and the right is on 𝒟2.

![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/11/2024.02.09.24302543/F4.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/F4)

Figure 4: TNR as a function of the threshold.
The horizontal axis is the parameter *λ*NPV. The vertical axis is the TNR, i.e., the fraction of actually negative scans labeled negative. The left plot is computed on 𝒟1 and the right is on 𝒟2.

#### Checking the marginal proportion of + and – cases

It is of interest to check what fraction of the total cases are assigned to the classes + and – as a function of *λ*—this is called the *marginal proportion of positive (or negative) cases*. Of course, setting *λ*PPV = 0 would result in all cases being assigned to the class +, and then the marginal proportion of positives would be 1. This is clearly undesirable because it means a large false positive rate. However, it may be necessary to over-predict the class + in order to achieve the safety guarantee, due to the stringent choice of *α*NPV; this is what the data in Figures 5 and 6 indicate. Meanwhile, the negative class is slightly under-predicted as *α*PPV is less stringent. These effects are not particularly extreme, but are tunable by changing the *α* parameters, as one can witness by examining the shape of the curves.

![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/11/2024.02.09.24302543/F5.medium.gif)

[Figure 5:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/F5)

Figure 5: Marginal proportion of cases assigned + as a function of the threshold.
The horizontal axis is the parameter *λ*PPV. The vertical axis is the fraction of cases assigned to the class +. The left plot is computed on 𝒟1 and the right is on 𝒟2.

![Figure 6:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/02/11/2024.02.09.24302543/F6.medium.gif)

[Figure 6:](http://medrxiv.org/content/early/2024/02/11/2024.02.09.24302543/F6)

Figure 6: Marginal proportion of cases assigned - as a function of the threshold.
The horizontal axis is the parameter *λ*NPV. The vertical axis is the fraction of cases assigned to the class -. In green is shown the true population fraction of - cases. The left plot is computed on 𝒟1 and the right is on 𝒟2.

## 4 Discussion

Conformal uncertainty quantification provides a non-heuristic approach to accurate medical AI in varying conditions, yielding better decision-making and risk management. This provides guarantees of reliable performance using an automated algorithm that can be re-run as the data distribution changes, even if the base model cannot be retrained. The requirement is a small set of exams representative of the population. Different strategies for tuning our PPV and NPV guarantees enable different clinical use-cases, as described below.

### Rule-out triage

This approach prioritizes minimizing the false-negative rates to effectively ‘rule out’ non-IC cases. Setting *α**NPV* to be small leads to a high accuracy in the negative group. The implication is that most of the actual negative cases are correctly identified, and these can be prepared for discharge, while safely directing hospital resources towards the more urgent cases.

### Rule-in triage

This strategy focuses on minimizing false-positive rates to ‘rule in’ potential IC cases. This involves setting *α**PPV* to be quite small, leading to a high accuracy in the positive group. The implication is that the diagnosed positive cases are actual positives, minimizing the resources spent on false alarms, and ensuring that neuroradiologists can read the most urgent scans quickly.

Both the rule-out and rule-in triage approaches can be achieved with only *one* of the PPV and NPV control parameters, i.e., one of the coordinates of *λ*. This is akin to a more rigorous form of ROC curve analysis. However, our approach also allows for simultaneous control of PPV and NPV, similar to picking two points on the ROC curve simultaneously.

### Real-time triage

This approach assures *both* high specificity and sensitivity, providing real-time insights that are neither too conservative nor too liberal. It involves picking both *α*PPV and *α*NPV to achieve both forms of control simultaneously, while leaving the middle group as uncertain. It facilitates more dynamic decision-making and can adapt to changing circumstances and data distributions.

The algorithms that we have presented are feasible in practice, do not require manual intervention, and are straightforward to implement. The data requirements are limited; the *n* required for calibration is not very large unless the levels *α*PPV and *α*NPV are quite stringent. Importantly, the approach is post hoc, meaning it can work with any AI model with no requirements on the training/development of that model. This makes it an attractive tool for deployment of medical imaging AI, and potentially also as a tool for regulators [15] to give basic guidelines that the models follow rigorous validation protocols.

## Data Availability

All data produced are available online at [https://github.com/aangelopoulos/conformal-triage](https://github.com/aangelopoulos/conformal-triage).

[https://github.com/aangelopoulos/conformal-triage](https://github.com/aangelopoulos/conformal-triage) 

## Footnotes

*   {angelopoulos{at}berkeley.edu,michael_jordan{at}berkeley.edu,malik{at}berkeley.edu}, stuartpomerantz{at}gmail.com, mlev{at}partners.org, {delton{at}mgh.harvard.edu, sdo{at}mgh.harvard.edu, cbridge{at}mgh.harvard.edu, rggonzalez{at}mgh.harvard.edu}

*   1 Importantly, the guarantee assumes access to a small amount of representative calibration data from the new domain.

*   2 Code available at [https://github.com/aangelopoulos/conformal-triage](https://github.com/aangelopoulos/conformal-triage).

*   Received February 9, 2024.
*   Revision received February 9, 2024.
*   Accepted February 11, 2024.


*   © 2024, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/)

## References

1.  [1].Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification, 2021.
    
    
2.  [2]. Anastasios N Angelopoulos,  Stephen Bates,  Emmanuel J Candès,  Michael I Jordan, and  Lihua Lei. Learn then Test: Calibrating predictive algorithms to achieve risk control. arXiv preprint arXiv:2110.01052, 2021.
    
    
3.  [3]. Michelle Chua,  Doyun Kim,  Jongmun Choi,  Nahyoung G Lee,  Vikram Deshpande,  Joseph Schwab,  Michael H Lev,  Ramon G Gonzalez,  Michael S Gee, and  Synho Do. Tackling prediction uncertainty in machine learning for healthcare. Nature Biomedical Engineering, 7(6):711–718, 2023.
    
    
4.  [4]. Synho Do,  Michael Lev, and  Ramon Gilberto Gonzalez. Systems and methods for brain hemorrhage classification in medical images using an artificial intelligence network, June 28 2022. US Patent 11,373,750.
    
    
5.  [5]. Jean Feng,  Rachael V Phillips,  Ivana Malenica,  Andrew Bishara,  Alan E Hubbard,  Leo A Celi, and  Romain Pirracchio. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. NPJ Digital Medicine, 5(1):66, 2022.
    
    
6.  [6]. Christopher J Kelly,  Alan Karthikesalingam,  Mustafa Suleyman,  Greg Corrado, and  Dominic King. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine, 17:1–9, 2019.
    
    
7.  [7]. Benjamin Kompa,  Jasper Snoek, and  Andrew L Beam. Second opinion needed: communicating uncertainty in medical machine learning. NPJ Digital Medicine, 4(1):4, 2021.
    
    
8.  [8]. Bhawesh Kumar,  Anil Palepu,  Rudraksh Tuwani, and  Andrew Beam. Towards reliable zero shot classification in self-supervised models with conformal prediction. arXiv preprint arXiv:2210.15805, 2022.
    
    
9.  [9]. Hyunkwang Lee,  Sehyo Yune,  Mohammad Mansouri,  Myeongchan Kim,  Shahein H Tajmir,  Claude E Guerrier,  Sarah A Ebert,  Stuart R Pomerantz,  Javier M Romero,  Shahmir Kamalian, et al. An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nature biomedical engineering, 3(3):173–182, 2019.
    
    
10. [10]. Matthew J Leming,  Esther E Bron,  Rose Bruffaerts,  Yangming Ou,  Juan Eugenio Iglesias,  Randy L Gollub, and  Hyungsoon Im. Challenges of implementing computer-aided diagnostic models for neuroimages in a clinical setting. NPJ Digital Medicine, 6(1):129, 2023.
    
    
11. [11]. Charles Lu,  Andréanne Lemay,  Ken Chang,  Katharina Höbel, and  Jayashree Kalpathy-Cramer. Fair conformal predictors for applications in medical imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12008–12016, 2022.
    
    
12. [12]. Natalia Norori,  Qiyang Hu,  Florence Marcelle Aellen,  Francesca Dalia Faraci, and  Athina Tzovara. Addressing bias in big data and ai for health care: A call for open science. Patterns, 2(10), 2021.
    
    
13. [13]. Sandosh Padmanabhan,  Tran Quoc Bao Tran, and  Anna F Dominiczak. Artificial intelligence in hypertension: seeing through a glass darkly. Circulation Research, 128(7):1100–1118, 2021.
    
    
14. [14]. Harris Papadopoulos,  Alex Gammerman, and  Volodya Vovk. Reliable diagnosis of acute abdominal pain with conformal prediction. Engineering Intelligent Systems, 17(2):127, 2009.
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000281470100007&link_type=ISI) 

15. [15]. US Food and  Drug Administration. Marketing submission recommendations for a predetermined change control plan for artificial intelligence. Machine Learning (AI/ML) Enabled Device Software Functions, 2023.
    
    
16. [16]. Janette Vazquez and  Julio C Facelli. Conformal prediction in clinical medical sciences. Journal of Healthcare Informatics Research, 6(3):241–252, 2022.
    
    
17. [17]. Vladimir Vovk,  Alex Gammerman, and  Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005.
    
    
18. [18]. Byung C Yoon,  Stuart R Pomerantz,  Nathaniel D Mercaldo,  Swati Goyal,  Eric M L’Italien,  Michael H Lev,  Karen A Buch,  Bradley R Buchbinder,  John W Chen,  John Conklin, et al. Incorporating algorithmic uncertainty into a clinical machine deep learning algorithm for urgent head CTs. PLOS one, 18(3):e0281900, 2023.

 [1]: /embed/graphic-1.gif
 [2]: /embed/inline-graphic-1.gif
 [3]: /embed/graphic-2.gif
 [4]: /embed/graphic-3.gif
 [5]: /embed/inline-graphic-2.gif
 [6]: /embed/inline-graphic-3.gif
 [7]: /embed/graphic-4.gif
 [8]: /embed/graphic-5.gif
 [9]: /embed/inline-graphic-4.gif
 [10]: /embed/inline-graphic-5.gif
 [11]: /embed/inline-graphic-6.gif