Abstract
Saliency methods, which “explain” deep neural networks by producing heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. Although many saliency methods have been proposed for medical imaging interpretation, rigorous investigation of the accuracy and reliability of these strategies is necessary before they are integrated into the clinical setting. In this work, we quantitatively evaluate seven saliency methods—including Grad-CAM, Grad-CAM++, and Integrated Gradients—across multiple neural network architectures using two evaluation metrics. We establish the first human benchmark for chest X-ray segmentation in a multilabel classification set up, and examine under what clinical conditions saliency maps might be more prone to failure in localizing important pathologies compared to a human expert benchmark. We find that (i) while Grad-CAM generally localized pathologies better than the other evaluated saliency methods, all seven performed significantly worse compared with the human benchmark; (ii) the gap in localization performance between Grad-CAM and the human benchmark was largest for pathologies that had multiple instances, were smaller in size, and had shapes that were more complex; (iii) model confidence was positively correlated with Grad-CAM localization performance. While it is difficult to know whether poor localization performance is attributable to the model or to the saliency method, our work demonstrates that several important limitations of saliency methods must be addressed before we can rely on them for deep learning explainability in medical imaging.
Introduction
Deep learning has enabled automated medical imaging interpretation at the level of practicing experts in some settings1–3. While the potential benefits of automated diagnostic models are numerous, lack of model interpretability in the use of “black-box” deep neural networks (DNNs) represents a major barrier to clinical trust and adoption4–6. In fact, it has been argued that the European Union’s recently adopted General Data Protection Regulation (GDPR) affirms an individual’s right to an explanation in the context of automated decision-making7. Although the importance of DNN interpretability is widely acknowledged and many techniques have been proposed, little emphasis has been placed on how best to quantitatively evaluate these explainability methods8.
One type of DNN interpretation strategy widely used in the context of medical imaging is based on saliency (or pixel-attribution) methods9–12. Saliency methods produce heat maps highlighting the areas of the medical image that most influenced the DNN’s prediction. Since saliency methods provide post-hoc interpretability of models that are never exposed to bounding box annotations or pixel-level segmentations during training, they are particularly useful in the context of medical imaging where ground-truth segmentations can be especially time-consuming and expensive to obtain. The heat maps help to visualize whether a DNN is concentrating on the same regions of a medical image that a human expert would focus on, rather than concentrating on a clinically irrelevant part of the medical image or even on confounders in the image13–15. Saliency methods have been widely used for a variety of medical imaging tasks and modalities including, but not limited to, visualizing the performance of a convolutional neural network (CNN) in predicting (1) myocardial infarction16 and hypoglycemia17 from electrocardiograms, (2) visual impairment18, refractive error19, and anaemia20 from retinal photographs, (3) long-term mortality21 and tuberculosis22 from chest X-ray (CXR) images, and (4) appendicitis23 and pulmonary embolism24 on computed tomography scans. However, recent work has shown that saliency methods used to validate model predictions can be misleading in some cases and may lead to increased bias and loss of user trust in high-stakes contexts such as healthcare25–28. Therefore, a rigorous investigation of the accuracy and reliability of these strategies is necessary before they are integrated into the clinical setting29.
In this work, we perform a systematic evaluation of seven common saliency methods in medical imaging (Grad-CAM30, Grad-CAM++31, Integrated Gradients32, Eigen-CAM41, DeepLIFT42, Layer-Wise Relevance Propagation43, and Occlusion44) using three common CNN architectures (DenseNet12133, ResNet15234, Inception-v435). In doing so, we establish the first human benchmark in CXR segmentation by collecting radiologist segmentations for 10 pathologies using CheXpert, a large publicly available CXR dataset36. To compare saliency method segmentations with expert segmentations, we use two metrics to capture localization accuracy: (1) mean Intersection over Union, a metric that measures the overlap between the saliency method segmentation and the expert segmentation, and (2) hit rate, a less strict metric than mIoU that does not require the saliency method to locate the full extent of a pathology. We find that (1) while Grad-CAM generally localizes pathologies more accurately than the other evaluated saliency methods, all seven perform significantly worse compared with a human radiologist benchmark (although it is difficult to know whether poor localization performance is attributable to the model or to the saliency method); (2) the gap in localization performance between Grad-CAM and the human benchmark is largest for pathologies that have multiple instances on the same CXR, are smaller in size, and have shapes that are more complex; (3) model confidence is positively correlated with Grad-CAM localization performance. We publicly release a development dataset of expert segmentations, which we call CheXlocalize, to facilitate further research in DNN explainability for medical imaging.
Results
Framework for evaluating saliency methods
Seven methods were evaluated—Grad-CAM, Grad-CAM++, Integrated Gradients, Eigen-CAM, DeepLIFT, Layer-Wise Relevance Propagation (LRP), and Occlusion—in a multi-label classification setup on the CheXpert dataset (Fig. 1a). We ran experiments using three CNN architectures previously used on CheXpert: DenseNet121, ResNet152, and Inception-v4. For each combination of saliency method and model architecture, we trained and evaluated an ensemble of 30 CNNs (see Methods for ensembling details). We then passed each of the CXRs in the dataset’s holdout test set into the trained ensemble model to obtain image-level predictions for the following 10 pathologies: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Lung Lesion, Lung Opacity, Pleural Effusion, Pneumothorax, and Support Devices. Of the 14 observations labeled in the CheXpert dataset: Fracture and Pleural Other were not included in our analysis because they had low prevalence in our test set (fewer than 10 examples); Pneumonia was not included because it is a clinical (as opposed to a radiological) diagnosis; and No Finding was not included because it is not applicable to evaluating localization performance. For each CXR, we used the saliency method to generate heat maps, one for each of the 10 pathologies, and then applied a threshold to each heat map to produce binary segmentations (top row, Fig. 1a). Thresholding is determined per pathology using Otsu’s method37, which iteratively searches for a threshold value that maximizes inter-class pixel intensity variance. We also conducted a second thresholding scheme in which we iteratively search for a threshold value that maximizes per pathology mIoU on the validation set. Both thresholding schemes reported similar findings (see Extended Data Fig. 1). The result shows that our evaluation of localization performance is robust to different saliency map thresholding schemes. Additionally, to calculate the hit rate evaluation metric (described below), we extracted the pixel in the saliency method heat map with the largest value as the single most representative point on the CXR for that pathology.
We obtained two independent sets of pixel-level CXR segmentations on the holdout test set: ground-truth segmentations drawn by two board-certified radiologists (middle row, Fig. 1a) and human benchmark segmentations drawn by a separate group of three board-certified radiologists (bottom row, Fig. 1a). The human benchmark segmentations and the saliency method segmentations were compared with the ground-truth segmentations to establish the human benchmark localization performance and the saliency method localization performance, respectively. Additionally, for the hit rate evaluation metric, the radiologists who drew the benchmark segmentations were also asked to locate a single point on the CXR that was most representative of the pathology at hand (see Supplementary Figs. S1 through S11 for detailed instructions given to the radiologists). Note that the human benchmark localization performance demonstrates interrater variability, and we use it as a reference when evaluating saliency method pipelines.
We used two evaluation metrics to compare segmentations (Fig. 1b). First, we used mean Intersection over Union (mIoU), a metric that measures how much, on average, either the saliency method or benchmark segmentations overlapped with the ground-truth segmentations. Second, we used hit rate, a less strict metric that does not require the saliency method or benchmark annotators to locate the full extent of a pathology. Hit rate is based on the pointing game setup38, in which credit is given if the most representative point identified by the saliency method or the benchmark annotators lies within the ground-truth segmentation. A “hit” indicates that the correct region of the CXR was located regardless of the exact bounds of the binary segmentations. Localization performance is then calculated as the hit rate across the dataset39. In addition, we report the sensitivity and specificity values of the saliency method pipeline and the human benchmark in Extended Data Fig. 2.
Evaluating localization performance
In order to compare the localization performance of the saliency methods with the human benchmark, we first used Grad-CAM, Grad-CAM++, and Integrated Gradients to run eighteen experiments, one for each combination of saliency method (Grad-CAM, Grad-CAM++, or Integrated Gradients) and CNN architecture (DenseNet121, ResNet152, or Inception-v4) using one of the two evaluation metrics (mIoU or hit rate) (see Extended Data Fig. 3). We also ran experiments to evaluate the localization performances of DenseNet121 with Eigen-CAM, DeepLIFT, LRP, and Occlusion. We found that Grad-CAM with DenseNet121 generally demonstrated better localization performance across pathologies and evaluation metrics than the other combinations of saliency method and architecture (see Table 1 for localization performance on the test set of all seven saliency methods using DenseNet121). Accordingly, we compared Grad-CAM with DenseNet121 (“saliency method pipeline”) with the human benchmark using both mIoU and hit rate. The localization performance for each pathology is reported on the true positive slice of the dataset (CXRs that contain saliency method, human benchmark, and ground-truth segmentations). Localization performance was calculated this way so that saliency methods were not penalized by DNN classification error: while the benchmark radiologists were provided with ground-truth labels when annotating the dataset, saliency method segmentations were created based on labels predicted by the model. (See Extended Data Fig. 4 for saliency method pipeline localization performance on the full dataset using mIoU.)
We found that the saliency method pipeline demonstrated significantly worse localization performance when compared with the human benchmark using both mIoU (Fig. 2a) and hit rate (Fig. 2b) as an evaluation metric, regardless of model classification AUROC. For each metric, we report the 95% confidence intervals using the bootstrap method with 1,000 bootstrap samples40. For five of the 10 pathologies, the saliency method pipeline had a significantly lower mIoU than the human benchmark. For example, the saliency method pipeline had one of the highest AUROC scores of the 10 pathologies for Support Devices (0.969), but had among the worst localization performance for Support Devices when using both mIoU (0.163 [95% CI 0.154, 0.172]) and hit rate (0.357 [95% CI 0.303, 0.408]) as evaluation metrics. On two pathologies (Atelectasis and Consolidation) the saliency method pipeline significantly outperformed the human benchmark. On average, across all 10 pathologies, mIoU saliency method pipeline performance was 26.6% [95% CI 18.1%, 35.0%] worse than the human benchmark, with Lung Lesion displaying the largest gap in performance (76.2% [95% CI 59.1%, 87.5%] worse than the human benchmark) (Extended Data Fig. 5). Consolidation was the pathology on which the mIoU saliency method pipeline performance exceeded the human benchmark the most, by 56.1% [95% CI 42.7%, 69.4%]. For seven of the 10 pathologies, the saliency method pipeline had a significantly lower hit rate than the human benchmark. On average, hit rate saliency method pipeline performance was 29.4% [95% CI 15.0%, 43.2%] worse than the human benchmark (Extended Data Fig. 6), with Lung Lesion again displaying the largest gap in performance (65.9% [95% CI 35.3%, 91.7%] worse than the human benchmark). The hit rate saliency method pipeline did not significantly outperform the human benchmark on any of the 10 pathologies; for the remaining three of the 10 pathologies, the hit rate performance differences between the saliency method pipeline and the human benchmark were not statistically significant. Therefore, while the saliency method pipeline significantly underperformed the human benchmark regardless of evaluation metric used, the average performance gap was larger when using hit rate as an evaluation metric than when using mIoU as an evaluation metric.
We compared saliency method pipeline localization performance using an ensemble model to localization performance using the top performing single checkpoint for each pathology. We found that the single model has worse localization performance than the ensemble model for all pathologies when using mIoU and for six of the 10 pathologies when using hit rate (see Extended Data Fig. 7).
Characterizing underperformance of saliency method pipeline
In order to better understand the underperformance of the saliency method pipeline localization, we first conducted a qualitative analysis with a radiologist by visually inspecting both the segmentations produced by the saliency method pipeline (Grad-CAM with DenseNet121) and the human benchmark segmentations. We found that, in general, saliency method segmentations fail to capture the geometric nuances of a given pathology, and instead produce coarse, low-resolution heat maps. Specifically, our qualitative analysis found that the performance of the saliency method depended on three pathological characteristics (Fig. 3a): (1) number of instances: when a pathology had multiple instances on a CXR, the saliency method segmentation often highlighted one large confluent area, instead of highlighting each distinct instance of the pathology separately; (2) size: saliency method segmentations tended to be significantly larger than human expert segmentations, often failing to respect clear anatomical boundaries; (3) shape complexity: the saliency method segmentations for pathologies with complex shapes frequently included significant portions of the CXR where the pathology is not present.
Informed by our qualitative analysis and previous work in histology45, we defined four geometric features for our quantitative analysis (Fig. 3b): (1) number of instances (for example, bilateral Pleural Effusion would have two instances, whereas there is only one instance for Cardiomegaly), (2) size (pathology area with respect to the area of the whole CXR), (3) elongation and (4) irrectangularity (the last two features measure the complexity of the pathology shape and were calculated by fitting a rectangle of minimum area enclosing the binary mask). See Extended Data Fig. 8 for the distribution of the four pathological characteristics across all 10 pathologies.
For each evaluation metric, we ran 8 simple linear regressions: four with the evaluation metric (IoU or hit rate) of the saliency method pipeline (Grad-CAM with DenseNet121) as the dependent variable (to understand the relationship between the geometric features of a pathology and saliency method localization performance), and four with the difference between the evaluation metrics of the saliency method pipeline and the human benchmark as the dependent variable (to understand the relationship between the geometric features of a pathology and the gap in localization performance between the saliency method pipeline and the human benchmark). Each regression used one of the four geometric features as a single independent variable, and only the true positive slice was included in each regression. Each feature was normalized using z-score normalization and the regression coefficient can be interpreted as the effect of that geometric feature on the evaluation metric at hand. See Table 2 for coefficients from the regressions using both evaluation metrics, where we also report the 95% confidence interval and the Bonferroni corrected p-values. For confidence intervals and p-values, we used the standard calculation for linear models.
Our statistical analysis showed that as the area ratio of a pathology increased, mIoU saliency method localization performance improved (0.566 [95% CI 0.526, 0.606]). We also found that as elongation and irrectangularity increased, mIoU saliency method localization performance worsened (elongation: -0.425 [95% CI -0.497, -0.354], irrectangularity: -0.256 [95% CI -0.292, -0.219]). We observed that the effects of these three geometric features were similar for hit rate saliency method localization performance in terms of levels of statistical significance and direction of the effects. However, there was no evidence that the number of instances of a pathology had a significant effect on either mIoU (−0.115 [95% CI -0.220, -0.009]) or hit rate (−0.051 [95% CI -0.364, 0.244]) saliency method localization. Therefore, regardless of evaluation metric, saliency method localization performance suffered in the presence of pathologies that were small in size and complex in shape.
We found that these same three pathological characteristics—larger size, and higher elongation and irrectangularity—characterized the gap in mIoU localization performance between saliency method and human benchmark. We observed that the gap in hit rate localization performance was significantly characterized by all four geometric features (number of instances, size, elongation, and irrectangularity). As the number of instances increased, despite no significant change in hit rate localization performance itself, the gap in hit rate localization performance between saliency method and the human benchmark increased (0.470 [95% CI 0.114, 0.825]). This suggests that the saliency method performs especially poorly in the face of a multi-instance diagnosis.
Effect of model confidence on localization performance
We also conducted statistical analyses to determine whether there was any correlation between the model’s confidence in its prediction and saliency method pipeline performance (Table 3). We first ran a simple regression for each pathology using the model’s probability output as the single independent variable and using the saliency method IoU as the dependent variable. We then performed a simple regression that uses the same approach as above, but that includes all 10 pathologies. For each of the 11 regressions, we used the full dataset since the analysis of false positives and false negatives was also of interest. In addition to the linear regression coefficients, we also computed the Spearman correlation coefficients to capture any potential non-linear associations.
We found that for all pathologies, model confidence was positively correlated with mIoU saliency method pipeline performance. The p-values for all coefficients were below 0.001 except for the coefficients for Pneumothorax (n=11) and Lung Lesion (n=50), the two pathologies for which we had the fewest positive examples. Of all the pathologies, model confidence for positive predictions of Enlarged Cardiomediastinum had the largest linear regression coefficient with mIoU saliency method pipeline performance (1.974, p-value<0.001). Model confidence for positive predictions of Pneumothorax had the largest Spearman correlation coefficient with mIoU saliency method pipeline performance (0.734, p-value<0.01), followed by Pleural Effusion (0.690, p-value<0.001). Combining all pathologies (n=2365), the linear regression coefficient was 0.109 (95% CI [0.083, 0.135]), and the Spearman correlation coefficient was 0.285 (95% CI [0.239, 0.331]). We also performed analogous experiments using hit rate as the dependent variable and found comparable results (Extended Data Fig. 9).
Discussion
The purpose of this work was to evaluate the performance of some of the most used saliency methods for deep learning explainability using a variety of model architectures. We establish the first human benchmark for CXR segmentation in a multilabel classification setup and demonstrate that saliency maps are consistently worse than expert radiologists regardless of model classification AUROC. We use qualitative and quantitative analyses to establish that saliency method localization performance is most inferior to expert localization performance when a pathology has multiple instances, is smaller in size, or has shapes that are more complex, suggesting that deep learning explainability as a clinical interface may be less reliable and less useful when used for pathologies with those characteristics. We also show that model assurance is positively correlated with saliency method localization performance, which could indicate that saliency methods are safer to use as a decision aid to clinicians when the model has made a positive prediction with high confidence.
Because ground-truth segmentations for medical imaging are time-consuming and expensive to obtain, the current norm in medical imaging—both in research and in industry—is to use classification models on which saliency methods are applied post-hoc for localization, highlighting the need for investigations into the reliability of these methods in clinical settings46,47. There are public CXR datasets containing image-level labels annotated by expert radiologists (e.g., the CheXpert validation set), multilabel bounding box annotations (e.g., ChestX-ray848 and VinDr-CXR49), and segmentations for a single pathology (e.g., SIIM-ACR Pneumothorax Segmentation50). To our knowledge, however, there are no other publicly available CXR datasets with multilabel pixel-level expert segmentations. By publicly releasing a development dataset, CheXlocalize, of 234 images with 885 expert segmentations, and a competition with a test set of 668 images, we hope to encourage the further development of saliency methods and other explainability techniques for medical imaging.
Our work has several potential implications for human-AI collaboration in the context of medical decision-making. Heat maps generated using saliency methods are advocated as clinical decision support in the hope that they not only improve clinical decision-making, but also encourage clinicians to trust model predictions51–53. Many of the large CXR vendors54–56 use localization methods to provide pathology visualization in their computer-aided detection (CAD) products. In addition to being used for clinical interpretation, saliency method heat maps are also used for the evaluation of CXR interpretation models, for quality improvement (QI) and quality assurance (QA) in clinical practices, and for dataset annotation57. Explainable AI is critical in high-stakes contexts such as health care, and saliency methods have been used successfully to develop and understand models generally. Indeed, we found that the saliency method pipeline significantly outperformed the human benchmark on two pathologies when using mIoU as an evaluation metric. However, our work also suggests that saliency methods are not yet reliable enough to validate individual clinical decisions made by a model. We found that saliency method localization performance, on balance, performed worse than expert localization across multiple analyses and across many important pathologies (our findings are consistent with recent work focused on localizing a single pathology, Pneumothorax, in CXRs58). We hypothesize that this could be an algorithmic artifact of saliency methods, whose relatively small heat maps (14×14 for Grad-CAM) are interpolated to the original image dimensions (usually 2000×2000), resulting in coarse resolutions. If used in clinical practice, heat maps that incorrectly highlight medical images may exacerbate well documented biases (chiefly, automation bias) and erode trust in model predictions (even when model output is correct), limiting clinical translation22.
Since IoU computes the overlap of two segmentations but pointing game hit rate better captures diagnostic attention, we suggest using both metrics when evaluating localization performance in the context of medical imaging. While IoU is a commonly used metric for evaluating semantic segmentation outputs, there are inherent limitations to the metric in the pathological context. This is indicated by our finding that even the human benchmark segmentations had low overlap with the ground truth segmentations (the highest expert mIoU was 0.720 for Cardiomegaly). One potential explanation for this consistent underperformance is that pathologies can be hard to distinguish, especially without clinical context. Furthermore, whereas many people might agree on how to segment, say, a cat or a stop sign in traditional computer vision tasks, radiologists use a certain amount of clinical discretion when defining the boundaries of a pathology on a CXR. There can also be institutional and geographic differences in how radiologists are taught to recognize pathologies, and studies have shown that there can be high interobserver variability in the interpretation of CXRs59–61. We sought to address this with the hit rate evaluation metric, which highlights when two radiologists share the same diagnostic intention, even if it is less exact than IoU in comparing segmentations directly. The human benchmark localization using hit rate was above 0.9 for four pathologies (Pneumothorax, Cardiomegaly, Support Devices, and Enlarged Cardiomediastinum); these are pathologies for which there is often little disagreement between radiologists about where the pathologies are located, even if the expert segmentations are noisy. Further work is needed to demonstrate which segmentation evaluation metrics, even beyond overlap and hit rate, are more appropriate for certain pathologies and downstream tasks when evaluating saliency methods for the clinical setting.
Our work builds upon several studies investigating the validity of saliency maps for localization62–64 and upon some early work on the trustworthiness of saliency methods to explain DNNs in medical imaging47. However, as recent work has shown32, evaluating saliency methods is inherently difficult given that they are post-hoc techniques. To illustrate this, consider the following models and saliency methods as described by some oracle: (1) a model M_bad that has perfect AUROC for a given image classification task, but that we know does not localize well (i.e. because the model picks up on confounders in the image); (2) a model M_good that also has perfect AUROC, but that we know does localize well (i.e. is looking at relevant regions of the image); (3) a saliency method S_bad that does not properly reflect the model’s attention; and (4) a saliency method S_good that does properly reflect the model’s attention. Let us say that we are evaluating the following pipeline: we first classify an image and we then apply a saliency method post hoc. Imagine that our evaluation reveals poor localization performance as measured by mIoU or hit rate (as was the case in our findings). There are three possible pipelines (combinations of model and saliency method) that would lead to this scenario: (1) M_bad + S_good; (2) M_good + S_bad; and (3) M_bad + S_bad. The first scenario (M_bad + S_good) is the one for which saliency methods were originally intended: we have a working saliency method that properly alerts us to models picking up on confounders. The second scenario (M_good + S_bad) is our nightmare scenario: we have a working model whose attention is appropriately directed, but we reject it based on a poorly localizing saliency method. Because all three scenarios result in poor localization performance, it is difficult—if not impossible—to know whether poor localization performance is attributable to the model or to the saliency method (or to both). While we cannot say whether models or saliency methods are failing in the context of medical imaging, we can say that we should not rely on saliency methods to evaluate model localization. Future work should explore potential techniques for localization performance attribution.
There are several limitations of our work. First, we did not investigate the impact of pathology prevalence in the training data on saliency method localization performance. Second, some pathologies, such as effusions and cardiomegaly, are in similar locations across frontal view CXRs, while others, such as lesions and opacities, can vary in locations across CXRs. Future work could investigate how the location of pathologies on a CXR in the training/test data distribution, and the consistency of those locations, affect saliency method localization performance. Third, while we compared saliency method-generated pixel-level segmentations to human expert pixel-level segmentations, future work might explore how saliency method localization performance changes when comparing bounding-box annotations, instead of pixel-level segmentations. Fourth, we explored post-hoc interpretability methods given their prevalence in the context of medical imaging, but we hope that by publicly releasing our development dataset of pixel-level expert segmentations we can facilitate the development of models that make use of ground-truth segmentations during training57. Fifth, the lack of a given finding can in certain cases inform clinical diagnoses. A common example of this is the lack of normal lung tissue pattern towards the edges of the thoracic cage, which is used to detect pneumothorax. For any characteristic pattern, both the absence and the presence provide diagnostic information to the radiologist. For example, the absence of a pleural effusion pattern is also used to rule out pleural effusion. For any characteristic radiological pattern, both the presence as well as the absence contributes to the final radiology report. Future work can explore counterfactual visual explanations that are similar to the counterfactual diagnostic process of a radiologist. Sixth, future work should further explore the potentially confounding effect of model calibration on the evaluation of saliency methods, especially when using segmentation, as opposed to classification, models. Finally, the impact of saliency methods on the trust and efficacy of users is underexplored.
In conclusion, we present a rigorous evaluation of a range of saliency methods and a human benchmark dataset, which can serve as a foundation for future work exploring deep learning explainability techniques. This work is a reminder that care should be taken when leveraging common saliency methods to validate individual clinical decisions in deep learning-based workflows for medical imaging.
Methods
Ethical and information governance approvals
A formal Stanford IRB review was conducted for the original collection of the CheXpert dataset. The IRB waived the requirement to obtain informed consent as the data were retrospectively collected and fully anonymized.
Dataset and clinical taxonomy
Dataset description
The localization experiments were performed using CheXpert, a large public dataset for chest X-ray interpretation. The CheXpert dataset contains 224,316 chest X-rays for 65,240 patients labeled for the presence of 14 observations (13 pathologies and an observation of “No Finding”) as positive, negative, or uncertain. The CheXpert validation set consists of 234 chest X-rays from 200 patients randomly sampled from the full dataset and was labeled according to the consensus of three board-certified radiologists. The test set consists of 668 chest X-rays from 500 patients not included in the training or validation sets and was labeled according to the consensus of five board-certified radiologists. See Extended Data Fig. 10 for test set summary statistics.
Ground-truth segmentation
The chest X-rays in our validation set and test set were manually segmented by two board-certified radiologists with 18 and 27 years of experience, using the annotation software tool MD.ai65 (see Supplementary Figs. S12 through S14). The radiologists were asked to contour the region of interest for all observations in the chest X-rays for which there was a positive ground truth label in the CheXpert dataset. For a pathology with multiple instances, all the instances were contoured. For Support Devices, radiologists were asked to contour any implanted or invasive devices including pacemakers, PICC/central catheters, chest tubes, endotracheal tubes, feeding tubes and stents and ignore ECG lead wires or external stickers visible in the chest X-ray.
Evaluating the expert performance using benchmark segmentation
To evaluate the expert performance on the test set using the IoU evaluation method, three radiologists, certified in Vietnam with 9, 10, and 18 years of experience, were asked to segment the regions of interest for all observations in the chest X-rays for which there was a positive ground truth label in the CheXpert dataset. These radiologists were also provided the same instructions for contouring as were provided to the radiologists drawing the reference segmentations. To extract the “maximally activated” point from the benchmark segmentations, we asked the same radiologists to locate each pathology present on each CXR using only a single most representative point for that pathology on the CXR (see Supplementary Figs. S1 through S11 for the detailed instructions given to the radiologists). There was no overlap between these three radiologists and the two who drew the reference segmentations.
Classification network architecture and training protocol
Multi-label classification model
The model takes as input a single-view chest X-ray and outputs the probability for each of the 14 observations. In case of availability of more than one view, the models output the maximum probability of the observations across the views. Each chest X-ray was resized to 320×320 pixels and normalized before it was fed into the network. We used the same image resolutions as CheXpert36 and CheXnet2, which demonstrated radiologist-level performance on external test sets with 320×320 images. There are models that are commercially deployed and have similar dimensions. For example, the architecture used by a medical AI software vendor Annalise.ai66 is based on EfficientNet67, which takes input of 224×224. Chest X-rays were normalized prior to being fed into the network by subtracting the mean of all images in the CheXpert training set and then dividing by the standard deviation of all images in the CheXpert training set. The model architectures (DenseNet121, ResNet152, and Inception-v4) were used. Cross-entropy loss was used to train the model. The Adam optimizer68 was used with default β-parameters of β1 = 0.9 and β2 = 0.999. The learning rate was hyperparameter tuned for the different model architectures. Grid search was used to tune the learning rates. We searched over learning rates of 1e-3, 1e-4, and 1e-5. The best learning rate for each architecture was: 1×10−4 for DenseNet121, 1×10−5 for ResNet152, 1×10−5 for Inceptionv4. Batches were sampled using a fixed batch size of 16 images.
Ensembling
We use an ensemble of checkpoints to create both predictions and saliency maps to maximize model performance. In order to capture uncertainties inherent in radiograph interpretation, we train our models using four uncertainty handling strategies outlined in CheXpert: Ignoring, Zeroes, Ones, and 3-Class Classification. For each of the four uncertainty handling strategies, we train our model three separate times, each time saving the 10 checkpoints across the three epochs with the highest average AUC across 5 observations selected for their clinical importance and prevalence in the validation set: Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural Effusion. In total, after training, we have saved 4 x 30 = 120 checkpoints for a given model. Then, from the 120 saved checkpoints for that model, we select the top 10 performing checkpoints for each pathology. For each CXR and each task, we compute the predictions and saliency maps using the relevant checkpoints. We then take the mean both of the predictions and of the saliency maps to create the final set of predictions and saliency maps for the ensemble model. See Supplementary Table S1 for the performance of each model architecture (DenseNet121, ResNet152, and Inception-v4) on each of the pathologies.
CNN interpretation strategy
Saliency methods were used to visualize the decision made by the classification network. The saliency map was resized to the original image dimension using bilinear interpolation. It was then normalized using max-min normalization and then converted into a binary segmentation using binary thresholding (Otsu’s method). We also reported mIoU localization performance using different saliency map thresholding values. We first applied max-min normalizations to the saliency maps so that each value gets transformed into a decimal between 0 and 1. We then passed in a range of threshold values from 0.2 to 0.8 to create binary segmentations and calculated the mIoU score per pathology under each threshold on the validation set. Then for the analysis on the full dataset (see Extended Data Fig. 4), we further ensure that the final binary segmentation is consistent with model probability output by applying another layer of thresholding such that the segmentation mask produced all zeros if the predicted probability was below a chosen level. The probability threshold is searched on the interval of [0,0.8] with steps of 0.1. The exact value is determined per pathology by maximizing the mIoU on validation set.
For Occlusion, we used a window size of 40 and a stride of 40 for each CXR.
Segmentation evaluation metrics
Localization performance of each segmentation was evaluated using Intersection over Union (IoU) score. The IoU is the ratio between the area of overlap and the area of union between the ground truth and the predicted areas, ranging from 0 to 1 with 0 signifying no overlap and 1 signifying perfectly overlapping segmentation. Confidence intervals are calculated using bootstrapping with 1000 bootstrap samples. The variance in the width of CI across pathologies can be explained by difference in sample sizes. For the percentage decrease from expert mIoU to AI mIoU, we bootstrapped the difference between human benchmark and saliency method localization and created the 95% confidence intervals. The confidence intervals for hit rates were calculated in the same fashion. For the evaluation of Integrated Gradients using IoU, we applied box filtering of kernel size 100 to smooth the pixelated map. For DeepLIFT, we applied box filtering of kernel size 50. For LRP, we used a kernel size of 80. The kernel sizes are tuned on the validation set. The noisy map is not a concern for hit rate because a single max pixel is extracted for the entire image.
Statistical analysis
Pathology Characteristics
We used four features to characterize the pathologies. (1) Number of instances is defined as the number of disjoint components in the segmentation. (2) Size is the area of the pathology divided by the total image area. (3) and (4) Elongation and irrectangularity are geometric features that measure shape complexities. They were designed to quantify what radiologists qualitatively described as focal or diffused. To calculate the metrics, a rectangle of minimum area enclosing the contour is fitted to each pathology. Elongation is defined as the ratio of the rectangle’s longer side to short side. Irrectangularity = 1 - (area of segmentation/area of enclosing rectangle), with values ranging from 0 to 1 with 1 being very irrectangular. When there are multiple instances within one pathology, we used the characteristics of the dominant instance (largest in perimeter).
Model Confidence
We used the probability output of the DNN architecture for model confidence. The probabilities were normalized using max-min normalization per pathology before aggregation.
Linear Regression
For each evaluation scheme (overlap and hit rate), we ran two groups of simple linear regressions, with AI evaluation metrics and their differences as the response variables. Each group has four regressions using the above four pathological characteristics as the regressions’ single attribute, respectively, and only the true positive slice was included in each regression. All features are normalized using min-max normalization so that they are comparable on scales of magnitudes. We report the 95% confidence interval and Bonferroni adjusted p-value of the regression coefficients.
Data Availability
CheXpert data is available at https://stanfordmlgroup.github.io/competitions/chexpert/. The validation set and corresponding benchmark radiologist annotations will be available online for the purpose of extending the study.
Data Availability
The CheXlocalize dataset is available here: https://stanfordaimi.azurewebsites.net/datasets/abfb76e5-70d5-4315-badc-c94dd82e3d6d. The CheXpert dataset is available here https://stanfordmlgroup.github.io/competitions/chexpert/.
Code Availability
The code used to generate segmentations from saliency method heat maps, fine-tune segmentation thresholds, generate segmentations from human annotations, and evaluate localization performance is available in the following public repository under the MIT License: https://github.com/rajpurkarlab/cheXlocalize. The version used for this publication is available at https://doi.org/10.5281/zenodo.681628869.
Author Contributions
Conceptualization: P.R. and A.P. Design: P.R., A.P., A.S., X.G. and A.A. Data analysis and interpretation: A.S., X.G., A.A., P.R., A.P., S.T., C.N., V.N., J.S., and F.B. Drafting of the manuscript: A.S., X.G., A.A., and P.R. Critical revision of the manuscript for important intellectual content: A.P, S.T., C.N., V.N., J.S., F.B, A.N., and M.L. Supervision: A.N., M.L., and P.R. Research was primarily performed while A.S. was at Stanford University. M.L. and P.R. contributed equally.
Competing Interests
M.L. is an advisor for and/or has research funded by GE, Philips, Carestream, Nines Radiology, Segmed, Centaur Labs, Microsoft, BunkerHill, and Amazon Web Services (none of the funded research was relevant to this project). A.P. is a medical associate at Cerebriu. The remaining authors declare no competing interests.
Extended Data
Acknowledgements
We would like to acknowledge MD.ai for generously providing us access to their annotation platform. We would like to acknowledge Weights & Biases for generously providing us access to their experiment tracking tools.
Footnotes
New saliency methods have been added and the language of the manuscript has been updated.