Abstract
Deep learning has enabled automated medical image interpretation at a level often surpassing that of practicing medical experts. However, many clinical practices have cited a lack of model interpretability as reason to delay the use of “black-box” deep neural networks in clinical workflows. Saliency maps, which “explain” a model’s decision by producing heat maps that highlight the areas of the medical image that influence model prediction, are often presented to clinicians as an aid in diagnostic decision-making. In this work, we demonstrate that the most commonly used saliency map generating method, Grad-CAM, results in low performance for 10 pathologies on chest X-rays. We examined under what clinical conditions saliency maps might be more dangerous to use compared to human experts, and found that Grad-CAM performs worse for pathologies that had multiple instances, were smaller in size, and had shapes that were more complex. Moreover, we showed that model confidence was positively correlated with Grad-CAM localization performance, suggesting that saliency maps were safer for clinicians to use as a decision aid when the model had made a positive prediction with high confidence. Our work demonstrates that several important limitations of interpretability techniques for medical imaging must be addressed before use in clinical workflows.
Introduction
Deep learning has enabled automated medical imaging interpretation at a level shown to surpass that of practicing experts in some settings1–3. While the potential benefits of automated diagnostic models are numerous, lack of model interpretability in the use of “black-box” deep neural networks (DNNs) represents a major barrier to clinical trust and adoption4,5,6. In fact, it has been argued that the European Union’s recently adopted General Data Protection Regulation (GDPR) affirms an individual’s right to an explanation in the context of automated decision-making7. Although many DNN interpretability techniques have been proposed, rigorous investigation of the accuracy and reliability of these strategies is lacking and necessary before they are integrated into the clinical setting8.
One type of DNN interpretation strategy widely used in the context of medical imaging is based on saliency (or pixel-attribution) methods9–12. Saliency methods produce heat maps that highlight the areas of the medical image that most influenced the DNN’s prediction. The heat maps help to visualize whether a DNN is concentrating on the same regions of the medical image that a human expert would focus attention on for a given diagnosis, rather than concentrating on a clinically irrelevant part of the medical image or even on confounders in the image13–15. However, recent work has shown that saliency methods used to validate model predictions can be misleading in some cases and may lead to increased bias and loss of user trust with concerning implications for clinical translation efforts16.
The purpose of this work is to perform a systematic evaluation of the most common saliency method, Grad-CAM17, on multi-label classification models for medical imaging interpretation from chest X-rays. In order to evaluate how well the saliency method identifies critical areas of an image for diagnosis, we compared the saliency method segmentations to human expert benchmark and reference annotations. We evaluated the accuracy of the saliency method segmentations and the human expert benchmark segmentations first by calculating their overlap with the human expert reference segmentations, and then by determining whether the segmentations correctly located the pathology of concern, regardless of the exact bounds of the segmentations. We further conducted statistical analyses to better understand how the localization accuracy of saliency methods is affected both by pathological characteristics and also by model confidence.
Results
Framework for evaluating a saliency method on multi-label classification models
A model can be trained to perform pixel-level localization in one of two ways: either through supervised learning, in which the model is trained directly on pixel-level segmentations, or through weakly supervised learning, in which the model is trained only on image-level class labels. Because ground truth segmentations for medical imaging can be especially time-consuming and expensive to obtain given the domain expertise required to create them18, one of the most common saliency methods in medical imaging is a weakly supervised localization technique, Grad-CAM, in which the classification model is never exposed to pixel-level segmentations during training. Instead, Grad-CAM generates a heat map corresponding to each image-level task label that highlights the regions of the input image that, in theory, are most indicative of that task label. This saliency method has been widely used for a variety of medical imaging tasks and modalities including but not limited to: visualizing the performance of a convolutional neural network in predicting (1) myocardial infarction19 and hypogycemia20 from electrocardiograms, (2) visual impairment21, refractive error22, and anaemia23 from retinal photographs (3) long-term mortality24 and tuberculosis25 from chest x-ray images, and (4) appendicitis26, and pulmonary embolism27 on computed tomography scans.
We use the following framework, which we call CheXplanation, to evaluate Grad-CAM in a multi-label classification setup. We trained and evaluated an ensemble of 30 CNN models on CheXpert, a large publicly available CXR dataset28. We then passed each of the 668 CXRs in the dataset’s holdout test set into the ensemble model to obtain image-level predictions for the following 10 pathologies: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Lung Lesion, Lung Opacity, Pleural Effusion, Pneumothorax, and Support Devices. For each CXR, we used Grad-CAM to generate 10 heat maps, one for each of the 10 pathologies. We then applied a threshold to the heat maps to produce 10 binary segmentations in order to evaluate their overlap with the human expert reference segmentations (see Fig. 1a). We also extracted the location of the pixel with the largest value from each heat map to determine whether it fell within the bounds of the reference segmentation (see Fig. 1d). In doing so, we could evaluate Grad-CAM’s localization performance regardless of the exact bounds of its binary segmentations.
To evaluate how well the saliency method segmentations identified clinically relevant pathology regions of input CXRs (“AI localization performance”), we obtained pixel-level reference segmentations on the holdout test set from two board-certified radiologists who were asked to segment any of the 10 pathologies that were present in each CXR as determined by the dataset’s ground-truth labels (see Fig. 1b). We also established a human benchmark (“expert localization performance”) by collecting segmentations from three additional radiologists who were asked to segment the 10 pathologies of interest present in each CXR as determined by the dataset’s ground-truth labels (there was no overlap between these three radiologists and the two who drew the reference segmentations) (see Fig. 1c). This second group of radiologists was also asked to locate each pathology present on each CXR using only a single most representative point for that pathology on the CXR to see whether that point fell within the bounds of the reference segmentation (see Fig. 1e). Which point in the CXR is “most representative” varied by pathology, but the point would always lie inside the pathology’s segmentation drawn by the radiologist. For example, the most representative point for Pneumothorax would be wherever the pathology is most evident, whereas the most representative point for Cardiomegaly would be the center of the heart. See Supplementary Figs. 1 through 11 for the detailed instructions given to the radiologists.
Our dataset of expert reference segmentations has been made publicly available to encourage further development and evaluation of CXR interpretation models.
Evaluating the localization performance of the saliency method
We used two evaluation schemes to compare the AI localization performance to the expert localization performance. First, we used mean Intersection over Union (mIoU) to measure how much, on average, either the saliency method segmentations or the human benchmark segmentations overlapped with the reference segmentations. Second, we used the pointing game setup29, in which a “hit” is when the single point used to locate a pathology lies within the reference segmentation and a “miss” is when the single point lies outside the reference segmentation. Localization performance is then calculated as the hit rate across the dataset30. See Fig. 2a for mIoU and hit rate for Grad-CAM segmentations on example CXRs.
For nine of the 10 pathologies, the saliency segmentations had a lower overlap with the reference segmentations than did the human benchmark segmentations (see Fig. 2b). Similarly, for nine of the 10 pathologies, the saliency method had a lower hit rate than the human benchmark (see Fig. 2c). Under both the overlap and the pointing game evaluation schemes, the gap between AI localization performance and expert localization performance was the largest for Lung Lesion: AI mIoU was 93.6% smaller than expert mIoU, and Grad-CAM hit rate was 74.9% smaller than expert hit rate. Lung Lesion is also the pathology for which the saliency method had both the lowest mIoU (0.027) and the lowest hit rate (0.215). Under the hit rate scheme, Support Devices and Pneumothorax displayed the second and third largest gaps, respectively, between AI and expert localization performance (Support Devices: 72.4%; Pneumothorax: 60.4%). Under the overlap scheme, Support Devices and Pneumothorax also displayed the fourth and third largest gaps, respectively, between AI and expert localization performance (Support Devices: 67.4%; Pneumothorax: 68.5%).
Under the overlap evaluation scheme, the only pathology for which the saliency method (mIoU 0.145 95% CI [0.126, 0.163]) outperformed the human benchmark (mIoU 0.122, 95% CI [0.108, 0.136]) was Atelectasis. Under the hit rate evaluation scheme, the only pathology for which the saliency method (hit rate 0.624 95% CI [0.462, 0.771]) outperformed the human benchmark (hit rate 0.513 95% CI [0.355, 0.674]) was for Consolidation. However, the differences were statistically insignificant. Both the saliency method and the human benchmark achieved their highest mIoUs for Cardiomegaly (AI: 0.275 95% CI [0.242, 0.30]); expert: 0.712 95% CI [0.692, 0.731]). AI hit rate was largest for Cardiomegaly (0.903 95% CI [0.857, 0.945]) and Enlarged Cardiomediastinum (0.761 95% CI [0.712, 0.81]). Expert hit rate was above 0.95 for Pneumothorax (1.0 95% CI [1.0, 1.0]), Cardiomegaly (0.971 95% CI [0.944, 0.994]), and Enlarged Cardiomediastinum (0.953 95% CI [0.927, 0.975]).
We also evaluated whether and how the gap between AI and expert localization performance changed from when overlap was used as an evaluation metric to when hit rate was used as an evaluation metric. Percentage decrease from expert to AI localization performance fell the most from mIoU to hit rate for Consolidation (41.2% - (−21.6%) = 62.8%), Cardiomegaly (61.4% - 7.1% = 54.3%), and Enlarged Cardiomediastinum (70.0% - 20.1% = 49.9%). Percentage decrease from expert to AI localization performance increased by far the most from mIoU to hit rate for Atelectasis (−18.9% - 59.0% = −-77.9%).
Characterizing the gaps between AI localization performance and expert localization performance
In order to better understand under what circumstances the AI localization performance was closer to, or further from, the expert localization performance, we first conducted a qualitative analysis by visually inspecting both the saliency method segmentations and the benchmark radiologist segmentations with a radiologist. Then, to inform our qualitative interpretation, we conducted a statistical analysis to quantify how both the saliency method segmentations and the benchmark radiologist segmentations performed in the presence of four pathological characteristics31: (1) number of instances for a given pathology (for example, bilateral Pleural Effusion would have two instances, whereas there is usually only one instance for Cardiomegaly), (2) area ratio (pathology area with respect to the area of the whole CXR), (3) elongation, and (4) irrectangularity (the last two features were meant to measure the complexity of the pathology’s shape). See Fig. 3a for example segmentations with the above 4 characteristics. See Fig. 3b for the distribution of the four pathological characteristics across all 10 pathologies. Lung Lesion had the largest number of instances on average, and the lowest mean area ratio among all the pathologies. Support Devices and Pneumothorax had the two highest mean elongation and irrectangularity values, respectively.
For each evaluation scheme (overlap and hit rate), we ran 12 simple linear regressions: four with the AI evaluation metric (IoU or hit rate) as the response variable, four with the expert evaluation metric as the response variable, and four with the difference between the expert and AI evaluation metrics as the response variable. Each group of four regressions used the above four pathological characteristics as the regression’s single attribute, respectively, and only CXRs with a positive label were included in each regression (n=1534). Each regression coefficient can be interpreted as the effect of that pathological characteristic on the evaluation metric at hand. See Table 2 for coefficients from the regressions under the overlap and hit rate evaluation schemes.
Our qualitative analysis uncovered three patterns in the saliency method segmentations that were associated with lower localization performance. First, we observed that when multiple instances of a single pathology are present in a CXR, instead of highlighting each distinct instance of the pathology separately, the saliency method segmentation often highlights one large confluent area that encompasses all of the instances (see Fig. 3c). Second, we found that saliency method segmentations tend to be significantly larger than either the human benchmark or reference segmentations, and often fail to respect clear anatomical boundaries (see Fig. 3d). Correspondingly, the AI overlap coefficient for area ratio was 0.556 (95% CI [0.510, 0.601]), suggesting that as a pathology’s area ratio decreases, AI localization performance worsens. Furthermore, under the overlap evaluation scheme, the gap between AI localization performance and expert localization performance increases as a pathology’s area ratio decreases: the area ratio coefficient was −0.151 (95% CI [–0.233, −0.07]) when the difference between expert and AI overlap was the response variable. Third, our qualitative analysis showed that, in segmenting complex and elongated pathologies, when the AI segmentations include the pathology, they also frequently enclose significant portions of the CXR where the pathology is not present (see Fig. 3e). Similarly, our statistical analysis demonstrated that AI localization performance worsens both as a pathology’s elongation increases (overlap coefficient = - 0.375 95% CI [-0.453, −0.297]), and as a pathology’s irrectangularity increases (overlap coefficient = −0.205 95% CI [–0.245, −0.165]).
Our statistical analysis showed that the first and second trends listed above held not only for the AI segmentations, but also for the expert segmentations: as the number of instances of a pathology increases, expert localization performance worsens (overlap coefficient = −0.178 95% CI [–0.334, −0.021]), and as the area ratio of a pathology increases, expert localization performance improves (overlap coefficient = 0.404 95% CI [0.334, 0.475]). However, our statistical analysis showed no evidence that the third trend holds for the expert segmentations: expert localization performance does not worsen as pathology complexity increases (irrectangularity overlap coefficient = 0.073 [0.016, 0.13]; elongation overlap coefficient was not statistically significant).
The results of the above experiments using hit/miss as an evaluation metric were consistent with the results when using overlap as an evaluation metric. In both experiments, we reported the 95% confidence interval and the Bonferroni corrected p-values.
Effect of model confidence on AI localization performance
Since the saliency method is highly dependent on the DNN’s architecture, we conducted statistical analyses to determine whether there was any correlation between the model’s confidence in its prediction and AI localization performance. We first ran a simple regression for each pathology using the model’s probability output for the pathology as the single independent variable and using the IoU of the AI segmentation with reference segmentation as the response variable. We then performed a simple regression that uses the same approach as above, but that includes all 10 pathologies. For each of the 11 regressions, we excluded true negative cases in order to calculate the IoU score for the expert segmentations. In addition to the linear regression coefficients, we also computed the Spearman correlation coefficients to capture any potential non-linear associations (see Table 3).
We found that for all pathologies, model confidence was positively correlated with AI localization performance. The p-values for all of the coefficients were below 0.001 except for the coefficients for Pneumothorax (n=11) and Lung Lesion (n=50), the two pathologies for which we had the fewest positive examples. Of all the pathologies, model confidence for positive predictions of Enlarged Cardiomediastinum had the largest linear regression coefficient with AI localization performance (1.974, p-value = 2.523e-24). Model confidence for positive predictions of Pneumothorax had the largest Spearman correlation coefficient with AI localization performance (0.734, p-value = 0.01), although the coefficient was not as statistically significant as the Spearman correlation coefficient for Pleural Effusion (0.69, p-value=8.08e-24), the second largest of all the pathologies. Combining all of the pathologies (n=2365), the linear regression coefficient was 0.109 (95% CI [0.083, 0.135]), and the Spearman correlation coefficient was 0.285 (95%CI [0.239, 0.331]). We also performed analogous experiments using hit rate as the response variable and found comparable results (see Supplementary Table 1).
Discussion
The purpose of this work was to evaluate the performance of saliency methods, which are widely used in clinical practice for DNN prediction explainability. We demonstrate that saliency maps are consistently worse than expert radiologists at localizing a variety of pathologies on CXRs. We use qualitative and quantitative analyses to establish that AI localization performance is furthest from expert localization performance in the face of pathologies that have multiple instances, are smaller in size, and have shapes that are more complex, suggesting that deep learning explainability as a clinical interface may be less reliable and less useful when used for pathologies with those characteristics. We show that model assurance is positively correlated with AI localization performance, which could indicate that the saliency methods are safer to use as a decision aid to clinicians when the model has made a positive prediction with high confidence. Finally, since IoU computes the overlap of two segmentations but pointing game hit rate better captures diagnostic attention, we suggest using both metrics to evaluate both AI and expert localization performance.
Our work has several potential implications for patient care. Heat maps generated using saliency methods are advocated as clinical decision support in the hope that the heat maps not only improve clinical decision-making, but also encourage clinicians to trust model predictions32–34. However, we found that AI localization performance, on balance, performed worse than expert localization across multiple analyses. This is consistent with recent work focused on localizing a single pathology, Pneumothorax, in CXRs35. Our work expanded this exploration and found that common saliency methods may underperform for many important pathologies on CXRs. If used in clinical practice, heat maps that incorrectly highlight medical images may exacerbate well documented biases, chiefly automation bias, and erode trust in model predictions (even when model output is correct), limiting clinical translation25.
Determining when saliency methods are more likely to succeed or fail in localizing pathologies could have further implications for patient care. That knowledge could inform not only under what clinical conditions saliency methods might be safer to use, but also how we might improve saliency methods in the future. We found that AI localization performance worsens in the presence of pathologies that have multiple instances. We also found that AI localization performance worsens in the presence of pathologies that are smaller in size compared with the CXR image. This result explains why, under both the overlap and hit rate schemes, the gap between AI and expert localization performance was the largest for Lung Lesion, whose mean area ratio was the smallest of the 10 pathologies we explored. Moreover, AI localization performance under both evaluation schemes was best for Enlarged Cardiomediastinum and Cardiomegaly, which had the largest and third largest area ratios, respectively, of the 10 pathologies, suggesting that saliency methods might be safer to use in the context of these two pathologies, or pathologies with similar characteristics. Grad-CAM segmentations often fail to respect clear anatomical boundaries, and we hypothesize that this is an algorithmic artifact of Grad-CAM, whose feature map sized (14 x 14) heatmap is interpolated to the original image dimension (usually 2000 x 2000), resulting in coarse resolution. We also found that AI localization performance worsens in the presence of pathologies whose shapes are more complex. AI localization for Pneumothorax and Support Devices, both of which were more elongated and complex than any of the other conditions, underperformed compared to expert localization performance; however, this performance gap must also be considered in the context of the model training data prevalence, and future work may explore the impact of training data prevalence on the localization performance of saliency methods.
While IoU is a commonly used metric for evaluating semantic segmentation outputs, there are inherent limitations to the metric in the pathological context. This is indicated by our finding that even the expert segmentations had relatively low overlap with the reference segmentations (the highest expert mIoU was 0.712 for Cardiomegaly). One potential explanation for this consistent underperformance is that pathologies can be hard to distinguish, especially without clinical context. Furthermore, whereas many people might agree on how to segment, say, a cat or a stop sign in traditional computer vision tasks, radiologists use a certain amount of clinical discretion when defining the boundaries of a pathology on a CXR. There can also be institutional and geographic differences in how radiologists are taught to recognize pathologies, and studies have shown that there can be high interobserver variability in the interpretation of CXRs36–38. We sought to address this with the hit rate evaluation metric, which highlights when two radiologists share the same diagnostic intention, even if it is less exact than IoU in comparing segmentations directly. Expert performance using hit rate was above 0.95 for four pathologies (Pneumothorax, Cardiomegaly, Support Devices, and Enlarged Cardiomediastinum); these are pathologies for which there is often little disagreement between radiologists about where the pathologies are located, even if the expert segmentations are noisy. The only pathology for which AI localization performance was better than expert localization performance under the hit rate scheme was Consolidation. However, because the hit rate scheme required the benchmark radiologists to select only one point on the CXR, even if there were multiple instances of the pathology present (as is often the case with Consolidation), it is likely that the hit rate setup unfairly penalized expert performance in this case and that it is not the best evaluation metric to use for Consolidation. Further work is needed to validate this hypothesis and to demonstrate which segmentation evaluation metrics, even beyond overlap and hit rate, are more appropriate for which pathologies when evaluating saliency methods for the clinical setting.
Our work builds upon several studies investigating the validity of saliency maps in localization39,40 and upon some early work on trustworthiness of saliency methods to explain DNNs in medical imaging41. We substantially extend the body of literature by doing a comprehensive analysis on a multi-label classification task using the most popular saliency method in the medical context, Grad-CAM. Our work analyzes 10 different commonly occurring clinical pathologies as opposed to only Pneumothorax and Pneumonia. In addition to demonstrating that Grad-CAM is not yet ready for clinical use, we also highlight the strengths and weaknesses of Grad-CAM by doing quantitative statistical analysis, thus opening future avenues of research to improve Grad-CAM in particular and saliency methods in general. Finally, we establish new ground truth and benchmark segmentations on 10 different CXR observations facilitating future research on attribution methods.
There are several limitations of our work. We chose to evaluate Grad-CAM, since it has become one of the most popular explainability methods for CXRs, but future work may also evaluate other saliency methods, including Integrated Gradients42, WILDCAT43, and SmoothGrad44. While we used DenseNet121 as the underlying model architecture for the saliency method, since it was shown to produce the best classification results on the CheXpert dataset, future work should explore and compare different model architectures for CXR explainability. Our dataset had only 11 CXRs with Pneumothorax and 50 CXRs with Lung Lesion. Future work should investigate the impact of pathology prevalence in the training data on AI localization performance. Some pathologies, such as effusions and cardiomegaly, are always in the same place in a frontal view CXR; others, such as lesions and opacities, can occur in different locations on a CXR. Future work could investigate how a pathology’s location on a CXR, and the consistency of that location, affect AI localization performance. Finally, we compared Grad-CAM-generated pixel-level segmentations to human expert pixel-level segmentations, but future work might explore how AI localization performance changes when comparing bounding-box annotations, instead of pixel-level segmentations.
In conclusion, we demonstrate that not only should we not yet rely on saliency methods for deep learning explainability in CXRs, but saliency methods are particularly brittle in the presence of pathologies that have multiple instances, are smaller in size, and have shapes that are more complex. Although our findings suggest that care should be taken when deploying saliency methods into clinical practice, we demonstrate scenarios in which heat maps generated by saliency methods are most consistent with human expert annotations. Our work serves as the foundation for future work that rigorously evaluates a range of saliency methods using a variety of evaluation metrics before deep learning explainability techniques are integrated into the medical workflow.
Methods
Ethical and information governance approvals
This study does not involve human subject participants.
Dataset and clinical taxonomy
Dataset description
The localization experiments were performed using CheXpert, a large public dataset for chest X-ray interpretation. The CheXpert dataset contains 224,316 chest X-rays for 65,240 patients labeled for the presence of 14 observations (13 pathologies and an observation of “No Finding”) as positive, negative, or uncertain. The CheXpert validation set consists of 234 chest X-rays from 200 patients randomly sampled from the full dataset and was labeled according to the consensus of three board-certified radiologists. The test set consists of 668 chest X-rays from 500 patients not included in the training or validation sets and was labeled according to the consensus of five board-certified radiologists. See Supplementary Table 2 for dataset summary statistics.
Reference segmentation
The chest X-rays in our validation set and test set were manually segmented by two board-certified radiologists with 18 and 27 years of experience, using the annotation software tool MD.ai45 (see Supplementary Figs. 12 through 14). The radiologists were asked to contour the region of interest for all observations in the chest X-rays for which there was a positive ground truth label in the CheXpert dataset. For a pathology with multiple instances, all the instances were contoured. For Support Devices, radiologists were asked to contour any implanted or invasive devices including pacemakers, PICC/central catheters, chest tubes, endotracheal tubes, feeding tubes and stents and ignore ECG lead wires or external stickers visible in the chest X-ray. Finally, of the 14 observations labeled in the CheXpert dataset, Fracture, Pleural Other, and Pneumonia were not segmented because they either had low prevalence and/or ill-defined boundaries unfit for segmentation
Evaluating the expert performance using benchmark segmentation
To evaluate the expert performance on the test set using the IoU evaluation method, three radiologists, certified in Vietnam with 9, 10, and 18 years of experience, were asked to segment the regions of interest for all observations in the chest X-rays for which there was a positive ground truth label in the CheXpert dataset. These radiologists were also provided the same instructions for contouring as were provided to the radiologists drawing the reference segmentations. To extract the “maximally activated” point from the benchmark segmentations, we asked the same radiologists to locate each pathology present on each CXR using only a single most representative point for that pathology on the CXR (see Supplementary Figs. 1 through 11 for the detailed instructions given to the radiologists). There was no overlap between these three radiologists and the two who drew the reference segmentations.
Classification network architecture and training protocol
Multi-label classification model
The model takes as input a single-view chest X-ray and outputs the probability for each of the 14 observations. In case of availability of more than one view, the models output the maximum probability of the observations across the views. Each chest X-ray was resized to 320×320 pixels and normalized before it was fed into the network. The DenseNet121 model architecture46 was used. Cross-entropy loss was used to train the model. The Adam optimizer47 was used with default β-parameters of β1 = 0.9 and β2 = 0.999, and the learning rate was fixed at 1 × 10−4 for the duration of the training. Batches were sampled using a fixed batch size of 16 images.
Ensembling
An ensemble of 30 DenseNet121 checkpoints was created to improve the performance of the model. The 30 checkpoints were generated by training the model for 3 epochs and selecting the 10 checkpoints from each epoch with the highest average AUC across 5 observations selected for their clinical importance and prevalence in the validation set: Atelectasis, Cardiomegaly, Consolidation, Edema, and Pleural Effusion. See Supplementary Table 3 for the performance of the model on each of the pathologies.
DNN interpretation strategy
Saliency method
Grad-CAM was used to visualize the decision made by the classification network. Grad-CAM uses the gradients of the target flowing into the final convolutional layer to produce a saliency map that highlights the regions on which the model focuses while making the decision. The saliency map outputted by Grad-CAM was resized to the original image dimension. It was then normalized using max-min normalization and then converted into a binary segmentation using binary thresholding (Otsu’s method48). To further ensure that the final binary segmentation is consistent with model probability output, another layer of thresholding was applied such that the segmentation mask produced all zeros if the predicted probability was below a chosen level. The probability threshold is searched on the interval of [0,0.8] with steps of 0.1. The exact value is determined per pathology by maximizing the mIoU on validation set.
Segmentation evaluation metrics
Localization performance of each segmentation was evaluated using Intersection over Union (IoU) score. The (Intersection over Union) IoU is the ratio between the area of overlap and the area of union between the ground truth and the predicted areas, ranging from 0–1 with 0 signifying no overlap and 1 signifying perfectly overlapping segmentation. We then compared the mean Intersection over Union (mIoU) of Grad-CAM and radiologist benchmark on each pathology. The mIoU is the average IoU of all the images in the test dataset. True negatives where both segmentations were labeled as all 0s are excluded in the mean calculation. Confidence intervals are calculated using bootstrapping with 1000 bootstrap samples. The variance in the width of CI across pathologies can be explained by difference in sample sizes.
Statistical analysis
Pathology Characteristics
We used four features to characterize the pathologies. 1. Number of instances is defined as the number of disjoint components in the segmentation. 2. Area ratio area is the area of the pathology divided by the total image area. 3.4. Elongation and irrectangularity are geometric features that measure shape complexities. They were designed to quantify what radiologists qualitatively described as focal or diffused. To calculate the metrics, a rectangle of minimum area enclosing the contour is fitted to each pathology. Elongation is defined as the ratio of the rectangle’s longer side to short side. Irrectangularity = 1 - the area of segmentation/area of enclosing rectangle, with values ranging from 0 to 1 with 1 being very irrectangular. When there are multiple instances within one pathology, we used the characteristics of the dominant instance (largest in perimeter).
Model Confidence
We used the probability output of the DNN architecture for model confidence. The probabilities were normalized using max-min normalization per pathology before aggregation.
Linear Regression
For each evaluation scheme (overlap and hit rate), we ran three groups of simple linear regressions, with expert and AI evaluation metrics and their differences as the response variables. Each group has four regressions using the above four pathological characteristics as the regression’s single attribute, respectively, and only CXRs with a positive label were included in each regression (n=1534). All features are normalized using min-max normalization so that they are comparable on scales of magnitudes. We report the 95% confidence interval and p-value of the regression coefficients.
Data Availability
CheXpert data is available at https://stanfordmlgroup.github.io/competitions/chexpert/. The validation set and corresponding benchmark radiologist annotations will be available online for the purpose of extending the study.
Data Availability
CheXpert data is available at https://stanfordmlgroup.github.io/competitions/chexpert/. The validation set and corresponding benchmark radiologist annotations will be available online for the purpose of extending the study.
Code Availability
All code used to produce the results of the paper will be in a public repository for the purpose of reproducing the study. The link to the code will be added to the text of the paper for the camera-ready version.
Competing Interests
There are no competing interests.
Author Contributions
(1) Conceptualization: P.R. and A.P., (2) Design: P.R., A.P., A.S., X.G. and A.A., (3) Data analysis and interpretation: A.S., X.G., A.A., P.R., A.P., S.T., C.N., V.N., J.S., and F.B., (4) Drafting of the manuscript: A.S., X.G., A.A., and P.R., (5) Critical revision of the manuscript for important intellectual content: A.P, S.T., C.N., V.N., J.S., F.B, A.N., and M.L., (6) Supervision: A.N., M.L, and P.R.