Abstract
Ultrasound is an important imaging modality for the detection and characterization of breast cancer. Though consistently shown to detect mammographically occult cancers, especially in women with dense breasts, breast ultrasound has been noted to have high false-positive rates. In this work, we present an artificial intelligence (AI) system that achieves radiologist-level accuracy in identifying breast cancer in ultrasound images. To develop and validate this system, we curated a dataset consisting of 288,767 ultrasound exams from 143,203 patients examined at NYU Langone Health, between 2012 and 2019. On a test set consisting of 44,755 exams, the AI system achieved an area under the receiver operating characteristic curve (AUROC) of 0.976. In a reader study, the AI system achieved a higher AUROC than the average of ten board-certified breast radiologists (AUROC: 0.962 AI, 0.924±0.02 radiologists). With the help of the AI, radiologists decreased their false positive rates by 37.4% and reduced the number of requested biopsies by 27.8%, while maintaining the same level of sensitivity. To confirm its generalizability, we evaluated our system on an independent external test dataset where it achieved an AUROC of 0.911. This highlights the potential of AI in improving the accuracy, consistency, and efficiency of breast ultrasound diagnosis worldwide.
Breast cancer is the most frequently diagnosed cancer and the leading cause of cancer-related deaths among women worldwide [1]. It is estimated that 281,550 new cases of invasive breast cancer will be diagnosed among women in the United States in 2021, eventually leading to approximately 43,600 deaths [2]. Identifying breast cancer at an early stage before metastasis enables more effective treatments and therefore significantly improves survival rates [3, 4]. Mammography has long been the most widely utilized imaging technique for screening and early detection of breast cancer, but it is not without limitations. In particular, for women with dense breast tissue, the sensitivity of mammography drops from 85% to 48-64% [5]. This is a significant drawback, as women with dense breasts have a 4-fold increased risk of developing breast cancer [6]. Moreover, mammography is not always accessible, especially in limited-resources settings, where the high cost of equipment is prohibitive and skilled technologists and radiologists are not available [7].
Given the limitations of mammography, ultrasound (US) plays an important role in breast cancer diagnosis. It often serves as a supplementary modality to mammography in screening settings [8] and as the primary imaging modality in many diagnostic settings, including the evaluation of palpable breast abnormalities [9]. Moreover, US can help further evaluate and characterize breast masses and is therefore frequently used for performing image guided breast biopsies [10]. Breast US has several advantages compared to other imaging modalities, including relatively lower cost, lack of ionizing radiation, and the ability to evaluate images in real time [4]. In particular, US is especially effective at distinguishing solid breast masses from fluid-filled cystic lesions. In addition, breast US is able to detect cancers obscured on mammography, making it particularly useful in diagnosing cancers in women with mammographically dense breast tissue [11].
Despite these advantages, interpreting breast US is a challenging task. Radiologists evaluate US images using different features including lesion size, shape, margin, echogenicity, posterior acoustic features, and orientation, which vary significantly across patients [12]. Ultimately, they determine if the imaged findings are benign, need short-term follow-up imaging, or require a biopsy based on their suspicion of malignancy. There is considerable intra-reader variability in these recommendations and breast US has been criticized for increasing the number of false-positive findings [13, 14]. Compared to mammography alone, the addition of US in breast cancer screening leads to an additional 5-15% of patients being recalled for further imaging and an additional 4-8% of patients undergoing biopsy [15, 16, 17]. However, only 7-8% of biopsies prompted by screening US are found to identify cancers [15, 17].
Computer-aided diagnosis (CAD) systems have been proposed to assist radiologists in the interpretation of breast US exams over a decade ago [18]. Early CAD systems often relied on handcrafted visual features that are difficult to generalize across US images that were acquired using different protocols and US units [19, 20, 21, 22, 23, 24]. Recent advances in deep learning have facilitated the development of AI systems for the automated diagnosis of breast cancer from US images [25, 26, 27]. However, the majority of these efforts rely on image-level or pixel-level labels, which require experts to manually mark images containing visible lesions within each exam or annotate lesions in each image, respectively [28, 29, 30, 31, 32, 33]. As a result, existing studies have been based on small datasets consisting of several hundreds or thousands of US images. Deep learning models trained on those datasets might not sufficiently learn the diverse characteristics of US images observed in clinical practice. This is especially important for US imaging as lesion appearance can vary substantially depending on the imaging technique and the manufacturer of the US unit system. Moreover, prior research has primarily focused on differentiating between benign and malignant breast lesions, hence evaluating AI systems only on the images which contain either benign or malignant lesions [34, 35, 36]. In contrast, the majority of breast cancer screening exams are negative (no lesions are present) [7, 11]. In addition, most AI systems in previous studies do not interpret the model’s predictions, resulting with “black-box” models [28, 29, 30, 31, 32, 33, 34, 35, 36]. So far, there has been little work on interpretable AI systems for breast US.
In this work, we present an AI system (Figure 1) to identify malignant lesions in breast US images with the primary goal of reducing the frequency of false positive findings. The AI system was trained to perform classification and localization in a weakly supervised manner [37, 38, 39]. That is, our AI system is able to explain its predictions by indicating locations of malignant lesions even though it is trained with binary breast-level cancer labels only (see Methods section ‘Breast-level cancer labels’), which were automatically extracted from pathology reports. The explainability of our system enables clinicians to develop trust and better understand its strengths and limitations.
The proposed system provides several advances relative to previous work. First, to the best of our knowledge, the dataset used to train and evaluate this AI system is larger than any prior dataset used for this application [29, 40]. Second, to understand the potential value of this AI system in clinical practice, we conducted a reader study to compare its diagnostic accuracy with ten board-certified breast radiologists. The AI system achieved a higher area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) than the ten radiologists on average. Moreover, we showed that the hybrid model, which aggregates the predictions of the AI system and radiologists, improved radiologists’ specificity and decreased biopsy rate while maintaining the same level of sensitivity. In addition, we showed that the performance of the AI system remained robust across patients from different age groups and mammographic breast densities. Accuracy of our system also remained high when tested on an external data set [40].
Results
Datasets
The AI system was developed and evaluated using the NYU Breast Ultrasound Dataset [41] consisting of 5,442,907 images within 288,767 breast US exams (including both screening and diagnostic exams) collected from 143,203 patients examined between 2012 and 2019 at NYU Langone Health in New York, USA. The NYU Langone hospital system spans multiple sites across New York City and Long Island, allowing the inclusion of a diverse patient population. The dataset included 28,914 exams associated with a biopsy, and among those, the biopsy yielded benign and malignant results for 26,843 and 5,593 breasts, respectively. Patients in the dataset were randomly divided into a training set (60%) that was used for model training, a validation set (10%) that was used for hyperparameter tuning, and an internal test set (30%) that was used for model evaluation. Each patient was included in only one of the three sets. We used a subset of the internal test set for the reader study. The statistics of the overall dataset, the internal test set, and the reader study set are summarized in Table 1.
Each breast within an exam was assigned a label indicating the presence of cancer using pathology results. The pathology examinations were conducted on tissues obtained using image-guided biopsy or surgical excision. As shown in Figure 1b, all cancer-positive exams were accompanied by at least one pathology report indicating malignancy collected either 30 days prior or 120 days after the US examination. This time frame was chosen to maximize the inclusion of both lesions found at primary screening US and lesions found during targeted US after an initial imaging workup with a different modality. We filtered the internal test set to ensure that cancers were visible on positive exams and that negative exams had either cancer-negative biopsy or at least one negative follow-up US exam (see Methods section ‘Additional filtering of the test set’). Studies with neither a biopsy nor any negative follow-up were included in the training and validation set but excluded from the internal test set.
To assess the ability of the AI system to generalize across patient populations and image acquisition protocols, we further evaluated it on the public Breast Ultrasound Images (BUSI) dataset collected at the Hospital for Early Detection and Treatment of Women’s Cancer in Cairo, Egypt [40]. This external test set consisted of 780 images, of which 437 were benign, 210 were malignant, and 133 were negative (no lesion present). These images were collected from 600 patients. Of note, the BUSI dataset was acquired using different US machines and was collected from patients with contrasting demographic backgrounds compared to the NYU dataset. Each image in the BUSI dataset was associated with a label indicating the presence of any malignant lesions.
AI system performance
On the internal test set of 44,755 US exams (25,003 patients, 79,156 breasts), the AI system achieved an AUROC of 0.976 (95% CI: 0.972, 0.980) in identifying breasts with malignant lesions. Additionally, we stratified patients by age, mammographic breast density, US machine manufacturer, and evaluated AI model performance across these sub-populations (Table 2). The AI system maintained high diagnostic accuracy among all age groups (AUROC: 0.969-0.981), mammographic breast densities (AUROC: 0.964-0.979), and US device manufacturers (AUROC: 0.974-0.990). In addition, we evaluated the AI system on the external test set (BUSI dataset) [40]. Even though the AI system was not trained on any images of the external test set, it maintained a high level of diagnostic accuracy (0.911 AUROC, 95% CI: 0.885, 0.933).
Reader study
To compare the performance of the AI system with that of breast radiologists, we constructed a reader study subset by selecting 663 exams (644 patients, 1,024 breasts) from the internal test set. Among the exams selected for this study, 73 breasts had biopsy-proven cancer, 535 breasts had a biopsy yielding exclusively benign findings, and 416 breasts were not biopsied but were evaluated by radiologists as likely benign and had a follow-up benign evaluation at 1-2 years. These proportions were chosen to increase the difficulty of the interpretation task and increase statistical power. Readers were informed that the study dataset was enriched with cancers but were not informed of the enrichment level.
Ten board-certified breast radiologists rated each breast according to the Breast Imaging Reporting and Data System (BI-RADS) [12]. Radiologists’ experience is described in Table A.1. Readers were provided with contextual information typically available in the clinical setting, including the patient’s age, burnt-in annotations showing measurements of suspicious findings, and notes from the technologist, such as specifying any region of palpable concern or pain. In contrast, the AI system was not provided any contextual information.
For each reader, we computed a receiver operating characteristic (ROC) curve and a precision-recall curve by comparing their BI-RADS scores to the ground-truth outcomes (see Methods section ‘Statistical analysis’). The ten radiologists achieved an average AUROC of 0.924 (SD: 0.020, 95% CI: 0.905, 0.944) and an average AUPRC of 0.565 (SD: 0.072, 95% CI: 0.465, 0.625) (Figure A.1). Compared to the average radiologist in this study, the AI system achieved a higher AUROC of 0.962 (95% CI: 0.943, 0.979) with an AUROC improvement of 0.038 (95% CI: 0.028, 0.052, P<0.001) and a higher AUPRC of 0.752 (95% CI: 0.675, 0.849) with an AUPRC improvement of 0.187 (95% CI: 0.140, 0.256, P<0.001) (Figure 2). In addition, we also compared the specificity and sensitivity achieved by the AI system and radiologists. We assigned a positive prediction to any breast a radiologist gave a BI-RADS score of ≥4, and a negative prediction to any breast that was given a BI-RADS score of 1-3. A BI-RADS score of ≥4 is an assessment that indicates a radiologist thinks an exam is suspicious for malignancy. This was selected as the threshold for positive predictions since this is the score above which a patient will typically undergo an invasive procedure (biopsy or surgical excision) to definitively determine whether they have cancer [12]. With this methodology, the ten radiologists achieved an average specificity of 80.7% (SD: 4.7%, 95% CI: 78.9%, 82.6%) and an average sensitivity of 90.1% (SD: 4.3%, 95% CI: 86.4%, 93.8%). At the average radiologist’s specificity, the AI system achieved a sensitivity of 94.5% (95% CI: 89.4%, 100.0%) and an improvement in sensitivity of 4.4% (95% CI: -0.3%, 7.5%, P=0.0278). At the average radiologist’s sensitivity, the AI system achieved a higher specificity of 85.6% (95% CI: 83.9%, 88.0%) with an absolute increase in specificity of 4.9% (95% CI: 3.0%, 7.1%; P<0.001). At the average radiologist’s sensitivity, the AI system recommended tissue biopsies on 19.8% (95% CI: 17.9%, 22.1%) of breasts and 32.5% (95% CI: 26.9%, 39.2%) of these biopsies were for breasts ultimately found to have cancer. Compared to the average reader’s biopsy rate of 24.3% (SD: 4.5%, 95% CI: 22.0%, 26.5%) and average PPV of 27.1% (SD: 4.1%, 95% CI: 22.9%, 33.1%), the AI system achieved an absolute reduction in biopsy rate of 4.5% (95% CI: 2.9%, 6.5%, P<0.001) which corresponds to 18.6% of all biopsies recommended by the average radiologist and achieved an absolute improvement in PPV of 5.4% (95% CI: 2.4%, 8.9%, P<0.001). The performance of the AI system and readers is summarized in Table A.2.
Subgroup analysis on the biopsied population
We conducted additional analyses on two clinically relevant subgroups in the reader study to understand the relative strengths of the AI system and radiologists. The first analysis examined diagnostic accuracy exclusively amongst breasts with lesions that had undergone biopsy evaluation (73 breasts with biopsy-confirmed malignant lesions and 535 breasts with exclusively biopsy-confirmed benign lesions). Breasts that yielded normal findings were not included. As expected, compared to the overall reader study population, AUROC (mean: 0.896, SD: 0.024, 95% CI: 0.874, 0.929) and specificity (mean: 69.8%, SD: 6.9%, 95% CI: 67.7%, 73.6%) of radiologists declined in this sub-population. Additionally, the average biopsy rate of radiologists increased to 37.4% (SD: 6.4%, 95% CI: 33.1%, 39.8%). On this subgroup, the AI system achieved an AUROC of 0.941 (95% CI: 0.922, 0.968). Compared to radiologists, the AI system demonstrated an absolute improvement of 8.5% (95% CI: 5.3%, 11.1%; P<0.001) in specificity, an absolute reduction of 7.5% (95% CI: 4.4%, 9.6%, P<0.001) in biopsy rate, and an absolute improvement in PPV of 6.7% (95% CI: 3.0%, 9.8%, P<0.001), while matching the average radiologist’s sensitivity. The performance of each reader is shown in Table A.3.
Next, we evaluated the accuracy of readers and the AI system exclusively amongst breasts with biopsy-confirmed cancers (97 malignant lesions across 73 breasts). As shown in Table A.4, we stratified malignant lesions by cancer subtype, histologic grade, and biomarker profile. This was done to further investigate the AI system’s ability to discriminate between benign and malignant lesions. Certain types of breast cancers (such as high grade, triple biomarker negative cancers) may closely resemble benign masses (more likely to have oval/round shape and circumscribed margins, less likely to have posterior attenuation compared to other cancers) and are considered particularly difficult to characterize [43]. This analysis demonstrated that the sensitivity of the AI system was similar to that of the readers across all stratification categories. There were no significant differences in sub-populations of patients where the AI system had inferior performance.
Qualitative analysis of saliency maps
In an attempt to understand the AI system’s potential utility as a decision support tool, we qualitatively assessed six studies using the AI’s saliency maps. These saliency maps indicated where the system identified potentially benign and malignant lesions, and represent data that could be made available to radiologists (in addition to breast level predictions of malignancy) if the AI system were integrated into clinical practice. Figure 3a,b shows two 1.5cm irregularly shaped hypoechoic masses with indistinct margins, that ultimately underwent biopsy and were found to be invasive ductal carcinoma. All readers as well as the AI system correctly identified these lesions as being suspicious for malignancy. Figure 3c displays a small 7mm complicated cystic/solid nodule with a microlobulated contour, which 7 out of 10 readers as well as the AI system thought appeared benign. However, this lesion ultimately underwent biopsy and was found to be invasive ductal carcinoma. Figure 3d displays a 7mm superficial and palpable hypoechoic mass with surrounding echogenicity, that underwent biopsy and was found to be benign fat necrosis. However, the AI system as well as 9 out of 10 readers incorrectly thought this lesion was suspicious for malignancy, and recommended it undergo biopsy. Lastly, Figure 3e shows a small 7mm ill-defined area and Figure 3f displays a 9mm mildly heterogenous lobulated solid nodule. All 10 radiologists thought these two lesions appeared suspicious and recommended they undergo biopsy. In contrast, the AI system correctly classified the exams as benign, and the lesions were ultimately found to be benign fibrofatty tissue (Figure 3e) and a fibroadenoma (Figure 3f). Although we were unable to determine clear patterns among these US exams, the presence of cases where the AI system correctly contradicted the majority of readers and produced appropriate localization information underscores the potential complementary role the AI system might play in helping human readers more frequently reach accurate diagnoses.
Potential clinical applications
To evaluate the potential of our AI system to augment radiologists’ diagnosis, we created hybrid models of the AI system and the readers. The predictions of each hybrid model were computed as an equally weighted average between the AI system and each reader (see Methods section ‘Hybrid model’). This analysis revealed that the performance of all readers was improved by incorporating the predictions of the AI system (Figure 4, Table A.5). On average, the hybrid models improved radiologists’ AUROC by 0.037 (SD: 0.013, 95% CI: 0.011, 0.070, P<0.001) and improved their AUPRC by 0.219 (SD: 0.060, 95% CI: 0.089, 0.372, P<0.001). At the radiologists’ sensitivity levels, the hybrid models increased their average specificity from 80.7% to 88.0% (average increase 7.3%, SD: 3.8%, 95% CI: 2.7%, 18.5%, P<0.001), increased their PPV from 27.1% to 38.0% (average increase 10.8%, SD: 5.3%, 95% CI: 3.7%, 25.0%, P<0.001), and decreased their average biopsy rate from 24.3% to 17.6% (average decrease, 6.8%, SD: 3.5%, 95% CI: 2.3%, 17.1%, P<0.001). The reduction in biopsies achieved by the hybrid model represented 27.8% of all biopsies recommended by radiologists.
In addition, the AI system could also be used to assist radiologists to triage US exams. To evaluate the potential of the AI system in identifying cancer-negative cases with high confidence, we selected a very low decision threshold to triage women into a no-radiologist work stream. On the reader study subset, using this triage paradigm, the AI system achieved an NPV of 99.86% while retaining a specificity of 77.7%. This result suggests that it may be feasible to dismiss 77.7% of normal/benign cases and skip radiologist review if we accept missing one cancer in every 740 negative predictions, which is less than 1/6 of the false negative rate observed among radiologists in the reader study (one missed cancer for every 109 negative evaluations). To evaluate the potential of the AI system in triaging patients into an enhanced assessment work stream, we used a very high decision threshold. In this enhanced assessment work stream, the AI system achieved a PPV of 84.4% while retaining a sensitivity of 52.1%. These results suggest that it may be feasible to rapidly prioritize more than half of cancer cases, with approximately five out of six biopsies leading to a diagnosis of cancer. For comparison, only 27.1% biopsies that the radiologists recommended were diagnosed with cancer. While we demonstrated the potential of AI in automatically triaging breast US exams, confirmation of these performance estimates would require extensive validation in a clinical setting.
Discussion
In this work, we present a radiologist-level AI system that is capable of automatically identifying malignant lesions in breast US images. Trained and evaluated on a large dataset collected from 20 imaging sites affiliated with a large medical center, the AI system maintained a high level of diagnostic accuracy across a diverse range of patients whose images were acquired using a variety of US units. By validating its performance on an external dataset, we produced preliminary results substantiating its ability to generalize across a patient cohort with different demographic composition and image acquisition protocols.
Our study has several strengths. First, in the reader study subset, we found that the AI system performed comparably to board-certified breast radiologists. The ten radiologists achieved an average sensitivity of 90.1% (SD: 4.3%, 95% CI: 86.4%, 93.8%) and an average specificity of 80.7% (SD: 4.7%, 95% CI: 78.9%, 82.6%). The sensitivity of radiologists in our study is consistent with the results reported in other breast US reader studies [10, 44], as well as the sensitivity of breast radiologists observed in clinical practice, despite the fact that radiologists in our study did not have access to the patient’s medical record or prior breast imaging [15, 45]. Compared to radiologists in our reader study, the AI system was able to detect cancers with the same sensitivity, while obtaining a higher specificity (85.6%, 95% CI: 83.9%, 88%), a higher PPV (32.5%, 95% CI: 26.9%, 39.2%), and a lower biopsy rate (19.8%, 95% CI: 17.9%, 22.1%). Moreover, the AI system achieved a higher AUROC (0.962, 95% CI: 0.943, 0.979) and AUPRC (0.752, 95% CI: 0.675, 0.849) than all ten radiologists. This trend was confirmed in our subgroup analysis which showed that the system could accurately interpret US exams that are deemed difficult by radiologists.
Another strength of this study is that we explored the benefits of collaboration between radiologists and AI. We proposed and evaluated a hybrid diagnostic model that combined the predictions from radiologists and the AI system. The results from our reader study suggest that such collaboration improves the diagnostic accuracy and reduces false positive biopsies for all ten radiologists (Table A.5). In fact, breast US has come under criticism for having a high false positive rate [13, 14]. As reported by multiple clinical studies, only 7-8% of breast biopsies performed under US guidance are found to yield cancers [15, 17]. Indeed, for the ten radiologists in our cancer-enriched reader study subset, on average 19.3% (SD: 4.7%, 95% CI: 17.7%, 20.6%) of cancer-negative exams were falsely diagnosed as positive and only 27.1% (SD: 4.1%, 95% CI: 22.9%, 33.1%) of the exams that they recommended to undergo biopsy actually had cancer. In this study, we showed that the hybrid models reduced the average radiologist’s false positive rate to 12.0% (SD: 3.9%, 95% CI: 7.6%, 21.0%), representing a 37.4% (SD: 13.0%, 95% CI: 34.1%, 40.0%) relative reduction. The hybrid models also increased the average radiologist’s PPV to 38.0% (SD: 6.0%, 95% CI: 24.1%, 50.0%). These results indicate that our AI system has the potential to aid radiologists in their interpretation of breast US exams to reduce the number of false positive interpretations and benign biopsies performed.
Beyond improving radiologists’ performance, we also explored how AI systems could be utilized to assist radiologists to triage US exams. We showed that high-confidence operating points provided by the AI system can be used to automatically dismiss the majority of low-risk benign exams and escalate high-risk cases to an enhanced assessment stream (Table A.6). Prospective clinical studies will be required to understand the full extent to which this technology can benefit US reading.
Finally, we have made technical contributions to the methodology of deep learning for medical image analysis. Prior work on AI systems for interpreting breast US exams, and other similar applications, rely on manually collected image-level or pixel-level labels [28, 29, 30, 31, 32, 33]. In contrast, our AI system was trained using breast-level labels which were automatically extracted from pathology reports. This is an important difference, as developing a reliable AI system for clinical use requires training and validation on large-scale datasets to ensure the network will function well across the broad spectrum of cases encountered in clinical practice. At such a scale, it is impractical to collect labels manually. We address this issue by adopting the weakly supervised learning paradigm to train models at scale without the need for image-level or pixel-level labels. This paradigm enables the model to generate interpretable saliency maps that highlight informative regions in each image. With the saliency maps, researchers can perform qualitative error analysis and understand the strength and limitations of the AI system. Furthermore, an interpretable AI system trained with such a large dataset could help discover novel data-driven imaging biomarkers, leading to a better understanding of breast cancer.
Despite the contributions of our study in advancing breast cancer diagnosis, it has some limitations. We focused on the evaluation of an AI system that detects breast cancer only using US imaging. In clinical practice, US imaging is often used as a complementary modality to mammography. One promising research direction is to utilize multimodal learning [46, 47] to combine information from other imaging modalities. Moreover, the diagnosis produced by our AI system is based only on a single US exam, while breast radiologists often refer to patients’ prior imaging to evaluate the morphological changes of suspicious findings over time. Future research could focus on augmenting AI systems to extract relevant information from past US exams.
Another limitation of this work is the design of reader study. To provide a fair comparison with the AI system, readers in our study were only provided with US images, patients’ ages, and notes from the operating technician. In clinical practice, breast radiologists also have access to other information such as patients’ prior breast imaging and their electronic medical records. Additionally, in the breast cancer screening setting, a screening US examination is typically accompanied by a screening mammogram. Even if prior US exams are not available, radiologists can typically refer to the mammogram for additional information, which can also influence the way that an US exam is interpreted. Finally, the qualitative analysis presented in this study was conducted over a limited set of exams. A systematic study on the differences between the AI system and the perception of radiologists in sonography interpretation is required to understand the limitations of such systems.
Despite these limitations, we believe this study is a meaningful contribution to the emerging field of AI-based decision support systems for interpreting breast US exams. On a clinically realistic population, our AI system achieved a higher diagnostic accuracy (AUROC: 0.976, 95% CI: 0.972, 0.980) than prior AI systems for breast US lesion classification (AUROC: 0.82-0.96) [32, 34, 48, 49, 50, 51, 52], though we acknowledge these systems can be compared only approximately as they were evaluated on different datasets. Key features that contributed to our AI system’s high level performance were the large dataset used in training, along with utilization of the weakly supervised learning paradigm that enables the system to learn from automatically extracted labels. Furthermore, as our AI system was evaluated on a large test set (>44,000 US exams) acquired from a diverse range of US units and patients of diverse demographics, we are optimistic of its ability to perform well prospectively, in the hands of radiologists. A few recent studies have demonstrated in retrospective reader studies that AI systems can improve the performance of radiologists when they have access to the decision support tool while reviewing US exams [48, 52]. However, these studies utilized an AI system that required radiologists to localize lesions by manually drawing bounding boxes. Moreover, these studies used small datasets and did not evaluate the AI’s performance on sub-populations stratified by age and breast density. This makes it hard to determine if the system would maintain performance across the broad range of US exams that a radiologist might encounter in different clinical settings. Regardless of these limitations, these studies demonstrate that an AI system with a relatively low AUROC of 0.86-0.88 can substantially improve the diagnostic accuracy of radiologists. Based on these results, we are optimistic that our AI system, which does not require radiologists to localize lesions and achieved a higher diagnostic accuracy (AUROC: 0.976) on a larger diverse patient population, could enable radiologists to achieve even greater levels of performance. As a next step, our system requires prospective validation before it can be widely deployed in clinical practice. The potential impact that such a system could have on women’s imaging is immense, given the enormous volume of women who undergo breast US exams each year.
In conclusion, we examined the potential of AI in US exam evaluation. We demonstrated in a reader study that deep learning models trained with a sufficiently large amount of data are able to produce diagnosis as accurate as experienced radiologists. We further showed that the collaboration between AI and radiologists can significantly improve their specificity and obviate 27.8% of requested biopsies. We believe this research could supplement future approaches to breast cancer diagnosis. In addition, the general approach employed in our work, mainly the framework for weakly supervised classification and localization, may enable utilization of deep learning in similar medical image analysis tasks.
Data Availability
The external test dataset is publicly available at https://scholar.cu.edu.eg/?q=afahmy/pages/dataset. The NYU Breast Ultrasound Dataset is not currently permitted for public release by the institutional review board. We published the following report explaining how the dataset was created for reproducibility: https://cs.nyu.edu/~kgeras/reports/ultrasound_datav1.0.pdf.
Methods
Ethical approval
This retrospective study was approved by the NYU Langone Health Institutional Review Board (ID#i18-00712_CR3) and is compliant with the Health Insurance Portability and Accountability Act. Informed consent was waived since the study presents no more than minimal risk. This study is reported following the TRIPOD guidelines [53].
NYU breast ultrasound dataset
The dataset used in this study was collected from NYU Langone Health system (New York, USA) across 20 imaging sites. The final dataset contained 288,767 exams (5,442,907 images) acquired from 143,203 patients imaged between January 2012 and September 2019. Each US exam included between 4 and 70 images with 18.8 images per exam on average (Figure A.2a). The images had an average resolution of 665 × 603 pixels in width and height, respectively (Figure A.2b). A summary of the acquisition devices is shown in Table A.7. Each exam was associated with additional patient metadata as well as a radiology report summarizing the findings. We extracted breast tissue density from the patients’ past mammography reports and assigned “unknown” to patients who did not have any mammography exams. Both screening and diagnostic US exams were included. Screening exams are performed for women who have no symptoms or signs of breast cancer while diagnostic US exams can be used to evaluate women who present with symptoms such as a new lump or pain in the breast or can be used to further evaluate abnormalities detected on a screening examination. While screening exams are typically comprehensive and image both breasts, diagnostic US exams vary in terms of how targeted they are, and might image both breasts, one breast, or sometimes just a single lesion. The dataset was filtered as described in the next section. Further details can be found in the technical report [41].
Filtering of the dataset
We initially extracted a dataset of 425,506 breast US exams consisting of 8,448,978 images collected from 212,716 unique patients. We then applied a few levels of filtering to obtain the final dataset for training and evaluating the neural network. This entailed the exclusion of exams with invalid patient identifiers, exams collected before 2012, exams collected from patients younger than 16 years of age, duplicate images, exams from non-female patients, and invalid images based on the ImageType attribute, which consisted of non-US images such as reports or demographic data screenshots. We further excluded images that were collected during biopsy procedures based on the PerformedProcedureStepDescription, StudyDescription & RequestedProcedureDescription attributes of the image metadata, in that order, images with missing metadata information relating to the type of procedure, images with more than 80% zero pixels, exams with multiple patient identifiers or study dates, exams with an extreme number of images, and exams with missing image laterality.
Patients were then randomly split among training (60%), validation (10%) and test (30%) sets. After splitting, each patient appeared in only one of the training, validation, and test sets. The training set consisted of 3,930,347 images within 209,162 exams collected from 101,493 patients. The validation set consisted of 653,924 images within 34,850 exams collected from 16,707 patients. The test set consisted of 858,636 images within 44,755 exams collected from 25,003 patients. The training set was used to optimize learnable parameters in the models. The validation set was used to tune the hyperparameters and select the best models. The test set was used to evaluate the performance of the models selected using the validation set. We applied additional filtering on the test set as described in the next section.
Additional filtering of the test set
To provide a clinically realistic evaluation of the AI system, we additionally refined the test set using the steps summarized in Figure A.3. First, we ensured that each non-biopsied exam was followed with a subsequent cancer-negative exam. Non-biopsied patients who had a negative (BI-RADS 1) or benign (BI-RADS 2) US exams were only included in the test set if they did not have any malignant breast pathology found within 0-15 months following their US exam, and had follow up imaging between 6 and 24 months that was also negative or benign (BI-RADS 1-2). Patients who did not undergo biopsy and had probably benign US exams (BI-RADS 3) were included in the test set if they did not have any malignant breast pathology found within 0-15 months following their exam, and met one of two additional criteria: all of their subsequent US exams in the 4-36 months following their initial US exam were BI-RADS 1-2, or they had at least one follow-up US exam at 24-36 months which was evaluated as BI-RADS 1-3.
Next, we refined exams with biopsy-proven benign findings to determine if the pathology results were deemed by the radiologist to be concordant or discordant with the imaging features of the breast lesion. Patients with biopsy reports that confirmed a discordant benign finding were only included in the test set if they received a subsequent biopsy (that was not discordant) or breast surgery within the 6 months following the initial discordant biopsy. Patients with benign discordant biopsies that did not receive subsequent pathological evaluation were excluded.
Lastly, we ensured that exams with biopsy-proven cancers contained images of these cancers. Since breast US produces small images which do not comprehensively capture the entire breast, a proportion of patients diagnosed with breast cancer did not have images of the cancer in any of their US images. US exams with a label indicating malignancy and a BI-RADS score of 1-2 were excluded as these exams typically did not contain images of the cancer. Additionally, patients diagnosed with breast cancer who did not have any breast pathology obtained using US-guided biopsy were also excluded, since the majority of patients diagnosed using MRI and stereotactic-guided biopsies had malignancies that were sonographically occult. US exams that received a BI-RADS score of 0, 3, and 6, as well as patients who had breast pathology obtained using multi-modal image guidance (US plus stereotactic and/or MRI guided biopsies) had their cases manually reviewed to confirm that breast cancer was visible on the US exam. Patients who were given a BI-RADS score of 4-5 and had all their breast pathology obtained using US-guided biopsy were presumed to have visible cancers and were not manually reviewed.
Breast-level cancer labels
Among all the exams in the dataset, 28,914 exams (approximately 10%) were associated with at least one biopsy performed within 30 days prior or 120 days after the US examination. The cancer labels of biopsies were determined using their associated pathology reports. In cases where there were multiple pathology reports recorded within the considered time window, all of these reports were evaluated. Malignant findings included primary breast cancers: invasive ductal carcinoma, invasive lobular carcinoma, special-type invasive carcinoma (including tubular, mucinous and cribriform carcinomas), inflammatory carcinoma, intraductal papillary carcinoma, microinvasive carcinoma, ductal carcinoma in situ, as well as non-primary breast cancers: lymphoma and phyllodes. Benign findings included cyst, fibroadenoma, scar, sclerosing adenosis, lobular carcinoma in situ, columnar cell changes, atypical lobular hyperplasia, atypical ductal hyperplasia, papilloma, periductal mastitis and usual ductal hyperplasia. The labels were automatically extracted from the corresponding pathology reports using a natural language processing pipeline developed earlier [41]. Of note, patients with multiple pathology reports could be assigned both malignant and benign labels if their exam contained both types of lesions.
Breast Ultrasound Images Dataset
This external dataset was collected in 2018 from Baheya Hospital for Early Detection and Treatment of Women’s Cancer (Cairo, Egypt) with the LOGIQ E9 ultrasound system and the LOGIQ E9 Agile ultrasound. It included 780 breast US images, with an average resolution of 500 × 500 pixels, acquired from 600 female patients whose ages ranged between 25 and 75 years old. Among these 780 images, 133 were normal images without cancerous masses, 437 were images containing malignant masses and 210 were images with benign masses. We refer the reader to the original paper for more information about this public dataset [40].
Deep neural network architecture
We present a deep learning model (DLM) whose architecture is shown in Figure A.4. To explain the mechanics of this model, we need to introduce some notation. Let x ∈ ℝH,W,3 denote an RGB US image with a resolution of H × W pixels and let X ={x1, x2, …, xK} denote an image set that contains all images acquired from the patient during an US exam from one breast. This DLM is trained to process the image set X, which may vary in number of the images it contains (Figure A.2), and generate two probability estimates ŷ b, ŷm ∈ [0, 1] that indicate the predicted probability of the presence of benign and malignant lesions in the patient’s breast, respectively. The DLM is designed to resemble the diagnostic procedure performed by radiologists. First, it generates saliency maps and probability estimates for each image xk in the image set. This step is similar to a radiologist roughly scanning through each US image and looking for abnormal findings. Then it computes a set of attentions scores which indicate the importance of each image to the cancer diagnosis task. This procedure can be seen as an analogue to a radiologist concentrating on images that contain suspicious lesions. Finally, it forms a breast-level cancer diagnosis by combining information collected from all images. This is analogous to modelling a radiologist comprehensively considering signals in all images to render a full diagnosis. Below we describe each step in detail.
Saliency maps. The DLM first utilizes a convolutional neural network [54] fg (parameterized as ResNet-18 [55]) to extract a representation of each image xk, in an image set X, denoted by hk ∈ ℝh,w,C. The height, the width, and the number of channels are denoted by h, w, and C, respectively. Inspired by Zhou et al. [38], we then apply a convolutional layer with 1 × 1 convolutional filters followed by sigmoid non-linearity to transform hk into two saliency maps and . These saliency maps highlight approximate locations of benign and malignant lesions in each image. Each element denotes the contribution of spatial location (i, j) towards predicting the presence of benign/malignant lesions. The resolutions of the saliency maps (h, w) depends on the implementation of fg. The sizes (h, w) are usually smaller than the resolution of the input image (H, W). In this work, we set h = w = 8, C = 512, and H = W = 256.
Attention scores. The images in the image set X might significantly differ in how relevant each of them is to the classification task. To address this issue, we utilize the Gated Attention Mechanism [56], allowing the model to select which information to incorporate from all images. Specifically, we first apply global max pooling to transform the representation hk computed for the image xk into a vector vk ∈ ℝC. Two attention scores and that indicate the importance of each image xk to the estimation of the probability of the presence of benign and malignant findings in the breast are computed as where denotes the concatenation of attention scores for both benign and malignant findings, ⊙ denotes an element-wise multiplication, and W ∈ ℝL,2, V ∈ ℝL×M and U ∈ ℝL × M are matrices of learnable parameters. In all experiments, we set L = 512 and M = 128.
Cancer diagnosis. Lastly, the DLM aggregates the information from all US images in the image set X and generates the final diagnosis using the attention scores and saliency maps. We first use an aggregation function fagg(A) : ℝh,w ↦ [0, 1] to transform the saliency maps into image-level predictions:
In our work, we parameterize fagg as the top t% pooling proposed by Shen et al. [57]. Namely, we define the aggregation function as where H+ denotes the set containing locations of top t% values in A, and t is a hyperparameter. The breast-level cancer prediction is then defined as the average of all image-level cancer predictions weighted by the attention scores:
Training details
In order to constrain the saliency maps to only highlight important regions, we impose the L1 regularization on A which penalizes the DLM for highlighting irrelevant pixels:
Despite the relative complexity of our proposed framework, this DLM can be trained end-to-end using stochastic gradient descent with the following loss function, defined for a single training example (i.e. one breast) as where BCE is the binary cross-entropy and β is a hyperparameter. For all experiments, the training loss is optimized using Adam [58]. Of note, labels indicating the presence of benign lesions (yb) were also used during training to regularize the network through multi-task learning [59]. On the test set, we focus on evaluating predictions of malignancy since it is a more clinically relevant task: identification of malignant lesions has an immediate and significant impact on patient management (biopsy, potential surgery), whereas identification of a benign breast lesions typically does not alter management compared to patients without breast lesions [12].
We optimized the hyperparameters with random search [60]. Specifically, we searched for the learning rate η ∈ 10[−5.5,−4] on a logarithmic scale, regularization hyperparameter β ∈ 10[−3,0.5] on a logarithmic scale, weight decay hyperparameter λ ∈ 10[−6,−3.5] on a logarithmic scale, and the pooling threshold t ∈ [0.1, 0.5] on a linear scale. We trained 30 separate models using hyperparameters uniformly sampled from the ranges above. Each model was trained for 50 epochs. We saved the model weights from the training epoch that achieves the highest AUROC on the validation set. To further improve our results, we used model ensembling [61]. Specifically, we average the breast-level predictions of the top 3 models that achieved the highest AUROC on the validation set to produce the overall prediction of the ensemble.
During training, we adopt image augmentation including random horizontal flipping (p=0.5), random rotation (−45° to 45°), random translation in both horizontal and vertical directions (up to 10% of the image size), scaling by a random factor between 0.7 and 1.5, and random shearing (−25° to 25°). The resulting image was then resized to 256 × 256 pixels using bilinear interpolation and normalized. During the validation and test stages, the original image was resized and normalized without any augmentation.
Reader study
We performed a reader study to compare the performance of the proposed DLM with breast radiologists. This study included ten board-certified breast radiologists with an average of 15 years of clinical experience (Table A.1). Their experience ranged from 3 to 40 years. Nine of the ten radiologists were fellowship-trained in breast imaging. The one radiologist who did not receive formal fellowship training (R10) worked as a sub-specialized breast radiologist and had over 30 years of breast imaging experience. The readers were provided with US images including metadata (breast laterality, position of the probe, notes from the sonographer) and the age of the patient. For each breast in all exams, the readers were then asked to provide a diagnostic BI-RADS score using the values 1, 2, 3, 4A, 4B, 4C or 5. A score of 0 was not permitted.
Hybrid model
To explore the potential benefit that the AI system might be able to provide, we created a hybrid model for each radiologist, whose predictions were created by averaging the predictions of the respective radiologist and the AI model: ŷhybrid = λ ŷexpert + (1 −λ) ŷAI. The BI-RADS scores of radiologists were used as their predictions. Both ŷAI and ŷexpert were standardized to have zero mean and unit variance. In this study, we set λ = 0.5. We note that λ = 0.5 is not the optimal value. On the other hand, the performance obtained by retroactively fine-tuning λ on the reader study is not transferable to realistic clinical settings. Therefore, we chose λ = 0.5 as the most natural way of aggregating two predictions without prior knowledge of their quality.
Statistical analysis
In this study, we evaluated the performance of the AI system, radiologists, and the hybrid models using the following evaluation metrics: area under receiver operating characteristic curves (AUROC), area under precision-recall curve (AUPRC), sensitivity, specificity, biopsy rate, negative predictive value (NPV), and positive predictive value (PPV). AUROC and AUPRC were used to assess the diagnostic accuracy of the probabilistic predictions generated by the AI system/hybrid models and the BI-RADS scores of the readers. The BI-RADS scores were treated as a 6-point index of suspicion for malignancy: scores of 1 and 2 were collapsed into the lowest category of suspicion; scores 3, 4A, 4B, 4C and 5 were treated independently as increasing levels of suspicion. AUROC avoids the subjectivity in selecting the thresholds to dichotomize continuous predictions, since it compares performance across all possible recall rates. However, AUROC weights omission and commission errors equally and therefore could provide excessively optimistic estimates in extremely imbalanced classification tasks such as cancer diagnosis where the negative cases often overwhelm the positive cases [62]. Therefore, to complement AUROC, we also reported AUPRC which solely evaluates the ability to correctly identify the positive cases. We calculated both AUROC and AUPRC using the Python Scikit-learn API [63].
In addition, we also evaluated the binary predictions of the AI system, the hybrid models, and the readers using sensitivity, specificity, biopsy rate, NPV, and PPV. These metrics are commonly used to assess the diagnostic accuracy in clinical studies [7, 11, 15]. The PPV reported in this study corresponds to PPV2, which is defined as the number of breasts with cancer that were recommended to undergo biopsy divided by the total number of breast biopsies recommended [12]. For each breast, the AI system and the hybrid models produced a probabilistic score that represents the likelihood of cancer being present. We dichotomized these scores to produce binary predictions by selecting a score threshold that separates positive and negative decisions. To compute sensitivity, we dichotomized the AI system’s probabilistic predictions to match average reader’s specificity. To calculate the specificity, biopsy rate, PPV and NPV, we dichotomized the AI system’s probabilistic predictions by matching the average reader’s sensitivity. We similarly dichotomized the predictions of each hybrid model using the sensitivity/specificity of its respective reader. For all evaluation metrics, we estimated the confidence intervals at 95% by 1,000 iterations of the bootstrap method [64].
In the reader study, we compared the AUROC, AUPRC, sensitivity, specificity, PPV, and biopsy rate of the AI system and hybrid models with those of the average radiologists. The confidence interval for these differences was obtained through 1,000 iterations of bootstrap method [64]. The p-values were computed using one-tailed permutation test [65]. In each of 10,000 trials, we randomly swapped the AI/hybrid model’s score with one of the comparator reader’s score for each case, yielding a reader–AI difference sampled from the null distribution. A one-sided p-value was computed by comparing the observed statistic to the empirical quantiles of the null distribution. We used a statistical significance threshold of 0.05.
Data availability
The external test dataset is publicly available at https://scholar.cu.edu.eg/?q=afahmy/pages/dataset. The NYU Breast Ultrasound Dataset is not currently permitted for public release by the institutional review board. We published the following report explaining how the dataset was created for reproducibility: https://cs.nyu.edu/~kgeras/reports/ultrasound_datav1.0.pdf.
Code availability
The neural networks used in our AI system were developed in PyTorch [66]. Code for preprocessing the data and running the inference, sufficient to evaluate our system on other datasets, is available for research purposes upon a reasonable request made to the corresponding author.
Author contributions
YS, FES and JO are the co-first authors of this paper. YS, FES and KJG designed the experiments with neural networks. YS conducted the experiments with neural networks. YS, FES, JO, JW, KK, JP, NW and CH built the data preprocessing pipeline. YS and FES conducted the reader study and analyzed the data. YS, FES, and JO conducted literature search. YS and JB conducted the statistical analysis. JO, CH, SW, AM, RE, DA, CT, NS, YG, CC, SG, JA, CL, SKS, CL, RM, CM, AL, BR, LM and LH collected the data. LH analyzed the results from a clinical perspective. KJG and FES supervised the execution of all elements of the project. All authors provided critical feedback and helped shape the manuscript.
Competing interests
The authors declare no competing interests.
A Extended Data
Acknowledgements
The authors would like to thank Mario Videna, Abdul Khaja and Michael Costantino for supporting our computing environment, Benny Huang and Marc Parente for extracting the data, Yizhuo Ma for providing graphical design consultation, and Catriona C. Geras for proofreading the manuscript. We also gratefully acknowledge the support of Nvidia Corporation with the donation of some of the GPUs used in this research. This work was supported in part by grants from the National Institutes of Health (P41EB017183, R21CA225175), the National Science Foundation (1922658), the Gordon and Betty Moore Foundation (9683), the Polish National Agency for Academic Exchange (PPN/IWA/2019/1/00114/U/00001) and NYU Abu Dhabi.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵