Abstract
Importance Early detection of pneumothorax, most often on chest radiograph (CXR), can help determine need for emergent clinical intervention. The ability to accurately detect and rapidly triage pneumothorax with an artificial intelligence (AI) model could assist with earlier identification and improve care.
Objective This study aimed to compare the accuracy of an AI model (Annalise Enterprise) to consensus thoracic radiologist interpretations in detecting (1) pneumothorax (incorporating both non-tension and tension pneumothorax) and (2) tension pneumothorax.
Design A retrospective standalone performance assessment was conducted on a dataset of 1,000 CXR cases.
Setting The cases were obtained from four hospitals in the United States.
Participants The cases were obtained from patients aged 18 years or older. They were selected using two strategies from all CXRs performed at the hospitals including inpatients and outpatients. The first strategy identified consecutive pneumothorax cases through a manual review of radiology reports and the second strategy identified consecutive tension pneumothorax cases using natural language processing. For both strategies, negative cases were selected by taking the next negative case acquired from the same x-ray machine. The final dataset was an amalgamation of these processes.
Methods Each case was interpreted independently by up to three radiologists to establish consensus ground truth interpretations. Each case was then interpreted by the AI model for the presence of pneumothorax and tension pneumothorax.
Main Outcome The primary endpoints were the areas under the receiver operating characteristic curves (AUCs) for the detection of pneumothorax and tension pneumothorax. The secondary endpoints were the sensitivities and specificities for the detection of pneumothorax and tension pneumothorax at predefined operating points.
Results Model inference was successfully performed in 307 non-tension pneumothorax, 128 tension pneumothorax and 550 negative cases. The AI model detected pneumothorax with AUC of 0.979 (94.3% sensitivity, 92.0% specificity) and tension pneumothorax with AUC of 0.987 (94.5% sensitivity, 95.3% specificity).
Conclusions and Relevance The assessed AI model accurately detected pneumothorax and tension pneumothorax on this CXR dataset. Its use in the clinical workflow could lead to earlier identification and improved care for patients with pneumothorax.
Question Does a commercial artificial intelligence model accurately detect simple and tension pneumothorax on chest x-ray?
Findings This retrospective study used 1,000 chest x-rays from four hospitals in the United States to compare artificial intelligence model outputs to consensus thoracic radiologist interpretations. The model detected pneumothorax (incorporating both simple and tension pneumothorax) with area under the curve (AUC) of 0.979 and tension pneumothorax with AUC of 0.987. The sensitivity and specificity were 94.3% and 92.0% respectively for pneumothorax, and 94.5% and 95.3% for tension pneumothorax.
Meaning This artificial intelligence model could assist radiologists through its accurate detection of pneumothorax.
Introduction
The diagnosis of a pneumothorax, especially tension pneumothorax, on chest radiograph (CXR) can lead to critical interventions for patient care.1 The automated detection of pneumothorax and tension pneumothorax through artificial intelligence (AI) has been hypothesized to improve patient care in multiple ways.2,3 Firstly, it can assist in triaging CXRs for sooner interpretation by a radiologist based on the suspected presence of a pneumothorax. Secondly, it can provide a second “set of eyes” to support identification of a pneumothorax.
Several AI models have been developed to assess for the presence of pneumothorax.4-9 Many models have also received US Food and Drug Administration (FDA) clearance for use in pneumothorax identification.10-17 These cleared models are all computer assisted triage devices, which are intended to aid in prioritization and triage of time sensitive findings.18
This retrospective study aimed to assess a commercial AI model that detects both pneumothorax and tension pneumothorax. It examined the model performance across a range of technical and demographic subgroups to assess the generalizability of the model. It also examined the model performance in the presence or absence of ancillary findings to determine how they impact model accuracy.
Methods
Study design
This retrospective standalone model performance study was conducted using radiology cases from four hospitals within the Mass General Brigham (MGB) network. It was approved by the MGB Institutional Review Board with waiver of informed consent. It was conducted in accordance with relevant guidelines and regulations including the Health Insurance Portability and Accountability Act (HIPAA).
Case selection
The cohort was selected based on two strategies that utilized the radiology reports within the MGB radiology archive: consecutive pneumothorax cases through manual review and consecutive tension pneumothorax cases through a natural language processing search engine. Each strategy involved taking the next negative case acquired on the same x-ray machine after each positive case to avoid temporal and technical bias. The cohort considered all CXRs performed at a hospital including inpatient and outpatient. The CXRs were obtained from patients at least 18 years of age.
For the consecutive pneumothorax cases, the consecutive cases were identified in a prospective order between June 1 2019 and May 31 2021. There were 85 report-positive cases and 85 report-negative cases from each of the four hospitals. A radiologist reviewed consecutive CXR reports at each of the hospitals until these numbers were achieved.
For the consecutive tension pneumothorax cases, the report-positive cases were identified between June 1 2015 and May 31 2021 using a natural language processing search for “tension pneumothorax” amongst CXR reports from the same four hospitals utilizing a commercial radiology report search engine (Nuance mPower Clinical Analytics). A radiologist then confirmed these cases were positive through a CXR report review. There were 160 report-positive cases selected across all four hospitals in a consecutive, retrospective manner from the most recent. An equal number of report-negative cases were then selected by taking the next report-negative case after each report-positive case on each x-ray machine.
All cases were deidentified and underwent an image quality review by an American Board of Radiology (ABR)-certified radiologist. The image quality review identified the patient positioning (erect or supine) and projections present (antero-posterior (AP), postero-anterior (PA) and/or lateral). Cases were excluded if they did not include a CXR or did not include a frontal projection (AP or PA). The review was performed using the FDA-cleared eUnity image visualization software (Version 6 or higher) and an internal web-based annotation system.
Ground truth interpretations
Ground truth interpretations were performed by three ABR-certified radiologists with fellowship training in thoracic radiology. They provided their interpretations independently, without access to the original radiology reports and in different worklist orders. They used the same image visualization software and annotation system as was used in the image quality review.
The radiologists answered one or three multiple choice questions about pneumothorax: if a pneumothorax was present or absent, if the size of pneumothorax was <2cm or ≥2cm, and if features of tension pneumothorax were present or absent (the latter two questions only applied if a pneumothorax was present). They also indicated the presence of ten ancillary findings: pleural effusion (including hemothorax), rib fracture, pneumomediastinum, pneumoperitoneum, subcutaneous emphysema, focal or diffuse pulmonary abnormalities (including nodules, masses, emphysema, airspace or interstitial processes), skin fold, intercostal drain, evidence of thoracic surgery (including thoracotomy wires or sutures), and other lines / tubes / devices (including pacemaker, endotracheal tube, central line).
For determining consensus for the three pneumothorax findings, a “2+1” strategy was used: the first two radiologists interpreted every case and a third radiologist then interpreted cases with discrepant interpretations. Any persistent discrepancies (which could occur for the size and features of tension pneumothorax questions after the three interpretations) were resolved at a meeting of all three radiologists. The ancillary findings were considered present if any radiologist interpreted them as being present.
Model inference
The evaluated AI model was version 2.0.0 of the Annalise Enterprise CXR Triage Pneumothorax device. This device is a variant of the Annalise Enterprise (CXR module) device, which is commercially available in some non-US markets. The Annalise Enterprise (CXR module) device is trained to identify over 100 different radiological findings and is a deep-convolutional neural network trained on over 750,000 CXRs, which were each labelled by three radiologists.19 The model was installed at MGB for use in this study and received only the Digital Imaging and Communications in Medicine (DICOM)-formatted CXR cases.
Analysis
The statistical analysis was performed in R (version 4.0.2) on the full analysis set. The predefined primary endpoints were the areas under the receiver operating characteristic curves (AUCs) for the detection of pneumothorax (incorporating non-tension and tension pneumothorax) and tension pneumothorax; they were calculated using the consensus annotations and the classification scores from the AI model. The predefined secondary endpoints were the sensitivities and specificities for the detection of pneumothorax and tension pneumothorax; they were calculated using the same outputs as the AUCs and the operating points of the model.
These analyses were repeated for additional subgroups as detailed in the results. The sex, age and manufacturers were derived from clinical databases or DICOM fields for each radiology case. Any missing data were treated as “Unknown” and no data were imputed. The manufacturers in the DICOM field may have represented manufacturers other than the x-ray machine (e.g., the cassette). The patient positioning, projections and ancillary findings were obtained as part of the case selection or ground truth interpretations as described above. All confidence intervals (CIs) were calculated using bootstrapped intervals with 2,000 resamples. The predefined passing criterion for the primary endpoints was AUC >0.95 based on recognized FDA performance benchmarks.18 While not predefined passing criteria, this manuscript refers to additional benchmarks of sensitivity >80% and specificity >80%. The sample sizes for each of pneumothorax and tension pneumothorax were calculated using prior model results and to ensure the lower bound of the 95% CI for AUC was >0.95.
Results
Participants
An initial cohort of 1,000 CXR cases were selected for this project (Supplementary Figure 1). Three cases were excluded as they did not contain a CXR during an image quality review. Twelve cases were unsuccessful at model inference. The remaining 985 cases were used for analysis. They included 435 (44.2%) positive pneumothorax cases and 550 (55.8%) negative pneumothorax cases. There were 128 cases that were positive for tension pneumothorax (29.4% of positive pneumothorax cases; 13.0% of total cases). The demographic and technical breakdowns are provided in Table 1.
Pneumothorax detection
The AI model identified pneumothorax (incorporating non-tension and tension pneumothorax) with AUC 0.979 (95% CI 0.970-0.987; Figure 1A), sensitivity 94.3% (95% CI 92.0-96.3%) and specificity 92.0% (95% CI 89.6-94.2%). When considering the subgroup analyses for sex, age, manufacturer, patient positioning and projections, all subgroups achieved at least AUC 0.95, sensitivity 80% and specificity 80% except the Kodak and Unknown manufacturer subgroups (Table 2). These two subgroups had small cohort sizes and underperformed due to false negative cases: 2 false negative cases out of 7 total positive cases for the Kodak manufacturer subgroup and 1 false negative case out of 2 total positive cases for the Unknown manufacturer subgroup. The model achieved higher sensitivity for detecting pneumothorax amongst tension (compared with non-tension) pneumothorax and ≥2cm (compared with <2cm) sized pneumothorax. It detected a pneumothorax in all cases that had a tension pneumothorax.
Tension pneumothorax detection
The AI model identified tension pneumothorax with AUC 0.987 (95% CI 0.980-0.992; Figure 1B), sensitivity 94.5% (95% CI 90.6-97.7%) and specificity 95.3% (95% CI 93.9-96.6%). When considering the subgroup analyses for sex, age, manufacturer, patient positioning and projections, all subgroups achieved at least AUC 0.95, sensitivity 80% and specificity 80% (Table 3).
Detection of pneumothorax and tension pneumothorax when ancillary findings present
The AI model accuracy was assessed in the presence or absence of ten ancillary findings. It mostly achieved AUC 0.95, sensitivity 80% and specificity 80% for the detection of both pneumothorax and tension pneumothorax in the presence or absence of each finding (Table 4). The exceptions included performing with AUC <0.95 and specificity <80% for the detection of pneumothorax in the presence of findings commonly associated with pneumothorax including rib fracture, pneumomediastinum, subcutaneous emphysema and intercostal drain; it also had AUC <0.95 for the detection of pneumothorax when there was evidence of thoracic surgery. In these situations, the model still performed with AUC >0.90.
Discussion
This retrospective study assessed the performance of an AI model at detecting pneumothorax and tension pneumothorax on CXR. The model overall achieved AUC >0.95, sensitivity >80% and specificity >80%. These results are consistent with other FDA-cleared pneumothorax detection models.10-14,16,17 This model is, however, the first FDA-cleared model that also incorporates tension pneumothorax detection.
This study demonstrated consistent accuracy of the AI model in identifying pneumothorax and tension pneumothorax across demographic and technical subgroups including age, sex, manufacturer, patient positioning and CXR projection. This performance suggests that the model has generalizability for the diverse clinical scenarios that it may encounter moving forward. It also achieved sensitivity >80% for detecting pneumothorax when the pneumothorax size was <2cm or ≥2cm, and when the features of tension pneumothorax were present or absent. It unsurprisingly had a higher sensitivity for detecting a pneumothorax when it was larger (≥2cm) or had features of tension; it had sensitivity of 100% for the latter.
The AI model similarly demonstrated consistent model accuracy across most ancillary findings. There were, however, some ancillary findings that the model performed less well on. In particular, when the ancillary findings pneumomediastinum, subcutaneous emphysema and intercostal drain were present, the model had specificity of <50% for detection of pneumothorax. These three findings are commonly associated with pneumothorax and it is possible that the model uses them to identify pneumothorax; hence their presence could contribute to the model calling a case positive. It is also possible that their presence suggests a resolved pneumothorax that the model calls positive. A follow-on experiment could utilize the segmentation function of this model to understand where the model believes the pneumothorax is; this study utilized the subsequently FDA-cleared version that included the classification functionality and did not include the segmentation functionality. Similarly, the presence of the ancillary findings rib fracture and evidence of thoracic surgery each had AUC 0.945 for detection of pneumothorax; these results likely reflect the associations of these findings with pneumothorax albeit not as strong as the associations of pneumomediastinum, subcutaneous emphysema and intercostal drain.
The cohort selection included two strategies: the consecutive pneumothorax cases were identified by a manual review of the original radiology reports from consecutive CXR cases, while the tension pneumothorax cases were initially identified with natural language processing before the manual review. The first strategy ensured that cases reflected the distribution of real-world pneumothorax cases, including cases with a spectrum of pneumothorax conspicuity. The second strategy was used given the infrequency of tension pneumothorax amongst all CXR cases. One item that we noted was that there were fewer ground truth-positive cases compared to the number of original radiology report-positive cases. We hypothesized that this discrepancy occurred because of the lack of associated clinical history and the lack of other imaging such as chest computed tomography, which might have increased the detection rate in the original clinical environment. The ground truth radiologists in this current study only had access to the CXR.
A key limitation of this study is that it is a retrospective study outside of the clinical workflow. While it demonstrates the accuracy of the AI model in interpreting imaging across many demographic and technical subgroups, it does not do so within the broader clinical environment. Further evaluation will be required to know how the model impacts the clinical workflow including case prioritization and patient outcomes. Further evaluation will also be required as the model encounters clinical scenarios beyond the current study including from new radiographic equipment manufacturers or models.
Conclusion
This study assessed an AI model that accurately detected pneumothorax and tension pneumothorax. Its use in the clinical environment may lead to earlier identification and improved care for patients with pneumothorax.
Data Availability
The AI model is part of the FDA cleared Annalise Enterprise CXR Triage Pneumothorax device, which is commercially available in the US. The test dataset generated for this study contains protected patient information. Some data may be available for research purposes from the corresponding author upon reasonable request.
Funding
This study was funded by Annalise-AI. JMH, BCB, SM, JKC, INC, SD, MG, VM, MKK, KD are employees of Mass General Brigham and/or Massachusetts General Hospital, which had received institutional funding from Annalise-AI for the study. GB, JCYS and CMJ are employed by or seconded to Annalise-AI.
Acknowledgements
The authors thank the broader Mass General Brigham Data Science Office and Annalise teams for their assistance with this project.