Abstract
Chest X-rays (CXRs) are a rich source of information for physicians – essential for disease diagnosis and treatment selection. Recent deep learning models aim to alleviate strain on medical resources and improve patient care by automating the detection of diseases from CXRs. However, shortages of labeled CXRs can pose a serious challenge when training models. Currently, models are generally pretrained on ImageNet, but they often need to then be finetuned on hundreds of thousands of labeled CXRs to achieve high performance. Therefore, the current approach to model development is not viable on tasks with only a small amount of labeled data. An emerging method for reducing reliance on large amounts of labeled data is self-supervised learning (SSL), which uses unlabeled CXR datasets to automatically learn features that can be leveraged for downstream interpretation tasks. In this work, we investigated whether self-supervised pretraining methods could outperform traditional ImageNet pretraining for chest X-ray interpretation. We found that SSL-pretrained models outperformed ImageNet-pretrained models on thirteen different datasets representing high diversity in geographies, clinical settings, and prediction tasks. We thus show that SSL on unlabeled CXR data is a promising pretraining approach for a wide variety of CXR interpretation tasks, enabling a shift away from costly labeled datasets.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Most datasets used in this study are public and can be accessed through their respective websites. Requests concerning the TB portals dataset should be addressed to Maha Farhat. Reis, Eduardo Pontes. "BRAX, a Brazilian labeled chest X-ray dataset." https://www.nature.com/articles/s41597-022-01608-8 Kermany, Daniel S. et al. "Identifying medical diagnoses and treatable diseases by image-based deep learning." Cell 172.5 (2018): 1122-1131. https://www.sciencedirect.com/science/article/pii/S0092867418301545 Jaeger, Stefan et al. "Two public chest X-ray datasets for computer-aided screening of pulmonary diseases." Quantitative imaging in medicine and surgery 4.6 (2014): 475. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256233/ Tang, Jennifer SN et al. "CLiP, catheter and line position dataset." Scientific Data 8.1 (2021): 1-7. https://www.nature.com/articles/s41597-021-01066-8 https://www.kaggle.com/jesperdramsch/siimacrpneumothorax-segmentation-data. Lanfredi, Ricardo Bigolin et al. "REFLACX, a dataset of reports and eye-tracking data for localization of abnormalities in chest X-rays." arXiv preprint arXiv:2109.14187 (2021). https://arxiv.org/abs/2109.14187 Wang, Xiaosong et al. "Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. https://openaccess.thecvf.com/content_cvpr_2017/html/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.html Demner-Fushman, Dina et al. "Preparing a collection of radiology examinations for distribution and retrieval." Journal of the American Medical Informatics Association 23.2 (2016): 304-310. https://academic.oup.com/jamia/article/23/2/304/2572395 Johnson, Alistair EW et al. "MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports." Scientific data 6.1 (2019): 1-8. https://www.nature.com/articles/s41597-019-0322-0 Bustos, Aurelia et al. "Padchest: A large chest X-ray image dataset with multi-label annotated reports." Medical image analysis 66 (2020): 101797. https://www.sciencedirect.com/science/article/pii/S1361841520301614?casa_token=mrtbcDdVl44AAAAA:9RnbkQfTXrk4e7tG-XR967koQar3fLFJztV9sBX446MKlCSe2xwEWWFHf-_BsvwbzYNSILZ_yozh https://bimcv.cipf.es/bimcv-projects/padchest/ Junji, Shiraishi et al. "Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists detection of pulmonary nodules." American Journal of Roentgenology 174.1 (2000): 71-74. https://www.ajronline.org/doi/full/10.2214/ajr.174.1.1740071 Nguyen, Ha Q.et al. "VinDr-CXR: An open dataset of chest X-rays with radiologist's annotations." arXiv preprint arXiv:2012.15029 (2020). https://arxiv.org/abs/2012.15029 Gabrielian, Andrei et al. "TB DEPOT (Data Exploration Portal): A multi-domain tuberculosis data analysis resource." Plos one 14.5 (2019): e0217410. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0217410
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Niveditha S. Iyer and Aditya Gulati share first authorship; Agustina D. Saenz and Pranav Rajpurkar share senior authorship.
Data Availability
Most datasets used in this study are public and can be accessed through their respective websites. Requests concerning the TB portals dataset should be addressed to Maha Farhat, (Maha_Farhat@hms.harvard.edu).
https://www.nature.com/articles/s41597-022-01608-8
https://www.sciencedirect.com/science/article/pii/S0092867418301545
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4256233/
https://www.nature.com/articles/s41597-021-01066-8
https://www.kaggle.com/jesperdramsch/siimacrpneumothorax-segmentation-data.
https://arxiv.org/abs/2109.14187
https://academic.oup.com/jamia/article/23/2/304/2572395