Abstract
Deep learning (DL) has been applied with success in proofs of concept across biomedical imaging, including across modalities and medical specialties 1–17. Labeled data is critical to training and testing DL models, and such models traditionally require large amounts of training data, straining the limited (human) resources available for expert labeling/annotation. It would be ideal to prioritize labeling those images that are most likely to improve model performance and skip images that are redundant. However, straightforward, robust, and quantitative metrics for measuring and eliminating redundancy in datasets have not yet been described. Here, we introduce a new method, ENRICH (Eliminate Needless Redundancy for Imaging Challenges), for assessing image dataset redundancy and test it on a well-benchmarked medical imaging dataset3. First, we compute pairwise similarity metrics for images in a given dataset, resulting in a matrix of pairwise-similarity values. We then rank images based on this matrix and use these rankings to curate the dataset, to minimize dataset redundancy. Using this method, we achieve similar AUC scores in a binary classification task with just a fraction of our original dataset (AUC of 0.99 ± 1.35e-05 on 44 percent of available images vs. AUC of 0.99 ± 9.32e-06 on all available images, p-value 0.0002) and better scores than same-sized training subsets chosen at random. We also demonstrate similar Jaccard sores in a multi-class segmentation task while eliminating redundant images. (average Jaccard index of 0.58 on 80 percent of available images vs 0.60 on all available images). Thus, algorithms that reduce dataset redundancy based on image similarity can significantly reduce the number of training images required, while preserving performance, in medical imaging datasets.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
EC, RA, RA, and RA were supported by the Department of Defense (W81XWH-19-1-0294) and the National Heart, Lung, and Blood Institute (NIH R01HL146398). RA and RA were supported by the National Institutes of Allergy and Infectious Diseases (NIH R01AI148747-01). EC and RA were supported by the American Heart Association (17IGMV33870001).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
All datasets were obtained retrospectively and de-identified, with waived consent in compliance with the Institutional Review Board (IRB) at the University of California, San Francisco (UCSF). IRB 17-21481
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Due to the sensitive nature of patient data, we are not able to make these data publicly available at this time.