PT - JOURNAL ARTICLE AU - Chinn, Erin AU - Arora, Rohit AU - Arnaout, Ramy AU - Arnaout, Rima TI - ENRICH: Exploiting Image Similarity to Maximize Efficient Machine Learning in Medical Imaging AID - 10.1101/2021.05.22.21257645 DP - 2021 Jan 01 TA - medRxiv PG - 2021.05.22.21257645 4099 - http://medrxiv.org/content/early/2021/05/25/2021.05.22.21257645.short 4100 - http://medrxiv.org/content/early/2021/05/25/2021.05.22.21257645.full AB - Deep learning (DL) has been applied with success in proofs of concept across biomedical imaging, including across modalities and medical specialties 1–17. Labeled data is critical to training and testing DL models, and such models traditionally require large amounts of training data, straining the limited (human) resources available for expert labeling/annotation. It would be ideal to prioritize labeling those images that are most likely to improve model performance and skip images that are redundant. However, straightforward, robust, and quantitative metrics for measuring and eliminating redundancy in datasets have not yet been described. Here, we introduce a new method, ENRICH (Eliminate Needless Redundancy for Imaging Challenges), for assessing image dataset redundancy and test it on a well-benchmarked medical imaging dataset3. First, we compute pairwise similarity metrics for images in a given dataset, resulting in a matrix of pairwise-similarity values. We then rank images based on this matrix and use these rankings to curate the dataset, to minimize dataset redundancy. Using this method, we achieve similar AUC scores in a binary classification task with just a fraction of our original dataset (AUC of 0.99 ± 1.35e-05 on 44 percent of available images vs. AUC of 0.99 ± 9.32e-06 on all available images, p-value 0.0002) and better scores than same-sized training subsets chosen at random. We also demonstrate similar Jaccard sores in a multi-class segmentation task while eliminating redundant images. (average Jaccard index of 0.58 on 80 percent of available images vs 0.60 on all available images). Thus, algorithms that reduce dataset redundancy based on image similarity can significantly reduce the number of training images required, while preserving performance, in medical imaging datasets.Competing Interest StatementThe authors have declared no competing interest.Funding StatementEC, RA, RA, and RA were supported by the Department of Defense (W81XWH-19-1-0294) and the National Heart, Lung, and Blood Institute (NIH R01HL146398). RA and RA were supported by the National Institutes of Allergy and Infectious Diseases (NIH R01AI148747-01). EC and RA were supported by the American Heart Association (17IGMV33870001).Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:All datasets were obtained retrospectively and de-identified, with waived consent in compliance with the Institutional Review Board (IRB) at the University of California, San Francisco (UCSF). IRB 17-21481All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesDue to the sensitive nature of patient data, we are not able to make these data publicly available at this time.