ABSTRACT
A number of challenges hinder artificial intelligence (AI) models from effective clinical translation. Foremost among these challenges are: (1) reproducibility or repeatability, which is defined as the ability of a model to make consistent predictions on repeat images from the same patient taken under identical conditions; (2) the presence of clinical uncertainty or the equivocal nature of certain pathologies, which needs to be acknowledged in order to effectively, accurately and meaningfully separate true normal from true disease cases; and (3) lack of portability or generalizability, which leads AI model performance to differ across axes of data heterogeneity. We recently investigated the development of an AI pipeline on digital images of the cervix, utilizing a multi-heterogeneous dataset (“SEED”) of 9,462 women (17,013 images) and a multi-stage model selection and optimization approach, to generate a diagnostic classifier able to classify images of the cervix into “normal”, “indeterminate” and “precancer/cancer” (denoted as “precancer+”) categories. In this work, we investigated the performance of this multiclass classifier on external data (“EXT”) not utilized in training and internal validation, to assess the portability of the classifier when moving to new settings. We assessed both the repeatability and classification performance of our classifier across the two axes of heterogeneity present in our dataset: image capture device and geography, utilizing both out-of-the-box inference and retraining with “EXT”. Our results indicate strong repeatability of our multiclass model utilizing Monte-Carlo (MC) dropout, which carries over well to “EXT” (95% limit of agreement range = 0.2 - 0.4) even in the absence of retraining, as well as strong classification performance of our model on “EXT” that is achieved with retraining (% extreme misclassifications = 4.0% for n = 26 “EXT” individuals added to “SEED” in a 2n normal : 2n indeterminate : n precancer+ ratio), and incremental improvement of performance following retraining with images from additional individuals. We additionally find that device-level heterogeneity affects our model performance more than geography-level heterogeneity. Our work supports both (1) the development of comprehensively designed AI pipelines, with design strategies incorporating multiclass ground truth and MC dropout, on multi-heterogeneous data that are specifically optimized to improve repeatability, accuracy, and risk stratification; and (2) the need for optimized retraining approaches that address data heterogeneity (e.g., when moving to a new device) to facilitate effective use of AI models in new settings.
AUTHOR SUMMARY Artificial intelligence (AI) model robustness has emerged as a pressing issue, particularly in medicine, where model deployment requires rigorous standards of approval. In the context of this work, model robustness refers to both the reproducibility of model predictions across repeat images, as well as the portability of model performance to external data. Real world clinical data is often heterogeneous across multiple axes, with distribution shifts in one or more of these axes often being the norm. Current deep learning (DL) models for cervical cancer and in other domains exhibit poor repeatability and overfitting, and frequently fail when evaluated on external data. As recently as March 2023, the FDA issued a draft guidance on effective implementation of AI/DL models, proposing the need for adapting models to data distribution shifts.
To surmount known concerns, we conducted a thorough investigation of the generalizability of a deep learning model for cervical cancer screening, utilizing the distribution shifts present in our large, multi-heterogenous dataset. We highlight optimized strategies to adapt an AI-based clinical test, which in our case was a cervical cancer screening triage test, to external data from a new setting. Given the severe clinical burden of cervical cancer, and the fact that existing screening approaches, such as visual inspection with acetic acid (VIA), are unreliable, inaccurate, and invasive, there is a critical need for an automated, AI-based pipeline that can more consistently evaluate cervical lesions in a minimally invasive fashion. Our work represents one of the first efforts at generating and externally validating a cervical cancer diagnostic classifier that is reliable, consistent, accurate, and clinically translatable, in order to triage women into appropriate risk categories.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under Contract No. 75N91019D00024, Task Order 75N91019F00134, as part of the National Cancer Institute Cancer Cures Moonshot Initiative. No commercial support was obtained. Brian Befano was supported by NCI/NIH under Grant T32CA09168. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics committee/IRB of the National Cancer Institute (NCI) and the National Institutes of Health (NIH) gave ethical approval for this work.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes