Summary
Background Determining molecular pathways involved in the development of colorectal cancer (CRC) and knowing the status of key mutations are crucial for deciding optimal target therapy. The goal of this study is to explore machine learning to predict the status of the three main CRC molecular pathways – microsatellite instability (MSI), chromosomal instability (CIN), CpG island methylator phenotype (CIMP) – and to detect BRAF and TP53 mutations as well as to predict hypermutated (HM) CRC tumors from whole-slide images (WSIs) of colorectal cancer (CRC) slides stained with Hematoxylin and Eosin (H&E).
Methods We propose a novel iterative draw-and-rank sampling (IDaRS) algorithm to select representative sub-images or tiles from a WSI given a single WSI-level label, without needing any detailed annotations at the cell or region levels. IDaRS is used to train a deep convolutional network for predicting key molecular parameters in CRC (in particular, prediction of HM tumors and the status of three main CRC molecular pathways – MSI, CIN, CIMP – as well as the detection of two key mutations, BRAF and TP53) from digitized images of routine H&E stained tissue slides of CRC patients (n=497 for TCGA cohort and n=47 cases for the Pathology AI Platform or PAIP cohort). Visual fields most predictive of each pathway and HM tumors identified by IDaRS are analyzed for verification of known histological features for the first time to reveal novel histological features. This is achieved by systematic, data-driven analysis of the cellular composition of strongly predictive tiles.
Findings IDaRS yields high prediction accuracy for prediction of the three main CRC genetic pathways and key mutations by deep learning based analysis of the WSIs of H&E stained slides. It achieves the state-of-the-art AUROC values of 0.90, 0.83, and 0.81 for prediction of the status of MSI, CIN, and HM tumors for the TCGA cohort, which is significantly higher than any other currently published methods on that cohort. We also report prediction of status of CIMP pathway (CIMP-High and CIMP-Low) from H&E slides, with an AUROC of 0.79. We analyzed key discriminative histological features associated with HM tumors and each molecular pathway in a data-driven manner, via an automated quantitative analysis of the cellular composition of tiles strongly predictive of the corresponding molecular status. A key feature of the proposed method is that it enables a systematic and data-driven analysis of the cellular composition of image tiles strongly predictive of the various molecular parameters. We found that relatively high proportion of tumor infiltrating lymphocytes and necrosis are found to be strongly associated with HM and MSI, and moderately associated with CIMP-H and genome-stable (GS) cases, whereas relatively high proportions of neoplastic epithelial type 2 (NEP2), mesenchymal and neoplastic epithelial type 1 (NEP1) cells are found to be associated with CIN cases.
Interpretation Automated prediction of genetic pathways and key mutations from image analysis of simple H&E stained sections with a high accuracy can provide time and cost-effective decision support. This work shows that a deep learning algorithm can mine both visually recognizable as well as sub-visual histological patterns associated with molecular pathways and key mutations in CRC in a data-driven manner.
Funding This study was funded by the UK Medical Research Council (award MR/P015476/1).
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
The research reported in this publication was supported by the UK Medical Research Council (award MR/P015476/1).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Anonymized scanned whole slide images were retrieved from The Cancer Genome Atlas (TCGA) project through the Genomics Data Commons Portal (https://portal.gdc.cancer.gov/). De-identified pathology images and annotations in the PAIP (Pathology AI Platform) cohort (used as external validation cohort in this study) were prepared and provided by the Seoul National University Hospital by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C0316). The organizing committee of PAIP 2020 Challenge: MSI Prediction in Colorectal Cancer, made the PAIP cohort available for this research study permitted by its institutional review board (Seoul National University Hospital IRB No. H-1808-035-964).
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Fig. 9 and Fig. 10 improved with the sharper version for a better view; Table 2 updated; Abstract revised.
Data Availability
All images and the associated pathways/mutations status for the TCGA cohort (COAD and READ) used in this study are publicly available at https://portal.gdc.cancer.gov/ and cbioportal. A link to the TCGA manifest file that can be used to download all images for the TCGA cohort can be found in the Supplementary Materials document. The ground truth labels of TCGA-CRC-DX for HMD/LMD, MSI/MSS, CIN/GS, and CIMP-H/L were obtained from Liu et al. A link to the spreadsheet containing the corresponding clinical and molecular data including cancer stages, subtypes and the status of mutations and pathways can also be found in the Supplementary Materials. De-identified pathology images and annotations from Pathology AI Platform (PAIP) used with institutional permissions in this study can be obtained via appropriate data access requests through the URL: http://www.wisepaip.org/paip.