Abstract
Objectives Develop an interpretable AI algorithm to rule out normal large bowel endoscopic biopsies saving pathologist resources.
Design Retrospective study.
Setting One UK NHS site was used for model training and internal validation. External validation conducted on data from two other NHS sites and one site in Portugal.
Participants 6,591 whole-slides images of endoscopic large bowel biopsies from 3,291 patients (54% Female, 46% Male).
Main outcome measures Area under the receiver operating characteristic and precision recall curves (AUC-ROC and AUC-PR), measuring agreement between consensus pathologist diagnosis and AI generated classification of normal versus abnormal biopsies.
Results A graph neural network was developed incorporating pathologist domain knowledge to classify the biopsies as normal or abnormal using clinically driven interpretable features. Model training and internal validation were performed on 5,054 whole slide images of 2,080 patients from a single NHS site resulting in an AUC-ROC of 0.98 (SD=0.004) and AUC-PR of 0.98 (SD=0.003). The predictive performance of the model was consistent in testing over 1,537 whole slide images of 1,211 patients from three independent external datasets with mean AUC-ROC = 0.97 (SD=0.007) and AUC-PR = 0.97 (SD=0.005). Our analysis shows that at a high sensitivity threshold of 99%, the proposed model can, on average, reduce the number of normal slides to be reviewed by a pathologist by 55%. A key advantage of IGUANA is its ability to provide an explainable output highlighting potential abnormalities in a whole slide image as a heatmap overlay in addition to numerical values associating model prediction with various histological features. Example results with can be viewed online at https://iguana.dcs.warwick.ac.uk/.
Conclusions An interpretable AI model was developed to screen abnormal cases for review by pathologists. The model achieved consistently high predictive accuracy on independent cohorts showing its potential in optimising increasingly scarce pathologist resources and for achieving faster time to diagnosis. Explainable predictions of IGUANA can guide pathologists in their diagnostic decision making and help boost their confidence in the algorithm, paving the way for future clinical adoption.
What is already known on this topic
Increasing screening rates for early detection of colon cancer are placing significant pressure on already understaffed and overloaded histopathology resources worldwide and especially in the United Kingdom1.
Approximately a third of endoscopic colon biopsies are reported as normal and therefore require minimal intervention, yet the biopsy results can take up to 2-3 weeks2.
AI models hold great promise for reducing the burden of diagnostics for cancer screening but require incorporation of pathologist domain knowledge and explainability.
What this study adds
This study presents the first AI algorithm for rule out of normal from abnormal large bowel endoscopic biopsies with high accuracy across different patient populations.
For colon biopsies predicted as abnormal, the model can highlight diagnostically important biopsy regions and provide a list of clinically meaningful features of those regions such as glandular architecture, inflammatory cell density and spatial relationships between inflammatory cells, glandular structures and the epithelium.
The proposed tool can both screen out normal biopsies and act as a decision support tool for abnormal biopsies, therefore offering a significant reduction in the pathologist workload and faster turnaround times.
Introduction
Histological examination is a vital component in ensuring accurate diagnosis and appropriate treatment of many diseases. In routine practice, it involves visual assessment of key histological and cellular patterns in the tissue, which is a major step in understanding the state of various conditions, such as cancer. Histopathology has been at the forefront of many advances in care including, but not limited to, cancer screening programmes, molecular pathology, tumour classification and companion diagnostic testing, resulting in a rapid rise in demand for histology-derived data3. This extra workload is placing tremendous pressure on pathologists, with 78% of UK cellular pathology departments already facing significant staff shortages4. The surging demand and staffing challenges ultimately lead to delays in diagnosis, negatively impacting patient care especially for those with abnormal conditions (e.g., cancer or serious inflammation) where early intervention and treatment are critical5.
New National Institute for Health and Care Excellence (NICE) guidelines for referral of suspected cancer forecast a rise in demand for endoscopy, with more than 750,000 additional procedures performed per year by 20206, leading to a breach in standard wait times in a quarter of NHS hospitals7,8. Endoscopic large bowel biopsies constitute approximately 10% of all requests in the UK National Health Service (NHS) pathology laboratories. During the examination process, the pathologist examines each biopsy slide searching for disease, typically working from low to high magnification, and analyses a set of pre-defined histological features, such as gland architecture, inflammation and nuclear atypia for signs of abnormality9,10. The resulting report indicates the presence of any disease process and categorises the abnormality into the most appropriate diagnosis11,12. An overview of the pathologist diagnostic decision process for reporting endoscopic colon biopsies is provided in Supplementary Fig 1. Approximately a third of colonic biopsy samples are reported as normal (Supplementary Table 1), representing a substantial workload where the pathologist’s expertise is not fully utilised. The underlying hypothesis of this study is that automated screening of normal biopsies may help address rising histopathology capacity challenges.
Since the advent of digital pathology13, there has been a sharp increase in the development of artificial intelligence (AI) tools that enable computational analysis of multi-gigapixel whole-slide images (WSIs). In particular, deep learning (DL) algorithms have achieved remarkable performance not only in routine diagnostic tasks, such as cancer grading14 and finding metastasis in lymph nodes, but also in finding origins for cancers of unknown primary (CUP)15 and improved patient stratification16,17. Notably, Campanella et al.18 presented a seminal paper on clinical-grade WSI classification, while Bejnordi et al.19 demonstrated that AI models are capable of surpassing pathologist performance for breast cancer metastasis detection. These models can be leveraged to help reduce inevitable errors in diagnosis, given that humans are naturally prone to mistakes, especially when faced with fatigue or distractions20,21. AI tools are not as susceptible to these kinds of errors and therefore may help mitigate oversight, reduce workload and increase reproducibility.
Differentiating between normal and neoplastic colorectal WSIs using DL has previously been addressed, with reports of excellent performance22-24. However, distinguishing normal from abnormal tissue samples required for large bowel biopsy screening remains a challenge, due to the difficulty in detecting various subtle conditions, such as mild inflammation. To the best of our knowledge, there are no existing multi-centric studies for normal vs abnormal classification of large bowel biopsies. Existing methods for colonic analysis operate on high power sub-images (or image patches) and so do not explicitly model both the tissue micro- and macro-structure, including glandular architecture, inflammatory cell density and spatial relationships between inflammatory cells, glandular structures and the epithelium. Relying solely on DL models to automatically detect histological patterns that are diagnostically relevant in small image regions may lead to sub-optimal performance. Alternatively, explicitly incorporating histological features that are routinely used by pathologists during the colon biopsy diagnostic workflow may not only improve performance over conventional DL models but may also increase transparency and interpretability of the algorithm’s decision making to the pathologist – a key requirement for trustworthy AI based medical decision models25,26.
To help reduce the burden of large bowel biopsy screening, we propose the first interpretable AI algorithm for large bowel slide classification employing a gland-graph network named IGUANA (Interpretable Gland-Graphs using a Neural Aggregator). In the proposed approach, a WSI is modelled as a graph with nodes27-30, each representing a gland associated with a set of 25 interpretable features capturing gland architecture, intra-gland nuclear morphology and inter-gland cell density. The interconnections between these nodes capture the spatial organisation of glands within the tissue. The node features were developed in collaboration with pathologists and in accordance with existing diagnostic pathways to boost predictive accuracy, interpretability and alignment with known histological characteristics of a wide range of colorectal pathologies. IGUANA identifies highly predictive regions in the biopsy tissue slide and provides an explanation as to why they may be highly predictive. Because of the use of biologically meaningful features, this explanation can easily be interpreted by a pathologist as the basis of the algorithm’s diagnostic decision making. We validate our algorithm on an internal dataset containing 5,054 WSIs and an independent multi-centre dataset containing 1,561 WSIs, achieving the best performance compared to recent top-performing approaches. In addition, we analyse predictive regions identified by IGUANA along with local and WSI-level explanations and show that our approach can identify areas of abnormality, such as inflammation and neoplasia. A summary of our overall pipeline can be seen in Fig 1, which consists of the following steps: 1) histological segmentation, 2) feature extraction and edge generation, 3) graph prediction and 4) graph explanation. The code for IGUANA is available in the open-source domain for research purposes (https://github.com/TissueImageAnalytics/iguana) and example results can be visualised in an interactive demo available at https://iguana.dcs.warwick.ac.uk.
Methods
WSI datasets
We collected data from four patient cohorts containing routine Haematoxylin and Eosin (H&E) stained WSIs of endoscopic colon biopsies from the following centres: 1) University Hospitals Coventry and Warwickshire (UHCW) NHS Trust, United Kingdom; 2) South Warwickshire NHS Foundation Trust, United Kingdom; 3) East Suffolk and North Essex (ESNE) NHS Foundation Trust, United Kingdom and 4) IMP Diagnostics Laboratory, Portugal23. Glass slides from UHCW were digitised with a GE Omnyx slide scanner at a pixel resolution of 0.275 microns per pixel (MPP). Slides from ESNE and South Warwickshire Hospitals were digitised with 3DHISTECH scanners at pixel resolutions of 0.122 MPP and 0.139 MPP, respectively. IMP Diagnostics slides were digitised with a Leica GT450 scanner at a pixel resolution of 0.263 MPP. In total, we collected 6,591 WSIs from 3,291 patients, with 5,054 from UHCW, 148 from ESNE, 257 from South Warwickshire and 1,132 from IMP Diagnostics. To rigorously evaluate our approach for colon biopsy screening, we performed 3-fold internal cross-validation on the UHCW dataset and held out the remaining three datasets for independent external validation. When creating the folds for internal cross-validation, the data was split with stratification at patient-level to ensure that our method was evaluated on completely unseen cases.
Collaborating pathologists categorised WSIs from UHCW, ESNE and South Warwickshire at slide-level into a ground-truth diagnosis label of either normal, non-neoplastic or neoplastic with consensus review of discordant cases. A wide range of histological conditions were present across the datasets to reflect the clinical screening procedure. A full summary of the specific diagnoses is shown in Supplementary Table 2. For this study, non-neoplastic and neoplastic classes were combined into a single abnormal category. WSIs from IMP diagnostics were originally categorised as either non-neoplastic, low-grade dysplasia or high-grade dysplasia23, where the non-neoplastic category contained a mixture of normal, inflammatory and hyperplastic slides. Therefore, our team of pathologists additionally reviewed non-neoplastic slides from IMP to separate normal from abnormal tissue samples. In our final curated datasets, 42% of slides from UHCW, 61% from ESNE, 40% from South Warwickshire and 84% from IMP Diagnostics were labelled as abnormal. We provide a data description diagram showing the experiment design and the inclusion and exclusion criteria used in Supplementary Fig 2. In addition, we provide an overview of all datasets used in this study in Supplementary Fig 3 and give a demographic breakdown of patients within the development set in Supplementary Fig 4.
Identification of histological objects
The first step of IGUANA requires the segmentation of various histological objects within the WSI, which enables subsequent graph construction and feature extraction. For this, we utilise our recently published Cerberus31 model, which performs simultaneous segmentation and classification of nuclei, glands, lumen and different tissue regions. During training, we use a multi-task learning strategy, which allows the utilisation of multiple independent datasets and enables simultaneous prediction with a single network. Therefore, our localisation step is computationally efficient and does not require multiple passes through various networks. As well as delineating object boundaries, Cerberus determines the category of each nucleus and differentiates the surface epithelium from other glands. Cerberus is trained on a large amount of data from 12 different centres, including more than 535K nuclei, 51K glands and 56K lumen annotations. In our previous work, we have shown that this crucial step of initial localisation generalises well to unseen examples31.
Extraction of clinically interpretable features
After performing segmentation of various histological objects using Cerberus, we extract a set of clinically meaningful features, which were carefully chosen in collaboration with pathologists so that they reflect what features are considered during the screening procedure. Our model’s ability to localise glands, lumen and nuclei within the tissue allows us to extract interesting gland, intra-gland and inter-gland features that are potentially capable of identifying various histological conditions. Specifically, the inter-gland features are defined in the non-glandular surrounding area, also known as the lamina propria. To obtain this region, we utilise the patch-based tissue type classification output from Cerberus and consider both normal gland and tumour patch predictions. We then subtract the gland segmentation output from the prediction map and carry out a series of refinement steps to obtain the final estimation of the lamina propria. We ensure that each feature that we consider has a key biological significance. For example, we consider the size and morphology of glands, which can be indicative of cancer. For quantifying the morphology, we utilise the best alignment metric (BAM)32, which provides a measure of how elliptical an object is, to help capture abnormal glands with irregular shapes. We also take into account the number of lumen along with their corresponding morphology, which can be suggestive of conditions such as cribriform architecture and serrated polyps, respectively. Furthermore, the organisation of epithelial cells and the amount of different inflammatory cells within the gland are diagnostically informative. For example, normal glands will have epithelial cells organised at the boundary and neutrophils within the gland are indicative of crypt abscesses. For measuring the epithelial organisation, we compute the mean and standard deviation of distances of epithelial nuclei centroids to their nearest gland boundary. We also utilise the mean and standard deviation of inter-epithelial nuclear distances within the gland. Certain inflammatory conditions, such as lymphocytic colitis, will have an increased number of inflammatory cells within the lamina propria. Therefore, we extract inter-glandular features indicative of the local density of inflammatory cells and report the associated cellular composition. Overall, we compute a set of 25 features, which are standardised before utilisation within our graph-based machine learning model. Visual examples of features used within our framework, along with examples from the 5th and 95th percentiles, are given in Supplementary Fig 5. We also provide a more in-depth description of these features, along with what conditions they can detect in Supplementary Table 3.
Gland-graph neural network for interpretable diagnosis
Recently, graph neural networks (GNNs) have become popular in Computational Pathology (CPath)27 due to their ability to model a large WSI as an interconnection of nodes representing histologically important constructs characterized by node-level features28,33,34. An added advantage of using GNNs for predictive modelling in CPath is their ability to generate an explanation of their output in terms of the node level features35-37.
Thus, once the different histological objects have been identified, each WSI is represented as a gland-graph. Here, glands are represented as nodes on a 2D plane that are connected by edges if they are within a certain distance of each other. Each gland is then associated with a set of 25 features that were previously described. Therefore, the overall graph provides a mechanism for representing local features across the entire tissue sample. As opposed to surgical resections, which usually contains a large bulk of tissue, biopsies can contain many separate tissue segments on the slide. This arrangement has no biological significance and, therefore, it would be unreasonable for glands to be connected between neighbouring tissue regions. Thus, we also ensure that an edge between any two given glands only exists if they are both located within the same tissue segment.
Upon formation of our gland-graph representation of a WSI in terms of its nodes and edges, we pass the input through a GNN, which sequentially aggregates features within the slide to predict the diagnosis. An important aspect of IGUANA is its ability to provide an interpretable and explainable output, which can be used to facilitate the diagnostic process and potentially for biomarker discovery. For this, we utilise GNNExplainer36, which generates a subset of nodes and features that play a crucial role in the GNN’s prediction. To obtain a WSI-level explanation, the local features are averaged within the top ten most predictive nodes. This enables analysis over larger cohorts to identify existing sub-populations. To further increase model interpretability, we can also visualise the intermediate nuclear, lumen and gland localisation results overlaid on top of the original WSI. We provide further detail of our graph-based approach in Supplementary Section S3.
Software, optimisation and reproducibility
We implemented our framework with the open-source software library PyTorch version 1.1038, PyTorch Geometric version 2.1.139 and Python version 3.6 on a workstation equipped with one NVIDIA Tesla V100 GPU. We utilised scikit-learn version 1.0.240 to perform the comparative experiments using random forest and fastcluster version 1.2.641 to perform biclustering analysis. We trained our graph neural network for 50 epochs using a batch size of 64 and an initial learning rate of 0.005, which was reduced by a factor of 0.2 after 25 epochs. It should be noted, that despite using a GPU with 32GB RAM, our GNN framework incurred a low memory utilisation and therefore different specification GPUs may also be used. The interactive demo was developed using the tile server from TIAToolbox42 and Bokeh 2.4.3. No changes were made to the AI system or the hardware over the course of the study.
Model code, along with a full list of software requirements, is located at https://github.com/TissueImageAnalytics/iguana. Model code and weights are for research purposes only and are therefore shared under a non-commercial Creative Commons license.
Patient and public involvement
Lay members have made a valuable contribution to this project in ensuring that the patient is at the heart of this project. Three lay advisors have been working with us since the conception of this project. One of the advisors is part of the National Cancer Research Institute (NCRI) consumer network and Independent Cancer Patient’s Voice (ICPV) group, who are both supportive of new technologies being brought into the NHS for patient benefit.
Results
Large-scale cross validation for colon biopsy screening
To rigorously evaluate our approach for colon biopsy screening, we performed 3-fold cross-validation using 5,054 Haematoxylin and Eosin (H&E) stained colon biopsy WSIs from University Hospitals Coventry and Warwickshire (UHCW), where each slide was labelled as either normal or abnormal. Interpretable screening of normal colon biopsies is a challenging problem due to a wide spectrum of large bowel abnormalities including a variety of neoplastic and inflammatory conditions. Fig 3 shows the results of IGUANA, achieving an average area under the receiver operating characteristic (AUC-ROC) curve of 0.9783 ± 0.0036 and an area under the precision-recall (AUC-PR) curve of 0.9798 ± 0.0031. These scores determine the level of agreement between consensus pathologist diagnosis and AI generated classification of normal versus abnormal biopsies. We also include results obtained using other existing slide-level classification algorithms such as IDaRS43, CLAM44 and a random forest (RF) baseline classifier using our glandular features (denoted by Gland-RF). We observe that IGUANA achieves the best performance compared to both patch-based methods (IDARS and CLAM), demonstrating its strong predictive ability given that it uses only 25 features per gland. We provide additional comparative results between IGUANA and IDaRS in Supplementary Fig 6. Note that despite IGUANA outperforming it, the Gland-RF model produces comparable performance – signifying the strength of our set of clinically-derived features – albeit without the localised interpretability provided by IGUANA. Also, as opposed to the two patch-based methods, IGUANA provides concrete justification as to why a certain diagnostic class was predicted. We go into further detail on interpretability and explainability later in this section.
Model generalization to independent cohorts
A true reflection of a model’s clinical utility requires the assessment of its performance on completely unseen cohorts. For this, we utilised three additional cohorts of H&E-stained colon biopsy slides, providing a total of 1,537 WSIs. These cohorts consisted of 1,132 slides from IMP Diagnostics Laboratory in Portugal23, 148 slides from East Suffolk and North Essex (ESNE) NHS Foundation Trust and 257 slides and South Warwickshire NHS Foundation Trust, where slides were again categorised as either normal or abnormal. We observe from Fig 3 that our model attains high performance for both the ESNE and South Warwickshire cohorts, reaching AUC-ROC scores of 0.9567 ± 0.0155, 0.9649 ± 0.0025 and 0.9789 ± 0.0023 and AUC-PR scores of 0.9731 ± 0.0105, 0.9466 ± 0.0034 and 0.9949 ± 0.0006 for ESNE, South Warwickshire and IMP datasets, respectively. It is evident that there is a large difference in performance between IGUANA and other approaches on the external cohorts, signifying that superior generalisation to unseen data is a strength of our model. In particular, at a sensitivity of 0.99 we obtain a percentage increase over IDaRS of 47.4%, 63.6% and 58.9% for IMP, ESNE and South Warwickshire cohorts, respectively. Example results obtained by our segmentation model across the four datasets are shown in Fig 2.
Analysis of expected reduction in pathologist workload
The real-world value of our approach is determined by its ability to reduce pathologist workload. As our model is intended for screening, it must achieve high sensitivity to minimise the risk of false negatives. Therefore, assessment of the specificity at high sensitivity cut-off thresholds provides a good indication of its potential effectiveness as a screening tool. Here, the specificity is directly indicative of the percentage reduction in normal slides that require pathologist review. In the middle column of Fig 3 we display the specificity of our model at sensitivities of 0.97, 0.98 and 0.99 on all datasets used in our experiments, where we see that IGUANA sustains the best performance at various cut-offs compared to other methods. During internal cross-validation, we obtain specificities of 0.7865 ± 0.0429, 0.6720 ± 0.1128 and 0.5409 ± 0.1210 for sensitivities of 0.97, 0.98 and 0.99, respectively. For independent validation, our method obtains average specificities across the three external datasets of 0.7513 ± 0.0919, 0.6679 ± 0.0779 and 0.5487 ± 0.1599 for sensitivities of 0.97, 0.98 and 0.99. Therefore, this indicates that at a sensitivity of 0.99, our method is able to screen around 55% of normal cases during both internal and external validation.
In Supplementary Fig 7 we show the proportion of slides that require pathologist review to achieve a certain sensitivity18. In these plots, we consider a target sensitivity of 0.99, which is reasonable due to high levels of inter-observer disagreement for conditions such as mild inflammation. We also show with a vertical dashed line the proportion of abnormal slides in each dataset, which indicates the minimum number of slides that need to be reviewed for screening. For each of the cohorts, we observe that for our target of 0.99 sensitivity our model can screen out 32%, 31%, 17% and 13% of total slides from UHCW, South Warwickshire, ESNE and IMP datasets, respectively. If considering a sensitivity of 0.97, we can screen out 44% of slides from UHCW, 46% from South Warwickshire, 30% from ESNE and 19% from IMP.
Performance across anatomical sites
We perform a statistical analysis of test results on the internal UHCW dataset to analyse potential differences in performance over colonic sites, such as the ascending, descending and transverse colon. After performing the Mann-Whitney U test45 between the results of different sites, we found that no statistically significant difference was observed, with p-values > 0.05. This suggests that there is insufficient evidence of significant differences in the results across anatomical sites.
Local feature explanations increase model transparency
A major component of IGUANA is the ability to provide an interpretable and explainable output. In Fig 4, we display visual explanations of the most predictive nodes and features given by IGUANA. Node explanations are shown in the form of a heatmap, where relatively high values indicate glandular areas that contribute to the slide being classified as abnormal. Therefore, we should expect that all glands in a normal slide will have low values in the associated heatmap as shown in Fig 4a, where no glands contribute to the slide being classified as abnormal. Fig 4b-d show WSIs with hyperplastic polyps, inflammation and adenocarcinoma, respectively. Hyperplastic polyps are often characterised by intraluminal folds and lumen dilation. On the other hand, inflammatory conditions usually have an increased number of lymphocytes, plasma cells, eosinophils and neutrophils within the lamina propria and potentially within the glands. Other indicators of inflammation can include crypt branching and crypt dropout. Colon adenocarcinoma is often denoted by irregular glandular morphology, epithelial nuclear atypia and multiple lumina. High-grade cancers typically lose their glandular appearance and form solid sheets of tumour cells. It can be observed that IGUANA is able to pick up abnormal glands with features in line with the above descriptions. In particular, we see that the most predictive glands in Fig 4b contain lumen with a clearly irregular morphology, whereas highlighted glands in 4c show areas with a high degree of inflammation. The adenocarcinoma heatmap in Fig 4d highlights areas that have lost their conventional glandular appearance. Specifically, epithelial nuclei are no longer arranged at the gland boundary, cribriform architecture is observed and glands appear much larger, due to the formation of tumour cell sheets.
In addition to the node explanation heatmap, IGUANA indicates why certain glands are being identified as abnormal. This is useful because it can provide confirmation that the correct features are being identified by the model, giving researchers and clinicians confidence that it is performing as expected. This strategy can also be used to identify additional features within abnormal conditions. To show this, in Fig 4, we display the most predictive glands in each slide and provide the corresponding feature explanations. Specifically, we display the top ten features in descending order of significance, along with their corresponding feature importance values between 0 and 1. Here, we expect that the feature explanations should align with what is observed in the associated cropped regions. In our hyperplastic polyp example, we see that the top glands (i.e., 1, 2 and 3) contain lumen with abnormal morphology, whereas lumen dilation is observed in top gland 4. In line with this, lumen morphology and lumen composition are high-scoring features across the provided examples. We also observe that lumen size and organisation of epithelial nuclei within the glands are often found to be important features. In the example shown in Fig 4c, we observe that top glands have a high degree of inflammation, which is matched by top features, such as inflammatory cell density, gland density and lamina propria neutrophil proportion. In the adenocarcinoma example, we see that the top four glands are all large, have irregular morphology and often display solid sheets of tumour cells with no obvious glandular structure. This is highlighted in the feature explanation, where gland morphology, gland size and epithelial organisation are consistently top-ranked features. Here, epithelial organisation describes how the epithelial nuclei are positioned at the gland boundary. Due to the presence of solid tumour patterns across the top glands, this feature is frequently highlighted in cancerous cases. We provide additional visual examples of the interpretability of our model output in Fig 5.
WSI-level feature explanations are consistent with known histological patterns
In Fig 6a, we show WSI-level explanations averaged over different sub-conditions in the UHCW and IMP cohorts. We focus on these datasets because they are the largest, with both containing over 1,000 samples. Here, we plot top 10 features across the various sub-conditions for increased readability. These plots can be used both to confirm that the global explanations are as expected and to understand which features are particularly important for categorising a certain sub-condition as abnormal. In both UHCW and IMP cohorts, the normal radar plots have a small radius, indicating that no feature contributes to the slide being classified as abnormal. For inflammatory cases, the UHCW and IMP radar plots show that a wide range of features can contribute to the slide being classified as abnormal, where there may be both cellular and architectural changes in the tissue. However, the most important features that can differentiate between other sub-conditions include inflammatory cell density, gland lymphocyte infiltration and gland density. Gland density can be indicative of gland dropout, which is a sign of inflammation. The UHCW radar plots for dysplasia and adenocarcinoma are similar, where the most important features are gland morphology, gland epithelial cell organization, gland epithelial cell size and variation of gland epithelial cell size. This is in line with the key expected histological patterns observed within these tissue types. Likewise, these plots are similar to the low- and high-grade dysplasia plots for the IMP cohorts, indicating that the correct histological features are being highlighted when providing the WSI feature explanation. For hyperplastic polyps, we can see that lumen composition, lumen morphology and epithelial cell organisation have a large influence in the slide being classified as abnormal. Lumen composition is the ratio of lumen to gland size and therefore can identify glands with lumen dilation, which is a distinguishing feature of hyperplastic polyps. Conversely, lumen serrations, which are present in hyperplastic polyps, can lead to irregular lumen morphology, further validating the feature explanations output by our model.
WSI-level feature explanations identify population subgroups
In Fig 6b, we perform hierarchical biclustering of all abnormal slides and WSI-level feature importance scores to help identify various subgroups that exist within the UHCW dataset. At the bottom of the plot, we identify various patient clusters which have varying histological appearance. These are numbered as follows: 1) general sign of inflammation, without neutrophil infiltration; 2) inflammation with a high degree of both lymphocytic and neutrophilic gland infiltration; 3) mainly neoplastic slides with irregular-shaped glands and large epithelial cells; 4) irregular gland morphology, with minimal inflammation; 5) abnormal lumen morphology and composition, with signs of inflammation in the lamina propria; 6) increased eosinophilic infiltration in the lamina propria and 7) neoplastic slides with gland epithelial clustering. Therefore, this gives us confidence that the network is learning key histological differences among the dataset to make an informed WSI-level prediction. More fine-grained clusters can be observed by referring to the associated dendrograms in the biclustering plot.
Interactive visualisation of results
We provide an interactive demo at https://iguana.dcs.warwick.ac.uk showing sample IGUANA results and highlighting the full output of our model at global and local levels, including the intermediate gland, lumen and nuclear segmentation results. In particular, we display the node explanations overlaid as a heat map on top of the glands and the local explanations by hovering over each node in the overlaid graph. Here, we provide the top 5 features to provide insight into what is contributing to certain glands being flagged as abnormal. It may also be of interest to assess the difference in features for nodes across the WSI. Therefore, we also enable visualisation of each of the 25 features overlaid on top of the glands as heat maps.
Discussion
Key findings
There has been a staffing crisis in pathology for many years46, which is being further exacerbated by the increased demand for histopathological examination. Embracing new technologies and AI in clinical practice may be necessary as hospitals seek to find new ways to improve patient care47. AI screening of large bowel endoscopic biopsies holds great promise in helping to reduce these escalating workloads by filtering out normal specimens. However, currently there doesn’t exist a solution that can do this with a high predictive performance. Also, explainable AI is now recognised as a key requirement for trustworthy AI in human-centred decision making26, yet is usually not considered in many healthcare applications. Therefore, in this study we developed an AI model that can accurately differentiate normal from abnormal large bowel endoscopic biopsies, while providing an explanation for why a particular diagnosis was made.
We demonstrated that our proposed method for automatic colon biopsy screening could achieve a strong performance during both internal cross validation (mean AUC-ROC=0.98, mean AUC-PR= 0.98) and on three independent external datasets (mean AUC-ROC=0.97, mean AUC-PR=0.97). Highly sensitive tools for screening are required to minimise the number of undetected abnormal conditions, since the false negative report is likely to lead to delayed diagnosis and potential patient harm. Currently, we obtain promising specificities of 0.789 ± 0.043 at a sensitivity of 0.97 and 0.541 ± 0.121 at a sensitivity of 0.99, which could have a positive impact on reducing pathologist workloads. We also show in Supplementary Fig 7 the expected reduction in clinical workload, where we report up to a 32% time saving by screening out normal biopsies that do not require assessment, while still maintaining a sensitivity of 0.99.
Analysis of misclassifications
To understand misclassifications made by our model, we show six normal slides with the highest predicted abnormality scores in Supplementary Fig 8. After inspection, we see that IGUANA correctly classifies these slides and therefore identifies mislabelling errors in the dataset. Here, the examples should have been labelled as either inflammatory or hyperplastic polyp. In the figure, we include sample image regions, as well as local and WSI-level feature explanations that are reflective of the true category of each slide. In addition, we performed a false negative analysis, where in Supplementary Fig 9a we show the counts of various sub-conditions along with the corresponding number of false negatives. In Supplementary Fig 9b, we show the false negative rate of each category. It can be observed that the model found slides with lymphocytic and collagenous colitis somewhat challenging, with false positive rates of 0.29 and 0.46, respectively. Explicit modelling of the sub-epithelial collagen band should enable us to better detect collagenous colitis. It may be worth noting that there was a relatively small number of collagenous colitis samples in all four cohorts and so they may not have a large impact on the overall performance. Also, a high false negative rate was observed in the mild inflammation category, but this is to be expected because they are visually similar to normal samples.
Strengths and limitations
Not only does the IGUANA algorithm attain a strong performance, it also highlights diagnostically important areas within the tissue and provides an explanation as to which clinical features in those areas are most related to its prediction. This is a major strength of our study as it implies that IGUANA can be used to screen out normal biopsies with high accuracy and also used as a highly interpretable decision support tool within abnormal biopsies.
To enable explainable predictions, our algorithm relies on an accurate intermediate segmentation step, which requires many pixel-level annotations. This can be a time-consuming step and can therefore act as a bottleneck in the development of similar methods. In addition, the type of features that can be incorporated into our AI algorithm are dependent on which kinds of histological objects are initially localised. For example, we do not currently detect goblet cells and so do not include features indicative of goblet cell-rich hyperplastic polyps. Other histological objects that could be added include giant cells, signet ring cells and mitotic figures. In addition, although we segment the surface epithelium, we do not extract any associated features that can help identify conditions such as collagenous colitis. Our method also does not assess surface abnormality to detect intestinal spirochaetosis or pigment to detect melanosis coli. These shortcomings will be addressed in future work. In Supplementary Fig 1b, we highlight diagnostic features (in red colour) that are not currently modelled in our framework.
Findings in context
There have been recent AI approaches developed for cancer detection in colonic whole-slide images22,48,49. However, such approaches cannot be used for screening in clinical practice because they often fail to identify non-cancerous abnormalities such as inflammation. Similarly, AI models have been developed for detecting polyps50,51, inflammatory bowel disease52 or grading dysplasia23, but again they do not address the problem of screening normal from all types of abnormality. Our approach utilises retrospective biopsies from pathology archives, where data is accordingly labelled as normal or abnormal to reflect the clinical screening process. Therefore, unlike other approaches, our AI model can be directly implemented as a triaging tool and may therefore have a profound effect on reducing pathologist workloads. In addition, most recent automatic methods rely on weak supervision, where only the overall diagnosis is used to guide the algorithm. This strategy may be advantageous because it does not rely on the time-consuming task of collecting many annotations. However, this limits the interpretability of the output, which may hinder the acceptance of such models in hospitals.
Implications for clinicians and policy makers
Analysis of colon biopsy slides by visual examination, either under the microscope or more recently on the computer screen, is the current gold standard. However, the current practice is unsustainable with increasing numbers of specimens that require examination and due to staff shortages, where only 3% of NHS hospitals report adequate staffing4. With advances in cancer screening programmes and no immediate sign of the pathologist staffing crisis being resolved, additional measures to assist with reporting will be essential. Our proposed AI model addresses this unmet need by automatically filtering out normal colonic biopsies that require minimal intervention, yet make up a substantial proportion of all cases, with high degree of accuracy. As a result, our model significantly reduces the number of samples that require review by pathologists.
AI models are now starting to be used in clinical practice for prostate cancer detection, where a clear advantage for clinicians has been demonstrated in terms of reducing workloads and increasing reporting accuracy53,54. There is growing evidence that automated methods for tissue diagnosis can transform pathologist workflows and help drive new policies in healthcare. However, no such tool currently exists for screening large bowel endoscopic biopsies, perhaps due to the fact that no automated tool has been able to accurately detect all kinds of abnormality, including inflammation, dysplasia, hyperplasia and neoplasia. With its triaging capability, the proposed model promises to have positive implications on patient treatment due to faster time to diagnosis, resulting in the potential for early intervention where it is needed the most.
Extending the model to surgical resections
Despite the screening of endoscopic large bowel biopsies being a focus of this study, the proposed approach could be applied to resection samples with minimal modification. Our method may be especially powerful in this case because each tissue segment within the slide is typically larger, allowing greater spatial context to be explored. Two potential areas of interest using resection samples include the prediction of genetic alterations55 and survival analysis17,56. In these cases, histological biomarkers are less well known, as compared to those used for routine screening tasks. Therefore, our approach might be used to aid biomarker discovery and help toward further understanding of which morphological patterns are associated with certain genetic alterations and clinical outcomes.
Conclusion
We have shown that IGUANA offers promise as an effective triaging tool for AI-based colon biopsy screening with a strong emphasis on diagnostic interpretability providing concrete justification as to why a certain diagnostic class was predicted and making its predictions transparent and explainable. The proposed AI method can help alleviate current issues in pathologist shortages in the NHS and worldwide and reduce turnaround times of the screening results. Before deployment in clinical practice, a larger scale validation is required with further analysis of IGUANA’s feature explanation output. In addition, considerable time needs to be invested into extending the current user interface so that it easily integrates with current pathologists’ clinical workflows. This will involve a detailed study on the effectiveness of the decision support tool within abnormal biopsies and assessing its implications on time to report the diagnosis.
Data Availability
Original WSIs from University Hospitals Coventry and Warwickshire NHS Trust, East Suffolk and North Essex NHS Foundation Trust and South Warwickshire NHS Foundation Trust will be made available upon completion of the PathLAKE project. Relevant information on obtaining the data from the IMP cohort can be found in the original publication. Graph data will be made available upon request, under a non-commercial Creative Commons license, to enable reproduction and improvement of the results obtained in this paper.
Data Sharing
Original WSIs from University Hospitals Coventry and Warwickshire NHS Trust, East Suffolk and North Essex NHS Foundation Trust and South Warwickshire NHS Foundation Trust will be made available upon completion of the PathLAKE project. Relevant information on obtaining the data from the IMP cohort can be found in the original publication16. Graph data will be made available upon request, under a non-commercial Creative Commons license, to enable reproduction and improvement of the results obtained in this paper.
Contributors
SG, DS and NR designed the study with support from all co-authors. SG, FM and NR developed the methods. SG wrote the code and carried out the experiments. MB provided results using IDaRS and CLAM. MA, YWT, EH, KD, HS, AR, SW, AA, KB, MN, KH, HE, KG and DS provided diagnostic annotations of colonic biopsy slides. SG, FM, MA, KG, DS and NR performed analysis and interpretation of the results. MB, MJ, NW, WL, AB and SEAR provided technical and material support. SG, FM, DS and NR were all involved in the drafting of the paper. NR is the Corresponding Author and guarantor. All authors read and approved the final paper. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding
This study was supported by the PathLAKE digital pathology consortium which is funded by the Data to Early Diagnosis and Precision Medicine strand of the government’s Industrial Strategy Challenge Fund, managed and delivered by UK Research and Innovation (UKRI). FM acknowledges funding from EPSRC grant EP/W02909X/1.
Competing Interests
All authors have completed the ICMJE uniform disclosure form and declare: this research was supported by grants from UK Research and Innovation (UKRI); SG, DS and NR are co-founders of Histofy Ltd, but this has not influenced the work in this study; no other relationships or activities that could appear to have influenced the submitted work. DS reports personal fees from Royal Philips, and NR and FM report research funding from GlaxoSmithKline, all outside the submitted work.
Ethical Approval
This study was conducted under Health Research Authority National Research Ethics approval 15/NW/0843; IRAS 189095 and the Pathology image data Lake for Analytics, Knowledge and Education (PathLAKE) research ethics committee approval (REC reference 19/SC/0363, IRAS project ID 257932, South Central - Oxford C Research Ethics Committee). Data collection and usage of the IMP Diagnostics dataset was performed in accordance with national legal and ethical standards applicable to this cohort16.
Transparency Statement
The lead authors (the manuscript’s guarantors) affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained. To help ensure this, the authors followed the principles outlined by the MI-CLAIM57 and DECIDE-AI58 checklists. We include a completed DECIDE-AI checklist in Supplementary Section S6.
Supplementary material
S1. Diagnostic Pathway
S2. Data Audit & Cohort Details
S3. Extended Methods
Graph neural networks for computational pathology
Existing graph neural networks (GNNs) for computational pathology usually consider fixed-size image patches at each node1-3 and so fail to incorporate features derived from macrostructures, which can span multiple image patches. However, GNNs that use nodes at centres of image patches in WSIs4 may have poor interpretability. Instead, nodes can be centred at known histological entities, such as nuclei and glands, allowing pathologists to directly reason with a model’s predicted outcome5. Although some methods position nodes at known entities, Deep Learning-based features are commonly used6,7, again leading to reduced interpretability. Rather than using features derived from image regions, graphs built on top of meaningful entities enable the extraction of morphological features. For example, previous methods have constructed graphs on top of nuclei (also known as cell graphs), allowing utilisation of interpretable cellular features8,9. However, nuclei are the most basic building blocks in the tissue and therefore associated features may have limited expressive power, and may fail to model important multi-cellular structures, such as glands. Cell graphs can also be very large, where a single tissue sample can contain tens of thousands of nuclei, leading to the generation of intractable graph models. To overcome recent limitations in the literature, our proposed method utilises the concept of gland-graphs for WSI classification, where the nodes are positioned at glands within the tissue, with associated human-interpretable features. The features that we utilise are clinically-meaningful and in line with pathologist diagnostic pathways, leading to excellent performance and providing a highly-interpretable output.
Identification of histological objects
Cerberus10 is a fully convolutional neural network with a shared ResNet3411 encoder and T independent decoders, each of which makes a prediction for a specific task t. Specifically, we consider 6 tasks: 1) gland instance segmentation, 2) gland semantic segmentation, 3) nuclear instance segmentation, 4) nuclear semantic segmentation, 5) lumen instance segmentation and 6) tissue type patch classification. Here, the instance and semantic segmentation branches are combined to achieve simultaneous segmentation and classification. For segmentation tasks, we use U-Net12-inspired decoders that upsample the features step-by-step by a factor of two. After each upsampling operation, we incorporate features from the encoder with skip connections, followed by two convolutions (3×3 kernel) with batch normalisation13. This is repeated until the features have the same spatial dimensions as the input. For tissue type classification, we use global average pooling to reduce the features at the output of the encoder to a 256-dimensional vector, which is then followed by two fully connected layers. The tissue type classification output is used to estimate the lamina propria region.
Gland-graph construction
Mathematically, a graph is defined as G ≡ (V, E), where V is a set of N vertices (or nodes) and E is a set of edges, where ei,j ∈ E denotes an edge between nodes i and j ∈ V. In our case, V describes the set of all glands in a WSI. Each node typically has an associated k-dimensional feature vector xi for i ∈ V. In existing methods, an edge ei,j is constructed if the Euclidean distance between the centroids of nodes i and j is less than a certain threshold9,14. The distance between neighbouring node centroids is suitable for convex node entities, such as nuclei, because centroids will usually be located within the object. However, glands can often be non-convex, especially when they become cancerous. Therefore, we instead define an edge ei,j in our gland-graph if the minimum distance between points on the boundary contours of two glands i and j is less than a certain distance α.
Gland-graph neural network
Upon formation of our gland-graph representation G ≡ (V, E) of a WSI in terms of its nodes V and their non-directional edges E, we pass the input through a GNN to compute the slide-level prediction score. Each node vector xi represents a gland in terms of the previously described features and the GNN aggregates information across nodes using the edges in its computation. Note that the number of nodes and edges in each graph can be different depending upon the tissue structure. The GNN first applies a linear operation on xi ∈ ℝ25 to produce another node-level feature representation for input into two Principal Neighbourhood Aggregation (PNA) graph convolution15 layers. Each PNA layer (l = f,2) updates each node representation by aggregating information from its neighbours j ∈ ℵi according to the following rule: , where γl and ρl are multi-layer perceptrons (MLPs) each with their own trainable weights. PNA uses a combination of aggregation strategies (denoted by ⊕) based on scaling of mean, standard deviation, minimum and maximum aggregation operators over node features. It has been shown recently that using this aggregation approach is superior to methods that use a single aggregation step, such as computing the sum as it enables the resulting GNN to be a better discriminator of local graph structures15. Outputs of the two PNA layers are concatenated and fused with a linear operation to arrive at the final node-level feature embedding ri. The final output f(G) ∈ ℝC is obtained by performing global attention pooling: where ψ and ω are MLPs, C denotes the number of classes predicted by the network, ⊙ denotes element-wise multiplication and hence the global pooling operator learns to assign a varying weight to different gland representations, signifying their relative importance in the final prediction. Finally, a softmax function is applied and all the trainable weights in the GNN are optimised in an end-to-end fashion by minimising the binary cross entropy loss between the output and the ground-truth labels of training slides.
Gland-graph explanation
To enable an interpretable and explainable output, we utilise the graph pruning method GNNExplainer16, which generates a subset of nodes and features that play a crucial role in the GNN’s prediction. The intuition here is that removing unimportant nodes and features should have a negligible impact on the performance and can therefore be removed. Specifically, GNNExplainer is formulated as an optimisation task that maximises the mutual information between a GNN’s prediction and the distribution of possible subgraph structures. Practically, this is achieved by learning a real-valued mask, which gives less weight to unimportant graph components. For our approach, we learn a node explanation mask Mn ∈ ℝN and a feature explanation mask Mf ∈ ℝN×25, where N denotes the number of nodes in each WSI and 25 is the pre-defined number of features. Rather than applying a threshold to the learned masks to give a compact subgraph, we visualise the raw mask output, which gives an interpretable and explainable output that can be discussed with clinicians. Specifically, the learned mask Mn provides the node explanation which can be overlaid on top of the glands in each WSI as a heatmap. Similarly, Mf can be used to identify the top features for each node/gland and the corresponding importance values. To obtain a WSI-level explanation, the local features are averaged within the top ten most predictive nodes.
To assess which node explanation method was best, we utilised the metric proposed by Jaume et al.5. The intuition behind their proposed metric is that a superior node explanation technique should be able to locate top nodes that can better differentiate between classes (in our case normal vs abnormal). We compared node explanations given by GNNExplainer, integrated gradients31 and the attention scores given by our network and found that GNNExplainer gave the best results in terms of class separability. It is important to note that GNNExplainer is a post-hoc method, which is why we used attention pooling during optimisation of our predictive model.
Comparative methods
To rigorously assess the performance of our approach, we compare with IDaRS17, CLAM18 and a random forest classifier using the interpretable glandular features that we extracted (denoted by Gland-RF). Both IDaRS and CLAM are recent state-of-the-art DL approaches that use raw H&E image patches as input in a multiple-instance learning (MIL) framework. The Gland-RF model computes the mean and standard deviation of all local features within the slide to obtain a fixed-size global feature vector before input to the model. When fitting the Gland-RF for each fold, we perform a grid search over the hyperparameters and select the best models in terms of their performance on the validation set.
S4. Interpretable Features
S5. Extended Results
S6. DECIDE-AI Checklist
Acknowledgements
We acknowledge Dr SA Sanders and Dr Naresh Chachlani for their assistance in providing WSIs from South Warwickshire NHS Foundation Trust.
Footnotes
Changed corresponding author.