Abstract
Tumor-Infiltrating Lymphocytes (TILs) have strong prognostic and predictive value in breast cancer, but their visual assessment is subjective. We present MuTILs, a convolutional neural network architecture specifically optimized for the assessment of TILs in whole-slide image scans in accordance with clinical scoring recommendations. MuTILs is a concept bottleneck model, designed to be explainable and to encourage sensible predictions at multiple resolutions. Our computational scores match visual scores and have independent prognostic value in invasive breast cancers from the TCGA dataset.
Introduction
Tumor-Infiltrating Lymphocytes (TILs) are an important prognostic and predictive biomarker in basal and Her2+ breast carcinomas [1]. The stromal TILs score is the fraction of stroma within the tumor bed occupied by lymphoplasmacytic infiltrates (Fig 1). TILs are assessed visually by pathologists through examination of formalin-fixed paraffin-embedded, hematoxylin and eosin (FFPE H&E) stained slides from tumor biopsies or resections. They are subject to considerable inter- and intraobserver variability, and hence a set of standardized recommendations was developed by the international Immuno-Oncology Working Group [2,3]. Nevertheless, observer variability remains a critical limiting factor in the widespread clinical adoption of TILs in research and clinical settings. Therefore, a set of recommendations was published for developing computational tools for TILs assessment [4]. This brief report describes the development and validation of MuTILs, an explainable deep-learning model for the evaluation of TILs.
Methods
MuTILs jointly segments tissue regions and cell nuclei and extends our earlier work on this topic (Fig 2) [5]. It comprises two parallel U-Nets (each with a depth of 5) for segmenting regions and nuclei at 1 and 0.5 microns-per-pixel (MPP), respectively [6]. Inspired by the HookNet architecture, information is passed from the region branch down to the nucleus branch to provide low-power context [7]. Additionally, we employed a series of constraints to promote compatible, biologically sensible predictions.
We relied on images from 125 infiltrating ductal breast carcinoma patients from the BCSS and NuCLS datasets [8,9]. Additionally, we supplemented the training set with annotations from 85 slides from the Cancer Prevention Study II cohort [10]. The slides were separated into training and testing sets using 5-fold internal-external cross-validation, using the same folds as the NuCLS modeling paper [9,11]. For training, we extrapolated the nuclear labels from the small ∼256×256 pixel high-power fields to large 1024×1024 pixel regions of interest (ROIs) by using NuCLS models to perform inference on the same slides they were trained on to obtain bootstrapped “weak” labels. Generalization results presented here use manual labels (Fig 3).
For whole-slide image (WSI) inference, we relied on data from 305 breast carcinoma patients for validation, 269 of whom were infiltrating ductal carcinomas, and 156 were Her2+. Visual scores were assessed by one pathologist (RS) and used as the baseline. The WSI accession and tiling workflow used the histolab and large_image packages and included: 1. Tissue detection; 2. Detection and exclusion of empty space and markers/inking; 3. Tiling the slide and scoring tiles at a very low resolution (2 MPP); 4. Analyzing the top 300 tiles [12,13]. Fixing the number of analyzed ROIs ensured a near-constant run time of less than two hours per slide. Low-resolution tiles with a high composition of cellular (hematoxylin-rich) and acellular (eosin-rich) regions received a higher informativeness score. This favored tiles with more peritumoral stroma. Color deconvolution was performed using the Macenko method from the HistomicsTK package [14,15]. Each of the top informative tiles was assigned one of the trained MuTILs models in a grid-like fashion. This scheme acted as a form of ensembling without increasing the overall inference time.
Trained MuTILs models were then used to segment tissue and nuclear components. A euclidean distance transform was applied to detect stroma within 32 microns from the tumor boundary. The fraction of image pixels occupied by this peritumoral stroma was considered a saliency score. We assessed the following variants of the TILs score (Fig 1):
Number of TILs / Stromal area (nTSa)
Number of TILs / Number of cells in stroma (nTnS)
Number of TILs / Total Number of cells (nTnA)
We obtained these score variants both globally (aggregating region and nuclear counts from all ROIs) and through saliency-weighted averaging of scores obtained for each ROI independently. A simple linear calibration was then used to ensure the scores occupied a similar range as the visual scores.
Results
Table 1 shows the region segmentation and nucleus classification accuracy on the testing sets. MuTILs achieves high accuracy for stromal region segmentation (DICE=80.8±0.4), as well as the classification of fibroblasts (AUROC=91.0±3.6), lymphocytes (AUROC=93.0±1.1), and plasma cells (AUROC=81.6±6.6) — all contributors to the computational TILs score. This accuracy is also supported by qualitative examination of model predictions on both the ROIs from BCSS and NuCLS datasets (Fig 3) and the full WSI (Fig 4). Computational TILs score variants had a modest-to-high correlation with the visual scores (Spearman R ranges between 0.55 - 0.58) (Fig 5). Some slides were outliers with discrepant visual and computational scores; the causes for this discrepancy are discussed below. Both global and ROI saliency-weighted scores were significantly correlated with the visual scores (p<0.001).
We examined the prognostic value of MuTILs on infiltrating ductal carcinomas and Her2+ carcinomas. While we had access to visual scores from the basal cohort, the number of outcomes was limited, and neither visual nor computational scores had prognostic value. Progression-free interval (PFI) is the endpoint used per recommendations from Liu et al. for TCGA, with progression events including local and distant spread, recurrence, or death [16]. First, we examined the Kapan-Meier curves for patient subgroups using a TILs-score threshold of 10% for stromal TILs score and the median value for the nTnA computational score variant (Fig 6). Both visual and computational scores had good separation within the infiltrating ductal cohort, although only the nTnS and nTnA computational scores had significant log-rank p-values (p=0.009 and p=0.006, respectively). Within the Her2+ cohort, all metrics had good separation on the Kaplan-Meier, although the visual score had a borderline p-value. All computational scores were significant within this cohort (p=0.018 for nTSa, p=0.002 for nTnS, and p=0.006 for nTnA).
We also examined the prognostic value of the continuous (untresholded) TILs scores using Cox proportional hazards regression, with and without controlling for clinically-salient covariates including patient age, AJCC pathologic stage, histologic subtype, and basal status (Table 2). Within the infiltrating ductal cohort, the only metric with significant independent prognostic value on multivariable analysis was the nTnS computational score. Within the Her2+ cohort, the visual score was not independently prognostic (p=0.158), while the computational scores all had independent prognostic value, with the most prognostic being the nTnS variant (p=0.003, HR<0.001). Saliency-weighted ROI scores almost always had better prognostic value than global computational scores.
Discussion
MuTILs is a concept bottleneck model; it learns to predict the individual components that contribute to the TILs score (i.e., peritumoral stroma and TILs cells) and uses those to make the final predictions [17]. This setup makes its predictions explainable and helps identify sources of error.
The region constraint helped provide context for the nuclear predictions at high resolution, which helped reduce misclassification of immature fibroblasts and plasma cells as cancer (Fig 7). A qualitative examination of slides with discrepant visual and computational TILs scores shows there are three major contributors to discrepancies:
Misclassifications of some benign or low-grade tumor nuclei as TILs.
Variations in TILs density in different areas within the slide, which causes inconsistencies in visual scoring. This phenomenon is also a well-known contributor to inter-observer variability in visual TILs scoring [3].
Variable influence of tertiary lymphoid structures on the WSI-level score.
Our results show that the most prognostic TILs score variant (nTnS) is derived from dividing the number of TILs cells by the total number of cells within the stromal region. The visual scoring guidelines rely on the nTSa, which is reflected in the slightly higher correlation of the nTSa variant with the visual scores compared to nTnS [2]. So why is nTnS more prognostic than nTSa? There are two potential explanations. First, it may be that nTnS is better controlled for stromal cellularity since it would be the same in low- vs. high-cellularity stromal regions as long as the proportion of stromal cells that are TILs is the same. Second, nTnS may be less noisy since it relies entirely on nuclear assessment at 20x objective, while stromal regions are segmented at half that resolution.
Finally, we note that this validation was done only using the TCGA cohort, and future work will include validation on more breast cancer cohorts. In addition, we note that MuTILs has limited ability to distinguish cancer from normal breast tissue at low resolution, which may necessitate manual curation of the analysis region, especially for low-grade cases.
Conclusion
MuTILs is a lightweight deep learning model for reliable computational assessment of TILs scores in breast carcinomas. It jointly classifies tissue regions and cell nuclei at different resolutions and uses these predictions to derive patient-level TILs scores. We show that MuTILs can produce predictions that have good generalization for the predominant tissue and cell classes relevant for TILs scoring. Furthermore, computational scores are significantly correlated with visual assessment and have strong independent prognostic value in infiltrating ductal carcinoma and Her2+ breast cancer.
Data Availability
The BCSS and NuCLS datasets used for training and validation are publicly available, and so are the TCGA clinical data. The Cancer Prevention Study-II data is available via the American Cancer Society (https://www.cancer.org/).
Acknowledgments
This work was supported by the U.S. NIH NCI grants U01CA220401 and U24CA19436201. We acknowledge support from Dr. David Gutman and the American Cancer Society, including Dr. Mia M. Gaudet, Dr. Samantha Puvanesarajah, Dr. Lauren Teras, James Hodge, and Elizabeth Bain