Abstract
Spatial quantification is a critical step in most computational pathology tasks, from guiding pathologists to areas of clinical interest to discovering tissue phenotypes behind novel biomarkers. To circumvent the need for manual annotations, modern computational pathology methods have favoured multiple-instance learning approaches that can accurately predict whole-slide image labels, albeit at the expense of losing their spatial awareness. We prove mathematically that a model using instance-level aggregation could achieve superior spatial quantification without compromising on whole-slide image prediction performance. We then introduce a superpatch-based measurable multiple instance learning method, SMMILe, and evaluate it across 6 cancer types, 3 highly diverse classification tasks, and 8 datasets involving 3,850 whole-slide images. We benchmark SMMILe against 9 existing methods, and show that in all cases SMMILe matches or exceeds state-of-the-art whole-slide image classification performance while simultaneously achieving outstanding spatial quantification.
1 Introduction
Spatial predictions are critical for computational pathology. For instance, pathologists assess tissue samples visually, and explainable artificial intelligence (AI) tools are expected to produce spatial maps to guide their attention [1–3] The discovery of spatial associations between tissue phenotype and the corresponding genotype can provide biological insights [4, 5], guide biomarker discovery [6, 7], and facilitate downstream tasks such as spatially resolved sequencing [8, 9].
A key bottleneck in the development of spatially-aware computational pathology models is the need for detailed spatial annotations, often unfeasible due to the vast scale of gigapixel images and the need for specialized domain knowledge [10]. Multiple Instance Learning (MIL) has emerged as the leading learning paradigm for whole-slide imaging (WSI) analysis [11], due to its ability to efficiently utilize patient-level labels, which are readily obtainable from pathology reports.
The vast majority of MIL-based computational pathology models adopt a so-called representation-based methodology with an attention mechanism to identify highly discriminative tissue regions, for example for cancer detection and diagnosis [12]. These approaches have been hugely successful, achieving excellent slide-level classification performance. However, attention maps produced by representation-based MIL approaches are often imprecise and unreliable for spatial analysis [13, 14]. This limitation significantly hampers the utility of these methods in scenarios that require accurate phenotypic descriptions beyond a simple global label.
In this manuscript we introduce a new method that performs accurate spatial quantification concurrently with WSI classification, called SMMILe (Superpatch-based Measurable Multiple Instance Learning method). The method is based on instance-based MIL equipped with an attention mechanism, a type of MIL with remarkable localization ability [15]. To address the potential limitations of instance-based MIL approaches, specifically their inferior WSI-level prediction capability and misdetection rate, SMMILe uses Neural Image Compression (NIC) [16], two novel instance-based comprehensive attention modules, and a Bayesian refinement network. We investigate the capabilities of SMMILe on 3 diverse families of pathology tasks in WSI analysis, including metastasis detection, subtype prediction, and grading, presented as binary, multi-class, and multi-label classification tasks. We study 8 datasets for 6 cancer types (Fig. 1a), including lung, renal, ovarian, breast, gastric, and prostate cancer. We demonstrate that SMMILe achieves state-of-the-art classification performance at the WSI level across cancer types (Fig. 1b). Importantly, it simultaneously provides spatial predictions that surpass 9 state-of-the-art computational pathology methods by a large margin (Fig. 1c).
2 Results
2.1 Instance-based learning produces highly skewed attention maps by design
Despite sharing a foundational MIL framework, instance-based (IAMIL) and representation-based (RAMIL) multiple-instance learning methods diverge in their methodologies for assigning attention scores to individual instances. Understanding these differences is critical to design a framework that has the best of both worlds. With that objective, we analyze the mathematical formulation of the different frameworks by means of three theorems and a targeted synthetic experiment (Section 4.1).
First, we find that, mathematically, RAMIL and IAMIL share a clear mutual link: RAMIL with the most commonly used linear attention is strictly equivalent to Logit-based attention MIL (LAMIL), a variant of IAMIL. The distinction in attention score allocation between LAMIL and IAMIL is closely related to the properties of the output layer’s activation function.
Second, we find that as a consequence of its intrinsic mathematical formulation, IAMIL assigns relatively lower attention scores to low-discriminative and non-discriminative instances, and higher scores to those with highly discriminative power, compared to the allocation patterns observed in LAMIL.
To examine the distinctions between RAMIL and IAMIL in the practical setting, we create a distinguishable synthetic dataset with a ‘positive’ and a ‘negative’ class adhering to the MIL settings (Fig. 2a, Supplementary Table 1). Then, we utilize the architecture of AB-MIL [17] as the basis for RAMIL and modify it for IAMIL by conducting attention pooling at the instance level (Fig. 2b, see Methods for details of theoretical findings and synthetic experiments). As shown in Fig. 2c,d, we find that IAMIL tends to allocate attention scores to highly discrimina-tive instances that are twice higher than RAMIL. However, other positive instances with lower discriminative power have lower attention scores in comparison to those in RAMIL. In addition, while the attention scores for most of the negative instances in IAMIL fall below 0.003, RAMIL assigns a considerable portion above 0.01 – indicating that the negative instances are assigned much lower attention scores by IAMIL as compared to RAMIL.
Together with our theoretical findings, these observations highlight the key limitation of IAMIL: it produces highly skewed attention maps, focusing only on a limited subset of highdiscriminative regions within WSI, leading to a decreased recall rate for the relevant tissue regions. Solving this critical shortcoming is the main barrier for IAMIL to achieve its potential as a superior foundational approach for accurate spatial quantification in WSIs.
2.2 Whole slide image classification
Building on from these results, we developed SMMILe, a superpatch-based measurable multiple instance learning method. SMMILe is designed to benefit from the spatial awareness of IAMIL, and equipped with custom modules that address the shortcomings identified previously.
SMMILe comprises a convolutional layer, an instance detector, an instance classifier, and five novel modules (i.e., the slide pre-processing, the consistency constraint, the parameter-free instance dropout, the delocalised instance sampling, and MRF-based instance refinement). The convolution layer for the instance embeddings enhances their local receptive field. The instance detector, designed with multiple streams, identifies the significance of each instance’s embedding for different categories, facilitating multi-label classification tasks. The classifier assigns each instance’s embedding to its respective category. Ultimately, the bag-level (WSI) classification is determined by aggregating the product of detection and classification scores from all instances (patches).
We evaluated the WSI-level classification performance of SMMILe on eight distinct cancer datasets across three different categories of pathology tasks of increasing complexity.
The simplest category was binary classification. Both instances and bags are categorized into only positive and negative classes (e.g. presence of cancer or not). As an example, we trained the framework to detect breast lymph node metastases using the well-studied Breast (Camelyon16) dataset [18]. We also studied multi-class classification, in which instances and bags can belong to one of a number of positive categories, or to the negative class (e.g. different cancer subtypes vs. no cancer). As an example, we studied four different examples of cancer subtyping tasks: Lung (TCGA-LU, non-small cell lung cancer subtyping) [19, 20], Renal-3 (TCGA-RCC, renal cell carcinoma subtyping, three categories) [21–23], Renal-4 (IH-RCC, renal cell carcinoma subtyping, four categories), and Ovarian (UBC-OCEAN, ovarian cancer subtyping) [24, 25]. Finally, we considered the multi-label classification case, which remains relatively unexplored in previous computational pathology literature. Here a bag may contain instances of multiple positive categories simultaneously, in addition to negative instances (e.g. mixed-type tumour). We evaluated three datasets for heterogeneity tumour analysis: Gastric (TCGA-STAD, gastric cancer subtyping) [26], Gastric Endoscopy (IH-ESD, histotype classification for endoscopic submucosal dissection specimens), and Prostate (SICAPv2, Gleason grading) [27].
All datasets were randomly divided at the patient level into five subsets to facilitate five-fold cross-validation, with the results reporting the mean and variance for each metric.
We benchmarked SMMILe against the two fundamental attention-based MIL methods (RAMIL and IAMIL), in addition to five current state-of-the-art MIL-based WSI classification methods: CLAM [28], DSMIL [29], TransMIL [30], DTFD-MIL [31], and AddMIL [13]. Moreover, we consider two NIC-based WSI classification methods: the standard neural image compression (NIC) approach [16], and NIC-WSS [32], an enhanced variant optimized for regional segmentation. Except for the already tessellated Prostate dataset, we process these WSIs into patches without overlapping. Patch embeddings are extracted from an identical layer (the third residual block) of the ResNet-50, which has been pre-trained on the ImageNet dataset. This approach is uniformly applied across all baselines, including two NIC-based methods initially utilizing an encoder pre-trained on histopathological datasets, paralleling the encoder used in SMMILe to ensure equitable comparison (see Methods for details of datasets and implementation).
SMMILe consistently surpasses comparative methods in WSI classification across nearly all datasets for both accuracy and macro AUC score (Fig. 3a,c). SMMILe attains accuracy of 87.68%, 94.63%, and 87.32% on the Lung, Renal-3, and Prostate datasets, respectively, exceeding the second-best methods by margins of 3.08%, 2.49%, and 2.92%. In the cases of Breast and Ovarian datasets, while CLAM and DSMIL record the highest accuracy, SMMILe trails narrowly by 1.2% and 0.2% in accuracy, respectively. However, SMMILe secures the top macro AUC scores of 93.21% for the Breast dataset and 90.51% for the Ovarian dataset. The error bars (Fig. 3a) and the box plot shapes (Fig. 3c) further underscore the robustness of SMMILe in WSI classification across different datasets. The two methods based on NIC demonstrate suboptimal WSI classification performance on most datasets. This shortcoming originates mainly from their intrinsic design, which is intended for handling WSI embeddings generated by pretrained encoders which are finely tuned for discriminative features pertinent to specific domains. These results demonstrate that SMMILe can deliver superior and consistent WSI classification performance across a wide range of cancer types and computational pathology tasks.
2.3 Spatial quantification and visualization
Beyond the accurate classification of entire WSIs, we wanted to assess the capabilities of SMMILe for spatial quantification, i.e., patch-level classification, compared to existing methods. Spatial ground truth annotations were available for eight datasets, either in their entirety (Breast, Gastric Endoscopy, and Prostate datasets) or in part (Lung, RCC-3, RCC-4, and Gastric datasets). The Ovarian dataset represents a unique case, with annotations available for only a subset of patches within certain WSIs. Consequently, the evaluation of this dataset is confined to the spatial quantification of these specifically annotated patches. None of the spatial annotations were used for model training in any case.
The derivation of patch-level predictions in representation-based attention MIL baselines (RAMIL, CLAM, DSMIL, TransMIL, and DTFD-MIL) relies on the raw attention scores. In the case of NIC and NIC-WSS, patch-level predictions are acquired from grad-CAM [33] outputs, whereas, for the instance-based MIL methods (IAMIL, and AddMIL), predictions are based on instance scores. SMMILe possesses an instance refinement network capable of directly generating patch-level predictions.
Across all evaluated datasets, SMMILe surpasses other methods by substantial margins (Fig. 3b,d and supplementary tables 3-10). It achieves macro AUC scores that either exceed or approach 90% in nearly all datasets. An exception is the Gastric Endoscopy dataset, characterized by its extremely imbalanced patient distribution, where SMMILe still demonstrates superior performance. Specifically, SMMILe outperforms the second-best methods by over 20% on the Lung, RCC-3, and RCC-4 datasets; by more than 10% on the Gastric, and Prostate datasets; by nearly 10% on the Breast, and Gastric Endoscopy datasets. In the case of the Ovarian dataset, given the incomplete and highly imbalanced patch-level annotations (a few non-tumor regions are annotated) of this dataset, several methods report commendable accuracy and macro AUC scores. However, while SMMILe only exceeds that of the second-best method (DTFD-MIL) by 3.29% in the macro AUC score, it significantly outperforms DTFD-MIL by 18.94% and 17.09% in the accuracy and macro F1 score, respectively.
Meanwhile, SMMILe exhibits a significant improvement in macro precision, recall, and F1-score (Supplementary Fig. 4), on the Lung, Ovarian, Renal-3, Renal-4, and Gastric datasets, because WSIs in these datasets are diagnostic slides of primary tumors, encompassing a vast array of both positive and negative instances. Consequently, SMMILe can effectively mitigate the missed detection and false alarm issues that are prevalent with other baselines.
For the Breast dataset, characterized by a lower proportion of tumor area within each WSI, which means the presence of tumors in only a few positive instances in most WSIs, IAMIL manages to achieve commendable patch-level performance as well. However, the improvement of SMMILe in the Prostate dataset is relatively modest when evaluated in terms of precision, and recall. This dataset, comprising a modest collection of biopsy slides (153 in total), from which only a limited quantity of instances can be generated, does not allow the proposed instance refinement network to be fully leveraged due to the sparse availability of instances for training. Nonetheless, SMMILe has a minimum of 17% improvement in macro AUC scores over other baselines in the Prostate dataset, highlighting its superior capability in distinguishing instances belonging to different histotypes.
Additionally, the spatial visualization results (Fig. 4) reveal that both representation-based MIL (RAMIL, CLAM, DSMIL, TransMIL, DTFD-MIL) and NIC-based methods (NIC, NICWSS) incur substantial false positive and false negative errors. Of these, NICWSS demonstrates better consistency with respect to the ground truth. In general, instance-based MIL methods (IAMIL, AddMIL) predominantly identify only a limited number of highly discriminative regions, aligning with our theorem. SMMILe exhibits superior performance in spatial quantification, often producing results nearly indistinguishable from the ground truth, even in challenging multi-label datasets. These results confirm that SMMILe successfully meets both objectives of precise WSI classification and spatial quantification on various datasets.
2.4 Module integration improve SMMILe performance
We conducted comprehensive ablation studies to assess the contribution of each module in SMMILe (Fig. 5a), including the consistency constraint, the parameter-free instance dropout (Fig. 5c), the delocalised instance sampling (Fig. 5d), the instance refinement network (Fig. 5e), and the MRF constraint (Fig. 5e), each denoted as Cons, InD, InS, InR, and MRF, respectively.
From the WSI classification results (Fig. 3e, the first line chart), it is evident that the baseline SMMILe configuration without any modules (index 0) establishes a robust benchmark, attributable to the integration of the NIC feature compression layer with IAMIL. The InD (index 2) and InS (index 3), acting as two forms of WSI-level augmentation strategies, further enhance the performance of SMMILe across most datasets. However, there is an exception with the Breast dataset, which focuses on breast cancer metastasis detection. Given that many WSIs contain only a few positive instances (i.e., cancerous patches), InD and InS might, in certain scenarios, mask all positive instances, adversely impacting the optimization process and the WSI level classification performance of SMMILe. Moreover, the InR (index 4 and index 5) and MRF (index 6) positively impact WSI classification performance by refining the decision boundaries of the instance classifier, albeit to varying extents. Nonetheless, there are exceptions. The WSI classification in the RCC-3 dataset is relatively straightforward, allowing all methods to achieve high results. Elevating macro AUC performance from 99% to 100% is practically unattainable. Additionally, the sparse presence of positive instances in the WSIs of the Breast and Gastric Endoscopy datasets, along with the limited instance scenario in the Prostate dataset, reduces the effectiveness of MRF (index 6). In certain cases, this reduction in effectiveness can even result in decreased WSI classification performance.
From the patch-level spatial quantification results (Fig. 3e, the second line chart), it is apparent that every module of SMMILe significantly enhances the spatial quantification performance of SMMILe. Integrating with Cons (index 1), which is solely enacted on negative bags, leads to a notable enhancement on the Breast and Gastric Endoscopy datasets, specifically, an increase of 2% and 3% in macro AUC score, respectively. The InD (index 2) and InS (index 3) modules report enhancements in macro AUC scores ranging from 1-8%, across various datasets. InR (index 4 and index 5) exerts a considerable impact across almost all datasets; for example, it facilitates a macro AUC score improvement of approximately 11% for the RCC-4 dataset and 17% for the Lung dataset. Meanwhile, by leveraging spatial smoothness, MRF (index 6) contributes to additional improvements in macro AUC scores for all datasets, except for the Breast dataset, where increases of about 1-8% are observed. Furthermore, more detailed metrics regarding spatial quantification (Fig. 3f), including macro precision, recall, and F1-score, show that in most datasets the baseline SMMILe (index 0) achieves good precision (Fig. 3f, the first bar chart), but the recall is poor (Fig. 3f, the second bar chart). This aligns with our theoretical findings, which posit that instance-based attention MIL can accurately identify high-discriminative positive instances, yet often neglects many low-discriminative positive instances. Consequently, it can be noted that each module enhances the recall capability of SMMILe while maintaining high pre-cision, as well as improving the F1 score. It is noteworthy that the InR module (index 4, index 5, and index 6), due to its capability to refine and align the decision boundaries for instance classification, contributes the most significant improvement to the performance of SMMILe in terms of macro precision, recall, and F1-score.
3 Discussion
To date, relying on patient-level diagnostic labels, which are easy to obtain, a vast number of weakly supervised learning approaches, particularly MIL methods, have been developed for WSI analysis. Most of them utilize attention-based representation-level aggregation for precise WSI classification but overlook the spatial quantification required for many critical tasks and discoveries in clinical pathology. Although these representation-based MIL methods can offer spatial interpretability through the visualization of attention scores, which seems capable of achieving spatial quantification, few works quantitatively assess their interpretability. Here, we conducted a detailed evaluation of WSI-level classification and spatial quantification on 9 SOTA weakly supervised WSI classification methods across 8 diverse datasets across 6 cancer types. In addition to common binary and multi-class classification datasets, we also assessed 3 more complex multi-label classification datasets. We demonstrated that although existing methods can achieve satisfactory WSI classification performance, their spatial quantification capabilities are inadequate for clinical applications and downstream analysis.
In this study, we delved into the theoretical underpinnings of how attention scores are allocated differently across MIL methodologies, specifically between RAMIL and IAMIL. These findings were further corroborated by synthetic experimentation. The theoretical exploration and synthetic experimental evidence suggest that IAMIL, despite its inherent challenges (relatively inferior bag-level prediction capabilities and a high precision yet low recall in positive instance detection) holds superior potential for spatial quantification and interpretability in WSI analysis compared to RAMIL. Subsequently, we illustrated that by integrating NIC, comprehensive attention modules, and an instance refinement network into the IAMIL framework, it is possible to develop a robust MIL method, SMMILe. Compared with other SOTA MIL methods, SMMILe not only excels in superior and consistent WSI classification performance but also provides excellent spatial quantification performance across a wide range of CPath tasks (metastasis detection, subtyping, and grading), with only patient-level labels available for model training.
Furthermore, we demonstrated that the advancements in performance can be directly linked to the innovative modules integrated within SMMILe. Specifically, NIC empowers SMMILe by facilitating convolution operations that broaden the local receptive field of instance embeddings, thereby boosting WSI classification performance. Two advanced comprehensive attention modules (parameter-free instance dropout, and delocalised instance sampling) elevate SMMILe’s ability to recognize positive patches with relatively low-discriminative features, improving its ability to differentiate between such patches and those that are negative. Moreover, given that these two modules incorporate stochastic processes, generating unique instance arrangements for each iteration, akin to WSI-level data augmentation also contributes to enhancing the classification performance and robustness of SMMILe at the WSI level. Building on these enhancements, the instance refinement network addresses a prevalent challenge in attention-based MIL methods: the variability in the optimal decision threshold for attention scores at the patch level across different WSIs. This variability complicates the task of selecting a universal threshold. The network markedly bolsters SMMILe’s ability to discriminate at the patch level by learning a unified decision boundary across different WSIs. When combined with the MRF energy constraint, it further improves the spatial coherence of patch-level predictions, ensuring smoother transitions and consistency in classifications across contiguous patches.
This work has certain limitations. While the evaluation has been performed on eight diverse datasets, it was confined to H&E stained slides for histotype-based classification, due to the available datasets; however, spatial quantification for (1) immunohistochemistry stained slides is also a routine pathological analysis task; and (2) genotype-based classification task could provide profound insights for understanding diseases, while the results of spatial quantification can be validated through spatial transcriptomics. Additionally, the potential of SMMILe has not yet been fully explored. Firstly, the scale of individual datasets is relatively small, with the largest dataset, Lung (TCGA-LU), containing under 1,000 WSIs. Training on larger and more diverse datasets could further enhance its performance. Secondly, the current instance embeddings are solely based on the encoder pre-trained on ImageNet. Combining SMMILe with encoders meticulously fine-tuned on large-scale pathological image databases could further boost its effectiveness.
In conclusion, we believe SMMILe possesses significant potential for clinical application: For general pathology diagnostic tasks, SMMILe not only provides slide-level diagnoses but also offers a comprehensive diagnostic rationale. For tasks involving the quantitative analysis of pathological phenotypes, SMMILe enables precise spatial quantification, further supporting extensive retrospective studies aimed at exploring the relationship between the proportions of different pathological phenotypes and patient treatment responses or prognostic evaluations. For the task of discovering new biomarkers, SMMILe has the potential to identify and quantify previously unknown pathological phenotypes correlated with genotypes, thereby facilitating subsequent targeted and quantitative analyses at the genetic level, such as single-cell and spatial transcriptomics sequencing of subregions.
4 Methods
4.1 RAMIL versus IAMIL
We delve into the distinct mechanisms of RAMIL and IAMIL, presenting novel theoretical contributions that elucidate their differences. Central to our investigation are two pivotal questions, addressed through three original theorems and a targeted synthetic experiment: (1) How does the variation in the stage of attention aggregation lead to noticeable differences in the attention score allocation? (2) How do these differences contribute to a reduced false positive rate in IAMIL compared to RAMIL?
Our exploration of these questions reveals that, typically in the training process, IAMIL assigns relatively lower attention scores to low-discriminative and non-discriminative instances, and higher scores to those of high-discriminative, in comparison to the allocation patterns observed in RAMIL.
Furthermore, it also disproves some previous interpretations regarding the inaccurate attention results in RAMIL [13]. Contrary to the belief that the attention score allocation in RAMIL has poor interpretability due to instances contributing positively or negatively to bag-level predictions, or due to instance interactions at the classification stage, our findings suggest a different explanation. In RAMIL, bag-level prediction logits can indeed be precisely decomposed into marginal instance contributions as determined by the instance attention scores. However, the properties of the output activation function result in a diminished contribution of these attention scores to the bag-level prediction score. Consequently, during optimization, the loss constraint is limited for attention score allocation, leading to a part of negative instances receiving relatively high attention scores.
4.1.1 Preliminaries
Let {x1, x2, …, xK} be a bag of instances (patches) extracted from a WSI X ∈ ℝW ×H×3, with the WSI-level label Y as the supervision, where xk ∈ ℝD×D×3 is a patch represented by its raw pixels and K represents the number of instances in this bag. In WSI analysis, representation-based MIL approaches have emerged as the dominant paradigm due to their superior bag-level prediction performance [34–37]. These approaches typically involve three essential steps, which are as follows: (1) Embedding all instances into the same representation space by utilizing the pre-trained encoder e(·), a bag of representations is denoted as {h1, h2, …, hK}, where hk = e(xk). (2) Aggregating these instance-level representations into a bag-level representation H by a permutation-invariant function, e.g., mean, and max pooling. (3) Mapping the bag-level representation to prediction score of each category (P = {P1, P2, …, PC} and C stands for the number of categories) using the linear projection function f (·) with an output activation function σ(·), e.g., sigmoid, and softmax. Take the mean pooling aggregation as an example, the bag-level prediction score P can be represented in the following form: However, representation-based MIL approaches remain a challenge in obtaining individual scores for each instance to ensure interpretability, which is of paramount importance in medical-related applications [38]. An ingenious solution to address this issue is the utilization of attention-based MIL pooling [17], named ABMIL, which incorporates an auxiliary network to generate attention weights for the representation of each instance. This mechanism enables fully trainable and highly flexible MIL pooling, while also providing easily interpretable individual scores for each instance. Consequently, the bag-level prediction score P of RAMIL can be reformulated as: Despite the remarkable achievements of RAMIL, some researchers argued that their visual interpretations are inexact and incomplete [13, 14]. Alternatively, the instance-based MIL approaches adopt the strategy of “mapping first then aggregation”. The bag-level prediction score P with the mean pooling aggregation can be formulated as: Following this strategy, scores for each instance can be naturally obtained, ensuring high interpretability. Nonetheless, the utilization of instance-based MIL for WSI analysis remains limited. This is attributed to the inaccurate instance labels, e.g., the pseudo label of non-discriminative instances, during MIL training, which leads to insufficient training of the linear projection function f (·) responsible for predicting instance scores in Eq. (3), thereby affecting the bag-level prediction [39]. A straightforward and reasonable solution is to introduce the attention mechanism into the aggregation of instance-based MIL. It differs from RAMIL, where attention scores are generated for the representation of each instance. In this case, the attention mechanism is applied directly to the instance scores. This solution assumes that the attention scores could mitigate the negative impact of inaccurate instance scores. As a result, the bag-level prediction score P of IAMIL can be calculated as: where the logit of f (·) for kth instance is denoted as lk = f (hk). This formulation is essentially the same as the Multiple Instance Detection Network (MIDN) [15] commonly used in WSOD. MIDN treats the image as a bag and considers the regions generated by object proposal methods as instances. The network is trained with two branches, one for detection and the other for classification, corresponding to g(·) and f (·) in IAMIL. From Eq. (2) and Eq. (4), the only distinction between RAMIL and IAMIL is identified in the integration stage of the attention mechanism (Fig. 2b). Considering observations from substantial existing related works (Section 1), along with the spatial quantification in both methodologies fundamentally rely on the attention scores (A = {a1, a2, …, aK}), it is reasonable to assume that even this slight distinction between RAMIL and IAMIL can result in notable differences in how attention scores are allocated. This insight prompts us to explore how these two methodologies diverge in their mechanisms for assigning attention scores to individual instances, despite sharing a foundational MIL framework.
4.1.2 Connecting RAMIL with IAMIL
We start by converting the formulation of RAMIL (Eq. 2) into a form that parallels with IAMIL.
RAMIL is rigorously equivalent to a specialized variant of IAMIL, wherein aggregation is conducted at the logit level,
This specific form of MIL termed Logits-based Attention MIL (LAMIL), was initially proposed in [40] (without attention pooling) for WSOD (see Supplementary Proof 1 for details).
4.1.3 The Gradient Descent Process of Attention MIL
Building upon Theorem 1, the divergence between RAMIL and IAMIL is equivalent to the difference between Eq. (4) and the right side of Eq. (5), i.e., where the activation function σ(·) is applied (Fig. 2). To simplify the following derivation, we extend the RAMIL and IAMIL to more general frameworks. This involves utilizing the class-wise attention mechanism and taking the sigmoid function as the output activation function σ(·). Consequently, the attention score ak is expanded to a vector, where denotes the softmax normalized attention score of the k-th instance specific to category c. This configuration broadens the applicability of attention MIL across various scenarios, including multi-label classification tasks.
Moreover, since the softmax function leads to entangled relationships between different attention scores across instances, we use the expression of softmax normalized attention scores to replace each of them and assess the gradient descent process of in RAMIL and IAMIL. Let represent the raw outputs (logits) of the linear pro-jection function f (·), where lk = f (hk) and each signifies the logits value specific to category c. Furthermore, let Lc denote the aggregated logits for category c, it can be defined as: accordingly, the predicted score for category c in RAMIL is calculated as follows: and the prediction score for category c in IAMIL is given by: where. To theoretically explore the variation in the attention score allocation under conditions where RAMIL and IAMIL are optimized to the same level, we posit the following assumptions:
Assumption 1: Except for the two attention networks, denoted as ğ (·) and ĝ (·), both models share identical parameters in their pre-trained encoders e(·) and linear projection functions f (·), which means for any instance xi and category c.
Assumption 2: Upon training with the same dataset and employing an identical loss function, ∃ ğ c(·), ĝ c(·) with weight for any c ∈ [1, C] and k ∈ [1, K − 1], the following two conditions are satisfied: (1) the bag-level predictions of RAMIL and IAMIL for the same inputs are identical, i.e., , for any category c. (2) The gradients of the final losses with respect to the respective network parameters are equivalent,.
If Assumptions 1 and 2 hold, the relationship in magnitude between and is directly proportional to the corresponding magnitude relationship between and , which is valid for any instance xi, i ∈ {1, 2, …, K − 1}:
Theorem 2 addresses the first question, asserting that the difference in the attention score allocation between RAMIL and IAMIL is intrinsically linked to the properties of the activation function σ(·) across the relevant interval (see Supplementary Proof 2 for details).
4.1.4 Properties of Activation Function in Local Intervals
Based on Theorem 2, we can infer that if the absolute distance surpasses the projection of on the tangent line t(l), then the attention score allocated to the i-th instance by RAMIL exceeds that assigned by IAMIL , where t(l) denotes the tangent line to σ(l) at Lc. Conversely, if the projection on the tangent line is greater, the result is reversed.
Also, the activation function (sigmoid) exhibits concavity in the interval [0, ∞) and convexity in the interval (−∞, 0]. This dichotomous property markedly affects the relative magnitudes of and . As a result, the magnitude relationship between them is not fixed but is intricately linked to the interval demarcated by Lc and .
If and , then .Conversely, if , max(Lc, Lint)), then. where Lint is the additional intersection point between t(l) and σ(l).
Typically, the i-th instance is considered to possess high discriminability if and . In contrast, the instances are regarded as relatively low-discriminative or non-discriminative instances. Furthermore, owing to the specific properties of the additional intersection point, the conditions outlined in Theorem 3 cover nearly all instances, and few instances are likely to satisfy the rest condition, i.e., and (see Supplementary Proof 3 for details).
Theorem 3 responds to the second question, highlighting that IAMIL, in contrast to RAMIL, tends to give higher attention scores to high-discriminative instances. Simultaneously, it assigns lower attention scores, which are several times lower, to instances with relatively low discriminability and those non-discriminative instances. Therefore, the instance-level classification results derived from the distribution of attention scores, IAMIL exhibits a very low false positive rate in comparison to RAMIL. However, this also results in only a small part of high-discriminative instances receiving high attention scores, leading to a relatively lower recall rate for positive instances of each category.
4.1.5 The Synthetic Experiment
In the previous section, we assert the near improbability of the condition and . This assertion plays a critical role in delineating the differences in attention score allocation between RAMIL and IAMIL. Consequently, to empirically observe the distributions of Lc, and in both RAMIL and IAMIL, as well as to examine the distributions of the raw and softmax normalized attention scores (, and ), we construct a synthetic dataset that follows the MIL setting (Fig. 2a, Supplementary Table 1).
For this synthetic MIL dataset, we generate a set of bags with corresponding bag-level labels where each bag consists of K instances, denoted as . Each instance is a vector with d dimensional features, . To simplify, this synthetic MIL dataset is defined with two categories. Consequently, we establish three types of instance distributions, used to generate discriminative (positive) instances for two categories, and non-discriminative (negative) instances. Each instance distribution is comprised of Gaussian distributions with different means and variances for each of the d dimensions, from which we sample each feature of instances. Therefore, the bag of each category is composed of a random proportion r of positive instances belonging to that category, supplemented with the remaining negative instances. It is important to note that the positive instance distributions for the two categories are significantly different, while their distributions, compared to the negative instances, are similar but non-overlapping. This setup is designed to synthesize positive instances with high and low discriminative features. Ultimately, the dataset comprises a total of 2,00 bags, each containing 1,000 instances. The ratio of positive instances within each bag is randomly sampled from a range of 0.1 to 1. The dataset is evenly distributed across the two categories, with each category consisting of 100 bags.
For model training, we utilize the architecture of AB-MIL[17] as the basis for RAMIL and modify it for IAMIL by conducting attention pooling at the instance level. Given that the instance is not an image, we omit the embedding step. Only the linear projection function f (·), with 4 hidden nodes, and the gated attention network g(·), with 8 hidden nodes in the first layer and 4 hidden nodes in the second layer, are used and trained. The loss function employed is binary cross entropy computed for each category. The Adam optimizer is used, with the learning rate set to 0.001. To ensure a consistent level of convergence between RAMIL and IAMIL, the training process is stopped when the loss decreases below a fixed threshold, specifically 0.05.
We can observe that when the model converges, the values of |Lc| for most bags are greater than 5 (Supplementary Fig. 3(a)), resulting in corresponding |Lint| values surpassing 144 (Supplementary Fig. 2(c)). The values of for most negative instances are less than 5 (Supplementary Fig. 3(b)), which means they are positioned within the interval demarcated by Lc and Lint. Also, the values of for a few positive instances are greater than |Lc| (Supplementary Fig. 3(c)). Following Theorem 3, this indicates that IAMIL assigns lower attention scores to the majority of negative instances, while a minority of positive instances receive higher attention scores compared to those assigned by RAMIL. This distinction is also evident from the distribution differences of and in RAMIL and IAMIL (Supplementary Fig. 3(d-g)).
4.2 SMMILe
Reiterating the definition in Section 4.1, a set of patches {x1, x2, …, xK}, each of size D × D, are extracted from a WSI X of size W × H, with the corresponding label Y . The objective of SMMILe is to learn a transformation function that maps the set of patches {x1, x2, …, xK} to the WSI-level label Y. Concurrently, it also aims to predict instance-level labels y1, y2, …, yK for each individual patch.
In MIL settings, supervision is exclusively available at the WSI level. Previous research has predominantly focused on binary or multi-class classification, where each WSI is assigned to a single category, i.e., Y is represented as a scalar in binary classification or as a C-dimensional onehot vector in multi-class classification. In this paper, we expand SMMILe to accommodate multilabel classification. Consequently, Y = {Y1, Y2, …, YC} is configured as a C-dimensional vector, with each element Yc being a binary indicator that is independently distributed, representing the presence or absence of each category in this WSI. This adaption aligns SMMILe with more general CPath scenarios, capturing multiple phenotypic categories that may concurrently exist in a single WSI.
4.2.1 Network Architecture
The proposed network architecture (Fig. 5) begins with a pre-trained encoder e(·), mapping all instances {x1, x2, …, xK} to a uniform embedding space {h1, h2, …, hK}. Given the large number of instances contained within each WSI, the encoder training becomes time-consuming and computationally strenuous, leading to the parameters of the encoder generally being kept frozen. In this paper, we employ the ResNet-50 [41], pre-trained on ImageNet [42], as our encoder e(·). Feature maps are extracted following the third residual block and aggregated through global average pooling, producing a 1024-dimensional embedding for each instance. This pre-trained encoder has been demonstrated to be effective in numerous RAMIL studies for CPath [28, 30, 31, 43].
Subsequently, a convolutional layer cov(·) is introduced, which further maps the instance embeddings to a lower dimension. The parameters of cov(·) are trainable for increasing the flexibility of the entire MIL framework. While existing works often employ a linear projection layer for this function, we replace it with a convolutional layer to enhance the representation ability of our framework for downstream tasks by introducing WSI-level local receptive fields. Similar to NIC [16], we reposition the instance embeddings {h1, h2, …, hK} according to their positions in the WSI, and fill other positions with zero embeddings, creating a compressed WSI to enable the convolution operation. We utilize 128 convolutional kernels with size 3 × 3 and padding operation, ensuring that the size of the compressed WSI is main-tained after convolution, i.e., , where . Then, the compressed embeddings of all instances can be obtained from based on their respective positions. It is important to note that when the size of convolutional kernels is set to 1 × 1, cov(·) degrades to a linear projection layer, which is particularly effective for specific MIL tasks with very limited positive instances in each bag, such as cancer metastasis detection.
In the final stage, the compressed embedding of each instance , serves as the input for an instance detector g(·) and an instance classifier f (·). The instance detector, analogous to the attention network in RAMIL frameworks, utilizes the gated attention mechanism [17]. It com-prises three linear projection layers with 64, 64, and C hidden nodes, respectively, assigning category-wised raw attention scores to each instance. Then these raw attention scores of each category are normalized via softmax function across all instances, resulting in detection (attention) scores , where the sum of them remains invariant to K and equals to 1. The instance classifier f (·) is a linear projection layer with C hidden nodes. It maps the compressed embedding of each instance to category-related scalars . These scalars are then normalized over categories using the softmax function for multi-class classification, or the sigmoid function for binary and multi-label classification, yielding the classification scores for each instance.
The bag-level prediction score Pc for each category is then obtained by taking the dot product of classification and detection scores, and summing them up, as defined in the aggregation formula in Eq. (4).
4.2.2 Instance-based Comprehensive Attention
To enhance the comprehensive attention capability of SMMILe toward all discriminative instances, we adhere to the traditional MIL by categorizing bags into two types: negative bags and positive bags. Negative bags do not contain any positive instances of any category, as exemplified by WSI of normal tissue; whereas positive bags include positive instances of one or several categories. We propose (1) an attention consistency constraint for negative bags; (2) a parameter-free instance dropout module; and (3) a superpatches-based delocalised instance sampling module for positive bags. It is worth noting that the instance dropout and instance sampling modules presented in this section introduce diversity in instance combinations of each bag, effectively serving as two forms of bag-level augmentation. This, in turn, enhances the performance of bag-level predictions.
Consistency Constraint
A bag is classified as negative only if none of the instances within it belong to a positive category. Thus, the classification of negative bags should not rely on a subset of instances but rather on all instances. Instead of applying the same attention mechanism to both negative and positive bags like previous RAMIL approaches, we introduce a consistency constraint for the attention mechanism, which restricts all instances in a negative bag should having the same attention score. This consistency loss is defined as follows: where . By applying this MSE loss penalty, the attention scores across all instances in a negative bag become uniform, effectively ensuring that no individual instance makes special contributions to any category. This uniformity explicitly boosts the classification accuracy for negative bags as well as the recognition ability of SMMILe to negative instances, thereby implicitly improving the comprehensive attention for discriminative instances.
Parameter-Free Instance Dropout
The comprehensive attention capability of SMMILe is limited by focusing mainly on high-discriminative instances and neglecting others. An intuitive idea is that during the training process, omitting high-discriminative instances while retaining bag-level supervision could encourage the model to focus on the remaining discriminative instances. This approach introduces an additional challenge: determining the ideal instance dropout rate. Specifically, excessive instance dropout during early, unstable training phases may obstruct model convergence. In contrast, a low dropout rate in stable phases offers limited benefits. Additionally, with substantial variation in positive instance proportions across different bags, selecting a uniform dropout rate effective for all bags is unfeasible.
To resolve this issue, we design a parameter-free instance dropout module (Fig. 5c) Instead of applying dropout to attention scores, which do not directly reflect the contribution of each instance towards the bag-level prediction score, this module targets the instance scores (i.e., the product of classification and detection scores for each instance), which have strict marginal contributions to the bag-level prediction score [13]. For a set of instance scores of a bag, where , and , we first apply Min-Max normalization to obtain normalized instance scores . Then, a corresponding set of random floating-point numbers , η ∈ [0, 1] is generated. We compare them pairwise to obtain the instance drop masks . Finally, these instance drop masks are applied to the instance scores, and the bag-level prediction score with instance dropout for category c is computed as: where denoting an Iverson bracket. It can be observed that the proposed instance dropout module does not require any additional hyperparameters. The decision to drop an instance is based on its instance score . The higher , the less likely it is to meet the condition , making it more prone to be dropped.
Superpatch-based Delocalised Instance Sampling
Instance sampling emerges as another viable solution for enhancing the comprehensive attention capability of SMMILe. By performing multiple rounds of random sampling within a bag, each sampling generates a pseudo-bag composed of a subset of instances. The bag-level supervision is then applied to guide predictions for these pseudo-bags. This approach enables the model to focus on diverse sets of discriminative instances in each pseudo-bag, thus improving its comprehensive attention ability. Nevertheless, this kind of random sampling is uncontrollable and may lead to some pseudo-bags lacking positive instances, thereby introducing substantial noise into the training process of the model. Here, we propose a superpatch-based delocalised instance sampling module (Fig. 5d) to address this issue. Recall the compressed WSI Hnic we constructed for a bag, wherein each pixel corresponds to a patch in the original WSI. We employ the Simple Linear Iterative Clustering (SLIC) [44], a widely-utilized, non-trainable clustering-based partition algorithm, to generate a set of superpatches {SP1, SP2, …, SPS} from Hnic, where S indicates the number of superpatches of each bag ((Fig. 5b)). This leads to patches that are spatially close with similar representations being grouped into the same superpatch. Based on these superpatches, instances of a bag can be divided into S subsets. By conducting T rounds of random sampling with replacement, where each round involves sampling one instance from each subset to create a pseudo-bag with S delocalised instances, a total of T pseudo-bags are generated. The instance sampling is also performed on instance scores directly. For t-th pseudo-bag, we have the sampled instance scores for category c, where is an instance score sampled from superpatch SPs, and the bag-level prediction score of t-th pseudo-bag is calculated as: Owing to the characteristic of superpatch, each pseudo-bag is composed of instances exhibiting varied spatial and representation distributions, providing the necessary diversity to encompass both positive and negative instances. Furthermore, sampling instances randomly from superpatches in each round results in a diverse array of instance combinations. This variety mitigates the overshadowing effect of high-discriminative instances on low-discriminative instances within the same pseudo-bag, consequently encouraging the model to focus more on those low-discriminative instances. Additionally, integrating instance sampling with instance dropout is particularly beneficial in scenarios where positive instances are predominant within a bag, such as subtyping on primary tumor slides. The bag-level prediction score for the t-th pseudo-bag, when instance dropout is applied, is computed as follows: where is the instance drop mask for . This can be regarded as masking high-discriminative superpatches.
Finally, all bag-level predictions generated by SMMILe, i.e., , and , are supervised by bag-level label Yc of each category. The classification loss for each bag is calculated as: where BCE(P, Y) stands for the binary cross entropy loss between the prediction P and the true label Y .
4.2.3 MRF-based Instance Refinement
The aforementioned modules endow SMMILe with the capability to distinguish between positive and negative instances within each bag for different categories. However, due to the diversity among different bags, such as variations in feature distributions and significant differences in the proportions of positive instances, it is nearly impossible to choose a unified decision boundary for instance-level classification across varying bags and categories. Consequently, we design an instance refinement network to align features of instances of the same category within different bags, thereby enabling us to learn a unified instance-level classification decision boundary across diverse bags.
Instance Refinement Network
The instance refinement network is structured with N linear layers {v1(·), v2(·), …, vN (·)}. Each layer is implemented with (C + 1) hidden nodes and employs a softmax activation function for output generation. Here, C represents the number of categories for WSIs, excluding the negative category. For the negative WSI, the label Y is encoded using a C-dimensional vector of zeros. Consequently, the term (C +1) incorporates the negative category to account for instances. These linear layers are assigned the identical task of generating predictions for individual instances but involve different sets of instances with associated compressed embeddings and pseudo-labels for training.
The challenge lies in acquiring pseudo-labels for network training. Leveraging the distinguishing capability of SMMILe for instances of different categories within each bag, we propose an online sample selection and labeling strategy. During each epoch of training, we can obtain a set of instance scores for category c from SMMILe. From this set, we select the top θ percent of instances, labeling them as belonging to category c. This selection is restricted to the categories present in each bag. For negative samples, we first compute the mean score across different categories for each instance, represented as . Subsequently, we select the bottom θ percent of instances from these averages {Ī1, Ī2, …, ĪK }, labeling them as negative, i.e., category (C + 1). Consequently, the first linear layer v1(·) is trained using the compressed embeddings and pseudo-labels of the selected instances, represented as , where J denotes the total number of selected instances, and is the first-round pseudo-label of j-th selected instance.
Building on the concept of self-training, we employ the prediction results of instances from v1(·) to supervise the learning of v2(·), and similarly for subsequent layers, to achieve a higher degree of instance refinement. Specifically, the compressed embeddings of all instances in a bag are fed into vn(·), yielding prediction scores of these instances, denoted by , where is a (C +1)-dimensional vector, where each element represents the probability of the instance belonging to a corresponding category. Then, the proposed online sample selection and labeling strategy is employed here, the selected instances and (n+1)-round pseudo-labels are used to train the subsequent linear layer v(n+1)(·). It is crucial to note that since each linear layer in the instance refinement network can generate predictions for the negative category (C + 1), there is no requirement for a separate selection process for negative samples, except for the training process of v1(·). Also, the instances selected for each linear layer are likely to differ, while the total number J remains constant. Through this strategy, every linear layer in the instance refinement network can be concurrently trained with the SMMILe primary network in each epoch. The refinement loss is defined as: where CE(p, y) stands for the categorical cross entropy loss between the prediction p and the pseudo-label y. The optimization process of the instance refinement network also facilitates learning more uniform discriminative instance features (generated by cov(·)) across different bags, which enhances the bag-level prediction performance of SMMILe.
superpatch-based MRF Constraint
Nevertheless, the proposed sample selection strategy, which treats each instance as an independent entity and only high-scoring instances from each category can be used for supervision, may induce prediction biases in the instance refinement network. Moreover, the current instance refinement network ignores the spatial relationship between instances (patches) within a bag (WSI), which is important for the comprehensive detection of positive instances. Therefore, we introduce a superpatch-based MRF constraint that incorporates local spatial smoothness at the WSI level. This constraint requires the minimization of both the first-order energy within each superpatch and the second-order energy between adjacent superpatches. Consider the n-th linear layer vn(·). The prediction scores for instances within the superpatch SPs are represented as , where |SPs| signifies the count of instances in this superpatch. Furthermore, the prediction score for superpatch SPs is calculated as , and it is surrounded by Ms adjacent superpatches, whose prediction scores are denoted by. The MRF constraint loss for superpatch SPs, incorporating both first-order and second-order energy, is defined as follows: where λ1 and λ2 control the balance between first-order and second-order energy. This constraint can implicitly propagate the pseudo-label information of high-scoring instances to the local regions constrained by superpatches, thereby enhancing the spatial smoothness of instance prediction scores.
4.3 Evaluation Datasets
Breast (Camelyon16) [18], employed for metastasis detection in breast cancer, exemplifies a classic binary classification task in WSI analysis. It comprises 399 WSIs with or without metastasis, each accompanied by detailed pixel-level annotations.
Lung (TCGA-LU) employed for subtyping in non-small cell lung cancer, comprises a total of 937 WSIs, categorizing them into two subtypes: adenocarcinoma (LUAD) [19] and squamous cell carcinoma (LUSC) [20]. The cancerous region of 523 WSIs was entirely annotated at the pixel level by two experienced pathologists and four medical students.
Ovarian (UBC-OCEAN) [25] employed for subtyping in ovarian cancer, comprises a total of 513 WSIs, categorizing them into five subtypes: high-grade serous cancerous (HGSC), low-grade serous cancerous (LGSC), endometrioid cancerous (EC), clear cell carcinoma (CC), and mucinous cancerous (MC). Part of the cancerous, healthy, or necrotic regions of 152 WSIs were annotated, we combined the healthy and necrotic annotations to categorize them as normal tissue.
RCC-3 (TCGA-RCC) includes a total of 660 WSIs, with three subtypes, clear cell RCC (CCRCC) [21], papillary RCC (PRCC) [22], and chromophobe RCC (CHRCC) [23]. The cancerous region of 338 WSIs was entirely annotated at the pixel level by two experienced pathologists and four medical students.
RCC-4 (IH-RCC) collected from the First Affiliated Hospital of Xi’an Jiaotong University, with ethical approval (KYLLSL2021-420). It encompasses 563 WSIs from 168 patients across four RCC subtypes, CCRCC, PRCC, CHRCC, and Renal Oncocytoma (ROCY), with approximately 40 patients per subtype. The cancerous region of 138 WSIs was entirely annotated at the pixel level. Gastric Endoscopy (IH-ESD) collected from the First Affiliated Hospital of Xi’an Jiaotong University, with ethical approval (KYLLSL2022-333). It includes 99 patients with early gastric cancer Endoscopic Submucosal Dissection (ESD) specimens, a total of 286 WSIs. Each WSI was meticulously annotated at the pixel level, with three categories, i.e., tumor, inflammation, and normal tissue.
Gastric (TCGA-STAD) [26], comprising 339 Whole Slide Images (WSIs), was sourced from the TCGA Database. As the TCGA database does not provide detailed classification information for gastric adenocarcinoma, two pathologists with over a decade of experience classified all WSIs following the World Health Organization (WHO) histological classification system [45]. At the same time, they performed a detailed, patch-level annotation on 128 WSIs, identifying three tissue subtypes: highly differentiated (papillary and tubular), poorly differentiated, and mucinous.
Prostate (SICAPv2) [27], a publicly available prostate Gleason grading dataset, includes 153 WSIs labeled with categories G3, G4, G5, and normal tissue. This dataset provides tessellated patches with corresponding coordinates. Although the majority of patches come with annotations, a notable subset is devoid of labels. Two experienced pathologists provided supplementary annotations for these unlabelled patches.
Except for the already tessellated Prostate dataset, we process these WSIs into tiles of 2048×2048 pixels for 40x magnification and 1024×1024 pixels for 20x magnification without overlapping, thereby standardizing the input for consistent model training and evaluation. How-ever, the Breast dataset presents a unique case. Given the variable sizes of tumor metastasis regions in this dataset, where some WSIs exhibit remarkably small tumor regions, we opt for a finer tessellation resolution of 512×512 pixels for the WSIs from this dataset. For the tessellated patches, we assign labels based on the predominant category within their representative regions, provided that pixel-level annotations are available. Additionally, some datasets (Breast, Gastric Endoscopy, and Prostate) include WSIs classified under the “Normal” category, indicating that all tessellated patches from these WSIs are normal (negative instance). Refer to Supplementary Table 1 for statistics of these datasets.
4.4 Implementation details
Patch embeddings used for all methods are extracted from (the third residual block of the ResNet-50, which has been pre-trained on the ImageNet dataset. Where possible, configurations were aligned with the original implementations or their corresponding publications. It is, how-ever, important to underline that the majority of these baselines are not intrinsically designed to accommodate multi-label classification tasks. Thus, modifications were introduced to these methods, with a particular emphasis on enhancing the attention aggregation mechanism, thereby extending their functionality to support multi-label classification. There is an exception, Trans-MIL is based on the self-attention mechanism and cannot be modified to multi-class attention. Therefore, in the experiment, TransMIL is unable to output patch-level predictions in multi-label datasets. Also, a weighted sampling technique is incorporated during the sample selection phase for all baselines, including the proposed SMMILe, to mitigate the issue of class imbalance. Furthermore, except for SMMILe, which possesses an instance refinement network capable of directly generating patch-level predictions, the derivation of patch-level predictions in representation-based attention MIL baselines, such as RAMIL, CLAM, DSMIL, TransMIL, and DTFD-MIL, relies on the raw attention scores. In the case of NIC and NIC-WSS, patch-level predictions are acquired from grad-CAM outputs, whereas for IAMIL, AddMIL, and the variants of SMMILe without integrating with the instance refinement network (Section 2.4), predictions are based on instance scores.
In the configuration of SMMILe, the kernel size of convolutional layer cov(·) is set to 1 × 1 for the Camelyon16 dataset, and 3 × 3 for other datasets. The super-patches are generated using the Simple Linear Iterative Clustering (SLIC) over-segmentation algorithm, as implemented in the scikit-image package. Considering both the high spatial homogeneity and the significant proportion of tumor areas in WSIs of multi-class classification datasets, the initial size of super-patches is set to 5×5 for multi-class datasets and 3×3 for others. Also, the integration of instance sampling with dropout is specifically utilized for multi-class classification datasets. The total number of sampling rounds T is set to 10, and the control parameters for the MRF constraint, λ1 and λ2, are fixed at 0.8 and 0.2, respectively, for all datasets.
The training of SMMILe is divided into two stages. In the first stage, the primary network of SMMILe is trained using the consistency loss ℒcons and the bag-level classification loss ℒcls for a maximum of 200 epochs. In the second stage, the number of linear layers N in the instance refinement network is set to 3, and the sample selection rate θ is established at 10 percent for each category. Both the primary and instance refinement networks are then trained using all loss functions, including ℒcons, ℒcls, ℒref, and ℒmrf, for a maximum of 100 epochs. The Adam optimizer is used with a learning rate of 2e−5, and an early stopping strategy is implemented for both stages. For inference, the bag-level output score Pc for each category and the instance-level output scores from the last linear layer vN (·) are employed as the prediction results for the WSI and the corresponding patches, respectively. For all experiments, we used patient-level 5-fold cross-validation to estimate the predictive performance of each model. For each fold, four-fifths of the data were used to create a train-validation split (90%-10%), and the remaining fifth of the data was used as a test set. We use PyTorch for model implementation and the experiments are performed on a workstation with two NVIDIA 2080 Ti GPUs.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
5 Data availability
TCGA data (TCGA-LU, TCGA-RCC, and TCGA-STAD) including whole-slide images and diagnostic labels are available at https://portal.gdc.cancer.gov. Other publicly available datasets can be accessed in their corresponding data portals: Camelyon16 (https://camelyon16.grand-challenge.org/), UBC-OCEAN (https://www.kaggle.com/competitions/UBC-OCEAN), SICAPv2 (https://data.mendeley.com/datasets/9xxm58dvs3/1), The pixel-level annotations for TCGA-LU and TCGA-RCC are partially available at https://sites.google.com/view/aipath-dataset/home. The rest of the pixel-level annotations for TCGA-LU and TCGA-RCC, patch-level annotations, and fine-grained subtype labels for TCGA-STAD will be released upon publication. Two in-house datasets (IH-RCC and IH-ESD) were collected with IRB approval for the current study, and there are no plans to make them publicly available.
6 Code availability
The code for training SMMILe will be released upon publication. We have documented the details of model architecture and training, guaranteeing the reproducibility of SMMILe within the research community.
Supplementary
1 Proofs
Revisiting f (·), typically designed as a linear projection involving two trainable vectors, denoted as w and b. The attention scores are generated by g(·) with softmax normalization over the instances, thereby ensuring that . Expanding Eq. (2), we derive: where the bag-level logit is aggregated as . The resultant form represents a unique variant of IAMIL, which aggregates the raw outputs (logits) of f (·) followed by the activation function σ(·).
Utilizing the chain rule of differentiation, the partial derivative of with respect to is formulated as: also, the partial derivative of with respect to is described as: Moreover, the derivative of in relation to is computed as: and the derivative of with respect to is calculated as: In light of Assumption 2, we deduce: Where . Furthermore, it can be readily demonstrated that is eauiv-alent to This equivalence arises because the values of these two partial derivatives are exclusively dependent on the architecture of the attention networks and the input vector hi (with and , which remain consistent across RAMIL and IAMIL. Consequently, we can infer that:
Firstly, The tangent line t(l) of σ(l) at Lc can be mathematically expressed as: Given that the sigmoid function σ(l) has a single inflection point at l = 0, and asymptotic property towards positive and negative infinity, it can be inferred that the tangent line t(l) intersects with σ(l) at an additional point, denoted as (Lint, σ(Lint)), where t(Lint) = σ(Lint). Importantly, the product of Lc and Lint is negative, i.e., Lc · Lint < 0. The sole exception to this rule occurs when Lc = 0, in which case there is only one intersection point, Lc = Lint = 0. Moreover, it can be readily demonstrated using the geometric approach (Suppplementary Fig.
1) that if , max(L, L)), then , and vice versa.
Secondly, the intersection point Lint can be expressed as: which is related to Lc. However, it does not have an analytical solution. Alternatively, we analyze the properties of Eq. (25) and give a series of discrete solutions using Newton’s method. The first-order derivative of Eq. (25) with respect to Lc can be expressed as: with Taylor expansion of and, we have: The second-order derivative of Eq. (25) with respect to Lc can be expressed as:
with Taylor expansion of and, we have: Since Lc · Lint < 0, if Lc > 0, it follows that (2σ(Lint) − 1) < 0, indicating that ∂Lint/∂Lc < 0 and ∂2Lint/∂2Lc < 0. Similarly, if Lc < 0, the first-order derivative ∂Lint/∂Lc is still less than 0, but the second-order derivative ∂2Lint/∂2Lc > 0. If and only if Lc = 0, ∂Lint/∂Lc and ∂2Lint/∂2Lc are equal to 0. These properties of Eq. (25) indicate that (1) the distance between Lc and Lint increases as the absolute value of Lc increases, and (2) the rate of change for the absolute value of Lint also increases as the absolute value of Lc increases.
To elucidate the relationship between Lc and Lint more explicitly, we employed Newton’s method to solve Eq. (25). For Lc = {0, 1, 2, 3, 4, 5}, the corresponding Lint values are {0.00, −2.22, −6.37, −18.09, −51.60, −144.41}, respectively. For Lc < 0, the Lint values are the negatives of their respective positive counterparts (Supplementary Fig. 2c). The results reveal that as Lc varies, the distance between Lc and Lint changes at an almost exponential rate.
Furthermore, we calculated the ratio of to across different and Lc values (Supplementary Fig. 2a,b). It can be seen that when the absolute value of Lc is relatively large, the ratio of to significantly exceeds 1 in most intervals, highlighting a substantial diver-gence between RAMIL and IAMIL in terms of attention scores allocated for low-discriminative and non-discriminative instances.
Considering the properties of the additional intersection point Lint, along with the expected convergence of σ(Lc) towards either 0 or 1, a relatively large magnitude of |Lc| is often observed, typically exceeding 5 in most cases (Supplementary Fig. 3a). Consequently, this results in |Lint| being greater than 100. It becomes evident that satisfying the condition and is improbable. Moreover, for the majority of instances that meet the condition , max(Lc, Lint)), the value of is expected to be several times greater than that of .
Although these proofs have been proven using the sigmoid function, their applicability extends universally to the softmax function. Under the assumption that the raw outputs of other categories remain constant, and focusing solely on the relationship between the attention score and the prediction score Pc for a single category, the softmax function can be effectively simplified to, where “cst” represents a constant. Notably, this simplified representation of the softmax function exhibits properties analogous to those of the sigmoid function.
2 Hyperparameter Analysis
Here, we investigate the influences of two important hyperparameters for SMMILe on the Lung and Prostate datasets, (1) the initial size of super-patches for SLIC, from 2 × 2 to 6 × 6, and (2) The total number of instance sampling rounds, {1, 5, 10, 15}. We can see that both the WSI-level and patch-level classification performance of SMMILe is influenced by these two parameters (Supplementary Fig. 5). The impact on WSI-level classification performance is minimal, with variations generally within 1%, in contrast, patch-level classification performance experiences changes in the range of 3-4%. The number of instance sampling rounds has a negligible effect on the performance of SMMILe, except when the rounds are set to 1, where lower diversity leads to decreased performance, and settings of 5, 10, and 15 rounds yield similar results. On the other hand, the initial size of super-patches has a more significant impact on the per-formance of SMMILe. Both excessively large (6 × 6) and small (2 × 2) super-patches result in diminished performance, with the optimal parameters differing across datasets. For instance, the Lung dataset, characterized by larger WSI sizes and better spatial consistency, benefits more from larger sizes like 4 × 4 and 5 × 5, whereas in the Prostate dataset, where biopsy slide sizes are smaller, super-patches of 3 × 3 size perform better.
7 Acknowledgements
We acknowledge funding and support from Cancer Research UK and the Cancer Research UK Cambridge Centre [CTRQQR-2021-100012], The Mark Foundation for Cancer Research [RG95043], GE HealthCare, and the CRUK National Cancer Imaging Translational Accelerator (NCITA) [A27066]. Additional support was also provided by the National Institute of Health Research (NIHR) Cambridge Biomedical Research Centre [NIHR203312] and EPSRC Tier-2 capital grant [EP/P020259/1]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.