Abstract
Lupus Nephritis classification has historically relied on labor-intensive and meticulous glomerular-level labeling of renal structures in whole slide images (WSIs). However, this approach presents a formidable challenge due to its tedious and resource-intensive nature, limiting its scalability and practicality in clinical settings. In response to this challenge, our work introduces a novel methodology that utilizes only slide-level labels, eliminating the need for granular glomerular-level labeling. A comprehensive multi-stained lupus nephritis digital histopathology WSI dataset was created from the Indian population, which is the largest of its kind. LupusNet, a deep learning MIL-based model, was developed for the sub-type classification of LN. The results underscore its effectiveness, achieving an AUC score of 91.0%, an F1-score of 77.3%, and an accuracy of 81.1% on our dataset in distinguishing membranous and diffused classes of LN.
1. Introduction
Lupus Nephritis (LN) is one of the most severe manifestations of systemic lupus erythematosus (SLE), an autoimmune disease, due to its potential for severe renal damage and the intricate diagnostic and classification process. The complex nature of this disease is worsened by the substantial-high inter and intra-observer variability in histopathological renal biopsies (Dasari et al., 2019). As some classes of LN exhibit varying levels of aggressiveness, a precise classification of these classes becomes crucial in assessing fatality risks, predicting long-term prognosis, and determining an effective therapeutic approach.
Deep learning has recently emerged as a powerful tool in medical AI and healthcare, revolutionizing various aspects of medicine, from diagnosis and treatment to drug discovery and patient monitoring (Rajkomar et al., 2018). Digital pathology has significantly advanced due to its capacity to extract intricate patterns and features from complex medical data (Wu and Moeckel, 2023; Ahmed et al., 2022). Improvements in image analysis have led to significant advancements in various aspects of renal pathology, including automated detection and classification of glomerular lesions (Sheehan and Korstanje, 2018; Ginley et al., 2019), and identification of interstitial fibrosis (Zheng et al., 2021a). Advanced imaging techniques and molecular analyses may assist, but standardization and consensus in interpretation remain ongoing challenges.
Traditional LN classification follows a two-step process: first identifying glomeruli types, then classifying LN based on these types, heavily dependent on detailed glomeruli annotations (Sheehan and Korstanje, 2018; Zheng et al., 2021b). Yet, annotating glomeruli on large-scale WSIs is impractical in clinical settings due to their massive size and memory limitations, leading to patching and streaming solutions (Campanella et al., 2019; Pinckaers et al., 2020). Previous studies mainly differentiated LN from non-LN, not addressing subtype classification (Wang et al., 2023), which is complicated by similar glomerular types across subtypes and the unequal contribution of glomeruli to classification. (Cicalese et al., 2020) proposed an end-to-end LN subtype classification method, but it required manual segmentation on mice biopsies, not directly applicable to human samples due to differences in physiology and pathology.
In contrast, our work simplifies this process by creating an end-to-end pipeline that does not necessitate reliance on glomeruli class labels at any intermediate stage. Multiple Instance Learning (MIL) has been extensively explored for other areas of digital histopathology (Campanella et al., 2019), but not much has been reported or explored in renal pathology. While digital pathology has made strides, the LN classification research faces challenges such as access to the datasets and lack of consensus among medical professionals regarding its classification. In light of these considerations, the principal contributions of our work are as follows:
We focus on creating a valuable dataset of LN to drive research (computational and medical) in kidney diseases. This dataset, featuring multi-stained whole slide images, stands as one of the largest collections for lupus nephritis, which is a part of the consortium India Pathology Dataset (IPD) 1.
We also introduce a novel architecture, LupusNet, an explainable MIL-based model that significantly improves LN subtype classification by integrating Gated and Multi-Head Attention, underscoring the critical requirement to learn the morphological differences between LN class 4 & 5.
To the best of our knowledge, we present the first end-to-end pipeline for LN subtype classification, designed to achieve efficient diagnosis and classification by relying only on slide-level labels, easing clinical workload and facilitating practical integration.
2. Materials and Method
2.1. Data Acquisition & Description
In this study, biopsy specimens of 166 patients (retrospective and prospective cases) in different subclasses (ranging from 1 to 6) of LN from the Nizam Institute of Medical Sciences (NIMS) in Hyderabad, India, were digitalized. A total of 540 WSIs were digitalized using the Morphle Optimus 6X Scanner, with each WSI captured at a maximum magnification of 40x and stored in the widely used TIFF format.
Within this repository of 540 WSIs, there are four distinct categories of stained images, specifically Hematoxylin and Eosin (H&E), Periodic Acid-Schiff (PAS), methenamine silver Periodic Acid-Schiff (mt-PAS), and silver methenamine Periodic Acid-Schiff (sm-PAS). In this dataset, LN classes 4 (diffused proliferated) and 5 (membranous) exhibited the highest representation, with 62 and 53 cases, respectively. Class 4 LN displays a varied glomerular appearance characterized by widespread inflammation, cellular proliferation, and diverse lesions, whereas class 5 LN demonstrates a uniform appearance due to immune complex deposition, resulting in a membranous pattern (Weening et al., 2004). Consequently, our study focused primarily on observations and results for these two prominent LN class classifications using PAS-stained slides, highlighting carbohydrates, glycogen, and glycoproteins, aiding the identification of renal structures.
This India region-specific dataset is created to support global collaboration in lupus nephritis research. It helps add diversity to the other existing cohort, offering insights into potential regional and ethnic variations in the disease.
2.2. Methodology
We aim to learn a function that can predict the presence or absence of a condition within a WSI based on its constituent patches. Mathematically, this problem can be defined as follows: We are provided with a dataset containing pairs of bag-labels . Each Xi represents a collection of instances (patches) within a bag, and Yi is the label assigned to that bag. Each bag Xi contains a variable number of instances {x1, x2, …, xN} ∈ Xi. These instances have labels {y1, y2, …, yN} with yn ∈ {0, 1}. However, the labels for individual instances are unknown during the training phase. If any instance in a bag belongs to the positive class, then the bag is considered positive. Conversely, if all the instances in a bag belong to the negative class, the bag is considered negative.
Our methodology extends this formulation to multiple positive classes for subtype LN classification. Unlike lung, brain, and breast datasets, renal pathology primarily focuses on a limited region of interest, particularly the glomerular area, allowing us to use recurrent networks. Glomeruli play a pivotal role in various renal diseases, including LN. Instead of providing MIL with all WSI patches, we exclusively use glomerular patches, enhancing precision by avoiding potential noise. Recognizing the laborious labeling at the glomerular area, we aimed to eliminate the need for intermediate glomerular-level labels; thus, opting for weakly supervised approaches is an appropriate option.
Our novel end-to-end MIL architecture for LN classification, LupusNet, works on raw glomerular patches, extracted using fine-tuned YOLOv4 model (Hemmatirad et al., 2023), with two key components: (a) Feature Extractor (f) and (b) Feature Aggregator (g), jointly trained. f transforms inputs into an information-rich feature space using a ResNet-50 network pre-trained on histopathology images (Kang et al., 2023). We built on CLAM principles (Lu et al., 2021), which utilizes gated attention pooling and instance-level clustering to distinguish positive from negative samples. Gated attention, however, cannot fully exploit the uniformity of class 5 lupus nephritis glomeruli, hindering its ability to achieve optimal efficacy in capturing its consistent patterns. We hypothesize that adding contextual information among all glomeruli patches will improve the performance. To address this, we integrate self-attention and Bi-LSTM into the MIL framework, enhancing contextual understanding among instances (patches) in a WSI.
Suppose, in a WSI bag X, we have N glomerular patches, and the Feature Extractor f transforms each image xn ∈ ℝ224×224×3 into a h vector of dimension d ∈ ℝ1×d. For N such images, we obtain a matrix H ∈ ℝN ×d (eq: 1). Our feature aggregator can further be divided into three branches: (1) Gated Attention Pooling, (2) Self-Attention + LSTM and (3) Instance-level Clustering. In Branch 1, the gated attention block assigns attention scores to every instance (eq: 2), followed by instance-level clustering using Ag as pseudo labels for confident instances (Branch 3).
where Wa, Wb and Wc are trainable parameters,
can be supposed as positive probability of instances. σ represents sigmoid function and ⊙ represents element-wise multiplication. Cg is the output context vector of Branch 1 (eq: 3).
In Branch 2, initially, H goes to MHA, yielding contextualized output among instances (As). Self-attention (eq: 4) enables context consideration between every instance pair, and the multi-head mechanism focuses on modeling various such contextual relationships and dependencies among instances. The attention scores obtained from different heads, nh is a total number of heads, are concatenated, and a linear transformation is applied to ensure that the resulting shape matches the input, resulting in ℝn×d (eq: 5). To further process this contextualized information, we employ LSTM, which uses gating mechanisms and outputs the hidden layer of the last time step ℝ1×d.
where
, and
, for the ith head, are derived using trainable parameters
, and Wo linearly transforms the multi-head outputs. dk is used for scaling to prevent the dot product from becoming too large, and Cs is the bi-LSTM processed output context vector from Branch 2 on As.
Furthermore, we use softmax normalized learnable parameters s0 and s1 to adaptively aggregate contributions from each pipeline’s output. A scaling learnable parameter γ fine-tunes the overall merged output contribution, introducing an additional degree of freedom in the weighting process (eq: 6). Inspired by attention principles, this approach facilitates contextual understanding and dynamic weighting for effective information extraction from both branches. It draws parallels from multiple layer fusion of contextual embeddings in ELMO during downstream task (Peters et al., 2018).
After applying the adaptive aggregation method, a binary classifier with a single neuron and a sigmoid activation function is used to estimate the probabilities, y, of a slide being positive. Subsequently, binary cross-entropy loss is computed at the slide level (Branch 1 and 2), while Smooth SVM loss (Lu et al., 2021) is applied for instance-level clustering (Branch 3). The Smooth SVM loss, a generalization of traditional cross-entropy classification loss, accommodates diverse margin values and temperature scaling strategies, providing flexibility to mitigate overfitting. The rationale for choosing Smooth SVM loss lies in addressing potential noise in pseudo-labels, offering robustness in the presence of uncertainties. The total loss, as per Equation 7, is calculated as the weighted sum of both losses, where H′ and are the subset of H and Ag respectively, ŷ is ground truth and β is a hyper-parameter.
3. Results
3.1. Experimentation Details
For a robust evaluation of classification performance, we employed 10-fold cross-validation. All methods were implemented in PyTorch and trained on a single NVIDIA RTX 3080ti GPU. The patch size for YOLOv4-based glom detector was set to 6000 × 6000, and the MIL training involved 50-200 epochs with early stopping. nh = 4, β = 0.8, a Bi-LSTM hidden dimension of 512, and Adam optimizer with lr = 1e4. Batch size is set to 1 for all models.
3.2. Quantitative analysis
We established baselines using a pseudo-labeling approach for lack of detailed glomerulus-level labels by assigning whole slide labels to all glomeruli and tested models like AlexNet, ResNet, and DenseNet, with ResNet-101 performing best (1). These experiments underscored the challenge of label inconsistency among similar glomeruli in lupus classes 4 and 5, affecting model accuracy and emphasizing the need for alternative methods in the absence of precisely labeled datasets.
Afterward, we employed a weakly supervised CLAM single-branched variant (CLAM-SB) and our proposed LupusNet on the in-house dataset. Results are presented for both scenarios, wherein we either input all the WSI patches or just the glomeruli patches. The conclusive findings, as shown in Table 1, demonstrate that LupusNet outperforms all base-line models. We can empirically observe a significant performance improvement when only glomeruli patches are provided, consequently reducing noise to the CLAM-SB model. Additional observation showed LupusNet outperforming CLAM-SB (GP), by a significant F1-score improvement for class 5 LN (65.17% to 77.03%), highlighting its efficacy in distinguishing the two classes and reducing false positives and enhancing precision.
L=LSTM; G=Gated Attention; C=Clustering (Instance level)
3.3. Qualitative analysis
Figure 3 represents the interpretability of a test sample, which contains multiple glomerulus images. It uses attention weight distributions using heatmaps from the two branches of our model. Here, MHA (Branch 2) focuses on glomeruli patterns, prioritizing context at the WSI level. This contextualization is crucial for capturing the uniform membranous patterns of class 5 LN (Figure: 2) and thus highlights the importance of MHA for improved classification performance compared to relying only on gated attention (Branch 1), which exhibits a diverse focus necessary for class 4 LN, which shows diffuse proliferation pattern.
LupusNet: Proposed architecture for our lupus nephritis classifier. Gated attention identifies each glomerulus’s importance, while multi-head attention (MHA) discerns their contextual relationships.
Comparison of visual features between subtype samples. (a) involves proliferative changes in the glomeruli, whereas (b) shows thickening of the glomerular basement membrane
Attention weights of both branches for a class 5 sample (a) Gated Attention and (b) Multi-head Attention
4. Ablation Study
In our ablation study, we methodically introduced various architectural components to evaluate their individual and combined effects on the model’s performance. Beginning with a basic LSTM model as our starting point, we then integrated Gated Attention and Instance-level clustering. Each addition led to noticeable improvements in performance, as shown in Table 2, with our final model, LupusNet, outperforming all other configurations. This step-by-step process helped us identify the specific contributions of each component to the model’s overall effectiveness in classifying two LN classes. We further optimized LupusNet by adjusting the learning rates and the number of Multi-Head Attention (MHA) blocks (Figure 4).
Hyperparameter tuning of LupusNet based on the optimized value of learning rate (left) and number of attention heads (right)
5. Discussion and Conclusion
Our study introduces LupusNet, a MIL-based model for lupus nephritis classification that uses only slide-level labels, eliminating the necessity for glomeruli-level labels. Due to the limited data size, other MIL-based models incorporating transformers (Shao et al., 2021) were deemed sub-optimal for our case. However, we recognized the need for self-attention among glomeruli for context inclusion. Therefore, our work includes this aspect without increasing network complexity while retaining interpretability for pathologists. This study is a valuable reference for pathologists to address inter/intra-variability. Additionally, it holds significance for researchers studying other diverse renal diseases beyond the specific focus on LN. It also contributes to renal pathology research by creating a digital whole slide image dataset. While LupusNet exhibits promising results, there are areas for potential improvement. Our future work involves improving glomeruli detection models and feature aggregators, which could extract even better contextual information from glomeruli.
Compliance with Ethical Standards
Procedures in studies with human participants adhered to ethical standards set by institutional (NIMS) and/or national research committees (ICMR).
Acknowledgments
We acknowledge IHub-Data, IIIT Hyderabad (H1-002) for financial assistance. We also thank Dr. Manasa Kondamadugu for project coordination, Ms. Ramya Alugam, and Mr. Akula Rajesh Goud for data digitalization and organization.
Footnotes
ekansh.chauhan{at}research.iiit.ac.in, megha_harke{at}yahoo.co.in, lizarajasekhar{at}gmail.com, jawahar{at}iiit.ac.in, vinod.pk{at}iiit.ac.in
updated content and made it a more neutral template