Abstract
Recent studies indicate bladder cancer is among the top 10 most common cancer in the world [1]. Bladder cancer frequently reoccurs, and prognostic judgments may vary among clinicians. Classification of histopathology slides is essential for accurate prognosis and effective treatment of bladder cancer patients, as a favorable prognosis might help to inform less aggressive treatment plans. Developing automated and accurate histopathology image analysis methods can help pathologists in determining the prognosis of bladder cancer. In this study, we introduced Bladder4Net, a deep learning pipeline to classify whole-slide histopathology images of bladder cancer into two classes: low-risk (combination of PUNLMP and low-grade tumors) and high-risk (combination of high-grade and invasive tumors). This pipeline consists of 4 convolutional neural network (CNN) based classifiers to address the difficulties of identifying PUNLMP and invasive classes. We evaluated our pipeline on 182 independent whole-slide images from the New Hampshire Bladder Cancer Study (NHBCS) [22] [23] [24] collected from 1994 to 2004 and 378 external digitized slides from The Cancer Genome Atlas (TCGA) database [26]. The weighted average F1-score of our approach was 0.91 (95% confidence interval (CI): 0.86–0.94) on the NHBCS dataset and 0.99 (95% CI: 0.97–1.00) on the TCGA dataset. Additionally, we computed Kaplan-Meier survival curves for patients predicted as high-risk versus those predicted as low-risk. For the NHBCS test set, patients predicted as high-risk had worse overall survival than those predicted as low-risk, with a Log-rank P-value of 0.004. If validated through prospective trials, our model could be used in clinical settings to improve patient care.
Introduction
Recent studies indicate bladder cancer is among the top 10 most common cancer in the world [1]. Most bladder cancer cases are urothelial carcinoma. Approximately 75-85% of patients with bladder cancer are classified as non-muscle-invasive bladder cancer (NMIBC). Furthermore, around 50% of NMIBC patients experience one or more disease recurrences, and the treatment procedure is different from patients diagnosed with muscle-invasive bladder cancer (MIBC) [2]. Urothelial carcinomas are graded according to the degree of tumor cellular and architectural atypia. The cancer grade has an important role in deciding the treatment plan, so if not determined accurately, the patient might undergo unnecessary treatments. The World Health Organization (WHO) 1973 and World Health Organization/International Society of Urological Pathology (WHO/ISUP) classifications are widely used for tumor grading, but these methods have relatively high intra- and inter-observer variabilities [3] [4]. Several studies compared different grading systems and their effect on choosing the best treatment [5] [6] [7] [8]. A study evaluated WHO 1973 classification from 11 pathologists, and inter-observer agreement was slight to moderate (κ = 0.19 − 0.44) [9]. Another study measured inter-observer agreement among 6 pathologists and showed that WHO/ISUP classification is slightly better than WHO 1973 [10]. Therefore, new methods should be sought to help pathologists diagnose bladder cancer.
Stage and grade of bladder tumors are important criteria in cancer treatment. The cancer stage consists of the location of the cancer cells and how far they have grown. Higher stages indicate whether the tumor has grown away from the surface. Urothelial carcinoma pathologic stages are named as Ta (Papillary tumor without invasion), TIS (Carcinoma in situ (CIS)), T1(tumor invades the connective tissue under the surface lining), T2 (tumor invades the muscle layer), T3 (tumor invades perivesical soft tissue), and T4 (extravesical tumor directly invades into other organs or structures). According to WHO 2016 classification, NMIBC is divided into three groups: Ta, Tis, and T1, while T2, T3, and T4 are MIBC. In cases of low grade, the cancer cells show morphology with less atypia and more resembling normal urothelial cells and grow slowly. Clinical research has demonstrated that the most common bladder tumors are low-grade [11]. On the other hand, high-grade cancer cells show more irregular and atypical morphology and can be found in both NMIBC and MIBC. It is essential to accurately differentiate between low and high grades because different treatments are available for various grade tumors. For example, high-grade cells in NMIBC need prompt treatment to avoid the spread of cancer.
Papillary urothelial neoplasm of low malignant potential (PUNLMP) was first introduced by the WHO/ISUP in 1998 as a new entity of bladder cancer [12]. PUNLMP and low-grade urothelial carcinoma are two bladder cancer types that are not easily distinguishable based on cell morphology under the microscope. Because of similarities between the two cancer types, the pathologic diagnostic accuracy for their differentiation is about 50% [13]. While the distinction between PUNLMP and low-grade is deemed essential by some pathologists, recent studies showed that separating PUNLMP and low-grade is not clinically crucial [5]. Of note, high-grade tumor cells are found in various tumor stages, and MIBC cancer type is considered high-grade.
Histological classification of bladder cancer has significant implications in the prognosis and treatment of patients. Moreover, detecting and classifying histologic patterns such as PUNLMP and low-grade urothelial carcinoma under the microscope is a time-consuming and challenging task for pathologists. Manual classification of bladder cancer histological patterns has a high error rate due to the similarity of histological features. Therefore, clinical information such as the cancer stage is commonly used for a more accurate prognosis. Automated image analysis using deep learning techniques can assist pathologists in providing faster and more consistent results. Additionally, these techniques can be improved by providing new data and associated annotated labels by several pathologists so that the model can be trained based on the expert opinion of multiple pathologists. In recent years, deep learning models, such as convolutional neural networks (CNNs), have been applied to a variety of computer vision tasks as well as biomedical applications [14] [15] [16]. CNN-based models have shown great promise in learning morphological characteristics of different cancer types from histological images [17] [18] [19] [20] [21].
Automated image analysis methods to classify and visualize various cancer patterns in high-resolution whole-slide images can help pathologists avoid errors and reduce their assessment time. In this study, we introduced a CNN-based model for the classification of urothelial bladder cancer based on whole-slide histopathology images to distinguish between low-risk and high-risk groups where low-risk class includes PUNLMP and low-grade cases, and the high-risk class includes high-grade and invasive cases.
Materials and methods
Datasets
For the model development and evaluation, we used images from the New Hampshire Bladder Cancer Study or NHBCS [22] [23] [24]. Risk factors for bladder cancer have been widely explored in previous reports from this study [25]. For external evaluation, we utilized histology images from The Cancer Genome Atlas (TCGA) [26]. The details of these datasets are included below.
New Hampshire Bladder Cancer Study (NHBCS) Dataset
This dataset contains 838 whole-slide images from 1994 to 2004 as part of the NHBCS [27]. These hematoxylin and eosin (H&E) stained surgical resection slides were digitized by Aperio AT2 scanners (Leica Biosystems, Wetzlar, Germany) at 20×□magnification (0.50 μm/pixel).
The Cancer Genome Atlas (TCGA) Dataset
We collected 378 whole-slide images from TCGA for external validation. The distribution of these whole-slide images used in this study is summarized in Table 1.
Data Annotation
The tumor histologic subtypes in NHBCS dataset were confirmed independently by two expert pathologists (A.S. & B.R.) from the Department of Pathology and Laboratory Medicine at Dartmouth–Hitchcock Medical Center (DHMC) based on a standard histopathology review. In NHBCS dataset, 637 whole-slides images were categorized into Papillary Urothelial Neoplasm of Low Malignant Potential (PUNLMP), Low-Grade Papillary Urothelial Carcinoma (low-grade, noninvasive), High-Grade Papillary Urothelial Carcinoma (high-grade, noninvasive), and Invasive Urothelial Carcinoma (IUC). Among these slides, 34 were classified as Carcinoma in Situ (CIS). Because of the small number of available CIS cases, we removed them from our study. In addition, 31 cases were labeled as others excluded from our study. We used 607 whole-slides from four classes (PUNLMP, low-grade, high-grade, and IUC) in our analysis. We combined PUNLMP and low-grade whole-slide images into a single class because of their similarity as they are both noninvasive and low risk cancers. Also, high-grade and IUC are merged into one group because they are considered high-risk cancers. We established the ground truth labels for each whole-slide image in our NHBCS data sets based on the consensus opinion of the two pathologists. If there was any disagreement, an expert pathologist (B.R.) re-reviewed the whole-side image and resolved the disagreement. We randomly partitioned these slides into an internal training set of 425 slides (∼70% of the NHBCS dataset) and an internal test set of 182 slides (∼30% of the NHBCS dataset).
Two pathologists (R.R. & B.R.) manually annotated the whole-slide images in our internal NHBCS training set using the Automated Slide Analysis Platform (ASAP) [28]. Regions of interests in each whole-slide image in our training set are annotated with bounding boxes at the highest resolution for each image. The annotated areas are split into smaller patches for training a patch-level classifier. As noted above, the groud truth labels for whole-slide images in our internal NHBCS test set were based on independent classification of two pathologists (B.R. and A.S.). The labels for the external TCGA test set were established based on the provided metadata from the TCGA database and additional confirmation by our study’s expert pathologist (B.R.).
Bladder4Net: Deep Learning Pipeline
In this study, we developed a deep learning-based model to distinguish between low- and high-risk bladder cancer cases where the low-risk class includes PUNLMP and low-grade cases, and the high-risk class includes high-grade and invasive cases. This deep learning pipeline, named Bladder4Net, is shown in Figure 1. We classify each patch in a whole-slide image with binary classifiers. The portion of patches classified as a subtype in a whole-slide image is included in a vector for all classes. Of note, the ratio of PUNLMP and low-grade patches are added to represent low-risk patches, and the ratio of high-grade and invasive are combined to represent high-risk patches. A Gaussian process classifier was trained on low-risk and high-risk patch ratios using the same training and test set partitioning used for training the CNN classifier. The details of this pipeline are included below.
Patch classification
Analyzing large histology images using deep learning models requires substantial memory resources. Therefore, we split each whole-slide image into fixed-size patches (224×224 pixels) with 1/3 overlapping.The Bladder4Net pipeline consists of four binary ResNet-18 [29] deep learning models that operate at the patch level for each class. We randomly select 10% of whole-slides images in the training partition for hyperparameter tuning in order to find the best hyperparameters during training process. We selected patches in annotated areas for training and evaluating the patch-level classifiers. We normalized the color intensity of patches and used standard data augmentation methods, including random vertical and horizontal flips and color jittering, which its parameters selected based on the random subsampling of patches in each class. Our model was trained on 260,610 patches (average 613 patches per whole-image slide), including 107,379 high-risk and 153,231 low-risk patches. To address the class imbalance, we used a weighted random sampler method for generating the training batches. For model training, we trained a ResNet-18 [29] initialized using normal distribution initialization. All four models used cross-entropy loss function and were trained for 100 epochs with the initial learning rate of 0.005 decayed by a factor of 0.9 each epoch.
Whole□Slide Inference
To classify whole-slide images, we aggregated patch-level prediction outputs. For each whole-slide image, we pre-processed the image by removing the white background and color markers. To aggregate the patch-level predictions, the ratio of patches from each class to the total number of patches from a slide is computed per whole-slide image. Bladder cancer is progressive, and there are mixed types of cancer cells in many whole-slides. Therefore, there are some low-risk patches in high-risk whole-slide images. We used a Gaussian process classifier for whole-slide inferencing. The classifier is trained on the patch ratios of whole-slide images from the training set and evaluated on the same validation set used in the patch-level analysis. Patch ratios of each classifier are given as input to the whole-slide inference classifier. Low-risk images usually have a higher ratio of patches labeled as PUNLMP and low-grade. High-risk images typically have a higher ratio of high-grade and invasive patches. Figure 2 shows patch ratios of various classes in four sample whole-slide images from different classes.
We integrated the output of four binary patch-level classifiers to keep high-confidence patches and exclude low-confidence and normal patches in the whole-slide inference step. This process in our proposed pipeline is shown in Figure 3. If a patch is assigned to more than one of the labels, it indicates that the patch class label is unreliable and should be eliminated from the inference process. If all classifiers assign the label “others” to a patch, the patch is also eliminated. Our proposed inference method does not require hyper-parameter tuning as it does not rely on a threshold to eliminate low-confidence patches.
Patient Survival Prediction
We analyzed the survival time of patients for low- and high-risk classes. Survival time was calculated from the date of diagnosis to the date of death for patients who did not survive or to the date when the Death Master File was queried for patients who survived [27]. We generated Kaplan-Meier survival curves for patients predicted as high-risk versus those predicted as low-risk. A Log-rank test was used to compare the survival between two predicted groups, considering the follow-up time. We used the Cox proportional hazards model [30] to estimate the effect size of our predicted risk group on patient survival.
Evaluation Metrics and Statistical Analysis
To measure the efficacy and generalizability of our approach, we evaluated our trained model on 182 independent whole-slide images (WSIs) from the NHBCS dataset and 378 WSIs from the TCGA dataset. We used precision, recall, and the F1-score as evaluation metrics. The confusion matrix is also shown for error analysis. In addition, 95% confidence intervals were computed using the bootstrapping method with 10,000 iterations for all the metrics.
Results
Classification of Low-Risk and High-Risk Groups
Table 2 summarizes our model’s per-class and average evaluation metrics and the associated 95% CI for detecting low- and high-risk groups based on whole-slide images in the NHBCS test set. Our model achieved a weighted mean accuracy of 0.91, weighted mean precision of 0.91, weighted mean recall of 0.91, and weighted mean F1-score of 0.91 on the NHBCS test set.
Table 3 shows the performance summary of our model on whole-slide images in the TCGA database under the study of urothelial bladder carcinoma. Of note, each case in the TCGA dataset may have more than one whole-slide image. Therefore, the images for these patients are aggregated in our study. The cancer stage of all TCGA images was T2 and above, i.e., high-risk, based on the patient metadata in the TCGA dataset. Although most patients in the TCGA cohort were in the high-risk group, high-risk histological patterns were absent on histology slides of a few patients based on the evaluation of our pathologist expert (B.R.). This is likely because only selected slides of each cases were uploaded to the TCGA database, and the selected slides may not represent the entire tumor. Therefore, based on the tumor morphology of WSIs available for these cases, we considered these cases as low-risk. On the TCGA dataset, our model achieved a weighted mean accuracy of 0.99, weighted mean precision of 0.99, weighted mean recall of 0.99, and weighted mean F1-score of 0.99. The confusion matrices for our model on the NHSBC and TCGA test sets are shown in Figure 4.
Prediction of Patient Survival
Figure 5 shows the Kaplan-Meier survival curve for patients from the internal NHBCS test set. The hazard ratio of overall survival using the predicted risk group by our model versus the tumor grade-defined risk groups for these patients is shown in Table 4. Figure 6 shows the Kaplan-Meier survival curve of patients from the external TCGA test set.
For the internal NHBCS test set, patients predicted as high-risk had worse overall survival than those predicted as low-risk, with a Log-rank p-value of 0.004 (Figure 5). The TCGA test patients were followed up to 216 months after the initial diagnosis, with a mean follow-up time of 123.8 months. The medium survival time of predicted high-risk patients was 177 months, while greater than 50% of the predicted low-risk patients survived until the end of their follow-ups. In the univariate Cox proportional hazards analysis using our predicted risk groups, the predicted high-risk group had an estimated hazard ratio of 1.958 (95% CI: 1.222–3.137, P-value=0.005) compared to the predicted low-risk group. This hazard ratio was slightly higher than the hazard ratio using the labels defined by the WHO/ISUP grading (Table 4).
Among 378 patients from the TCGA dataset, 367 were predicted as high-risk, and 11 as low-risk. Due to the small number of low-risk patients and their limited follow-up time of the TCGA data, we limited our survival analysis to the first 24 months after the initial diagnosis. There was no death event reported during the follow-up of the low-risk group and 35 death events reported in the high-risk group in the first 24 month of follow-up, with a Log-rank P-value of 0.04 (Figure 6). Of note, due to no events in the low-risk group, we could not estimate the hazard ratio and did not conduct Cox proportional hazards analysis on the TCGA dataset.
For the NHBCS test set, patients predicted as high-risk had worse overall survival compared to those predicted as low-risk, with a Log-rank P-value of 0.004 (Figure 5). The patients in the NHBCS dataset were followed up to 216 months after the initial diagnosis, with a mean follow-up time of 123.8 months. The median survival time of predicted high-risk patients was 177 months, while greater than 50% of the predicted low-risk patients survived until the end of their follow-ups.
Discussion
The WHO has updated its bladder cancer grading guidelines several times since 1973 to align them more closely with disease recurrence and progression [3]. Based on the most recent update from WHO in 2016, PUNLMP, low-grade and high-grade with stage T1 bladder cancers are categorized as NMIBC, and high-grade cases with stage T2 and above are classified as MIBC. Detection and classification of bladder cancer histologic patterns under the microscope is critical for accurate prognosis and the appropriate treatment of patients; however, this histopathological assessment is a time-consuming and challenging task and suffers from a high variability rate among pathologists. Therefore, patients with NMIBC can incorrectly be diagnosed as high-grade cases, which might result in unnecessary treatment or even surgery which can affect patients’ quality of life [13] [31] [32]. In this study, we developed and evaluated a deep learning model to classify patients as high- or low-risk based on their whole-slide images to inform their prognosis and treatment. Our evaluation results on both internal and external datasets showed that this approach could potentially assist pathologists in their histopathological assessment, improve their accuracy and efficiency of diagnosis, and ultimately improve patient health outcomes.
As part of this study, we investigated developing a multi-class CNN model with four labels, including PUNLMP, low-grade, high-grade, and invasive. Although this model achieved a reasonable performance at the patch-level (See Table S1 in Supplemental Material), some classes, such as PUNLMP and invasive, achieved sub-optimal results. This outcome indicates that a single multi-class model cannot effectively handle the complexities of this task and achieve a good performance and generalization for all four classes. Notably, differentiating between PUNLMP and low-grade types has the lowest accuracy rate among clinicians due to their morphological similarities [13]. In addition, high-grade bladder cancer cells are found in any stage of the disease. Therefore, in our study, we used the ResNet-18 architecture as a backbone for binary classifications to differentiate between various classes instead of a single multi-class model. Each of our four binary CNN-based patch-level classifiers focuses on one class and differentiates that class from other subtypes. Our patch-level classification results (Table S1) indicate the high performance of our approach for this patch-level classification, as all binary classifiers achieved an F1-score of more than 0.79.
As the primary whole-slide level classification outcomes, we focused on identifying low- and high-risk groups for bladder cancer based on histology slides, where low-risk class includes PUNLMP and low-grade cases and the high-risk class includes high-grade and invasive cases. The differentiation between these two risk groups has a significant clinical impact on patient prognosis and treatment. For whole-slide inferencing, we built a Gaussian process classifier based on the distribution of the classified patches from each slide.
To demonstrate the generalizability of our model, in addition to evaluating it on 182 whole-slide images in our internal test set from NHBCS, we also evaluated our approach on 378 whole-slide images from TCGA as an external test set. Our approach achieved the weighted average F1-score of 0.91 (95% CI: 0.86–0.94) on the internal NHBCS test set. Our model achieved the weighted average F1-score of 0.99 (95% CI: 0.97–1.00) on the external TCGA test set. Of note, TCGA metadata information showed that all the cases in this dataset belong to the high-risk class. That said, a few cases were identified by our study’s expert pathologist (B.R.) as low risk based on their histology images, likely due to the selected slides included in the TCGA dataset may not represent the entire tumor.
Finally, we computed Kaplan-Meier survival curves for patients predicted as high-risk versus those predicted as low-risk for both NHBCS and TCGA test sets. Patients predicted as high-risk had worse overall survival than those predicted as low-risk, with Log-rank P-values of 0.004 and 0.039 on the NHBCS and TCGA test sets, respectively. Also, our predicted high-risk group in the NHBCS test had an estimated hazard ratio of 1.958 (95% CI: 1.222–3.137, P-value=0.005) compared to the predicted low-risk group, which was slightly higher than the hazard ratio using the labels defined by the WHO/ISUP grading.
In future work, we plan to expand our model to distinguish between high-grade invasive and high-grade noninvasive classes, which is clinically helpful to determine the progression and reoccurrence of bladder cancer. Because we had a limited number of muscle-invasive cases in our datasets, building such a model for this differentiation was not feasible in the current study. We plan to collect additional data and develop new data augmentation techniques, such as generative adversarial networks (GANs), to tackle the dataset imbalance. Such techniques can mitigate the effects of unbalanced data by preventing overfitting and thus improving overall performance [33]. In addition, we consider including vision transformers in our future pipeline to improve our high-resolution image encoding approach [34]. As future work, we also plan to deploy the developed model for histopathological characterization of whole-slide images for bladder cancer as a computer-aided diagnosis system in clinical settings. We plan to conduct a prospective study to validate our approach in clinical practice and evaluate its impact on health outcomes.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Acknowledgements
This research was supported in part by grants from the US National Library of Medicine (R01LM012837), the US National Cancer Institute (R01CA249758), and the US National Institute of General Medical Sciences (P20GM104416).