Abstract
Background Although clinical features of multi-parametric magnetic resonance imaging (mpMRI) have been associated with biochemical recurrence in localized prostate cancer, such features are subject to inter-observer variability.
Objective To evaluate whether the volume of the dominant intraprostatic lesion (DIL), as provided by a deep learning segmentation algorithm, could provide prognostic information for patients treated with definitive radiation therapy (RT).
Design, Setting, and Participants Retrospective study of 438 patients with localized prostate cancer who underwent an endorectal coil, high B-value, 3-Tesla mpMRI and were treated with RT between 2010 and 2017.
Intervention RT.
Outcome Measurements and Statistical Analysis Biochemical recurrence and metastasis risk, assessed with a cause-specific Cox regression and time-dependent receiver operating characteristic analysis.
Results and Limitations The artificial intelligence (AI) model identified DILs with an area under the receiver operating characteristic curve (AUROC) of 0.827 at the patient level. For the 233 patients with available PI-RADS scores, with a median follow-up of 5.6 years, AI-defined DIL volume was significantly associated with biochemical failure (adjusted hazard ratio 1.54, 95% confidence interval 1.09-2.17, p=0.014) after adjustment for PI-RADS score. Among all 438 patients with a median follow-up of 6.9 years, the AUROC for predicting 7-year biochemical failure for AI volume (0.790) was similar to that for an expanded National Comprehensive Cancer Network (NCCN+) category (p=0.17). The AUROC for predicting 7-year metastasis for AI volume trended towards being higher compared to NCCN+ categories (0.854 vs 0.769, p=0.06).
Conclusions A deep learning algorithm could identify the DIL with good performance. AI-defined DIL volume may be able to provide prognostic information independent of the NCCN+ risk group or other radiologic factors for patients with localized prostate cancer treated with RT.
Introduction
Multi-parametric magnetic resonance imaging (mpMRI) has revolutionized the management of localized prostate cancer. mpMRI-based biopsy strategies have been shown to improve the detection of clinically significant disease while reducing the detection of clinically insignificant disease1–4. Furthermore, the characteristics of dominant intraprostatic lesions (DILs) on mpMRI, including PI-RADS scores5, radiologic T-stage6,7, and lesion size8–11, have been shown to be prognostic. Recently, a nomogram based on clinical and radiologic parameters was found to be highly prognostic for early biochemical recurrence after radical prostatectomy12.
However, the assignment of various clinical features on mpMRI is subject to significant inter-observer variability. For example, multiple grading systems13–15 exist for assigning extracapsular extension (EPE), each with differing sensitivities for the detection of histologic EPE extent16. Multi-reader studies have reported moderate inter-observer variability in the reporting of Prostate Imaging Reporting and Data Systems (PI-RADS) v2.0 scores17–19. A recent study demonstrated that the positive predictive value of PI-RADS scores for detecting high-grade prostate cancer was low (35%) and variable across 26 centers20. The varied performance of mpMRI is likely multifactorial and could be partially attributed to technical and interpretative factors, as well as benign conditions that mimic malignancy21.
Given these limitations, there has been considerable interest in the development of artificial intelligence (AI) algorithms for supporting the radiologist’s workflow. Although the performance of AI algorithms has not been shown to match that of radiologists for the detection of clinically significant prostate cancer22, recent work suggests that deep learning algorithms could be used as an adjunct for assisting radiologists23. However, less is known about the prognostic information provided by deep learning algorithms, particularly in comparison to current staging systems. The purpose of this study is to evaluate whether the mpMRI DIL volume, as determined by a deep learning segmentation algorithm24, could provide prognostic information for patients with localized prostate cancer treated with definitive radiation therapy (RT).
Methods
Clinical datasets
We conducted a retrospective study of 438 patients with cT1cN0M0 prostate cancer who underwent an endorectal coil, high B-value, 3-Tesla mpMRI and were treated with definitive prostate RT, both at our institution, between 2010 and 2017 (Table S1). mpMRI were obtained with either General Electric (GE) Signa HDxt (n=237) or DISCOVERY MR750w (n=201) (Table S2). Apparent diffusion coefficient (ADC)/diffusion-weighted imaging (DWI) were obtained at the same resolution (0.70 × 0.70 × 3-4 mm) as the T2-weighted images. All research was conducted with approval from the institutional review board.
We captured baseline clinical characteristics and treatment characteristics (Table 1). We stratified patients into an expanded National Cancer Center Network (NCCN+) classification, based on the 4-tiered NCCN stratification (low, favorable intermediate-risk, unfavorable intermediate-risk, high-risk)25, with the addition of a very high-risk category per eligibility criteria from recent STAMPEDE trials (≥2 of the following: cT3-T4, Gleason score 8-10, or PSA ≥40 ng/mL)26. We captured radiologic staging parameters, including PI-RADS v2.0 scores27, which were assessed by a subspecialty abdominal radiologist (L.K.L.) for 83 patients who were scanned in 2010-2013 and available for all patients after mid-2015, as well as radiologic T-stage (T1: no visible tumor on mpMRI, T2: organ-confined, T3a: EPE, T3b: seminal vesicle invasion [SVI]) based on mpMRI. Lastly, we captured outcomes including biochemical failure (nadir+2 ng/mL), development of metastasis (non-regional nodal or distant disease on bone scan, computed tomography, or positron emission tomography), and survival.
We divided patients into 3 cohorts. Of the 233 patients with available PI-RADS scores, we randomly allocated them into training (TrainPIRADS) and testing (TestPIRADS) sets. The remaining 205 patients were allocated into a second testing (TestNoPIRADS) set. For the TrainPIRADS and TestPIRADS sets, a radiation oncologist contoured the prostate transitional zone (TZ), peripheral zone (PZ), and DIL based on the presence of PI-RADS 3-5 scores. For the TestNOPIRADS set, the radiation oncologist contoured only the DIL based on radiology reports and biopsy data.
nnUNet algorithm
We utilized the publicly available nnUNet24 deep learning algorithm for automated delineations of the TZ, PZ, and DIL. First, T2, ADC, and DWI images were downloaded onto a research workstation. Images were anonymized and cropped into smaller arrays (128 × 128 × 15 or 20 pixels). T2-weighted images were registered onto the ADC images, utilizing the best performing registration among 4 algorithms (rigid, rigid + B-spline, affine, affine + B-spline) of prostate segmentation distance maps28. The median registration Dice coefficient was 0.92 (interquartile range [IQR] 0.90-0.94).
We trained a 3D full-resolution nnUNet model utilizing 5-fold cross-validation on the TrainPIRADS set. We then applied the selected configuration onto the TestPIRADS and TestNoPIRADS sets. With the utilization of 5-fold cross-validation on the training set and application of the model on the 2 testing sets, we generated TZ, PZ, and DIL segmentations for all 438 patients.
Assessment of AI DIL volume performance
We reported the Dice coefficients for the TZ and PZ, as noted by nnUNet. For the DIL, we analyzed the detection performance (patient-level area under the receiver operator characteristic curve [AUROC] and lesion-level specificity at 0.5 false positive per image) utilizing PICAI_eval29. To assess whether false positive (FP) or false negative (FN) lesions were smaller and less conspicuous than true positives (TP), we classified AI lesions as TP, FP, or FN, based on whether a reference DIL was present and correctly identified (TP: Dice coefficient ≥10%29), identified but not present (FP), or present but not identified or with insufficient Dice coefficient overlap (<10%) (FN). We analyzed lesion-level DIL volumes and contrast ratios (ratio of ADC intensities within the DIL versus a surrounding 2mm ring) based on classification status. Finally, we reported the similarity metrics (Dice coefficient) for TP lesions. All processing was performed using Python version 3.8.12 with the SimpleITK toolkit.
Comparison of AI and reference DIL volumes for sextant subset
For the 303 patients with sextant biopsies among all 3 cohorts, we compared the ability of the AI and reference DIL contours to detect clinically significant (Gleason grade group ≥2) disease within each sextant. We compared area under the curve (AUC) values utilizing ROC analysis (R package pROC). AI and reference DIL contours were allocated according to sextants based on the reference prostate contour.
Comparison of DIL volume with radiologic staging for PI-RADS subset
For the 233 patients (TrainPIRADS and TestPIRADS) with available PI-RADS scores among all 3 cohorts, we conducted cause-specific Cox regression models (R package coxph) for predicting biochemical failure with log-transformed AI DIL volume in cm3 (VAI), clinical T-stage, radiologic T-stage, PI-RADS scores (0-4 vs 530), age, whether patients received standard of care (SOC, i.e. ≥75.6 Gy if receiving external beam RT and androgen deprivation therapy [ADT] duration of 6 months for unfavorable intermediate-risk disease and ≥18 months for high/very high-risk disease), year of treatment, and scanner model (Signa HDxt vs DISCOVERY MR750w) as independent variables. An additional Cox regression model for metastasis was created using the above independent variables with the addition of salvage ADT use as a time-dependent variable.
Comparison of DIL volume with NCCN+ risk category for entire cohort
For the entire cohort, we created ROC curves for NCCN+, reference DIL volume (VREF), and VAI for predicting 7-year biochemical failure and metastasis, utilizing the timeROC package. We also created cause-specific Cox regression models (R package coxph) for predicting risk of biochemical failure, local failure, and metastasis. Independent variables included log-transformed VAI, NCCN+ classification, age, receipt of SOC, year of treatment, scanner model, and for local failure and metastasis models, salvage ADT use as a time-dependent variable. Finally, we estimated cumulative incidences of biochemical failure, local failure, and metastasis, based on partitions of VAI intervals, utilizing the R package prodlim. All statistical analysis was conducted with R version 4.1.2.
Results
Assessment of AI DIL performance
For the TestPIRADS cohort, median Dice coefficients for the TZ and PZ were 0.89 and 0.84, respectively (Table S3). The nnUNet exhibited a patient-level AUROC of 0.827 and a lesion-level sensitivity of 72.7% at a false positive rate of 0.5 case per image (Table S4). For patient-level analysis (Table S5), there was no significant difference between VAI (median 1.02cm3, IQR 0.40-2.34) and VREF (median 1.31cm3, IQR 0.37-2.61, p=0.66), with Pearson’s correlation coefficient of 0.79. For lesion-level analysis (Table S6), the total volume of AI TP lesions (median 1.06cm3, IQR 0.47-1.85) was significantly greater than for FP (median 0.25cm3, IQR 0.09-0.58; p<0.001) and FN (median 0.37cm3, IQR 0.21-0.88, p=0.007) lesions. The TP lesions were also more conspicuous (median contrast ratio 0.72, IQR 0.68-0.76) than FP (median 0.83, IQR 0.79-0.86; p<0.001) and FN (median 0.81, IQR 0.79-0.88; p<0.001) lesions. The median Dice coefficient for TP lesions was 0.74. Examples of TP, FP, and FN lesions are shown in Fig. S1.
Comparison of AI and reference DIL contours for sextant subset
For the subset of 303 patients with sextant biopsies among all 3 cohorts, we did not detect a difference in AUROC values associated with AI versus reference contours for any sextant (Table S7).
Comparison of DIL volume with radiologic staging for PI-RADS subset
For the 233 patients with PI-RADS scores (TrainPIRADS and TestPIRADS cohorts), with median follow-up of 5.6 years, there were 28 biochemical failures. AI algorithm detected 41.2%, 68.3%, and 87.1% of PI-RADS3, 4, and 5 lesions, respectively (Table S8). PIRADS 5 vs 0-4 (HR 5.90, 95% confidence interval [CI] 2.05-17.03, p=0.001) and VAI (HR 1.89 per log cm3 increase, 95% CI 1.41-2.53, p<0.001) were significantly associated with biochemical failure on univariable analysis (Table 2). However, only VAI retained significance on multivariable analysis (adjusted HR 1.54, 95% CI 1.09-2.17; p=0.014). The association between VAI and metastasis was similarly significant (adjusted HR 1.71, 95% CI 1.04-2.80, p=0.04; Table S9).
Comparison of DIL volume with NCCN+ risk category for the entire cohort
Among all 438 patients with a median follow-up of 6.9 years, there were 49 biochemical failures, 14 local failures, and 22 metastases. The AUROC for 7-year biochemical failure for VAI (0.790) was similar to that for VREF (0.779; p=0.42) and NCCN+ category (0.740; p=0.17). The sensitivity and specificity for 7-year biochemical failure at DIL volumes >0.5 mL were 90.5% and 45.8%, respectively, and at DIL volumes >2.0 mL, were 57.7% and 84.4%, respectively. The AUROC for predicting 7-year metastasis for VAI (0.854) was similar to that for VREF (0.817; p=0.15) but trended towards being higher compared to NCCN+ category (0.769; p=0.06). The sensitivity and specificity for predicting 7-year metastasis at DIL volumes >0.5 mL were 94.3% and 44.1%, respectively, and at DIL volumes >2.0 mL were 68.1% and 82.9%, respectively (Fig. 1).
VAI was significant on both univariable and multivariable analyses for predicting biochemical failure (Table 3; adjusted HR 1.61, 95% CI 1.23-2.12, p<0.001). For metastasis, only VAI was significant on univariable and multivariable analysis (Table 4; adjusted HR, 1.71, 95% CI 1.08-2.70, p=0.02). VAI was also associated with local recurrence risk on univariable analysis (HR 1.41, 95% CI 1.00-1.98, p=0.05; Table S10).
Fig. 2 shows cumulative incidences of biochemical failure, local failure, and metastasis for VAI of 0-0.4, 0.5-1.9, and ≥2.0cm3. Respective incidences of 7-year biochemical failure were 2.7%, 9.2%, and 23.7%. Respective incidences of 7-year local failure were 1.0%, 5.8%, and 5.3%. Respective incidences of 7-year metastasis were 0.7%, 2.9%, and 11.8%. Table S11 depicts the association of DIL volume ranges with NCCN+ and radiologic factors. Cumulative incidence curves for NCCN+ categories are shown in Fig. S2.
Discussion
Our study demonstrates that the prostate and DIL can be accurately delineated on mpMRI using nnUNet,24 an out-of-the-box self-configuring deep learning segmentation algorithm. Our findings suggest that AI-determined DIL volume provides prognostic information comparable to both NCCN+ risk category and human-generated DIL volume for estimating biochemical recurrence and metastasis risk for localized prostate cancer treated with RT. Furthermore, for the subset of patients with PI-RADS scores, VAI was more strongly associated with biochemical failure than other mpMRI parameters, including radiologic staging and PI-RADS score.
Though other studies have reported on the prognostic significance of tumor size for biochemical recurrence8–11, ours is among the first to show that DIL volume can be reliably obtained with an AI algorithm. This is clinically meaningful because DIL volume determination can be time-consuming and is subject to significant inter-observer variability31. The AI algorithm, on the other hand, was able to generate the DIL volume in an efficient, automated, and standardized manner. Additionally, this work may be clinically impactful as it suggests that VAI could provide important prognostic information even in the absence of a biopsy. With further validation, VAI could have the potential to provide unique prognostic information in addition to that provided with current clinical and radiologic staging systems.
VAI shows exceptional promise as a potential prognostic factor as it is a single, well-defined entity which may be generated in a systematic manner from mpMRI. Unlike NCCN risk categories25, VAI does not require the presence of a systematic biopsy, which may be important as MR-only biopsies are increasingly utilized4. Unlike radiomic approaches, VAI does not require the selection and regression of multiple features, which may have limited generalizability due to inter- and intra-scanner reproducibility32. Unlike genomic classifiers33 or digital pathology approaches34, no specific retrieval or handling of clinical specimens are required. Rather, prognostic information could be obtained directly from the mpMRI.
Additionally, this study provides support for extreme dose-escalation to large DILs, either through external beam RT35 or brachytherapy36,37 techniques. We found that VAI to be associated with local recurrence risk. Given emerging evidence which strongly suggests that local failures seed distant metastases38, extreme dose-escalation in appropriately selected patients with large DILs may be clinically impactful.
Our study showed that FP and FN lesions were smaller and less conspicuous than TP lesions. Our AI algorithm may have retained prognostic significance, despite missing such lesions, because of the weaker association between smaller lesions8–11 and less conspicuous ADC values39 with biochemical recurrence.
Key strengths of this study include the utilization of a well-annotated contemporary patient cohort treated with modern RT approaches. Furthermore, this study shows that mpMRI obtained after prior biopsy and with an endorectal coin contains important prognostic information that can be extracted by deep learning approaches. At our institution and likely many others, most mpMRIs acquired before 2018 had a pre-existing biopsy1.
It should be noted that the algorithm developed within this study may not be generalizable to MRIs obtained using different scanning parameters or without endorectal coils. External validation would be necessary to establish generalizability. It is our hope that deep learning algorithms with similar levels of performance may be built using nnUNet (or other deep learning segmentation approaches) for MRIs obtained under different conditions. Additionally, the AI algorithm missed significant disease in a subset of cases, including 12.9% of PI-RADS 5 lesions. As a result, AI DIL volume should not be used as a standalone entity.
Further research involves extending this work to multi-institutional datasets involving other scanning configurations and treatment types (e.g. surgery). We are very encouraged by the excellent performance of deep learning algorithms across multi-institution datasets21,28 and look forward to the results of the PICAI-Eval challenge29.
Only with additional information can we better understand the full potential of deep learning AI algorithms for providing potentially valuable prognostic information for patients.
Data Availability
All data produced in the present study are available upon reasonable request to the authors.