Abstract
Background It remains unclear if deep-learning-based white matter lesion (DL-WML) volume can predict outcome after ischemic stroke.
Purpose We aimed to develop, validate, and evaluate DL-WML volume in NCCT as a risk factor and IVT effect modifier compared to the Fazekas scale (WML-Faz) in patients also receiving EVT in an EVT-capable center.
Methods A deep-learning model for WML segmentation in NCCT was developed and validated internally and externally. The volumetric correspondence of DL-WML volume per mL was reported relative to expert annotation with the intraclass correlation coefficient (ICC) and a Bland-Altman analysis reporting bias and limits of agreement (LoA). In a post-hoc analysis of the MR CLEAN No-IV trial, univariable and multivariable regression models were used to report (un)adjusted common odds ratios ([a]cOR) to associate DL-WML volume and WML-Faz with the occurrence of symptomatic-intracerebral hemorrhage (sICH) and 90-day functional outcome with the modified Rankin Scale (mRS).
Results DL-WML volumes were comparable with those of the ground truth for both the internal test set (10/20(50%) male, age median:72[IQR:67-85], ICC mean:0.91 95%CI:[0.87;0.94];bias:-3mL LoA:[-12mL;7mL]) and the external test set (36/101(36%) male, age median:59[IQR:42-73], ICC mean:0.87 95%CI:[0.71;0.95];bias:-2mL LoA:[-11mL;7mL]). 516 patients from the MR CLEAN No-IV trial (291/516(56%) male, age median:71 IQR:[62-79],) were analyzed. Both DL-WML volume and WML-Faz were associated with sICH (DL-WML volume acOR:1.31 95%CI[1.08;1.60], WML-Faz acOR:1.53 95%CI[1.02;2.31]) and mRS (DL-WML volume acOR:0.84 95%CI[0.76;0.94], WML-Faz acOR:0.73 95%CI[0.60;0.88]). Only for the unadjusted analysis, WML-Faz was an IVT effect modifier (p=0.046), DL-WML was not (p=0.274).
Conclusion DL-WML volume and WML-Faz had a similar relationship with functional outcome and sICH.
Summary statement Deep-learning-based white matter lesion volume in non-contrast CT can substitute human ratings to prognosticate symptomatic intracerebral hemorrhage and functional outcome at 90-days in acute ischemic stroke patients.
Key points
This was the first study to show that white matter lesion volume in non-contrast CT based on deep-learning segmentations (DL-WML volume) was associated with symptomatic intracerebral hemorrhages and a worse functional outcome.
Compared to the Fazekas scale, regression models using DL-WML volume had a similar fit to the data.
White matter lesion load might be associated with more symptomatic intracerebral hemorrhages if IVT was given before EVT.
Introduction
The benefit of intravenous thrombolysis (IVT) before endovascular treatment (EVT) for large vessel occlusion acute ischemic stroke (AIS) in EVT-capable stroke centers remains unclear (1–6). An intracerebral hemorrhage (ICH) is a common complication after AIS that can be induced by IVT (7–10) and frequently results in severe neurologic disability and death (11,12). In patients with a high white matter lesion (WML) burden, ICH occurs more often and is aggravated by IVT (11,12). Therefore, accurate quantification of the WML burden might be used to guide safe administration of IVT in patients also eligible for EVT in an EVT-capable stroke center.
Currently, visual rating scores such as the Fazekas scale are used to grade WML (WML-Faz) in non-contrast CT (NCCT) or MRI (13) but suffer from imprecision and inter-rater variability(14). In a post-hoc analysis of the Chinese DIRECT-MT trial, randomizing patients for IVT before EVT in EVT-capable stroke centers, WML-Faz was associated with functional outcome while no association was found with ICH and effect modification of IVT for ICH or functional outcome (15). Thus, in patients with high WML-Faz scores, no increased rate of IVT-induced ICH was observed. Besides visual rating scales, several automated methods have been developed for WML segmentation in MRI and NCCT allowing for a fast and quantitative evaluation of WMLs (16,17). These automated WML segmentation methods for NCCT have only been analyzed for diagnostic accuracy relative to visual rating scales and based on spatial and volumetric correspondence compared to an expert-based ground truth segmentation (16,17). It remains unclear how such an automated segmentation method relates to patient outcomes and if it is a feasible alternative to a visual rating scale for WML. Finally, recent improvements in deep-learning led to superior performance over currently used automated methods but have not been considered for WML segmentation in NCCT (19).
We aimed to develop a state-of-the-art deep-learning model to segment WML in NCCT. We hypothesized that automatically extracted deep-learning-based WML (DL-WML) volume has a similar predictive association as WML-Faz with symptomatic or asymptomatic intracerebral hemorrhage (sICH/aICH) and functional outcome. Furthermore, we hypothesized that both DL-WML volume and WML-Faz result in more ICH and a worse functional outcome if IVT is given.
Methods
Data and study design
Three retrospective imaging databases were used. Data from two centers (A and B) were used to train, internally, and externally validate segmentations obtained through a nnUnet deep-learning model (19). For the center A-1 cohort, we included 118 patients from a previous multicenter study (17) and added 147 patients retrospectively collected (A-2 cohort). For the center B cohort, 101 patients with FLAIR and NCCT imaging acquired within 3 months were included (16). Subsequently, data from the MR CLEAN No-IV trial was used to compare the value of WMLs according to the Fazekas scale and automatically extracted DL-WML volume (6). The MR CLEAN No-IV was a randomized controlled trial for intravenous alteplase (0.9mg/kg) before EVT or direct EVT in EVT-capable stroke centers including patients between January 2018 and October 2020 (6). Patients from the MR CLEAN No-IV were included if a baseline NCCT was available (n=516). Figure 1 and Online supplement A describe the data specifics and reuse of previously reported data.
Lesion annotation and data processing
In the center A cohort, ground-truth WMLs were segmented in NCCT by three experts (17). A single ground truth segmentation was created using the stapler algorithm which is based on a weighted majority voting between raters (20). WML-Faz was rated by any member of the independent core-lab. From the center A cohort (A1-train), we used 98 patients for training the nnUnet and 20 (center A-test set) for the internal validation test set. From the center A-2 cohort, 147 patients with single rater WML annotated slices were included for training (A2-train). For the center B data, NCCTs were co-registered to the FLAIR images using Elastix (21). Subsequently, WMLs in FLAIR were automatically segmented using medseg.ai (22) and manually adjusted by any of four raters (HV, LV, LP, and MM) using 3D-Slicer software, version 4.11 (23). All 101 patients from center B were used as an external validation test set (B-test).
Lesion segmentation evaluation
The best nnUnet model was determined using the Dice similarity coefficient (DSC) for each of the 5 cross-validation folds. Evaluation metrics were reported based on a majority voting ensemble of these 5 models. To exclude false positive segmentations, we only considered WML segmentations inside a 10mm dilated mask of the ventricles (further described in Online Supplement B). DL-WML segmentations were evaluated with the DSC and Hausdorff Distance (HD) in mm for spatial agreement, and volumetric correspondence using the bias and limits of agreement (LoA) in a Bland-Altman-plot and the two-way mixed effects intra-class correlation coefficient (ICC). The center A and B datasets were described with the total WML volume, average volume per WML, and the total WML count. Additionally, the NCCT quality was described with the signal-to-noise ratio (average voxel values in NCCT/standard deviation of background voxel values) and contrast-to-noise ratio ([average voxel values in WML – average background voxel values]/standard deviation of background voxel values) considering the entire brain tissue and a 5mm radius surrounding the WML as background.
Post-hoc analysis MR CLEAN NO-IV
The primary outcome measure was functional outcome measured with the ordinal modified Rankin Scale (mRS) 90 days after AIS. Secondary outcome measures were functional independence (mRS≤2), mortality, sICH, and aICH according to the Heidelberg criteria (24). WML-Faz as a continuous variable and DL-WML volume were studied as independent variables. These WML measures were considered as risk factors for the outcome measures using binary and ordinal logistic regression models. Subsequently, a multiplicative interaction term of the WML measures with an intention-to-treat for IVT was introduced to the regression models to analyze treatment effect modification. Additional analyses for treatment effect modification were added in a per-protocol setting, excluding all patients that did not receive IVT and EVT according to the randomization. Model fit with DL-WML volume or the Fazekas scale was described with the likelihood ratio (LL). Odds ratios (ORs) with 95% confidence interval (95%CI) were reported for unadjusted univariate models and statistically adjusted multivariable models considering the following potential confounders: age, sex, pre-stroke mRS, collateral score, national institute of health stroke scale (NIHSS), Alberta stroke program early CT score (ASPECTS), IVT before EVT, time from onset of neurological deficit to arterial puncture, occlusion location, secondary hemostasis (INR: international normalized ratio), history of hypertension, history of diabetes mellitus, and previous cardiovascular disease consisting of peripheral artery disease, myocardial infarction, or ischemic stroke.
Baseline characteristics and WML measures were compared between the two treatment arms of the MR CLEAN No-IV with ANOVA, Kruskal-Wallis, and Chi-squared test for normal and non-normal distributed continuous, and categorical distributed baseline variables. Online supplement A describes the scoring of the imaging measures. All statistical analyses were performed in R version 3.6.3. For logistic regression analyses, missing values were imputed using multiple imputations for five datasets with the R-package MICE considering all potential confounders, outcome measures, and independent variables.
Results
Descriptive statistics
Table 1 describes the characteristics of the NCCTs and the WML in the scans for the center A1-train set (55/98[56.1%] male sex, age median:76[IQR:66;85]), the A2-train set (66/147[44.8%] male sex, age median:78[IQR:68;85]), the center A-test set (10/20[50.0%] male sex, age median:72[IQR:67;85]), the center B-test set (36/101[35.6%] male sex, age median:59[IQR:42;73], and the included patients from the MR CLEAN No-IV trial (291/516[56.4%] male sex, age median:71[IQR:62;79]). All characteristics differed significantly across the datasets. The total WML volume and average volume per WML were higher in the center A datasets than in the B-test set. Furthermore, the number of lesions per patient was higher in the B-test set. Since the center A2-train set contained annotated slices and not volumes, the volume measures and number of WML are not directly comparable with other datasets. The signal-to-noise ratio was lower for the B-test set compared to other datasets. The contrast-to-noise ratio of the center A datasets was more negative than that of the B-test set. This indicates that in NCCTs from the B-test set WMLs were often in an area with similar Hounsfield Unit values inside the lesion as in the local background surrounding the WML. Patients in the B-test set and MR CLEAN No-IV had more often a thin slice (≤0.5mm) NCCT. Table 2 describes the baseline characteristics of included and excluded patients from the MR CLEAN No-IV trial, per randomization arm and Fazekas scale sub-score. From the IVT before IVT arm 257 patients were included, 11 did not receive EVT or IVT. From the EVT without IVT arm 259 patients were included, in this arm, 28 patients did not receive EVT or did receive IVT. Patients with more severe WML according to the Fazekas scale were older, had a worse mRS before AIS, more often had a history of ischemic stroke, peripheral artery disease, diabetes mellitus, and hypertension, had a longer time from onset of neurological deficit to groin puncture, and had a larger DL-WML volume. Online supplementary Table S1 describes the baseline characteristics between the included and excluded patients from the MR CLEAN No-IV trial.
White matter lesion segmentation
Figure 2. depicts NCCTs from a patient in the A- and B-test sets. Compared to the A-test set, in the B-test set DSC was lower (B-test median DSC:0.23 IQR:[0.06;0.43] vs. A-test median DSC:0.68 IQR:[0.37;0.77]), and the HD was higher (B-test median HD:32.3 IQR:[26.9;48.3] mm vs. A-test median HD:17.9 IQR:[13.6;23.9] mm). Figures 3A and 3B describe the relationship between the number of WML, contrast-to-noise ratio, and DSC. Average volume per WML and the contrast-to-noise ratio were (log)linearly associated with DSC. For a similar average volume per WML and contrast-to-noise ratio both the A- and B-test sets had a similar DSC. Volumetric correspondence between the deep-learning-based segmentations and the ground truth was similar for both test sets in terms of ICC (B-test median ICC:0.91 95%CI:[0.87;0.94] vs. A-test median ICC:0.87 95%CI:[0.71;0.95]) and for the Bland-Altman analysis (B-test bias:-2mL LoA:[-11mL;7mL] vs. A-test bias:-3mL LoA:[-12mL;7mL]) (Figures 3C and 3D). For larger volumes, DL-WML underestimated the ground truth volume in both test sets. Online supplementary Figures S1-6 contain plots relating DSC with all other NCCT quality metrics, results for DL-WML segmentations not using the dilated ventricle mask, and the DL-WML volume per WML-Faz score.
Post-hoc analysis MR CLEAN No-IV
Table 3 describes all the associations between WML-Faz and DL-WML volume with the outcome measures without and with adjustment of potential confounders. Both the WML-Faz and DL-WML volume were associated with functional outcome, with and without adjustment for potential confounders for ordinal mRS (DL-WML volume: cOR:0.76[95%CI:0.68;0.83], acOR:0.84[95%CI:0.76;0.94]; WML-Faz: cOR:0.57[95%CI:0.48;0.67], acOR:0.73[95%CI:0.60;0.88]), functional independence (DL-WML volume: cOR:0.65[95%CI:0.55;0.76], acOR:0.76[0.64;0.91]; WML-Faz: cOR:0.51[95%CI:0.41;0.63], acOR:0.62[0.49;0.80]), and sICH (DL-WML volume: cOR:1.26[95%CI:1.08;1.46], acOR:1.31[95%CI:1.08;1.60]; WML-Faz: cOR:1.59[95%CI:1.13;2.24], acOR:1.53[95%CI:1.02;2.31]). WML-Faz and DL-WML volume were both associated with death without adjusting for potential confounders but not after adjusting (DL-WML volume: cOR:1.21[95%CI:1.08;1.46], acOR:1.01[95%CI:0.87;1.17]; WML-Faz: cOR:1.64[95%CI:1.32;2.24], acOR:1.19[95%CI:0.91;1.57]). Neither WML measure was associated with aICH (DL-WML volume: cOR:1.06[95%CI:0.92;1.22], acOR:1.03[95%CI:0.87;1,22]; WML-Faz: cOR:1.24[95%CI:0.99;1.56], acOR:1.16[95%CI:0.91;1.49]). Online supplementary tables S2-3 contain model fit parameters for all the regression models and a cross-table with sICH for each sub-score on the Fazekas scale. Regression models using DL-WML volume or the WML-Faz had comparable LL for the univariable unadjusted models (ordinal mRS -unadjusted: DL-WML volume LL: 36.14, WML-Faz LR: 45.28; sICH - unadjusted: DL-WML volume LL: 7.23, WML-Faz LR:6.67) and multivariable-adjusted models (ordinal mRS - adjusted: DL-WML volume LL: 189.34, WML-Faz LL: 189.53; sICH - adjusted: DL-WML volume LL: 42.88, WML-Faz LL: 43.98).
Table 4 describes the ORs and p-values for IVT effect modification analyses using an intention-to-treat approach for receiving IVT and EVT. WML-Faz-based IVT effect modification for sICH was statistically significant before adjusting for potential confounders (cOR:8.77 95%CI:[1.04;74.09], p=0.046), but not after adjusting for potential confounders (acOR:7.24 95%CI:[0.71;73.96], p=0.095). DL-WML volume-based IVT effect modification for sICH was not statistically significant (cOR:1.08 95%CI:[0.94;1.25], p=0.274; acOR:1.09 95%CI:[0.93;1.28], p=0.295). Compared to models without WML-Faz-based IVT effect modification for sICH, models with WML-Faz-based IVT effect modification had a higher LL for the unadjusted analyses (LL no WML-Faz-IVT effect modification: 6.67, LL with WML-Faz-IVT effect modification: 11.20) and a similar LL for the adjusted analyses (LL no WML-Faz-IVT effect modification: 43.98, LL with WML-Faz-IVT effect modification: 43.38). For the remainder of outcome measures, no IVT effect modification due to WML-Faz or DL-WML was observed. Figure 4 depicts the probability of sICH for each Fazekas scale sub-score and by the DL-WML volume for both the intention-to-treat and per-protocol analyses. sICH occurred in 28 (5.4%) and in 25 (5.1%) all patients in the intention-to-treat and per-protocol analysis respectively. Online supplementary tables S4-6 describe the IVT effect modification results for the per-protocol analysis, a table with model fit statistics, and a cross-table with sICH for each sub-score on the Fazekas scale. The IVT effect modification for sICH due to WML-Faz was higher in the per-protocol analysis and statistically significant with and without adjustment for confounders (cOR: 37.11 95%CI:[3.03;454.13]), p=0.005; acOR:34.52 95%CI:[2.18;545.41], p=0.012). For DL-WML volume-based IVT effect modification for sICH the multiplicative term was non-significant (cOR: 1.20 95%CI:[1.0;1.43]), p=0.052; acOR:1.22 95%CI:[0.99;1.50], p=0.065).
Discussion
An automated deep-learning-based method was developed for the volumetric segmentation of WML in NCCT and validated internally and externally. Differences in DSC between the internal (median DSC:0.68 IQR:[0.37;0.77]) and external (median DSC:0.23 IQR:[0.06;0.43]) validation test sets could be explained by varying contrast-to-noise ratios in the NCCT and smaller WMLs in the external validation test set. The volumetric correspondence between the internal (bias:-3mL LoA:[-12mL;7mL]; mean ICC:) and external (bias:-2mL LoA:[-11mL;7mL]; mean ICC) validation test sets was comparable. DL-WML volumes were lower than the WML volumes (semi-automatically) segmented by experts. This study was the first to show that DL-WML volume and the Fazekas visual rating scale have a similar association with sICH and functional outcome after EVT with or without prior IVT irrespective of patient age and cardiovascular history. We did not find a higher probability of sICH if IVT was given before EVT for patients with a higher WML burden (adjusted interaction: IVT*WML-DL p=0.295 ; IVT*WML-Faz p=0.095).
The post-hoc analyses of the Chinese DIRECT-MT trial only found associations of WML-Faz with functional outcome and not with effect modification for sICH in patients receiving IVT before EVT due to WML-Faz (15). In contrast, our findings suggest that DL-WML volume and WML-Faz were associated with worse functional outcomes and more sICH. This contradiction might be due to differences in cardiovascular risk profiles between the Chinese and Dutch populations (25). Alternatively, the association with mortality was stronger in the DIRECT-MT analyses, potentially indicating missed sICH patients. Furthermore, our regression analyses used WML-Faz as a continuous variable while in the DIRECT-MT analyses the statistical power was reduced by using dichotomized WML-Faz variables (15,26). We observed good volumetric correspondence and statistically significant associations with outcomes even though the spatial correspondence was suboptimal in the external test set compared to the internal test set. Although the performance difference between the internal and external validation test sets could indicate overfitting of our deep-learning model on the training- and internal validation test-sets, this difference was probably related to more patients with few and small WML and lower NCCT quality in the external test set. Specifically, Pitkänen et al. achieved a higher DSC (16), which might be attributed to observed higher average Fazekas score, and thus WML volume, compared to the present work. In line with previous research, the suboptimal DSC could be attributed to difficulties with ground truth segmentation for WML in NCCT (16). Whereas other studies only performed internal validation (16,17), we present an external validation and evaluation of DL-WML volume as a risk factor.
Our study has shortcomings. This study is a post-hoc analysis of a randomized trial with several patients receiving IVT in the group randomized not to receive IVT. In the intention-to-treat analysis, we did not find IVT effect modification due to WML, while in the per-protocol analysis IVT effect modification was present for WML-Faz but not for DL-WML volume. This difference might be due to a subset of patients with a sICH that received IVT after the EVT procedure due to clot-fragmentation in the no IVT arm. Nevertheless, our findings on sICH are based on a small set of patients with sICH. Ideally, a prospective study should be performed selecting patients based on WML for IVT. Lastly, the use of WML volume in the dilated ventricle mask might not fully comprehend the complexity, and thus the etiology, of WML patterns that are more clearly described in some visual WML rating scores (13). Future research should validate the diagnostic accuracy and the associations with sICH and functional outcome of DL-WML volume in a patient-level meta-analysis considering similar studies as the MR CLEAN No-IV trial (12,19–23). Although we reported the contrast-to-noise ratio of an NCCT and the number of WMLs in an NCCT as determinants that affect spatial correspondence, future research should more clearly describe cases where deep-learning-based WML segmentation is reliable or not. Since DL-WML volume is on a continuous scale, it might be used to define a DL-WML volume threshold for excluding patients for IVT before EVT. Such a threshold could prove to be more precise for identifying patients with IVT-induced ICH and poor functional outcome than a threshold based on the Fazekas scale. Other characteristics that can be extracted from WML segmentations such as the number of WML, average WML volume, and WML shapes should be studied for outcome prediction and treatment effect estimation. Finally, automatically extracted WML characteristics could be considered for optimizing anti-platelet and anti-coagulant therapies used as secondary prevention for thrombo-embolic events at the cost of more intracerebral hemorrhages (24,25).
Data Availability
Data will be made available upon reasonable request to the authors.
Conclusion
DL-WML volume and WML-Faz had a similar relationship with functional outcome and sICH. Thus, DL-WML volume is a reliable alternative to WML-Faz. A patient-level pooled meta-analysis is required to assess whether DL-WML volume and WML-Faz are associated with a higher probability of sICH and worse functional outcome.
Acknowledgements
We would like to thank NVIDIA Corporation (Santa Clara, California, USA) for providing a GPU for model development.
Abbreviations
- AIS
- Acute ischemic stroke
- NCCT
- non-contrast CT
- DL-WML volume
- deep-learning-based white matter lesion volume
- WML-Faz
- White matter lesion load according to the Fazekas Scale
- IVT
- intravenous thrombolysis
- EVT
- endovascular treatment
- sICH/aICH
- (a)symptomatic intracerebral hemorrhage
- mRS
- modified Rankin Scale
- DSC
- Dice similarity coefficient
- HD
- Hausdorff distance in mm