Abstract
Introduction Diffusion-weighted imaging (DWI) on MRI-linear accelerator (MR-linac) systems can potentially be used for monitoring treatment response and adaptive radiotherapy in head and neck cancers (HNC) but requires extensive validation. We perform technical validation to compare six total DWI sequences on an MR-linac and MR simulator (MR sim) in patients, volunteers, and phantoms.
Methods Ten human papillomavirus-positive oropharyngeal cancer patients and ten volunteers underwent DWI on a 1.5T MR-linac with three DWI sequences: echo planar imaging (EPI), split acquisition of fast spin echo signals (SPLICE), and turbo spin echo (TSE). Volunteers were also imaged on a 1.5T MR sim with three sequences: EPI, BLADE, and RESOLVE. Participants underwent two scan sessions per device and two repeats of each sequence per session. Repeatability and reproducibility within-subject coefficient of variation (wCV) of mean ADC were calculated for tumors and lymph nodes (patients) and parotid glands (volunteers). Differences in measured ADC values between sequences were quantified using Bland-Altman analysis. ADC bias, repeatability/reproducibility metrics, and SNR were quantified using a phantom.
Results In vivo repeatability/reproducibility wCV of mean ADC for parotids were 5.41%/6.72%, 3.83%/8.80%, 5.66%/10.03%, 3.44%/5.70%, 5.04%/5.66%, 4.23%/7.36% for EPIMR-linac, SPLICE, TSE, EPIMR sim, BLADE, RESOLVE. Repeatability/reproducibility wCV for EPIMR-linac, SPLICE, TSE were 9.64%/10.28%, 7.84%/8.96%, 7.60%/11.68% for tumors and 7.80%/9.95%, 7.23%/8.48%, 10.82%/10.44% for nodes. Bland-Altman analysis revealed significant differences between all sequence pairs except BLADE-EPIMR-linac and RESOLVE-SPLICE. All sequences except TSE had phantom ADC biases within ±0.1×10−3 mm2/s for most vials. MR-linac sequences had inconsistent ADC values between different vials with the same known ADC value, indicating spatial inhomogeneities. SNR of b=0 images was 87.3, 180.5, 161.3, 171.0, 171.9, 130.2 for EPIMR-linac, SPLICE, TSE, EPIMR sim, BLADE, RESOLVE.
Conclusion MR-linac DWI sequences demonstrate near-comparable performance to MR sim sequences and warrant further clinical validation for treatment response assessment in HNC.
Introduction
Diffusion-weighted imaging (DWI) is a quantitative magnetic resonance imaging (MRI) technique that measures diffusion of water molecules in tissue, a surrogate of tissue cellularity. DWI has many applications for head and neck cancer (HNC) imaging, including lesion characterization and prediction and treatment response assessment [1–5]. Several recent studies have focused on understanding how serial DWI throughout chemotherapy and/or radiotherapy (RT) can be used to monitor response and adapt treatments based on individual response [5–12]. However, longitudinal imaging is burdensome to patients and clinicians and is generally infeasible outside of specialized research studies.
The clinical implementation of hybrid MRI/linear accelerator (MR-linac) devices has made it possible to acquire quantitative MRI sequences during every RT treatment fraction [13–17]. Current MR-linac systems enable on-line treatment plan adaptation based on changes in tumor size and shape and anatomical deformations. With further software development and validation of quantitative MRI sequences, biological image-guided adaptive RT on MR-linac systems may soon become a clinical reality [18,19].
Still, hardware modifications of current MR-linac systems to accommodate linear accelerator integration introduce additional challenges for acquiring robust quantitative MRI information. The 1.5T MR-linac employs a split gradient coil design to allow radiation beam passage, which may contribute to magnetic field gradient non-linearities [20,21]. The maximum gradient strength and slew rate of this system are lower than conventional MRIs, which necessitates longer diffusion times for the same b-value and reduces the signal-to-noise ratio (SNR) [20]. The radiolucent 2×4 channel body coil array also reduces SNR compared to other commonly used coils [22]. In light of these challenges, the MR-Linac Consortium has released guidelines for acquiring DWI on this system [20], which has informed the selection of sequence parameters in this study.
In RT, spatial accuracy of images is crucial to ensure the precise delivery of radiation. Single-shot echo planar imaging (EPI), the most commonly used readout method for DWI, is prone to severe geometric distortions and susceptibility artifacts, especially in the head and neck [23]. Turbo spin echo (TSE)-based DWI sequences have been shown to improve spatial fidelity [23,24] and are of interest for biological image-guided adaptive RT applications. However, destructive interference between spin echoes and stimulated echoes can reduce SNR in TSE-DWI. An alternative TSE-based method, “split acquisition of fast spin echo signals” (SPLICE), acquires the spin echo and stimulated echo contributions separately to preserve SNR while maintaining the spatial accuracy of TSE [25,26].
In this study, we investigate the performance of EPI, TSE, and SPLICE DWI sequences on the 1.5T MR-linac and compare them to three DWI sequences on a 1.5T diagnostic-quality MR simulation (MR sim) scanner. The MR sim sequences include EPI and two additional low-distortion sequences: “BLADE,” which uses a radial blade k-space acquisition, and “readout segmentation of long variable echo trains” (RESOLVE), a multi-shot EPI sequence. In this R-IDEAL stage 2a1 study [27], we perform technical validation of these six DWI sequences using data from human papillomavirus-positive (HPV+) oropharyngeal cancer patients, healthy volunteers, and a diffusion phantom.
Methods
Participants and Imaging
Ten patients and ten healthy volunteers were included in this study. All participants provided written informed consent; patients were consented to the MOMENTUM observational clinical trial [28] and volunteers to an internal volunteer imaging protocol, both approved by MD Anderson Cancer Center’s institutional review board. Inclusion criteria for patients included non-recurrent, histologically confirmed HPV+ oropharyngeal cancer with no prior history of cancer therapy. All imaging occurred between diagnosis and the start of treatment. Clinical demographics are in Table 1.
Patients and volunteers were imaged on a 1.5T MR-linac (Unity; Elekta AB; Stockholm, Sweden) with a 3-D fat-suppressed T2-weighted MRI sequence and three DWI sequences: EPIMR-linac, SPLICE, and TSE. Volunteers were also imaged on a 1.5T MR sim (MAGNETOM Aera; Siemens Healthcare; Erlangen, Germany) with a multi-slice fat-suppressed T2-weighted MRI sequence and three DWI sequences: EPIMR sim, BLADE, and RESOLVE. Sequence descriptions and parameters are shown in Supplementary Tables S1 and S2. The MR-linac acquisitions used a rigid radiolucent 2×4 channel array coil [29], and the MR sim acquisitions used two 4-channel flex coils. Acquisition times (minutes) were 3.07, 7.38, 4.90, 2.93, 7.13, and 6.75 for EPIMR-linac, SPLICE, TSE, EPIMR sim, BLADE, and RESOLVE, respectively.
The diffusion b-values used were 0, 150, 500 s/mm2 for the MR-linac, 0, 500 s/mm2 for EPI on the MR sim, and 0, 800 s/mm2 for BLADE and RESOLVE. The choice of b-values for the MR sim was based on scan protocols used in clinical trials at our institution, which were used in this study without modification.
MR-linac b-values were chosen based on the MR-Linac Consortium’s recommendations [20]. However, because only two b-values were used for the MR sim images, ADC maps for the MR-linac images were reconstructed only with the 0 and 500 s/mm2 images for direct comparison to the MR sim. ADC maps were reconstructed using the built-in software on each scanner. An analysis in the Supplementary Data explores differences in ADC values and repeatability/reproducibility metrics between ADC maps reconstructed with b=0,500 mm2/s (b0,500) and b=150,500 mm2/s (b150,500).
Each study participant underwent two scan sessions per device. The first and second time points occurred at least one day apart, depending on clinical scheduling availability; mean (range) number of days between scans was 8 (1-15) for patients and 6 (1-21) for volunteers. All participants were imaged in custom RT immobilization masks to minimize motion and ensure setup reproducibility. During each session, participants were scanned twice with each DWI sequence, with a short “coffee break” out of the mask between each set to test repeatability.
In Vivo Data Analysis
A radiologist with 5 years of experience delineated the primary tumor and pathological lymph nodes (patients) and parotid glands (volunteers). One patient did not have an MR-visible primary tumor. A total of 9 primary tumors, 30 lymph nodes, and 20 parotid glands were analyzed. Regions of interest were delineated on T2-weighted images and rigidly copied to the high-b-value image of each DWI then manually edited to account for any distortion. Segmentations were rigidly copied to corresponding ADC maps.
Repeatability/reproducibility metrics (within-subject coefficient of variation (wCV) of mean ADC) were calculated for each DWI sequence and structure type according to the Quantitative Imaging Biomarker Alliance (QIBA) consensus recommendations [30,31]. 95% confidence intervals for wCV were calculated using a chi-square statistic with n(K-1) degrees of freedom, where n is the number of sets of replicate measurements and K is the number of repeats [32]. For the repeatability (i.e. short-term) wCV calculation, there were two pairs of replicate measurements per patient/volunteer (two replicate images from the first scan session and two replicate images from the second scan session). For the reproducibility (i.e. long-term) wCV calculation, there were also two pairs of replicate measurements per patient/volunteer (the first images from the first and second scan sessions were paired, and the second images from the first and second scan sessions were paired).
Bland-Altman analysis was performed between all pairs of DWI sequences to measure differences in calculated ADC values. Values from all four imaging time points were included. Mean difference (bias) values, 95% confidence intervals for the mean differences, and 95% Bland-Altman limits of agreement were calculated in JMP (v15.0.0; SAS Institute Inc.; Cary, NC, USA) using the Method Comparison add-in. Bland-Altman analysis was also performed to assess differences in ADC values between b0,500 and b150,500 ADC maps (Supplementary Data).
Phantom Data Acquisition and Analysis
Four sequential repeats of each DWI sequence were acquired of the QIBA diffusion phantom (model 128; CaliberMRI; Boulder, CO) for ADC bias, repeatability wCV and repeatability coefficient (RC), and SNR calculations. ADC bias, wCV, RC, and SNR were calculated for each sequence using methods described in the QIBA guidelines [30,31]. ADC bias was calculated for each vial by measuring the mean ADC value in each region of interest (ROI) and subtracting the manufacturer-provided ADC. The SNR calculation involved creating a “signal image” from the voxel-wise average of the four images and a “temporal noise image” from the voxel-wise standard deviation. SNR is equal to the ROI mean for the “signal image” divided by the ROI mean for the “temporal noise image.” A single repeat of each sequence was also acquired at a separate time point for reproducibility measurements, i.e. wCV and reproducibility coefficient (RDC). ADC bias was calculated for each vial, while the other quantities were calculated only in the central phantom vial, per the guidelines.
Results
Representative images the six DWI sequences of a patient and volunteer are shown in Figure 1. In vivo mean ADC values and repeatability/reproducibility wCV values are shown in Table 2. Reproducibility wCV values were higher than repeatability wCV values for all sequences and structure types. For the MR-linac sequences with which both patients and volunteers were imaged, both repeatability and reproducibility wCV values were consistently higher for tumors and nodes than for parotid glands. In parotid glands, wCV values were slightly higher overall for the MR-linac sequences compared to the MR sim sequences.
Differences in mean ADC and wCV between b0,500 and b150,500 ADC maps are shown in Supplementary Table S3. Mean ADC values were consistently higher for b0,500 than for b150,500 ADC maps. Both repeatability and reproducibility wCV values were higher for the b150,500 ADC maps in nearly all cases.
Bland-Altman analysis (Figure 2, Supplementary Figure S1) revealed statistically significant biases for all pairs of DWI sequences except BLADE-EPIMR-linac and RESOLVE-SPLICE. Mean differences between sequences ranged from 0.00 (BLADE-EPIMR-linac) to 0.56×10−3 mm2/s (EPIMR sim -TSE). For the MR-linac sequences where both patient and volunteer data were compared, the SPLICE-EPIMR-linac and TSE-EPIMR-linac combinations showed different biases among structure types.
Bland-Altman plots showing the differences in calculated ADC values between b0,500 and b150,500 ADC maps are shown in Supplementary Figure S2. b0,500 overestimated b150,500 by 0.34, 0.48, and 0.31 ×10−3 mm2/s for EPIMR-linac, SPLICE, and TSE, respectively.
Phantom ADC bias results across the range of phantom ADC values are shown in Figure 3. For all MR sim sequences, the range of ADC bias values fell within ±0.1×10−3 mm2/s (except BLADE at 1.127×10−3 mm2/s). For the MR-linac sequences, EPIMR-linac and SPLICE had all ADC bias values fall within ±0.1×10−3 mm2/s for the first four vials, but ADC overestimation occurred for EPIMR-linac and SPLICE at the lowest two ADC values. TSE underestimated ADC by more than 0.1 mm2/s for nearly all vials. In general, the MR sim sequences were more precise than the MR-linac sequences. A similar graph showing the ADC bias results for each individual vial in the phantom is shown in Supplementary Figure S3. This alternative representation reveals that ADC values for the MR-linac sequences tend to be consistent for each vial across replicate images but inconsistent across different vials with the same true ADC value.
Phantom ADC bias (in the central vial), %CV, repeatability and reproducibility metrics, and SNR are shown in Table 3. Tolerance values from the QIBA Profile [31] are also included for reference. All sequences except SPLICE and TSE met the ±40×10−6 mm2/s criterion for ADC bias, while SPLICE and TSE had values of -57.4 and -123.0×10−6 mm2/s, respectively. For %CV, only EPIMR sim and BLADE met the 2% threshold, but all MR-linac sequences were close (≤3.62%). RESOLVE had a much higher %CV (8.30%). For RC, all MR-linac sequences and EPIMR sim fell under 15×10−6 mm2/s, but BLADE and RESOLVE did not (26.88 and 27.93 ×10−6 mm2/s, respectively). All sequences were within the RDC limit. All sequences exceeded the SNR threshold of 50, but EPI had the lowest SNR (87.3 for b=0).
Discussion
The goals of this study were 1) to compare the performance of DWI on the 1.5T MR-linac with a 1.5T diagnostic quality MR sim and 2) to select an optimal DWI sequence for HNC on the MR-linac. To accomplish these goals, we quantified the ADC repeatability and reproducibility, ADC bias, and SNR of three DWI sequences each on a 1.5T MR-linac and a 1.5T MR sim in vivo and in a phantom.
ADC repeatability was previously quantified for various HNCs on diagnostic MRI systems by Paudyal et al. [35]. They measured a wCV of 2.38% for lymph nodes in a mixed cohort of 9 HNSCC patients imaged with EPI on a 3T MRI. This value is substantially lower than the wCV values measured for lymph nodes with the three MR-linac sequences in our study but is also lower than the values measured for parotid glands on the MR sim, which may be attributable to field strength, gradient, and coil hardware differences.
While the current study is the first to measure ADC repeatability/reproducibility for HNC on the MR-linac, repeatability/reproducibility of quantitative imaging biomarkers have been previously quantified on the MR-linac for other disease sites. Lawrence et al. [16] measured the ADC repeatability and reproducibility of brain tumors and healthy tissues on the MR-linac and a 1.5T diagnostic scanner. wCV values on the MR-linac were within 5% and were comparable to the diagnostic system. Kooreman et al. [15] assessed the reproducibility of intravoxel incoherent motion parameters in prostate cancer patients on the MR-linac and found the RDC of the diffusion coefficient to be 0.09×10−3 mm2/s for non-cancerous prostate and 0.44×10−3 mm2/s for tumors. For a HNC cohort on a 0.35T MR-linac, Yang et al. [17] did not explicitly calculate reproducibility metrics but found the ADC of the brainstem to be within 0.47-0.57×10−3 mm2/s across repeat imaging sessions. These studies demonstrate the robustness of quantitative imaging biomarkers on the MR-linac, suggesting their potential for longitudinal quantitative imaging and biological image-guided adaptive treatments. However, extensive validation of quantitative imaging biomarkers and standardization of scan protocols across sites is necessary for large-cohort, multi-center studies [13,19].
The QIBA Diffusion Profile [31] provides acceptability criteria for phantom metrics, which are included in Table 3 for comparison to measured values. All sequences except EPIMR sim violated at least one tolerance value. However, it is important to note that these criteria are defined for system performance evaluation using an EPI sequence with specific parameters. Thus, they are not directly applicable to evaluate the sequences used in this paper but may serve as starting points for acceptability criteria based on clinical needs. In particular, the lowest ADC components of the phantom are best characterized using b=2000 s/mm2; use of lower maximum b-values can result in higher wCVs [36].
Our phantom data showed that in MR-linac sequences only, ADC values varied substantially across different vials with the same diffusivity but not across replicate images within the same vial, suggesting that spatial inhomogeneities are more significant for the MR-linac than the MR sim. Kooreman et al. [20] found similar results with a homogeneous diffusion phantom imaged on an MR-linac and MR sim. They attributed the differences to the split gradient coil design on the MR-linac, which is necessary to accommodate the radiation treatment beam but induces eddy currents closer to isocenter, causing magnetic field inhomogeneities.
Parotid gland wCV values were higher overall for the MR-linac compared to the MR sim, but differences were not severe and values were within clinically acceptable ranges. However, the spatial dependence of phantom ADC values was more substantial on the MR-linac, which may have implications for longitudinal in vivo studies as patient anatomy changes and setup uncertainty increases throughout treatment. Still, these results demonstrate that in vivo DWI of HNC is possible on the 1.5T MR-linac with acceptable repeatability/reproducibility and lay the foundation for future clinical studies.
Comparing the MR-linac sequences, EPIMR-linac and SPLICE had more accurate phantom ADC values than TSE, with the exception of SPLICE at low ADC values. For repeatability, SPLICE had the lowest overall wCV values and highest overall SNR. Based on these data and our clinical preference for a low-distortion DWI sequence, SPLICE is the optimal sequence for HNC imaging on the MR-linac. However, a disadvantage of SPLICE is the long acquisition time (7 minutes for 3 b-values).
One surprising result is that the SNR of EPIMR-linac was much lower than for SPLICE and TSE. A likely explanation is that short inversion time inversion recovery (STIR) was used for fat suppression for EPIMR-linac and spectral presaturation with inversion recovery (SPIR) for SPLICE and TSE. STIR uses inversion recovery to nullify fat signal but reduces signal from all tissues, resulting in reduced SNR. SPIR uses a spectrally selective inversion pulse to improve SNR from non-fat tissues [37]. We used the consensus EPI protocol that had been distributed among MR-Linac Consortium [38] sites without modification for comparison across sites. Because no consensus SPLICE or TSE protocols existed, these sequences were optimized in-house, and we chose SPIR over STIR to maximize SNR. If SPIR is used for EPI, SNR and reproducibility would likely improve. However, poor fat suppression in EPI causes large chemical shift artifacts, while the non-suppressed fat in TSE and SPLICE sequences only shows a small chemical shift.
One limitation of this study was that clinical scheduling constraints prevented patients from undergoing two scans each on both the MR-linac and MR sim between the time of simulation and the start of treatment. While patients would provide the most ideal comparison between the devices, healthy volunteers were included to assess differences between the systems in parotid glands. However, our data reveal that the repeatability and ADC bias behavior of parotid glands differs from that of tumors and nodes, so future investigations using patients on both the MR-linac and MR sim should confirm the findings in this study. Furthermore, the most ideal comparison between the MR-linac and a diagnostic-quality scanner would be to use a 1.5T Philips MRI with the same DWI sequences, but we were limited by the MR sim and sequences available at our institution. Finally, we did not assess geometric distortion in this study, which is a major consideration for RT applications.
Conclusion
We have assessed the repeatability/reproducibility, ADC bias, and SNR of DWI sequences on a 1.5T MR-linac and MR sim for HNC both in vivo and in phantoms and demonstrated near-comparable performance between the MR-linac and MR sim. These results show that the MR-linac DWI sequences are robust and worthy of further evaluation as a quantitative method of assessing treatment response in HNC.
Data Availability
The images and segmentations used for this project will be anonymized and made available in a public data repository by the time of publication.
Footnotes
↵1 R-IDEAL is an evaluation framework for radiation oncology technological advancements. Stage 2a is the “development” phase, including “technical improvements, feasibility, and safety.”