Abstract
Measurements of liver volume from MR images can be valuable for both clinical and research applications. Automated methods using convolutional neural networks have been used successfully for this using a variety of different MR image types as input. In this work, we sought to determine which types of magnetic resonance images give the best performance when used to train convolutional neural networks for liver segmentation and volumetry.
Abdominal MRI scans were performed at 3 Tesla on 42 adolescents with obesity. Scans included Dixon imaging (giving water, fat, and T2* images) and low-resolution T2-weighted anatomical scans. Multiple convolutional neural network models using a 3D U-Net architecture were trained with different input images. Whole-liver manual segmentations were used for reference. Segmentation performance was measured using the Dice similarity coefficient (DSC) and 95% Hausdorff distance. Liver volume accuracy was evaluated using bias, precision, and normalized root mean square error (NRMSE).
The models trained using both water and fat images performed best, giving DSC = 0.94 and NRMSE = 4.2%. Models trained without the water image as input all performed worse, including in participants with elevated liver fat. Models using the T2-weighted anatomical images underperformed the Dixon-based models, but provided acceptable performance (DSC ≥ 0.92, NMRSE ≤ 6.6%) for use in longitudinal pediatric obesity interventions. The model using Dixon water and fat images as input gave the best performance, with results comparable to inter-reader variability and state-of-the-art methods.
1. Introduction
Making accurate measurements of liver volume is valuable for several clinical and research needs. Preparation for bariatric surgery often involves prescribed weight loss intended to reduce liver volume, which can simplify the surgical procedure and improve some outcomes (1–3). Accurate volumetry is also essential for assessing potential living liver donors to ensure that an appropriate graft size can be collected while leaving a sufficient liver remnant (4). Research studies focusing on non-alcoholic fatty liver disease (NAFLD) can benefit from quantitative measurements of liver size and fat content. NAFLD is the most prevalent chronic liver disease, affecting ∼25% of adults worldwide (5). Early stages of NAFLD may be asymptomatic, but later stages may progress to liver failure (6). NAFLD is also associated with obesity, and in adults weight loss has been associated with a decrease in both liver volume and fat content (7–9). Volume measurements are a valuable adjunct to fat content measurements because, while related, they are independent measures of the liver condition and its response to treatment (10).
Manual delineation of liver borders on either CT or MR images is currently the standard method for determining liver volume, even though it is time consuming and subjective. Methods to automate liver segmentation have long been a subject of research. Past competitions such as SLIVER07 (11) encouraged participants to develop algorithms to accurately segment the liver from publicly available CT images, with the winning entrant giving volume estimation error of −2.9% and a Dice similarity coefficient (DSC, a measure of volume overlap) of 0.97. Segmentation based on MR images has generally been a more difficult problem. Methods have been developed to work with a variety of MR image types as input, including anatomical images (12,13), quantitative T1 or T2* maps (14,15), and water images reconstructed from multi-echo Dixon imaging (also called chemical shift encoded imaging) (16). Techniques using convolutional neural networks (CNNs) based on the U-Net architecture (17,18) have generally shown the best performance. The no-new U-Net technique (19) won the liver MR volume task of the recent CHAOS challenge (12) with a DSC of 0.95 and error of 2.85%, using either T1-weighted anatomical images or Dixon in/out of phase images as input. In an analysis of the Hepat1ca dataset (20), which included quantitative T1 and T2* maps, Owler et al. used a 3D U-Net to produce a median DSC of 0.97 (15).
Our long-term goal is to use liver volumetry in longitudinal obesity treatment studies among adolescents. Studies in adults have shown changes in liver fat content and volume with dietary, medical, and surgical interventions (7,21,22). Such studies commonly use Dixon imaging to quantify liver steatosis (23). Dixon imaging methods generate multiple image contrasts from a single acquisition, including water, fat, T2* or R2* maps, and maps of the proton density fat fraction (PDFF). Based on prior work it is evident that these can be used as input for CNNs to segment the liver and estimate its volume, but it is not clear which image or combination of images would give the best performance. A water image alone can be used, and can give volumes that are not greatly biased by the degree of fat content (16). However, these images have low contrast where the liver adjoins the chest wall, stomach, and heart, and therefore it may be possible to improve performance by adding additional contrasts.
The primary goal of this work is to compare the performance of CNNs trained with different images derived from Dixon acquisitions to determine the best-performing set of input images. A secondary goal was to determine if the anatomical images acquired during these studies could be used to train CNNs with comparable performance. Low-resolution T2-weighted images are commonly acquired for scout imaging in both clinical and research settings, but these have highly anisotropic spatial resolution and often feature blurring and banding artifacts. If these image series could provide accurate volume estimates it would enable analyses of large retrospective image series that were not prospectively designed to include quantitative Dixon scans. In this study, we trained CNNs using both Dixon and anatomical images as inputs, and compared their performance using standard segmentation metrics, liver volume accuracy, and evaluated their sensitivity to liver fat content. This manuscript extends the initial findings we reported in an earlier conference abstract (24).
2. Material and Methods
A total of 42 adolescents (age 13-18 years) with obesity (body mass index > 95th percentile by age and sex) were consented and completed a clinical trial approved by our Institutional Review Board (clinicaltrials.gov NCT02496611). Participants were imaged prior to any intervention on a 3 T Siemens Prismafit MR system (Siemens, Erlangen, Germany) using standard spine and body receive coils. Participants received repeated scans in the study, but only the baseline MRI scan for each subject was used for this analysis.
Imaging included two single breath-hold T2-weighted scans used for anatomical reference: an axial balanced steady-state free precession (TrueFISP) scan (TR/echo time = 399/1.56 ms, flip angle 60°, matrix 320×260, FOV 440×360 mm, 8.0 mm slice thickness, 40 slices with a 20% gap, GRAPPA acceleration R=2, readout bandwidth 1420 Hz/pixel, reconstructed to 1.4 × 1.4 mm in-plane, 18 s acquisition time) and a coronal half-Fourier acquisition single-shot turbo spin-echo (HASTE) scan (TR/echo time = 1000/86 ms, flip angles 90/124°, matrix 256×256, FOV 440×440 mm, 8.0 mm slice thickness, 17 slices with a 75% gap, GRAPPA acceleration R=3, readout bandwidth 698 Hz/pixel, reconstructed to 1.7 × 1.7 mm in-plane, 17 s acquisition time). Liver fat was quantified using a single breath-hold 3D Dixon scan with six echo times (TE=1.15-7.23 ms, ΔTE=1.23 ms, TR=9ms, flip angle = 4°, matrix 224×198×64, FOV 450×398×224 mm, CAIPIRINHA acceleration R=2×2, readout bandwidth 1060 Hz/pixel, reconstructed to 2 × 2 × 3.5 mm, 18 s acquisition time). From this acquisition the water (W), fat (F), fat fraction (FF), and T2-star (T2*) images were reconstructed using the vendor’s qdixon software.
To provide objective measurements of liver volume, the Dixon source images were deidentified and sent to a commercial provider of medical imaging processing services (3DR Labs, LLC, Louisville KY) with expertise in liver segmentation. For all datasets the liver boundaries were manual delineated on each axial slice containing liver tissue. These manual whole-liver segmentations were used as a ground truth for calculating liver volumes and fat fraction (by averaging FF over the volume) and for training CNNs.
The coronal HASTE and axial TrueFISP images required interpolation and registration to align them with the Dixon images and manual segmentations. Preprocessing included interpolation of the images to match the Dixon resolution, manual cropping to reduce regions beyond the liver, and coarse manual shifting in the superior-inferior direction to correct for inconsistent breath holds. Registration was performed with an affine registration followed by a multi-resolution non-rigid B-spline registration, implemented in Python using the SimpleElastix (25) library. The registration used a gradient descent optimizer, mutual information as the agreement metric, and a bending energy penalty to constrain the degree of warping. After reviewing the results of automatic segmentation, five cases required additional warping using 20-40 manually selected control points to produce acceptable registration. Figure 1 shows a set of all images used for this study, including the co-registered anatomical images.
CNNs were trained to produce liver segmentations using the manual segmentations as ground truth. Seven distinct CNNs were trained using the same ground truth but with different input images. Three networks used a single image type from the Dixon acquisition (W, F, and T2*), two used combinations (WF = water and fat, WFT2* = water, fat, and T2*), which were input as multiple channels. Additionally, two networks were trained using the anatomical images (TrueFISP and HASTE). Note that that PDFF images were not evaluated as input because they can be algebraically derived from W and F, and could result in a redundant network. For brevity, each distinct CNN will be identified by their input image set; e.g., the CNN trained using water and fat as input images will be referred to as the WF model.
All modeling, training, and inference was performed in Python using Keras with Tensorflow (26,27). All models used the same 3D U-Net architecture (18), but with varying numbers (1-3) of input channels, as detailed in Figure 2. Training was performed with five-fold cross validation with random augmentation (3D rotation, translation, and scaling) of the input data. Each network was trained over 350 epochs with batch size 2, using DSC as the loss function and Adam (28) as the optimizer with a learning rate of 10−4.
The performance of the segmentations from each set of image inputs was assessed using DSC, which measures the relative intersection between true and predicted segmentations, and the 95% Hausdorff distance (HD95), which characterizes the distance between true and predicted segmentation boundaries. The accuracy of estimating total liver volume was assessed using both bias (mean error) and precision (standard deviation of error), and also by using the normalized root-mean-square error (NRMSE, RMSE divided by mean ground truth volume) as an aggregate metric of accuracy (29).
Analyses were performed with all data and within mean liver fat content subgroups of normal (mean FF<5.5%) and NAFLD (>= 5.5%). The performance metrics (DSC and HD95) were compared between the outputs of each model, and in all three liver fat content subgroups, using paired t-tests (2 metrics x 3 subgroups x 21 combinations = 126 comparisons in total). Additionally, the performance between NAFLD and normal groups were compared for each metric and model using independent t-tests. Correlations between volumes, body mass index (BMI), and fat fraction were assessed using Pearson’s correlation. All statistical tests were performed in python using scipy 1.6.0. (30). A P value < 0.05 was considered statistically significant.
3. Results
In this population of 42 participants, fat fraction ranged from 2.6% to 23.4% (mean 7.9%, standard deviation 5.7%), and BMI ranged from 31.3 to 50.8 kg/m2 (mean 39.2, standard deviation 4.9). Twenty-three participants had liver fat <5.5% and were classified as normal, while 19 had ≥ 5.5% and were classified as having NAFLD. Liver volumes based on the manual segmentations ranged from 1327 to 3252 mL (mean 1970.1, standard deviation 421.6 mL), and were moderately correlated with both BMI (R=0.66, p<0.001) and mean fat fraction (R=0.66, p<0.001).
The segmentation performance of all seven models is shown in Figure 3. Considering all subjects, models based on Dixon WF and WFT2* gave the highest DSC values, which were each significantly higher than any other model but not significantly different from each other. These two models also gave significantly lower HD95 values than other models, with the exception that the difference between WF and W did not reach statistical significance (p=0.122). The performance of WF and WFT2* was closely followed by that of W. Figure 3a and 3b show a trend that segmentation performance measured by both DSC and HD95 was a little better in NAFLD cases than in normal liver, although this did not reach statistical significance in all but one comparison (HD95 for the F model, p=0.032). The models based on anatomical images (TrueFISP and HASTE) gave poorer segmentation performance compared to Dixon image models, as expected due to the lower spatial resolution.
The primary goal of this study was to evaluate accuracy of liver volume measurements, as shown in Figure 4. The NRMSE (Figure 4a) and precision (Figure 4b) results show that the top three models all include the Dixon-W image (W, WF, WFT2*), with no statistically significant differences between them in pairwise t-tests. Note that each of the predicted liver segmentations tend to overestimate the liver volume (Figure 4c), although this effect is small, on the order of 1%. In these accuracy measurements, several of the models have notably different performance depending on the NAFLD status. This is an undesirable characteristic for a predictor of liver volume, as it could introduce bias in a longitudinal study where liver fat levels change over time.
Based on segmentation performance and accuracy metrics, it can be argued that the WF model is the most appropriate for estimating liver volumes in longitudinal studies of steatosis, giving overall performance of DSC=0.940, HD95=5.44 mm, NRMSE=4.2%, bias=1.1%, precision=4.1%. This model has slightly higher bias than the WFT2* model (DSC=0.939, HD95=5.31 mm, NRMSE=4.8%, bias=0.4%, precision=4.8%), but it gives greater consistency between normal and NAFLD cases in accuracy metrics.
For the models based on anatomical images, the TrueFISP model (DSC=0.921, HD95=6.82 mm, NRMSE=6.3%, bias=-2.1%, precision=6.0%) generally outperforms the HASTE model (DSC=0.920, HD95=6.98 mm, NRMSE=6.6%, bias=-0.6%, precision=6.7%). While the TrueFISP model underestimates liver volumes by ∼2%, it has slightly better precision and overall accuracy, and its bias does not vary greatly with NAFLD status.
The best-performing models based on segmentation performance (W, WF, and WFT2*) also gave the best performance based on accuracy and precision. However, the overall correlation between segmentation performance and accuracy was modest. Pooling across all models and subjects, the correlation coefficient between NRMSE and DSC was R=0.19 (p=0.001), and R=−0.27 for HD95 (p<0.001).
Two representative examples of segmentation performance with the Dixon-based images are shown in Figure 5. These examples compare the results of the best- and worst-performing models (WF, and T2* respectively) with the gold-standard manual segmentation. The two cases are representative of cases with high (left column, A, B) and low (right column, C, D) accuracy, based on the error across all models. Note that the errors in these cases are predominantly at the borders with the heart and chest wall; this trend was observed throughout the full dataset.
Figure 6 shows examples of segmentations with the T2-weighted anatomical images. While these models do not perform as well as the Dixon-based models, the segmentations have good overall performance and accuracy despite the low spatial resolution and clear presence of imaging artifacts.
4. Discussion
The best performing CNNs based on both accuracy and segmentation metrics were the three that included the Dixon-W image as input: W, WF, and WFT2*. Relative to using just the W image, adding F and T2* images to the input set gave modest improvement in segmentation performance but no statistically significant improvement in accuracy. Using multiple Dixon images as model inputs adds a minor complexity and slightly increases training duration, but does provide a modest performance benefit. Of all the models evaluated, the WF model would be most useful for longitudinal studies which may have changes in liver fat content over time, as this model had the smallest bias difference between subjects with and without NAFLD. These models had moderate but non-statistically significant performance differences for normal participants and those with NAFLD, similar to the findings reported by Stoker et al. which used Dixon-W as input for their model (16).
The overall performance of the WF model in this work (DSC=0.94, error 4.2%) was comparable but slightly lower than that of state-of-the-art nnUNet (DSC=0.95, error 2.9%), and that of human inter-reader performance (DSC=0.95, error 1.5%), as reported in the CHAOS challenge data set (12). This can be partly attributed to the smaller training dataset used in this study (n=42 vs n=60). This level of performance is suitable for use in longitudinal studies of NAFLD treatments. Using manually segmentation, Luo et al. reported liver volume decreases of 12.2% over a 2-week liquid diet, and 21% from baseline to 1-month post bariatric surgery (7). Notably, in the 6 months post-surgery, liver volume was unchanged while PDFF continued to decrease, underscoring the independence of PDFF and volume measurements.
While the performance of models based on T2-weighted anatomical images was lower than that of the Dixon-based models, the segmentation performance (DSC≥0.92) and volume accuracy (NMRSE≤6.6%) were better than expected. Despite the thick slices, gaps between slices, and substantial blurring (in HASTE) and banding artifacts (in TrueFISP), the trained models were still able to produce good volumetric segmentations. These results can be used to design retrospective studies of larger clinical datasets which do not contain high-resolution anatomical images or Dixon acquisitions.
This study had several limitations. All images were acquired in a pediatric population, and the results may not generalize to an older population with a greater incidence of iron retention conditions, fibrosis, or other liver disease. Furthermore, this comparison used a single CNN architecture for consistent comparisons between input image types. While this architecture is widely used for medical image segmentation, other architectures may lead to different relative performance between the input image types. Finally, the ground truth segmentations used in this work were generated by one reader per case. Our review of these segmentations found them to be of good quality, but this ground-truth could be strengthened by merging segmentations from multiple readers.
5. Conclusions
In conclusion, this work compared the liver segmentation performance of CNNs trained with different sets of MR images. A model using Dixon W and F as input gave the best performance (DSC=0.94, error 4.2%), with results comparable to inter-reader variability and state-of-the-art methods.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Footnotes
Funding: This work was supported in part by NIH NIDDK R01DK105953, NIH NIBIB P41EB027061, and NIH NCATS UL1TR002494.
Reformatted abstract, moved fig S1 into paper, removed note regarding MRM submission, minor layout edits
Abbreviations
- BMI
- body mass index
- CAIPIRINHA
- Controlled Aliasing in Parallel Imaging Results in Higher Acceleration
- CNN
- Convolutional neural network
- CT
- computed tomography
- DSC
- Dice similarity coefficient
- FOV
- Field of view
- GRAPPA
- GeneRalized Autocalibrating Partial Parallel Acquisition
- HASTE
- half-Fourier single-shot turbo spin echo
- HD95
- 95th percentile Hausdorff distance
- MRI
- magnetic resonance imaging
- NAFLD
- non alcoholic fatty liver disease
- NRMSE
- normalized root mean square error
- PDFF
- proton density fat fraction