Abstract
Introduction Magnetic resonance imaging (MRI) enables the investigation of pathological changes in gray and white matter at the lumbosacral enlargement (LSE) and conus medullaris (CM). However, conducting group-level analyses of MRI metrics in the lumbosacral spinal cord is challenging due to variability in CM size, lack of established image-based landmarks, and unknown scan-rescan reliability. This study aimed to improve inter-subject alignment of the lumbosacral cord to facilitate group-level analyses of MRI metrics. Additionally, we evaluated the scan-rescan reliability of MRI-based cross-sectional area (CSA) measurements and diffusion tensor imaging (DTI) metrics.
Methods Fifteen participants (10 healthy volunteers and 5 patients with spinal cord injury) underwent axial T2*-weighted and diffusion MRI at 3T. We assessed (i) the reliability of spinal cord and gray matter based landmarks for consistent inter-subject alignment of the lumbosacral cord, (ii) the inter-subject variability of MRI metrics before and after adjusting for the CM length, (iii) the intra- and inter-rater reliability of CSA measurements, and (iv) the scan-rescan reliability of CSA measurements and DTI metrics.
Results The slice with the largest gray matter CSA as an LSE landmark exhibited the highest reliability, both within and across raters. Adjusting for the CM length greatly reduced the inter-subject variability of MRI metrics. The intra-rater, inter-rater, and scan-rescan reliability of MRI metrics were the highest at and around the LSE (scan-rescan coefficient of variation <3% for CSA measurements and <7% for DTI metrics within the white matter) and decreased considerably caudal to it.
Conclusion To facilitate group-level analysis of corresponding spinal cord levels, we recommend using the slice with the largest gray matter CSA as a reliable LSE landmark, along with an adjustment for the CM length. We also stress the significance of the anatomical location within the lumbosacral cord in relation to the reliability of MRI metrics. The scan-rescan reliability values serve as valuable guides for power and sample size calculations in future longitudinal studies.
1. Introduction
The lumbosacral spinal cord (SC) contains nuclei that innervate the lower limbs and pelvic organs (Fowler et al., 2008; Krassioukov & Elliott, 2017; Sharrard, 1964). Pathological changes in the gray matter (GM) or white matter (WM) of the lumbosacral cord can lead to various dysfunctions such as motor and sensory impairments in the lower limbs, as well as lower urinary tract, sexual and bowel dysfunction (Ahuja et al., 2017; Panicker et al., 2015; Park et al., 2017).
Pathological changes in the SC can be investigated in vivo by utilizing magnetic resonance imaging (MRI) (David et al., 2019a; Seif et al., 2019). Cross-sectional area (CSA) measurements of SC, GM, and WM, derived from multi-echo gradient-echo sequences, has been utilized as indirect measures of atrophy in the cervical cord and lumbosacral enlargement (LSE) (David et al., 2021, 2022; David et al., 2019b; Huber et al., 2018; Paquin et al., 2018; Vallotton et al., 2021; White et al., 2011). Furthermore, advanced MRI techniques, such as diffusion MRI, have provided complimentary information on WM integrity (Ciccarelli et al., 2007; Cohen-Adad et al., 2011, 2013; Combes et al., 2022; David et al., 2021, 2022; David, et al., 2019b; Huber et al., 2018; Vallotton et al., 2021). In contrast, there have been only a few studies investigating the conus medullaris (CM) extending caudally from the LSE (Büeler et al., 2022; Yiannakas et al., 2019). Imaging the entire lumbosacral cord is of particular interest, as tissue damage may also occur (Konno et al., 1986) or even originate within the CM (Panicker et al., 2020).
In the lumbosacral cord, SC level-specific and group-level analyses are challenged by the mismatch between vertebral and neurological levels (Canbay et al., 2014), which precludes the use of vertebral levels as neuroanatomical landmarks. Given the difficulty of identifying neurological levels in-vivo in the lumbosacral cord (Nunès et al., 2023), previous studies have suggested the use of the slice with the largest SC CSA, here referred to as the "LSE landmark", as an image-based neuroanatomical landmark (Yiannakas et al., 2014, 2019). However, the intra- and inter-rater reliability of image-based landmarks have not been reported. Additionally, slice-wise group comparisons of MRI metrics are increasingly challenging toward the tip of the spinal cord due to the high inter-subject variability in the CM length (Yiannakas et al., 2019).
Another challenge is related to the use of automatic SC and GM segmentation techniques, which have been shown to perform well for the cervical cord (Prados et al., 2017) but have not yet been optimized for the lumbosacral cord. This is primarily attributed to the smaller size and lower signal-to-noise ratio compared to the cervical cord, as well as the close proximity of nerve roots, which can compromise SC segmentation. Consequently, manual segmentation is still the standard segmentation technique for the lumbosacral cord. For healthy volunteers, the feasibility of manual GM and WM segmentation has been demonstrated within the LSE (Yiannakas et al., 2014) and CM (Yiannakas et al., 2019), while the feasibility of diffusion tensor imaging (DTI) has been shown within the LSE (Yiannakas et al., 2016). However, none of these studies reported scan-rescan values for the CM.
The aim of this study is twofold. First, to facilitate group-level analyses, we aimed to improve the inter-subject alignment of the lumbosacral cord by comparing GM- and SC-based landmark definition methods and proposing a method to adjust for the individual CM length. Second, we computed intra-rater, inter-rater, and scan-rescan reliability of the SC, GM, and WM CSA, as well as scan-rescan reliability of the DTI metrics, within both LSE and CM, and both in healthy volunteers and patients with spinal cord injury (SCI). These patients represent a challenging imaging cohort as they often experience higher levels of involuntary motion (e.g., due to spasticity) (Holtz et al., 2017) and may have spinal instrumentation (e.g., fixative orthopedic implants) that affects image quality.
2. Methods
2.1 Study participants
Fifteen participants including 10 healthy volunteers (4 females, age (mean ± standard deviation (SD)): 32.9 ± 10.8 years) and 5 SCI patients with cervical or thoracic injuries, but without injury-related radiological structural abnormalities in the lumbosacral cord (1 female, age: 46.1 ± 20.3 years, time since injury: 5.9 ± 0.1 months) participated in this study. SCI patients had neurological impairments with mixed aetiologies, injury levels, and severities (see Table 1 for demographic and clinical information). The patient cohort was part of a randomized controlled trial investigating the effect of transcutaneous tibial nerve stimulation on the emerge of neurogenic lower urinary dysfunction following SCI (ClinicalTrials.gov identifier: NCT03965299) (Birkhäuser et al., 2020). The main inclusion criteria for all participants were: (i) no MRI contraindications, (ii) no pre-existing neurological and mental disorders, and (iii) > 18 years of age. For a comprehensive list of inclusion and exclusion criteria specific to SCI patients, refer to (Birkhäuser et al., 2020). Healthy volunteers were scanned twice, with an interval of (mean ± SD) 8.4 ± 3.0 weeks (range: 6-15 weeks) between scans. The study was approved by the local ethics committee (Kantonale Ethikkommission Zürich, BASEC-Nr. 2019-00074) and conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from all participants.
2.2 MRI acquisition
Scanning was performed on a 3T Siemens Prisma MRI scanner (Siemens Healthineers, Erlangen, Germany) equipped with a body transmit coil and a 32-channel spine matrix coil. Foam wedges were placed beneath the knees to minimize the lower spine natural lordotic curve and maximize contact between the lower back and the spine matrix coil. Motion artifacts in the lower back area were reduced by placing the legs onto a spine vacuum cushion and applying Velcro straps around the knees, hips, and chest.
A sagittal T2-weighted turbo spin echo sequence was acquired as an anatomical reference of the lumbosacral cord with 15 slices of 4 mm thickness (10% slice gap), in-plane resolution of 0.7×0.7 mm2, field of view (FOV) of 330×330 mm2, repetition time (TR) of 3 s, echo time (TE) of 89 ms, flip angle of 154°, and acquisition time of 00:59 min.
Subsequently, axial T2*-weighted images were acquired using a 3D spoiled multi-echo gradient-echo sequence (ME-GRE, Siemens FLASH) with 20 axial-oblique slices of 5 mm thickness (no gap). To account for the positional variation of the lumbosacral cord in relation to the vertebral levels (Kim et al., 2003; Saifuddin et al., 1998; Soleiman et al., 2005), the FOV was not fixed to certain vertebral levels. Instead, it was set such that the 6th most rostral slice (highlighted in red in Fig. 1A) corresponded with the maximum width of the spinal cord as observed in the sagittal T2-weighted image. This ensured coverage of both the LSE and the entire CM in all participants (indicated by the green box in Fig. 1A). A saturation band was placed anterior to the spine to suppress potential artifacts from the abdomen. The sequence parameters were as follows: in-plane resolution of 0.5×0.5 mm2, in-plane FOV of 192×192 mm2, TR of 38 ms, echo train length of 5, first TE of 6.85 ms, echo spacing of 4 ms, flip angle of 8°, 8 repetitions, GRAPPA 2x acceleration, no partial Fourier, anterior-posterior phase-encoding direction, bandwidth of 260 Hz/pixel, and acquisition time of 17:56 min.
Images for diffusion MRI were acquired using a reduced-FOV single-shot spin-echo echo planar imaging (EPI) sequence with 15 slices of 5 mm thickness (no gap) and the central slice positioned at the 9th most rostral slice of the axial T2*-weighted scan (indicated by the yellow box in Fig. 1A). To prevent fold-over artifacts, two saturation bands were respectively placed anterior and posterior to the FOV. The sequence consisted of 180 diffusion-weighted (b=800 s/mm2) and 6 T2-weighted (b=0 s/mm2) images with an in-plane resolution of 0.9×0.9 mm2, in-plane FOV of 86×32 mm2, TR of 440 ms, TE of 56 ms, no GRAPPA, 7/8 partial Fourier in the anterior-posterior phase-encoding direction, and bandwidth of 1270 Hz/pixel. The acquisition was cardiac gated, acquiring 3 slices per cardiac cycle with a trigger delay of 120 ms. The total acquisition time depended on the heart rate and was approximately 12 min.
2.3 Processing of ME-GRE images
For each repetition, the first three echoes of the ME-GRE scan were combined via root-mean-squares, as it has been demonstrated to be the optimal combination for segmenting the SC and GM within the same image (Büeler et al., 2022). The resulting combined echo was then averaged across all repetitions. The SC and GM were manually segmented according to a standard operating procedure (SOP), made available on GitHub1, using the sub-voxel segmentation tool in JIM 7.0 (Xinapse systems), providing corresponding CSA values. WM CSA was obtained by subtracting GM CSA from SC CSA. Sub-voxel segmentations were also binarized at an inclusion threshold of 100% for SC and 50% for GM to create binary SC and GM masks. A binary WM mask was created by subtracting the binary GM mask from the binary SC mask.
The raters were instructed to segment the SC and GM by drawing an isointense contour within the partial volumes along the edges of the SC and GM, respectively, while also taking into account the anatomical shape and smoothness of the SC and GM. The three raters initially segmented a separate training set consisting of three healthy volunteers, which was not included in the main analysis. These initial segmentations were then compared, and any disagreements were resolved through discussion among the raters. The resulting consensus guidelines were added to the SOP.
2.4 Processing of diffusion MRI images
The diffusion MRI images were processed using the ACID toolbox (David et al., 2023). All images were cropped to an in-plane FOV of 36×36 mm2. Eddy current and motion correction was performed using ECMOCO (Mohammadi et al., 2010) with a 3-degrees-of-freedom (DOF) volume-wise registration (translation along x and y; scaling along y (with x and y being the left-right and anterior-posterior direction, respectively)), followed by a 2-DOF slice-wise registration (translation and scaling along y). Images underwent adaptive smoothing using msPOAS (Tabelow et al., 2015), with parameters k*=5 and lambda=10, to increase the signal-to-noise ratio without introducing blurring across tissue edges. The diffusion tensor model was then fitted on the corrected images using a robust tensor fitting algorithm (David et al., 2017; Mohammadi et al., 2013) to generate maps of fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD), and radial diffusivity (RD).
The mean corrected diffusion-weighted image was manually segmented for SC in FSLeyes and was spatially normalized to the PAM50 template (De Leener et al., 2018) resulting in both forward (native to template space) and backward (template to native space) warping fields. Normalization was aided by labeling both the slice with the largest SC CSA and the most caudal slice of the SC (determined from the ME-GRE sequence), which were aligned with the corresponding labels in the PAM50 template. The probabilistic WM atlas, integrated into the PAM50 template, was warped into the native space using the obtained backward warping field. Using sct_extract_metric from the Spinal Cord Toolbox (v.5.8) (De Leener et al., 2017), slice-wise weighted average values of DTI metrics were extracted within the GM, WM, as well as the dorsal, lateral, and ventral WM columns. As the PAM50 atlas does not encompass the caudal half of the CM, DTI metrics were extracted only from the upper half of the CM.
2.5 Comparison of neuroanatomical landmark definition methods
The LSE landmark as an image-based neuroanatomical landmark was defined either as the slice with the largest SC CSA or the largest GM CSA. The curves of slice-wise CSA values (see Fig. 3 for examples) were either unsmoothed or underwent smoothing by moving window averaging across three adjacent slices (corresponding to 15 mm). The choice of three slices was driven by the intrinsic smoothness of the curves of slice-wise CSA values and the need for selecting an uneven number of slices (thus avoiding the need for interpolation). Overall, this resulted in a total of four LSE landmark definition methods:
SCmax: slice with the largest SC CSA without moving window averaging
SCmax,mw: slice with the largest SC CSA with moving window averaging
GMmax: slice with the largest GM CSA without moving window averaging
GMmax,mw: slice with largest GM CSA with moving window averaging
To assess the reliability of the LSE landmarks, the mean absolute deviation (MAD) of the determined slices was calculated across three sets of segmentations performed by the same rater (intra-rater reliability) and across segmentations performed by three different raters on the same set (inter-rater reliability) (see Section 2.7 for segmentation procedures).
Identifying the tip of the spinal cord (CMtip landmark) was difficult due to the large slice thickness (5 mm). Therefore, it was determined by extrapolating the curve of slice-wise SC CSA values to zero. Note that the most caudal slice where SC was segmented was usually 1-2 slices above the CMtip landmark.
2.6 Adjusting for the conus medullaris length
To adjust for the individual CM length, we divided the CM, i.e., space between the two landmarks (LSE and CMtip), into five segments of equal thickness in each subject. For example, if the CM consisted of 8 slices (40 mm) in length, each of the five segments comprised 8/5=1.6 slices (equivalent to 8.0 mm). Without reslicing the images, we extracted CSA values and DTI metrics within each segment, computed as a weighted average of the slice-wise values, where the weights represent the spatial contribution of each slice to the particular segment. This allowed us to determine MRI metrics within segments centered at the LSE and the CMtip landmarks, respectively, four segments located between them, and three segments rostral to the LSE landmark (see Fig. 2 for segments).
2.7 Intra- and inter-rater reliability of cross-sectional area measurements
A total of 10 participants including 5 healthy volunteers (1 female, age (mean ± SD): 41.4 ± 8.6 years) and 5 patients with SCI (1 female, age: 46.1 ± 20.3 years) were included in this analysis. No images had to be discarded due to motion or other artifacts. Segmentation was performed by three experienced raters (S.B., M.D.L., G.D). Two raters performed the segmentation three times and one rater performed it once, with a minimum of two weeks between each round of segmentation. The order of segmentation in each round was pseudo-randomized into three blocks using a computer-generated randomization list to ensure a balanced distribution of patients and healthy volunteers. The LSE landmark, defined as the GMmax,mw slice was determined in each subject as the median across three raters.
Slice-wise coefficient of variation (CV) of SC, GM, and WM CSA was calculated as the percent ratio between the standard deviation and mean of CSA values either (i) across three sets of segmentations performed by the same rater (intra-rater reliability) or (ii) across segmentations performed by three different raters on the same set (inter-rater reliability). The CV was then averaged across all subjects (n=10) and separately for patients (n=5) and healthy volunteers (n=5).
Intra- and inter-rater intraclass correlation coefficients (ICCs) were simultaneously computed using Eliasziw’s two-way mixed effects (Eliasziw et al., 1994), absolute agreement, single-measure model, as implemented in the relInterIntra function of the irr package in R. Two-sided (upper and lower bounds) 95% confidence intervals were additionally calculated according to the formulas in (Hayen et al., 2007). The Dice coefficient (Dice, 1945) represents the spatial overlap between masks and was computed as the size of the union of two (binary) segmentation masks, created either by the same rater (intra-rater Dice coefficient) or different raters (inter-rater Dice coefficient), divided by the average size of the two segmentations masks. Values range from 0 (no overlap) to 1 (perfect overlap). Single Dice coefficients were calculated by averaging pairwise Dice coefficients for each subject, and then further averaged across all subjects.
2.8 Scan-rescan reliability of cross-sectional area measurements and diffusion tensor imaging
The reliability analysis across both imaging sessions (scan and rescan) included the 10 healthy volunteers. Patients were not included, as their scan-rescan data might be affected by disease-related longitudinal changes. No images had to be discarded due to motion or other artifacts. Segmentation was performed by a single rater (S.B.). The LSE landmark was defined as the GMmax,mw slice.
We calculated the mean of the scan-rescan differences along with the 95% limits of Agreement Scan-rescan CV was calculated as the percent ratio between the standard deviation and mean of MRI metrics across scan and rescan. Scan-rescan ICC was computed using a two-way mixed effects, absolute agreement, single-measure model using the icc function of the irr package in R. The minimal detectable change (MDC), or smallest real difference represents the smallest longitudinal change that be confidently attributed to a meaningful change rather than mere scan-rescan variability. The MDC with 95% confidence threshold were calculated as where SDtotal denotes the total standard deviation (taken across all measurements) and ICC denotes the scan-rescan ICC (Beckerman et al., 2001). While MDC describes the minimal detectable longitudinal change in an individual, it is important to note that group studies can detect even smaller longitudinal changes by utilizing multiple subjects (Dontje et al., 2019).
3. Results
3.1 Comparison of neuroanatomical landmark definition methods
Visual comparison of different LSE landmarks is provided in Fig. 3. The inter-rater MAD values of the LSE landmarks were consistently the same or higher than the corresponding intra-rater values (Table 2). For both intra- and inter-rater analysis, the lowest MAD values were achieved when using the GMmax,mw slice as LSE landmark. The GMmax,mw slice was located (mean ± SD) 8.5 ± 4.5 mm more caudally than the SCmax,mw slice. The inter-subject variability of SC and GM CSA was lower at caudal locations when using the GMmax,mw slice as LSE landmark; however, no clear trend was observed at rostral locations (Table 3, Fig. 4).
3.2 Adjusting for the conus medullaris length
After adjusting for the CM length, the inter-subject CV of CSA measurements remained similar at and around the GMmax,mw slice, but decreased substantially caudal to it (Table 4). For example, the inter-subject CV of SC CSA was 31.5% five slices (25 mm) below the GMmax,mw slice (before adjustment), whereas it reduced to 17.4% after adjustment at roughly the same anatomical location (segment LSE-3) (compare values highlighted in bold in Table 4).
3.3 Intra- and inter-rater reliability of cross-sectional area measurements
Fig. 5 illustrates the intra- and inter-rater variability in SC and GM segmentations, while Table 5 lists the intra- and inter-rater reliability metrics for the CSA measurements. Intra-rater reliability was higher than inter-rater reliability for all CSA values and slices, indicated by lower CV and higher ICC and Dice coefficients. Intra- and inter-rater reliability was in general higher for SC CSA than for GM and WM CSA. Intra- and inter-rater CV increased considerably from the GMmax,mw slice toward the tip of the spinal cord, accompanied by a concurrent decrease in Dice coefficients. Notably, such a trend could not be observed in the ICC. Intra-rater CV values were similar between healthy volunteers and patients, but inter-rater CV values were higher in the patient group, especially at caudal locations (Table S1).
3.4 Scan-rescan reliability of cross-sectional area measurements and diffusion tensor imaging
There was no systematic bias, as measured by between the scan and rescan values for any of the MRI metrics (Tables 6-8, Fig. 6). In general, scan-rescan reliability was higher for SC CSA than for GM and WM CSA, indicated by lower CV and MDC, and higher ICC values (Table 6). Among the DTI metrics, RD tended to have the lowest scan-rescan reliability. In general, reliability was higher for DTI metrics when extracted within the entire WM rather than separate WM columns. Reliability values (with the exception of ICC values for DTI metrics) decreased considerably from the LSE segment toward the tip of the SC. Without adjusting for the CM length, the scan-rescan CV values for DTI metrics were higher (compare Tables 6-8 with Tables S2-4, respectively). The scan-rescan reliability decreased only minimally (i.e., the increase in CV was, on average, below 1 percentage point for MRI metrics extracted within the WM) when the landmarks were determined independently for scan and rescan, as opposed to when they were determined in the first scan (compare Tables 6-8 with Tables S5-7, respectively).
4. Discussion
In this study, we investigated methods for improving the inter-subject alignment of axial slice stacks within the lumbosacral cord. We found that the slice with the largest gray matter (GM) cross-sectional area (CSA) can serve as a reliable image-based neuroanatomical landmark for the lumbosacral enlargement (LSE). Adjusting for the conus medullaris (CM) length by dividing the CM into a fixed number of segments and extracting MRI metrics from those segments substantially reduced inter-subject variability, facilitating group-level analysis. Additionally, we report different aspects of reliability for CSA measurements and diffusion tensor imaging (DTI) metrics within the lumbosacral cord at 3T. We found that intra-rater, inter-rater, and scan-rescan reliability was the highest at and around the LSE landmark, and decreased toward the tip of the spinal cord.
4.1 Slice with the largest gray matter cross-sectional area is a reliable image-based neuroanatomical landmark
We found that the slice with the largest GM CSA, as opposed to the slice with largest SC CSA, can be identified more consistently across raters and across repeated assessments of the same rater. This is likely because of the sharper peak in the curves of slice-wise GM CSA values at the lumbosacral enlargement compared to the flatter peak in SC CSA (Fig. 4). Sharper peaks are easier for raters to identify than flatter peaks. Further improvement in the reliability of image-based landmarks can be achieved by smoothing the curves of slice-wise CSA values through moving window averaging across three adjacent slices, which reduces fluctuations inherent in the slice-wise CSA values (see Fig. 3 for examples).
Furthermore, when aligning the individual axial slice stacks at the LSE landmark determined based on GM CSA, as opposed to SC CSA, we noticed a slight reduction in inter-subject variability of CSA values at caudal locations. The reason for this reduction remains unclear. We argue that the slice with the largest GM CSA is more closely associated with the neurological level than the slice with the largest SC CSA, considering that the lumbosacral enlargement is neuroanatomically attributed to the enlargement of the GM.
Notably, a previous study based on post-mortem MRI utilized distinct morphological features of the ventral GM horns to characterize the lumbosacral cord across species (Toossi et al., 2019). Another study revealed a strong link between these morphological features and the motoneuron pools located within the ventral GM horns (Gross et al., 2017). We anticipate that also for in vivo studies the shape of GM would serve as an even better neuroanatomical landmark than its size. However, a higher resolution would be necessary to perform in vivo morphological analyses of GM in the lumbosacral cord. In a recent study, a distinct approach was used to directly identify neurological levels through nerve root tracing (Mesbah et al., 2023); however, this approach necessitates acquiring an additional image optimized for nerve segmentation, thereby increasing the scan time, and might not work reliably in all subjects.
Considering these observations, we argue that the slice with the largest GM CSA, rather than the slice with the largest SC CSA, serves as a superior image-based neuroanatomical landmark in the lumbosacral cord.
4.2 Improved inter-subject alignment after adjusting for the conus medullaris length
To adjust for the considerable inter-subject variation in CM length, we divided the CM into a fixed number of segments and extracted MRI metrics from these segments using a weighted average of the slice-wise values. This adjustment substantially reduced the inter-subject variability of MRI metrics at caudal locations (Table 4). This is beneficial for group-level analyses, as reduced inter-subject variability increases statistical power for the same sample size or necessitates a smaller sample size for the same statistical power. We note that another line of research is concerned with reducing inter-subject variability of MRI metrics that arise from biological variability using regression models incorporating demographic information, spine, and SC metrics (Papinutto et al., 2020a; Papinutto et al., 2020b; Yiannakas et al., 2019); however, this was not the focus of our investigation.
The decision of dividing the CM into five segments was determined by both the slice thickness (5 mm) and the range of CM lengths observed within the healthy population (Yiannakas et al., 2019). By using five segments, we ensured that segments were thicker than slices in all participants, avoiding the need for interpolating within slices. While thinner slices would allow for a higher number of segments and increased spatial specificity, this comes at the cost of reduced scan-rescan reliability. Conversely, dividing the CM into fewer segments would likely enhance scan-rescan reliability, but it would come at the expense of specificity along the rostro-caudal axis.
Spinal cord segments, created by correcting for the individual CM length (Fig. 2), do not correspond to specific neurological levels. Instead, each segment consists of a combination of neurological levels, where more caudal segments encompass more neurological levels due to their progressively shorter length (Frostell et al., 2016). Nevertheless, this approach is still consistent with neurological levels as long as their distribution within each segment remains constant across subjects.
4.3 Excellent intra-rater reliability of cross-sectional area measurements at the lumbosacral enlargement
We demonstrated excellent intra-rater reliability for CSA measurements, when using the previously optimized ME-GRE imaging protocol (Büeler et al., 2022), with intra-rater CV values below 4%, 5%, and 6% for SC, GM, and WM CSA, respectively, at and within a 15 mm radius of the LSE landmark. Notably, the inter-rater CV values for CSA measurements were approximately twice as high as the corresponding intra-rater CV values. Consequently, we strongly advocate for the segmentation of all images by the same rater.
We also highlight the importance of anatomical location for the reliability of CSA measurements. More caudal slices within the CM had lower intra- and inter-rater reliability, in line with a previous report (Yiannakas et al., 2019). This observation is likely attributed to the conus medullaris becoming thinner; for example, a minor difference between two segmentations has a larger impact when the area being segmented is smaller. Despite the increasing CV values toward the tip of the spinal cord, the intra- and inter-rater ICC values exhibit relatively consistent trends along the CM. This is because higher CV values are offset by higher inter-subject variability at more caudal locations (Table 4).
While intra-rater CV values for SC, GM, and WM CSA measurements were, on average, less than 1 percentage point higher in patients with spinal cord injury compared to controls, the inter-rater CV values were, on average, between 2 and 5 percentage points higher, with the largest differences occurring caudal to the LSE landmark. This finding suggests that, for patients, the lower quality of the ME-GRE images increases the inter-rater, but not the intra-rater variability of SC and GM segmentations. The lower quality for patients may be attributed to a higher level of involuntary motion (4 of 5 patients reported spasticity in the lower limbs). While 4 of 5 patients had spinal instrumentation, they were not in close proximity to our FOV and are therefore unlikely to affect the image quality.
The higher values for SC CSA, in comparison to GM and WM CSA, reflect the raters’ experience that SC segmentation is an "easier" task than GM segmentation. This is due to the fact that the contrast between the GM and WM is typically lower than between the WM and cerebrospinal fluid on ME-GRE images, and GM exhibits more ambiguity owing to its irregular shape (Büeler et al., 2022). The reliability of CSA measurements depends on the confidence with which manual segmentation is performed, referred to as "segmentability". SC and GM segmentability is influenced by various factors, including the rater’s experience, the subject being investigated, the imaging hardware, pulse sequence, and sequence parameters (Büeler et al., 2022). Consequently, the comparability of reliability values across studies is limited by variations in these factors. In comparison to values obtained on a 3T Philips MRI scanner (Yiannakas et al., 2019), our study showed lower intra-rater (1.6% vs. 3.5%) but higher inter-rater CV for SC CSA (4.7% vs. 3.0%) at the LSE landmark within healthy volunteers. The CV values for GM CSA were found to be lower in our study for both intra-rater (4.0% vs. 7.7%) and inter-rater (6.1% vs. 9.9%) analyses.
4.4 Scan-rescan reliability depends on the anatomical location within the lumbosacral cord
Scan-rescan reliability was generally lower compared to the corresponding intra-rater reliability. This is expected, as in addition to the variabilities captured by intra-rater reliability, scan-rescan reliability also encompasses variabilities arising from subject positioning, the position of the imaged organ, FOV positioning, and potential imaging hardware instabilities (e.g., scanner drift). The time interval between scan and rescan is another important factor; longer intervals are associated with lower reliability. In our study, we deliberately selected a relatively long time interval of 6 to 15 weeks to mimic the time frames commonly encountered in longitudinal clinical studies. Scan-rescan reliability was generally higher compared to the corresponding inter-reliability, which further emphasizes the importance of employing the same rater for segmentation.
The scan-rescan reliability of DTI metrics was found to be higher (i) in slices with larger cross-sectional area, particularly at and around the LSE landmark, as opposed to more caudal slices, and (ii) within larger ROIs (such as WM), compared to smaller ROIs (such as dorsal and ventral WM). This is because averaging DTI metrics within a larger region reduces the impact of individual outliers, small anatomical variations, and partial volume effects. Scan-rescan reliability can be further improved by averaging DTI metrics across several adjacent slices or segments, although this comes at the cost of decreased spatial specificity. For example, when averaging across three adjacent slices around the LSE landmark, as in (Yiannakas et al., 2014, 2016), we observed a reduction in scan-rescan CV from 3.4% to 2.2% for WM CSA and from 4.2% to 2.7% for FA within the WM. When adjusting for the CM length, a small improvement in the scan-rescan reliability for DTI metrics was observed, which is probably due to the larger thickness of the segments (1-2 slices) compared to a single slice.
The DTI metrics exhibited lower scan-rescan reliability in comparison to CSA measurements. This is because diffusion MRI is inherently noisier than structural MRI, and, unlike CSA measurements, the average DTI metrics within a ROI are not solely determined by the size of the ROI, but also by its location. For example, if two SC segmentations do not fully overlap but have the same area, they would yield the same SC CSA, yet different average DTI metrics within the SC. Nevertheless, we demonstrated a very high overlap between segmentation masks created by the same rater, as measured by the Dice coefficient, with values above 91% for SC, 90% for GM, and 76% for WM segmentation.
Despite differences in the cohort, MRI scanner, sequence, and processing pipeline, our scan-rescan CV for SC CSA at the LSE landmark is similar to that reported previously in healthy volunteers scanned on a 3T Philips MRI scanner (vs. Yiannakas et al., 2014; 1.8% vs. 2.0%). However, our CV for GM CSA was notably lower (2.9% vs. 7.8%). For DTI metrics extracted within the WM at the LSE landmark, our CV values were in line with previously reported values for FA (vs. Yiannakas et al., 2016: 2.8% vs. 6.0%), MD (5.5% vs. 5.0%), AD (4.7% vs. 5.4%), and RD (6.4% vs. 8.3%).
4.5 Limitations
In contrast to CSA values, we did not separately investigate intra- and inter-rater reliability for DTI metrics. This is because while obtaining CSA values involves a single manual step (SC and GM segmentation in the ME-GRE images), the processing pipeline for DTI requires several manual interventions (selection of SC midpoint, SC segmentation in both the EPI and ME-GRE images), making it challenging to isolate individual effects. However, we argue that the reported scan-rescan reliability values encompass the combined effect of all these sources of variability and are therefore most relevant for planning future studies. Additionally, DTI metrics could not be extracted from the caudal half of the CM, as this region is not covered by the PAM50 atlas. This is a recognized issue within the community and is likely to be addressed in future developments.
5. Conclusions
We provide recommendations for improved inter-subject alignment within the lumbosacral cord to facilitate group-level analyses of MRI metrics. Specifically, we propose using the slice with the largest gray matter cross-sectional area as a reliable image-based neuroanatomical landmark, along with an adjustment method for the length of the conus medullaris. We emphasize the importance of anatomical location for intra-rater, inter-rater, and scan-rescan reliability, which were the highest at lumbosacral enlargement and decreased toward the tip of the spinal cord. The provided scan-rescan reliability values serve as valuable guides for power and sample size calculations in future longitudinal studies.
Data and Code availability
Data and scripts used to generate the results may be shared with other research groups pending a formal data sharing agreement.
Author Contributions
Silvan Büeler: Conceptualization, Software, Formal analysis, Investigation, Writing - Original Draft, Visualization, Project administration
Patrick Freund: Conceptualization, Writing - Review & Editing, Supervision
Thomas M. Kessler: Conceptualization, Writing - Review & Editing, Supervision
Martina D. Liechti: Conceptualization, Investigation, Writing - Review & Editing, Supervision
Gergely David: Conceptualization, Software, Formal analysis, Investigation, Writing - Original Draft, Visualization, Supervision
Funding
This work is financially supported by the Swiss National Science Foundation (SNSF) (33IC30_179644). PF is funded by a SNSF Eccellenza Professorial Fellowship grant (PCEFP3_181362/1).
Declaration of competing interests
The authors declare no competing interests.
Supplementary materials
Acknowledgments
We thank all the volunteers who participated in this study. We also thank Veronika Birkhäuser and Oliver Gross, as well as the entire clinical team of the Department of Neuro-Urology, Balgrist University Hospital, University of Zürich for recruiting the patients and screening the healthy volunteers. Furthermore, we thank Collene Anderson and Adrian Cathomen for pseudo-randomizing the images for segmentation. Imaging was performed with support of the Swiss Center for Musculoskeletal Imaging, SCMI, Balgrist Campus AG, Zürich.