Abstract
Current Alzheimer’s disease (AD) research has a major focus on validating and discovering noninvasive biomarkers that can detect AD, benchmark disease severity, and aid in testing the efficacy of interventions. Structural magnetic resonance imaging (sMRI) is a well-validated tool used in diagnosis and for monitoring disease progression in AD. Much of the sMRI literature centers around hippocampal and other medial temporal lobe structure atrophy, which are strongly associated with cognition and diagnosis. Because atrophy patterns are complex and vary by patient, researchers have made efforts to condense more brain information into validated metrics. Many of these methods use machine learning (ML), which can be difficult to interpret clinically, hampering clinical adoption. Here, we introduce a practical, clinically meaningful and interpretable index which we call an “AD-NeuroScore.” Our approach is automated and uses multiple regional brain volumes associated with cognitive decline. We used a modified Euclidean inspired distance function to calculate the differences between each participant and a cognitively normal (CN) older adult template, adjusting for intracranial volume, age, sex, and scanner model. Here we report validation results, including sensitivity to diagnosis (CN, mild cognitive impairment (MCI), and AD) and disease severity (Clinical Dementia Rating Scale Sum of Boxes (CDR-SB), Mini Mental State Exam (MMSE), and Alzheimer’s Disease Assessment Scale-Cognitive Subscale (ADAS-11) in 929 older adults (mean age=72.7 years, SD=6.3, Range=55.1-91.5, 50% Female) drawn from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study. To determine if AD-NeuroScore might be predictive of disease progression, we assessed the relationship between the calculated AD-NeuroScore at baseline and change in both diagnosis and disease severity scores at 12, 24, 36, and 48-months. We performed additional validation in all analyses, benchmarking AD-NeuroScore against adjusted hippocampal volume (AHV). We found that AD-NeuroScore was significantly associated with diagnosis and all disease severity scores at baseline. Associations between AD-NeuroScore and disease severity (CDR-SB and ADAS-11) were significantly stronger than with AHV. Baseline AD-NeuroScore was also associated with change in diagnosis and changes in disease severity scores at all time points. Performance was equivalent, or in some cases superior, to AHV. These early validation results suggest that AD-NeuroScore has the potential to be a clinically meaningful biomarker for dementia.
1. Introduction
The prevalence of Alzheimer’s disease (AD) is increasing with the aging population. According to the World Health Organization, there are an estimated 36 million patients with AD in 2021 with approximately 6.5 million new cases reported annually (WHO, 2021). AD is a progressive neurodegenerative disease characterized by cognitive decline and early atrophy within the medial temporal lobe (MTL) followed by progression to the rest of the brain. Neurodegeneration often precedes cognitive decline (Gómez-Isla et al., 1996a), making atrophy assessed with structural magnetic resonance imaging (sMRI) an ideal clinical biomarker for detecting AD early and forecasting future cognitive and functional deterioration. Detecting AD prior to the point of irreversible neurodegeneration could improve the efficacy of currently available treatments (Coupé et al., 2015a) and improvements to current quantitative sMRI would add value in both clinical and research settings.
At later stages of AD, additional and validated quantitative sMRI metrics could add clinical utility while monitoring disease progression. Neurocognitive assessments become more burdensome and less valuable as patients convert to moderate and severe AD. Regardless, there is still high clinical value in tracking a patient’s disease progression and anticipating future deterioration throughout all stages of illness. sMRI is a passive, relatively short and inexpensive measurement, making it an ideal modality for monitoring disease progression in general, but especially in later stage AD patients.
sMRI has wide adoption in both clinical and research settings. The majority of the sMRI literature centers around hippocampal or other MTL measures. The magnitude of hippocampal volume atrophy is strongly associated with cognition, diagnosis, and AD-etiology (Csernansky et al., 2004; Gosche et al., 2002; Jack et al., 2002; Silbert et al., 2003). Hippocampal shape (Achterberg et al., 2014), texture (Sørensen et al., 2016), grading (Coupé et al., 2015b) and subfield volumes (Khan et al., 2015) are all validated hippocampus-based markers of AD-related etiology. The MTL has generally been a focus due to its critical role in memory and because MTL atrophy is often detectable in cognitively normal (CN) individuals before symptoms of cognitive decline arise (Jack et al., 1997).
However, other parts of the brain also provide important information, especially in the differential diagnosis of mild cognitive impairment (MCI), AD, and frontotemporal lobe dementia (FTD), which have overlapping atrophy within the MTL (Rabinovici et al.,2007). A large body of research on AD-related etiology shows volume loss (Rabinovici et al., 2007) that quickly spreads from the MTL (Braak et al., 1997; Dickerson et al., 2001; Frisoni et al., 1999; Gómez-Isla et al., 1996b; Laakso et al., 1998, 1996; Rabinovici et al., 2007; Thompson et al., 2007, 2003; Vercelletto et al., 2002) to the parietal lobe (Jacobs et al., 2012a), where grey matter (GM) and white matter (WM) loss likely occurs relatively early in cortical progression and is linked functionally to important AD symptoms such as impaired memory retrieval, naming ability, attention, and executive function (Cabeza et al., 2008; Jacobs et al., 2012b; Lindeboom and Weinstein, 2004). Over time and with variability in the order and rate of decline amongst patients, neurodegeneration spreads to posterior temporal, lateral occipital (Rabinovici et al., 2007), and left frontal lobes as well as limbic structures including the thalamus, cingulate gyrus, and nucleus accumbens (Nie et al., 2017). Because the underlying pattern of atrophy is important to consider in each patient, clinical imaging reports that quantify regional atrophy are standard in many dementia clinics and regional volumes are common endpoints in clinical research.
Several medical imaging companies have implemented regional brain volume reports in clinical settings (Ahdidan et al., 2016; Brewer et al., 2009; Cavedo et al., 2022). These tools are helpful in understanding the overall pattern of neurodegeneration. However, the spatiotemporal pattern of atrophy varies widely in patients, and it is difficult to quickly assess a patient’s overall neurodegeneration from several numbers, reducing utility in diagnosis, which is most commonly done using neurocognitive assessment. Further, patients often receive scans at different facilities or on different scanners, and many of the current commercial tools do not account for these differences, affecting the validity of longitudinal follow-up reports. A valid, harmonized summary score with norms and cutoffs would be an important addition to existing, clinically-implemented regional brain volume reports. In research settings, several regions are frequently used as endpoints (e.g., hippocampal, entorhinal cortical, lateral ventricle, and whole brain volumes). Depending on the design and sample size of the study, reducing the number of regions to be considered as endpoints could enhance sensitivity. Additionally, a validated, harmonized score that could be compared across multiple research studies could be valuable for meta-analyses and help researchers build on each other’s findings.
Because atrophy patterns are complex and vary by patient, making the qualitative interpretation of several individual region of interest (ROI) results laborious, some researchers have condensed sMRI information into validated metrics for AD-related neurodegeneration. Early sMRI-based, multivariate analysis approaches have focused on the classification of CN and AD (Riciotti et al., 2012a, Vermuri, 2009). As the field has matured, researchers have aimed to classify more disease stages (i.e. CN, MCI, and AD) (Rallabandi et al., 2020; Popuri et al., 2020) as well as predict who will stabilize and who will decline (Coupé et al., 2015c; Ezzati et al., 2019; Popuri et al., 2020b). Most of these multivariate scoring methods use machine learning (Casanova et al., 2018; Diciotti et al., 2012a; Vermuri, 2009) and several combine sMRI with other imaging, biochemical, or clinical features to improve performance (Dukart et al., 2013; Salvatore et al., 2018; Vemuri et al., 2009a).
The objective of this study is to develop and validate a practical, clinically meaningful, and interpretable biomarker, which we call an “AD-NeuroScore” by summarizing AD-related neurodegeneration using only sMRI-based measurements. Our approach uses a vector of multiple sMRI ROI-based features that are associated with cognitive decline in an intuitive manner that is easy to implement. ROI-based features from only one type of image were selected because they are widely adopted by researchers and clinicians and because their significance is well understood. This is in contrast to multivariate models that successfully use machine learning approaches to generate probability scores (Casanova et al., 2018b; Diciotti et al., 2012b; Lu et al., 2021; Popuri et al., 2020b; V P.Subramanyam Rallabandi et al., 2020; Vemuri et al., 2009a) or more narrowly focused regional scores (Achterberg et al., 2014; Ahdidan et al., 2016; Brewer et al., 2009; Coupé et al., 2015d; Elahi et al., 2015; Sean M Nestor et al., 2008; Sørensen et al., 2016). Machine learning-based, black box models are powerful but have been have not yet been widely implemented in neurological practices. Model interpretability is an important barrier to clinical adoption despite the need for quality control in ROI-based approaches (Pinto et al., 2022). Further, while it is possible to understand the algorithms behind the ML models, this is not the same as having a clinically interpretable and meaningful metric. Here we developed an automated approach to assist in clinical practice, resulting in a summary score calculated using only well-understood sMRI regional volumes, where the relative contribution of each ROI is transparent. More narrowly focused ROI-based approaches are interpretable (Achterberg et al., 2014; Ahdidan et al., 2016; Archetti et al., 2021; Brewer et al., 2009; Casanova et al., 2013; Coupé et al., 2015d; Davatzikos et al., 2009; Sørensen et al., 2016), however, they, in some cases have difficulty capturing a significant portion of sMRI features due to the loss of information in the process of combining to a single metric. The biomarker we propose here, AD-NeuroScore, is automated, computationally inexpensive, and relies on widely adopted and validated sMRI features. We believe these attributes could enable widespread research and clinical deployment.
2. Methods
2.1. Participants
2.1.1 Data Acquisition and Demographics
CN individuals and patients with MCI or AD diagnosis were drawn from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) (Jack et al., 2008; Mueller et al., 2005a, 2005b). The ADNI is a global research study launched in 2003, primarily aimed at evaluating biological markers to measure the progression of MCI and early AD. Biomarkers will be derived from MRI, PET, and blood and cerebrospinal fluid (CSF)-based biospecimen assays and their discovery and validation are intended to facilitate the development of treatments to slow or halt AD progression. For up-to-date information, see www.adni-info.org.
All ADNI studies were conducted according to the Good Clinical Practice guidelines, the Declaration of Helsinki, the U.S. 45 Code of Federal Regulations (CFR) Part 46 and 21 CFR Part 50 – Protection of Human Subjects, and 21 CFR Part 56 - Institutional Review Boards. Written informed consent and HIPAA authorizations were obtained from all participants or authorized representatives prior to the conducting of protocol-specific procedures. The ADNI protocol was approved by the Institutional Review Boards of all participating institutions, listed in File 1 in Supplementary Materials (Mukherji et al., 2021).
In this work, a total of 1,619 subjects with available 3T, accelerated, T1-weighted MR images at baseline were collated from the ADNI-GO, ADNI-2, and ADNI-3 study phases; of these, 388 participants were discarded due to incomplete data used to calculate or validate the biomarker, including diagnosis (n=42), sex (n=45), age (n=8), scanner manufacturer (n=3), and at least one brain volume (n=298). Acquisition methods are detailed by Chow et al. (Chow et al., 2015). The remaining cohort of 1,231 individuals was comprised of 488 CN, 564 MCI, and 179 mild AD. Participants were diagnosed at baseline and reassessed at each study visit (Petersen et al., 2010). The study sample was subdivided into the following three cohorts: region of interest (ROI) selection, cognitively normal template creation, and experimental (see Table 1 for a full description of all cohort demographics).
2.1.2 Region of Interest Selection Cohort
To determine the regional volumes that were associated with cognitive impairment, a sample of 150 age- and sex-matched individuals from the overall participant pool were pseudo-randomly selected using the numpy random module (Harris et al., 2020) with equivalent proportions of each diagnostic category (see Table 1 for a full description of ROI selection cohort demographics) and grouped into a cohort in order to quantitatively determine ROI inclusion and weighting (see Methods 2.3.2. ROI Selection).
2.1.3 Cognitively Normal Template Cohort
A sample of cognitively normal, age- and sex-matched individuals were randomly selected from the remaining participant pool to generate a CN template. The purpose of this template was to represent the average, healthy older adult brain (see Table 1 for a full description of CN template demographics). The sample size for this cohort was set to 152 individuals, to match that which was drawn from a normative adult population to construct the MNI152 template, a classic and widely adopted structural brain atlas based on the group-wise registration of 152 3D, T1-weighted MR images (Mazziotta et al., 1995a, 1995b).
2.1.4 Experimental Cohort
The remaining 929 participants were grouped into an experimental cohort for sMRI metric extraction and testing (see Table 1 for a full description of experimental cohort demographics). Baseline analyses were conducted for all 929 participants of the experimental cohort.
Longitudinal analyses were conducted on all participants who had follow-up data available. These participants were classified as those who worsened in diagnosis (denoted as Diagnosisdecline) and those who did not change in diagnosis or, in rare cases, improved (denoted as Diagnosisstable), based on their diagnosis at 12, 24, 36 and 48-month follow-up sessions. Diagnosisdecline is thus composed of those who declined from CN to MCI or CN to AD (CNdecline) and from MCI to AD (MCIdecline); Diagnosisstable is the group who either did not change in diagnosis or improved (composed of MCIstable, CNstable, and ADstable). There was no later stage diagnostic category than AD in this sample, and therefore all AD participants were ADstable, with no ADdecline group (see Figure 1 for a flowchart detailing the construction of longitudinal cohorts).
2.2 Neurocognitive Assessments
Participant scores on the Clinical Dementia Rating Scale Sum of Boxes (CDR-SB), Alzheimer’s Disease Assessment Scale Cognitive Subscale (ADAS-11), and Mini-Mental State Exam (MMSE) were collated for all experimental cohort participants with scores available at each time point. The original ADAS-Cog (Rosen et al., 1984), referred to as the ADAS-11, is comprised of 11 subtests whose scores are summed (unweighted) to generate a total raw score from 0 to 70. Higher scores on the ADAS-11 correspond with a larger number of errors and poorer performance, as rated by clinicians (Grochowalski et al., 2015). The MMSE is a short, 11 items questionnaire routinely used to measure cognitive impairment (Folstein et al., 1975). The MMSE has six subdomains: time orientation, place orientation, registration, attention, recall, language, and visual construction and yields a score from 0 to 30, with lower scores representing greater cognitive dysfunction The CDR-SB (Balsis et al., 2015; Hughes et al., 1982; Morris, 1993; O’Bryant et al., 2008) similarly measures cognitive function, along with functional ability, as rated by clinicians across six domains (also referred to as boxes): memory, orientation, judgement and problem-solving, community affairs, home and hobbies, and personal care. Individual scores for each domain (or box) are summed to produce a total score ranging from 0 to 18, where higher scores stage more severe dementia (O’Bryant et al., 2008)
2.3 AD-NeuroScore
Cortical reconstruction and volumetric segmentation of the 3T, accelerated, T1-weighted MR images was performed with version 7.1 of the Freesurfer image analysis suite, which is documented and freely available for download online (http://surfer.nmr.mgh.harvard.edu/). The details of these procedures are described in previous publications (Dale et al., 1999; Desikan et al., 2006; Fischl et al., 2004a, 2004b, 2002, 2001, 1999a, 1999b; Jovicich et al., 2006; Reuter et al., 2010; Segonne et al., 2007, 2004). With these methods, eighty-four cortical and subcortical regional volumes were estimated at baseline for each subject. Additionally, results were reviewed using the ENIGMA structural imaging quality control protocols (http://enigma.usc.edu/) (Stein et al., 2012).
2.3.2. ROI Selection
In order to determine the brain regions most sensitive to diagnosis at baseline, we first performed an Analysis of Variance (ANOVA) in our ROI selection cohort to test for an effect of diagnosis for each of the 84 regions (using a Bonferroni-corrected alpha of 0.05/84; see Supplementary File 2 for a complete list of the 84 cortical and subcortical regions tested). Forty-one regions were found to be significant. All these regions were consistent with the existing literature (Harper et al., 2017; Yin et al., 2013; Zanchi et al., 2017) and were used to compute AD-NeuroScore (Table 2).
2.4.1. Cognitively Normal Template
To generate a CN template vector representative of the healthy average older adult brain, each volume that was determined from the ROI selection process was averaged across all CN template cohort participants (see Table 2). Data harmonization procedures were applied to each region in the CN template vector (see Methods 2.4.2. Data Harmonization). This CN template vector had the same dimensions as vectors extracted from experimental cohort participants. These two vectors were used to compute the distance metrics evaluated in this work (see Methods 2.4.3. Calculating the Z-Weighted Euclidean Distance).
2.4.2. Data Harmonization
In order to account for factors such as interindividual variations in head size, age, sex, and MRI acquisition features like scanner model and manufacturer, we utilized a w-score, a data harmonization approach previously validated on sMRI data (Ma et al., 2019; Popuri et al., 2020a). This approach uses a generalized linear model (GLM) framework, where the structural volume for a given region of each participant is modeled as the linear combination of all covariates as shown in Equation (1): Here, the volume of region r for subject i, is modeled as the sum over NC covariates (age, sex, scanner model, and total ICV) of coefficient , for covariate c, multiplied by the value of the covariate, , plus the residual term, .
To compute the w-score, , also known as the standardized residual, we took the z-transformation of the residual term from Eq. (1), , as shown in Equation (2): where and represent the mean and standard deviation, respectively, of the residual term for a given subject’s regional brain volume. In this way, regional volumes used to calculate AD-NeuroScore were adjusted for the effects of age, sex, head size, and scanner model.
2.4.3 Calculating the Z-Weighted Euclidean Distance
To compute the difference between each experimental cohort participant and the CN template, we employed a novel modified Euclidean inspired distance function which we call the Z-weighted Euclidean distance (ZWE). A traditional Euclidean distance metric is calculated by treating the list of significant, harmonized regional volumes as a vector in n-dimensional space, where n is the number of regions, and computing the Euclidean distance between each participant and the CN template vectors. Our ZWE distance function differed only in that each region was multiplied by a weight resulting from each region’s level of significance (z-score) as determined during ROI selection process as shown in Equation (3): where Zi refers to the average of the z-scores associated with the p-values across the three possible pairwise comparisons (CN vs MCI, CN vs AD, and MCI vs AD) for a given region, i, and NR denotes the total number of regions. The ZWE distance was computed between the harmonized residual of each subject, , and the CN template, . Thus, the more significant a region was in ROI-selection, the greater its contribution to the computed AD-NeuroScore. The concept of weighing multivariate sums by measures of significance has been used in a number of applications since its inception in the proposal of Hotelling’s T2 distance in 1947 (Gutman et al., 2013; Hotelling, 1947; Hua et al., 2013). The significant ROIs listed in Table 2 are visualized in Figure 2, with a color scale indicating respective Zi weighting.
We investigated several other methods for computing the difference between each experimental cohort subject and the CN template. Each of these algorithms similarly involved computing a distance function between like objects constructed from the harmonized, significant regional brain volumes, which we determined in the ROI selection step. Generally, the distance functions fell into one of two categories (curves and points). For the distance functions that computed distances between curves, in addition to projecting the curve in one dimensional space, the k-false nearest neighbors method was implemented to embed each curve in n-dimensions (see Supplementary File 3 for further details). The functions which computed a distance between curves included the Fréchet distance and the Hausdorff distance. The Fréchet Distance is one method used to quantify the similarity between two curves or sets of points in space, with an emphasis on the location and ordering of points (Dumitrescu and Rote, 2004). Similarly, the Hausdorff distance measures how far two subspaces of a metric space are from each other; however, it does not account for the flow or order of points (Maiseli, 2021). The vector-based distance functions included the Euclidean and ZWE distance. Further details along with equations modeling these mathematical distances are included in Supplemental File S3.
Statistical analyses tested the performance of each distance metric based on sensitivity to diagnosis, disease severity, and progression (See Tables 1-3 in Supplementary Materials). All metrics evaluated were benchmarked using AHV, an NIA-AA diagnostic biomarker for Alzheimer’s disease (Jack et al., 2018) (Supplemental Table 1). The optimal biomarker, the ZWE distance, was selected as our “AD-NeuroScore.” (See 2.5 Validation Procedures).
2.5. Validation Procedures
2.5.1. Baseline Validation Procedures
At baseline, we evaluated sensitivity to diagnosis of AD-NeuroScore by calculating the area under the Receiver Operating Characteristic curve (AUC-ROC) and the associated 95% confidence intervals, using a logistic regression model for each pairwise comparison of diagnostic groups (CN vs MCI, MCI vs AD, and CN vs AD). Sensitivity to diagnosis was further assessed using pairwise, two-tailed t-tests for each possible diagnostic group comparison, using a Holm-Bonferroni-corrected alpha of 0.05. Results were then converted to z-scores and Cohen’s d was calculated to indicate effect size.
Baseline associations of AD-NeuroScore and AHV with disease severity, operationalized as MMSE, ADAS-11, and CDR-SB scores, were tested using linear regression, both in the overall baseline experimental cohort and in each diagnostic sub-group. In addition to examining the significance of the slope using a Holm-Bonferroni-corrected alpha of 0.05, we also estimated the correlation coefficients and compared the performance of AD-NeuroScore and AHV by conducting a Fisher’s z-test of the z-transformed correlation coefficients, using the Holm-Bonferroni method to adjust for the 3 comparisons at baseline.
2.5.2. Longitudinal Validation Procedures
To determine if AD-NeuroScore might be predictive of disease progression, we assessed the relationship between AD-NeuroScore at baseline and both the change in diagnosis and change in disease severity at 12, 24, 36, and 48 months. Logistic regression was used to examine whether baseline AD-NeuroScore was predictive of change in diagnosis (Diagnosisdecline vs. Diagnosisstable); AUC-ROC (and associated 95% CI) was used as a metric of the predictive ability. We also compared the baseline distribution of AD-NeuroScore between Diagnosisdecline and Diagnosisstable groups using pairwise, two-tailed t-tests in the full experimental cohort and subsequently further stratified by baseline diagnosis (MCI or CN) to investigate if the ability to predict decline is driven by a specific patient population. Similar to the baseline validation procedures, the results for all comparisons were converted to z-scores, and Cohen’s d was calculated to assess effect size.
Longitudinal association with disease severity was tested using linear regression between baseline metric scores and the change in the neuropsychological assessment scores (MMSE, ADAS-11, and CDR-SB) from baseline, at each respective longitudinal session. Significance was assessed using a Holm-Bonferroni-corrected alpha of 0.05. All absent longitudinal comparisons were excluded due to insufficient sample size.
2.6. Validation Using Clinically Implemented ROI
To validate AD-NeuroScore in the context of the alternative imaging analysis tools frequently used by and accessible to clinicians, a vector of Neuroreader®-analagous ROI volumes was approximated by transforming the FreeSurfer results into a less granular vector of regions closely matching the Neuroreader® atlas. To investigate how these changes might impact biomarker performance, 80 of the 84 brain regions estimated by FreeSurfer were surjectively mapped to construct the 22 Neuroreader® ROI structures. Four of the eighty-four FreeSurfer-estimated regions, the left and right nucleus accumbens and insula, had no corresponding Neuroreader® ROI, and were thus dropped from this branch of the analysis. The calculation and validation methods used to compute and validate the Freesurfer-based AD-NeuroScore were applied to the pseudo-Neuroreader® ROI vector (Supplementary Tables S4-S7).
2.7. Data Availability
Full, open access to all de-identified ADNI imaging and clinical data is publicly and freely available to individuals who register with the ADNI and agree to the conditions in the “ADNI Data Use Agreement,” upon approval of a request that includes the proposed analysis and the named lead investigator (contact via http://adni.loni.usc.edu/data-samples/access-data/). Additional details about the ADNI data acquisition and sharing policies can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_DSP_Policy.pdf.
The code supporting the findings of this study is openly available in [repository name: “AD-NeuroScore”] at https://github.com/jbramen/AD-NeuroScore. The Python programming language (version 3.9) was used for all analyses (Python Software Foundation, https://www.python.org/).
3. Results
3.1. Baseline Validation
3.1.1. Baseline Sensitivity to Diagnosis
AD-NeuroScore was significantly associated with diagnosis at baseline (p<0.001 for all comparisons, Holm-Bonferroni corrected; Table 3). AD-NeuroScore performed best at distinguishing AD from other groups (CN and MCI) and least well at distinguishing CN and MCI participants. AD-NeuroScore performed as well as our benchmark, AHV (p<0.001 for all comparisons, Holm-Bonferroni corrected), in this cross-sectional validation (Table 3). AUC-ROC values of AD-NeuroScore and AHV were similar across all group comparisons. Visual inspection of the overlaid AD-NeuroScore and AHV AUC-ROC curves for AD comparisons (CN vs AD and MCI vs AD) indicated that at low false positive rates, the AD-NeuroScore true positive rate tended to be higher (Figure 3). Examining the distribution of AD-NeuroScore and AHV for each diagnostic group at baseline (Figure 4), AD-NeuroScore qualitatively demonstrated greater separation between diagnostic category medians, with more centrally concentrated distributions and longer tails in CN and MCI groups at the edges of the distribution.
3.1.2. Baseline Association with Disease Severity (MMSE, ADAS-11, and CDR-SB)
In the overall baseline experimental cohort, we found that AD-NeuroScore was significantly associated with disease severity, as measured by MMSE, ADAS-11, and CDR-SB scores (Table 4; Figure 5; all p-values < 0.001, Holm-Bonferroni corrected). In sub-analyses stratified by baseline diagnosis, we found significant associations between AD-NeuroScore and metrics of disease severity in participants with MCI (MMSE, ADAS-11, and CDR-SB) and AD (ADAS-11, and CDR-SB). Conversely, we found no significant association with disease severity in CN individuals.
AD-NeuroScore generally performed as well or better than our benchmark, AHV, in this cross-sectional validation using the overall baseline experimental cohort (Table 4; Figure 5). Results from z-tests conducted using Fisher z-transformed correlation coefficients revealed that associations between both CDR-SB and ADAS-11 and AD-NeuroScore were significantly stronger than with AHV (p=0.006 for CDR-SB and p=0.024 for ADAS-11). Sub-analyses stratified by baseline diagnosis revealed that AD-NeuroScore and AHV performed the most similarly in participants with MCI, and the most differently in participants with AD (AD-NeuroScore: p<0.01 for ADAS-11 and p<0.05 for CDR-SB, both Holm-Bonferroni corrected, AHV: NS for all).
3.2. Longitudinal Validation
3.2.1. Metric Sensitivity to Progression
In the overall baseline experimental cohort, we found that baseline AD-NeuroScore differentiated Diagnosisstable from Diagnosisdecline significantly at all timepoints (12, 24, 36, and 48-months; Table 5; p<0.001, Holm-Bonferroni corrected, for all). Sub-analyses stratified by baseline diagnosis revealed that AD-NeuroScore’s ability to predict decline was primarily driven by individuals with MCI at baseline (p< 0.001, Holm-Bonferroni corrected, for all). We found no significant effects in CN individuals.
We found that AD-NeuroScore performed as well as AHV in the overall experimental cohort (AHV: p<0.01 at 12-months, p<0.001 at all successive sessions, Holm-Bonferroni corrected). While there were no significant differences between AD-NeuroScore and AHV, qualitatively, AD-NeuroScore tended to do modestly better than AHV at differentiating Diagnosisstable, from Diagnosisdecline participants at 12- and 24-months, and slightly worse at 36- and 48-months. Sub-analysis by baseline diagnosis revealed similar patterns; AD-NeuroScore performed equivalently to AHV in individuals with MCI. However, AHV demonstrated somewhat
3.2.2. Longitudinal Association with Disease Severity (MMSE, ADAS-11, and CDR-SB)
In the overall longitudinal cohort, we found that baseline AD-NeuroScore was also significantly associated with the change in disease severity scores (Table 4; p <0.001, Holm-Bonferroni corrected, for all), as measured by the change in MMSE, ADAS-11, and CDR-SB scores from baseline to each time point (12, 24, 36, and 48-months). Sub-analysis by baseline diagnosis revealed that AD-NeuroScore’s association with the change in disease severity scores was also primarily driven by individuals with MCI at baseline. We found significant associations between AD-NeuroScore and the change in MMSE, ADAS-11, and CDR-SB scores in participants with MCI at all time points (p<0.001, Holm-Bonferroni corrected for all). Conversely, we found no significant association with the change in disease severity in AD individuals, and only two significant associations in CN individuals (CDR-SB at 12- and 24-months; p<0.05, Holm-Bonferroni corrected, for both).
AD-NeuroScore generally performed as well or better than our benchmark, AHV, in this longitudinal validation (Table 4). As with AD-NeuroScore, correlations between AHV at baseline and change in disease severity scores were significant in the overall longitudinal cohort (Table 4; p<0.001, Holm-Bonferroni corrected, for all timepoints). Results from z-tests conducted using Fisher z-transformed correlation coefficients revealed that correlations between AD-NeuroScore and 24-month change in ADAS-11 were significantly stronger than with AHV (p=0.003, Holm-Bonferroni corrected). In the MCI group, AD-NeuroScore performed as well or somewhat better than AHV (AD-NeuroScore: p<0.001 for all; AHV: p<0.05 for MMSE across all time points, p<0.01 for ADAS-11 at 12-months, and p<0.001 for all other scores and time points, all Holm-Bonferroni corrected).
In the AD group, neither AD-NeuroScore nor AHV were significantly associated with change in disease severity scores. The AD sub-analysis was limited by sample size longitudinally. In the CN group, both AD-NeuroScore and AHV had generally weak correlations, none of which were significant aside from AD-NeuroScore and CDR-SB at 12- and 24-months (both p<0.05, Holm-Bonferroni corrected).
3.3. Alternative Atlas Analysis
3.3.1 Baseline and Longitudinal Validation
Results from analyses repeated using the Neuroreader® ROI-based AD-NeuroScore were similar to those obtained with the Freesurfer-based AD-NeuroScore, including all cross-sectional and longitudinal analyses. Benchmarking the pseudo-Neuroreader® ROI-based AD-NeuroScore against AHV also produced results analogous to those of the original analysis (Supplementary Tables S5-S7).
4. Discussion
We developed a Euclidean inspired sMRI-based distance metric, AD-NeuroScore, which was significantly associated with diagnosis (CN, MCI, and AD) and disease severity (MMSE, ADAS-11 and CDR-SB scores) at baseline. AD-NeuroScore performed as well as AHV, the most commonly used sMRI measure of AD-related neurodegeneration, and demonstrated comparably strong correlations with disease severity scores. AD-NeuroScore was also significantly associated with changes in the three disease severity scores over a relatively long follow-up period of 48 months. We found that these associations were largely driven by the MCI group. It is possible that the limited sample size of CN participants and patients with AD with longitudinal scores available resulted in the CN and AD sub-analyses being underpowered to detect a relationship.
Moreover, AD-NeuroScore was also able to differentiate between participants who declined in cognitive status (from CN to MCI or AD; from MCI to AD) and those who remained stable in longitudinal analyses; this was also true of AHV. Qualitatively, AD-NeuroScore tended to perform modestly better than AHV at earlier time points (12- and 24-months) and slightly worse at later sessions (36- and 48-months). Stratifying this analysis by starting diagnosis revealed that the MCI patient population was primarily driving these results as well. This finding was likely the result of two major factors: the combined sample of stable and declining patients at each time point was always much larger for the patients with a baseline MCI diagnosis and the greatest and most consistent changes in neuropsychological scale scores across sessions would be expected to occur in the MCI sub-population. Because patients more frequently seek monitoring and intervention after the onset of cognitive deficits (i.e., with MCI) rather than prior (i.e., CN), the strength of our metric in this patient population is of importance. Similarly, in clinical trials aimed at assessing the efficacy of interventions in slowing AD progression, a measure that is optimally sensitive in patients with MCI is particularly well-suited for usage.
A particularly attractive feature of AD-NeuroScore is the ease with which it could be integrated into current clinical and research workflows. One of the challenges in the full clinical implementation of AI applications is the difficulty integrating these into daily workflows (Hwang and Park, 2020). Adding AD-NeuroScore to existing, clinically implemented regional brain volume reports, and calculating AD-NeuroScore based on the individual regional volumes obtained in research settings would be straight-forward. Further, by replacing different anatomical ROIs with a single measure as an endpoint, AD-NeuroScore could enhance the sensitivity of many studies by reducing the number of statistical tests, while meaningfully encapsulating as much or more information. Future directions of this work include creating validated AD-NeuroScore norms and diagnostic cutoffs to aid in integration with clinical workflows.
An unexpected result of the present study, inconsistent with previous literature, is that both the left and right lateral ventricular volumes were not retained during ROI selection, despite the frequent observation of ventricular dilation in patients with AD (Attier-Zmudka et al., 2019; Ott et al., 2010). Absolute ventricular enlargement has proven sufficient to discriminate between MCI patients who progressed to AD or remained stable over a 6-month interval in a smaller ADNI cohort (n=504) and, consequently, has been proposed as a potential biomarker for disease progression (Sean M. Nestor et al., 2008). However, at baseline, in line with previous cross-sectional volumetric studies, a large overlap was found between the total ventricular volumes of pathological groups and controls (Giesel et al., n.d.; Sean M. Nestor et al., 2008; Schott et al., 2005). Thus, one potential explanation may be that absolute ventricular change is a more sensitive measure of AD-pathology than total ventricular volume. Given that our investigation only examined regional volumes at cross-section and employed a conservative correction to adjust for the 84 brain regions tested during ROI selection, this discrepancy may be plausibly attributed to differences in study design.
It is also important to address the distinctions between AD-NeuroScore and other biomarkers that have been developed and studied in the literature. Many of the ML-based metrics successfully classify participants as AD or not, using similar input features (perhaps from different data sets) and slightly different ML algorithms. The STructural Abnormality iNDex (STAND)-score uses a support vector machine classifier which takes in sMRI normalized input features such as GM, WM, and CSF tissue densities extracted from SPM5 tissue segmentation as well as demographic information and assigns a numerical value to classify a patient as CN or AD based on a 280 sample (140 each) training set from the Alzheimer’s disease Patient Registry (Vemuri et al., 2009b, n.d.). AD Pattern Similarity (AD-PS) uses the same sMRI input features (GM, WM, and CSF tissue maps) derived from the ADNI data set (Casanova et al., 2013) and high dimensional regularized logistic regression to classify a patient as CN or AD. AD-PS and AD-NeuroScore perform similarly (with significantly overlapping AUC-ROC confidence intervals); while AD-PS was also shown to correlate negatively with cognitive scores obtained through clinic visits and telephone interviews, lack of details regarding the correlations makes direct comparison with AD-NeuroScore difficult. Another ML-based metric, Subtype and Stage Inference (SuStaIn) (Young et al., 2014), primarily subtypes individuals based on their likelihood to stabilize or decline (Fonteijn et al., 2012) and excels at differentiating AD from CN. AD-NeuroScore performs at least as well as SuStaIn at subtyping MCI patients (AUC=0.72 for SuStaIn; AUC=0.77 for AD-NeuroScore) (Archetti et al., 2021). SuStaIn utilizes multi-modal sporadic disease data rather than a single data source, making it more complicated than AD-NeuroScore to integrate into clinical workflows, potentially a barrier to full clinical implementation.
Amongst the more interpretable ML models, MRDATS and SPARE-AD (Davatzikos et al., 2009; Popuri et al., 2020b) deserve mention. MRDATS uses w-corrected brain ROI volumes in an ensemble-learning algorithm to arrive at a score between 0 and 1 to indicate CN to dementia progression (Popuri et al., 2020b). Using a threshold of 0.5, MRDATS performed well at binary classification of CN and dementia, however this was assessed only in the training set. As with SuStaIn, MRDATS was developed primarily to differentiate stable from declining patients. MRDATS performed equivalently to AD-NeuroScore in distinguishing stable and progressive MCI groups (0–3-year AUC=0.75 for MRDATS and AUC=0.75 for AD-NeuroScore). The SPARE-AD (Davatzikos et al., 2009) index uses high-dimensional pattern classification of volumetric atrophy and rate of change of SPARE-AD was highly predictive of cognitive status. A future direction of this work is to determine if utilizing rate of change of AD-NeuroScore (versus a single time point) improves its performance. Compared to MRDATS and SPARE-AD, AD-NeuroScore can be calculated in a computationally simple way using FDA-approved, clinically implemented, automated ROI-metrics such as Neuroreader®.
The most similar metric to AD-NeuroScore is RVI or the ENIGMA Dot Product (Kochunov et al., 2022), which also uses the concept of weighing a multivariate sum by a metric describing the effect of a group status with harmonized phenotypes and performs well as a biomarker in several neuropsychiatric conditions. There are many notable differences between RVI and AD-NeuroScore, the first of which pertains to the phenotypes used within the metric. AD-NeuroScore exclusively uses regional volumes while RVI uses several brain-derived metrics including gray matter thicknesses and fractional anisotropy values, which are not presently part of popular clinical workflows. Moreover, AD-NeuroScore retains both hemispheric volumes compared to RVI’s averaging of the hemispheres. AD-NeuroScore additionally harmonizes with respect to scanner model, improving its utility in multi-scanner circumstances, which are common in large cohort data sets and in clinical practice.
Most of the important differences arise in the construction of AD-NeuroScore. The core concept of the metric is the Euclidean distance, where each harmonized volume is subtracted from the corresponding normal template harmonized volume. Unlike RVI, it does not take the absolute value of the transformed residual after harmonization. For this reason, each multivariate term in the sum is squared to account for possible cancellations of differences and to emphasize differences that are greater while minimizing smaller ones. Moreover, AD-NeuroScore quantifies the difference between groups, weighs each term in the sum with a z-score rather than effect size and does not normalize by the number of terms. These differences could explain why AD-NeuroScore better distinguishes CN and AD groups from the ADNI database than RVI, with statistically significant differences in their Cohen’s d values (p<0.001) (Kochunov et al., 2021).
An important advantage of AD-NeuroScore is its potential to accurately monitor disease advancement and be compared across multiple sites by using standard ranges to generate probabilities of diagnosis or decline. Similarly, ranges of AD-NeuroScore that are standard for different groups could be used to aid early differential diagnosis of common neurodegenerative diseases that cause dementia, such as vascular dementia, dementia with Lewy bodies, and frontotemporal dementia, which demonstrate unique patterns of atrophy that may be difficult to assess by AHV and other standard approaches. In support of exploring this functionality further is previous literature which has found that sMRI alone can be used to differentiate between these four dementia-causing neurodegenerative diseases groups and healthy controls with a high accuracy (Koikkalainen et al., 2016). A future direction of this research is to determine if AD-NeuroScore can aid in the differential diagnosis of dementia.
While the findings of this work encourage further investigation, there are several limitations that should be carefully considered and addressed in successive studies. First, the sample of patients with AD was much smaller than the CN and MCI cohorts across all sessions, with this number falling across successive visits. The limited number of participants across all diagnostic groups with scores available on the MMSE, CDR-SB, and ADAS-11 also dwindled longitudinally, rendering some of the later tests for the CN and AD groups particularly underpowered. Additionally, the AD subset was confined to patients with mild AD dementia, as defined in the ADNI protocol, which leaves more progressed patients with AD understudied with regards to our metric. Repeating this study in a real-world clinical sample that better represents the AD patient class at various stages of disease advancement would solidify the findings presented in this study.
Data Availability
Full, open access to all de-identified ADNI imaging and clinical data is publicly and freely available to individuals who register with the ADNI and agree to the conditions in the “ADNI Data Use Agreement,” upon approval of a request that includes the proposed analysis and the named lead investigator (contact via http://adni.loni.usc.edu/data-samples/access-data/). Additional details about the ADNI data acquisition and sharing policies can be found at http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_DSP_Policy.pdf. The code supporting the findings of this study is openly available in [repository name: "AD-NeuroScore"] at https://github.com/jbramen/AD-NeuroScore. The Python programming language (version 3.9) was used for all analyses (Python Software Foundation, https://www.python.org/).
Declarations of Interest
The Pacific Neuroscience Institute Foundation (PNIF) is a public charity that holds a patent for AD-NeuroScore. JB, EP, DM, and SP, are employees of PNIF. GK, SB, and PS served as consultants/advisors for PNIF. PT received partial grant support from Biogen, Inc., for research unrelated to this manuscript. ST, CC, and HZ have nothing to disclose.
Acknowledgements
We thank the participants and families volunteering for this research. We thank the Pacific Neuroscience Institute and Foundation staff and leadership, including the CEO and Founder, Dan Kelly, MD, Vice President, Melissa Coleman, Director of Research Administration and Operations Melanie Lampa, Director of Marketing, Zara Jethani, Executive Assistant, Danielle Wozniak, and Accounting Manager, Bersabel Belay and the Pacific Brain Health Clinic’s Practice Manager, Brenda Smith for their support. This work is supported by the Pacific Neuroscience Institute Foundation, including the generous support of the Barbara and John McLoughlin Family as well as the Cary and Will Singleton Family. Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Footnotes
↵* Gavin Kress and Emily Popa shared equally to the joint first authorship
‡ Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
minor phrase change in acknowledgments
Abbreviations
- (AD)
- Alzheimer’s Disease
- (ADAS-11)
- Alzheimer’s Disease Assessment Scale-Cognitive Subscale
- (aMCI)
- Amnestic Mild Cognitive Impairment
- (AUC-ROC)
- Area Under the Receiver Operator Characteristic Curve
- (CSF)
- Cerebrospinal Fluid
- (CDR-SB)
- Clinical Dementia Rating Scale Sum of Boxes
- (GM)
- Grey Matter
- (ML)
- Machine Learning
- (MTL)
- Medial Temporal Lobe
- (MCI)
- Mild Cognitive Impairment
- (MMSE)
- Mini-Mental State Exam
- (PHI)
- Protected Health Information
- (ROC)
- Receiver Operator Characteristic
- (ROI)
- Region of Interest
- (SD)
- Standard Deviation
- (sMRI)
- Structural Magnetic Resonance imaging
- (WM)
- White Matter