Abstract
Accurate Alzheimer’s Disease (AD) progression prediction is essential for early intervention. The TADPOLE challenge, involving 92 algorithms, used multimodal biomarkers to predict future clinical diagnosis, cognition, and ventricular volume. The winning algorithm, FROG, utilized a Longitudinal-to-Cross-sectional (L2C) transformation to convert variable longitudinal histories into fixed-length feature vectors, which contrasted with most existing approaches that fitted models to entire longitudinal histories, e.g., AD Course Map (AD-Map) and minimal recurrent neural networks (MinimalRNN). The TADPOLE challenge only utilized the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset. To evaluate FROG’s generalizability, we trained it on the ADNI dataset and tested it on three external datasets covering 2,312 participants and 13,200 timepoints. We also introduced two FROG variants. One variant, L2C feedforward neural network (L2C-FNN), unified all XGBoost models used by the original FROG with an FNN. Across external datasets, L2C-FNN and AD-Map were the best for predicting cognition and ventricular volume. For clinical diagnosis prediction, L2C-FNN was the best, while AD-Map was the worst. L2C-FNN compared favorably with other approaches regardless of the number of observed timepoints, and when predicting from 0 to 6 years into the future, underscoring its potential for long-term dementia progression prediction. Pretrained ADNI models are publicly available: GITHUB_LINK.
1 Introduction
Alzheimer’s disease (AD) is a devastating neurodegenerative disorder (Jack et al., 2018; Hampel et al., 2021), which transitions slowly from a preclinical phase to a fully manifested clinical syndrome (Villemagne et al., 2013). There is no cure for AD, although medications exist to slow down cognitive decline in early AD (Van Dyck et al., 2023). The growing consensus is that early intervention is critical for slowing or stopping disease progression (Dubois et al., 2016; Scheltens et al., 2016, 2021). Therefore, predicting AD progression is an important clinical task (Zhang et al., 2017; Venkatraghavan et al., 2019) that can enable early treatment and caregiver planning (de Vugt & Verhey, 2013; Rasmussen & Langerman, 2019), decrease clinical trial costs by enriching enrollment (Burns et al., 2021; Oxtoby et al., 2022), and decrease costs by prioritizing expensive drugs for patients most at risk for dementia (Cummings et al., 2019).
Most disease progression prediction studies focus on a “static” setup, using cross-sectional (baseline) data to predict outcomes at a single future timepoint (Qiu et al., 2020; Hebling Vieira et al., 2022), such as whether individuals with mild cognitive impairment will develop dementia within 3 years (Basaia et al., 2019; El-Sappagh et al., 2021). However, real-world data often involves a variable number of observed timepoints, reflecting irregular clinical observations. Therefore, in The Alzheimer’s Disease Prediction Of Longitudinal Evolution (TADPOLE) challenge (Marinescu et al., 2019, 2021), multimodal AD biomarkers at one or more timepoints are used to predict cognition, ventricular volume and clinical diagnosis of each individual every month into the future. This setup presents three challenges: (1) variable number of observed timepoints per participant, (2) predicting indefinitely into the future1, and (3) significant missing data since not all biomarkers are available at every timepoint.
To tackle the TADPOLE challenge, dynamical state models, such as linear state space models and recurrent neural networks (RNNs), have been proposed (Ghazi et al., 2019; Nguyen et al., 2020; Jung et al., 2021; Xu et al., 2022). In these models, an individual’s latent state is represented by a vector, thus providing a rich encoding of an individual’s “disease state” beyond a single integer (as in the case of discrete state hidden Markov models). At each timepoint, observations are used to update the latent state of the individual at that timepoint. The latent state is in turn used to predict observations at the next time point. If some (or even all) observations are missing at the next time point, the model predictions can be used to fill in the missing data. Therefore, predicting missing data and future disease progression are unified into a single prediction task. One early algorithm in this category was the MinimalRNN model, which achieved second place on the TADPOLE leaderboard at the time of publication, behind the FROG algorithm (Nguyen et al., 2020).
Second, inspired by theoretical models of a sigmoidal evolution of AD biomarkers (Jack et al., 2010; Villemagne et al., 2013; Selkoe & Hardy, 2016), approaches have emerged to fit parametric sigmoid-like functions to longitudinal biomarkers. Mixed-effects models capture group-level trends as fixed effects and individual variations as random effects (Iddi et al., 2019; Li, Iddi, et al., 2019; Oxtoby, 2023). Jedynak and colleagues used dynamic time-warping to align individual biomarker data to a group template represented by sigmoid curves (Jedynak et al., 2012). A Bayesian extension incorporated individual-specific latent time shifts (Bilgel et al., 2019). This line of research was extended to fit a constrained generalized sigmoidal function (Ghazi et al., 2021). A disease course mapping framework integrating Riemannian geometry and mixed-effects modeling with time reparameterization, known as AD course map (AD-Map), has demonstrated great promise (Koval et al., 2021), outperforming MinimalRNN for predicting cognition (Maheux et al., 2023).
By contrast, the TADPOLE winner FROG utilized a longitudinal-to-cross-sectional (L2C) transformation technique (Nanopoulos et al., 2001; Deng et al., 2013; Barandas et al., 2020) to convert participants’ variable visit histories into a cross-sectional format, thus reducing the TADPOLE problem into a traditional “static” prediction problem with fixed length input features. XGBoost (eXtreme Gradient Boost; Chen & Guestrin, 2016) was then used to predict disease progression with the L2C features, training separate models for different forecast windows and target variables. FROG was the best for predicting clinical diagnosis and the overall TADPOLE winner. However, FROG has only been tested in the ADNI dataset, so it remains unclear how well it generalizes to new datasets. Furthermore, as far as we know, the L2C approach is relatively unique in the medical imaging community, with most approaches falling into either parametric model fitting (e.g., AD-Map) or dynamic state modeling (e.g., MinimalRNN).
Therefore, in the current study, we evaluated the FROG algorithm by training it on the ADNI dataset and evaluating its cross-dataset performance in three external datasets comprising 2312 participants with 13200 timepoints from the United States, Australia and Singapore. In addition, we considered two additional FROG variants. One FROG variant unified all XGBoost models with a single feedforward neural network (FNN) model, which we refer to as L2C-FNN. We also compared the FROG variants with a representative parametric approach (AD-Map) and a representative dynamic state modeling approach (MinimalRNN).
2 Methods
2.1 Problem overview
The problem setup followed the TADPOLE challenge (Marinescu et al., 2019, 2021). Given multimodal biomarkers and diagnostic history (Table 1) at one or more timepoints of an individual, we aimed to predict the cognitive state, ventricle volume normalized by intracranial volume (ICV), and clinical diagnosis of the individual for every subsequent month beyond the last observed timepoint up to 120 months into the future. Cognitive state was measured with the Alzheimer’s Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13) in the original TADPOLE challenge. However, not all the external datasets had ADAS-Cog13, so we switched to predicting mini mental state examination (MMSE) in this study.
We used four datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) dataset, the Memory, Ageing and Cognition Centre (MACC) Harmonization Cohort, and the Open Access Series of Imaging Studies (OASIS) dataset. These datasets consisted of longitudinal multimodal data, such as T1-weighted structural MRI data, cognitive measurements and clinical diagnosis, as well as baseline demographics. The diagnostic categories corresponded to cognitively normal (CN), mild cognitive impairment (MCI), as well as dementia with various etiologies (DEM). Data collection was approved by the Institutional Review Board (IRB) at each corresponding institution. The analysis in the current study is approved by the National University of Singapore IRB.
2.2 Datasets, preprocessing and participant selection
2.2.1 Datasets
The ADNI dataset (Jack et al., 2008; Petersen et al., 2010) is a comprehensive multicenter research initiative in the United States with three phases: ADNI1 (2004-2009), ADNI-GO/2 (2010-2016), and ADNI3 (2017-2023), with a primary focus on advancing the understanding of Alzheimer’s disease dementia. Each phase incorporates newly enrolled participants and individuals transitioning from earlier phases. Notably, there are variations in MRI scanner models and protocols across the phases, with ADNI1 primarily employing 1.5T scanners and subsequent phases adopting 3T scanners (see Table S1 for details). T1 images were downloaded from the USC Laboratory of Neuro Imaging’s Image and Data Archive (IDA). The ADNIMERGE spreadsheet (the ADNI team, 2023) containing various phenotypic data (e.g., demographics, clinical diagnoses, cognitive measurements) was also downloaded.
The AIBL study (Fowler et al., 2021) is an Australian flagship initiative that shares a similar goal and technical infrastructure with ADNI. MRI scans were acquired using both 1.5T and 3T (Avanto, Tim Trio and Verio) scanners (see Table S2 for details). MRI scans and phenotypic data were obtained from the USC Laboratory of Neuro Imaging’s Image and Data Archive (IDA).
The MACC Harmonization dataset (Hilal et al., 2020) focuses on a memory clinic population in Singapore. T1 images in this dataset were acquired exclusively using 3T (Tim Trio and Prisma) scanners (see Table S3 for details). Note that this dataset contained participants with vascular dementia and/or Alzheimer’s Disease dementia, which we have grouped together as Vascular/Alzheimer’s disease dementia (DEM). The mixed pathology allowed us to evaluate the generalizability of these models beyond AD dementia. However, for completeness, we will also report results for only AD dementia.
The OASIS dataset (LaMontagne et al., 2019) serves as a multimodal resource for studying normal aging and cognitive decline. It consists of four releases: OASIS-1 (cross-sectional) and OASIS-2 (longitudinal) as smaller-scale studies, OASIS-3 as the primary large dataset that includes OASIS-1 and OASIS-2 subjects, and OASIS-4, which encompasses a separate clinical cohort. For this study, we utilized OASIS-3 data. Note that this dataset included both AD and non-AD dementia, which we have grouped together as DEM (similar to MACC). However, for completeness, we will also report results for only AD dementia. The imaging data in OASIS were acquired using both 1.5T and 3T scanners (see Table S4 for details). MRI scans and phenotypic data were downloaded from the XNAT Central (Herrick et al., 2016).
2.2.2 Preprocessing
All T1 images were de-obliqued and reoriented to the RPI orientation. Subsequently, we used the FreeSurfer 6.0 recon-all pipeline (Fischl et al., 2002; Desikan et al., 2006) to derive the volumes of various regions of interest (ROIs). Following the FROG algorithm, we derived five brain ROI volumetric features associated with AD-dementia, namely Hippocampal volume, Fusiform volume, Middle Temporal (MidTemp) volume, Ventricle volume, and Whole Brain volume (see Table S5 for details). We also incorporated intracranial volume (ICV) as an additional feature and standardized the five brain ROI features with respect to ICV (Table 1).
The generated brain ROI features were then merged with downloaded phenotypic data (e.g., demographics, clinical diagnoses, cognitive measurements). Notably, non-MRI phenotypes (e.g., clinical diagnoses and cognitive measurements) and MRI scans might not have been performed on the same day. Following the ADNIMERGE convention, if the non-MRI and MRI dates were within 6 months of each other, then the non-MRI and MRI phenotypes were merged into one timepoint corresponding to the non-MRI date.
We systematically eliminated empty or duplicate entries, along with those displaying outliers or errors. Certain datasets used inconsistent coding for missing values, such as NaN or special integers (e.g., -1, -4, 999). To ensure consistency, we replaced all special integers with NaN. Consequently, we obtained a clean longitudinal data table where each row represents one timepoint of a participant, containing MRI features and/or cognitive features and/or diagnostic features (Table 1).
2.2.3 Participant selection and characteristics
Our objective was to predict the longitudinal progression of dementia, so across all four datasets, we only included participants with recurring features (Table 1) at two or more timepoints. We note that under this criterion, the recurring features did not need to all occur in the same timepoints. For example, a participant with only MRI features in timepoint 1 and only cognitive features in timepoint 2 was considered acceptable.
With the above selection criterion, the final ADNI dataset comprised 2111 participants with a total of 15791 timepoints, including 9668 timepoints with MRI features. In the case of AIBL, the final dataset comprised 402 participants with a total of 1220 timepoints, including 940 timepoints with MRI features. In the case of MACC, the final dataset comprised 650 participants with a total of 3067 timepoints, including 1453 timepoints with MRI features. In the case of OASIS, the final dataset comprised 1260 participants with a total of 8913 timepoints, including 2519 timepoints with MRI scans.
The demographics, disease severity and number of timepoints vary significantly between ADNI and the three external datasets (Table 2). Actual distributions are plotted in Figures S1 and S2. Compared with ADNI, the AIBL participants were younger and had higher MMSE scores. There were also proportionally more female and CN participants in the AIBL dataset than the ADNI dataset. Compared with ADNI, the MACC participants had lower MMSE scores. Furthermore, there were proportionally more female and participants with DEM diagnosis in MACC than ADNI. Finally, compared with ADNI, the OASIS participants were younger and had higher MMSE scores. There were also proportionally more female and CN participants in the OASIS dataset than the ADNI dataset.
Furthermore, both the AIBL and MACC datasets typically have fewer than 7 timepoints collected within a span of 7 years. In contrast, some participants in the ADNI and OASIS datasets have up to 20 to 30 timepoints, covering a tracking period of over 15 years. The percentage of timepoints with missing data was also highly different across the datasets (Table 2).
2.3 Training, validation and test procedure
We compared five different models: MinimalRNN, AD-Map, original FROG (L2C-XGBw), FROG variant 1 (L2C-XGBnw) and FROG variant 2 (L2C-FNN). We trained different models using ADNI and evaluated their performance within the ADNI dataset (within-dataset evaluation) and in AIBL, MACC and OASIS datasets (cross-dataset evaluation). More specifically, we randomly divided the ADNI participants into 20 groups. As all participants contributed data collected at multiple timepoints, care was taken to ensure that all timepoints for any given participant were exclusively assigned to a single group, precluding any division across multiple groups.
To train a given model, 18 groups were used for training, while 1 group was used as a validation set to tune the hyperparameters. The remaining group was used as test set to evaluate the within-cohort performance of the model. To ensure stability of results (Kong et al., 2019; Li, Kong, et al., 2019; Varoquaux, 2018), this procedure was repeated 20 times with a different group being the test set (e.g., group 5) and the group next to it being the validation set (e.g., group 6). Therefore, we ended up with 20 sets of trained models together with 20 sets of within-cohort evaluation results. The 20 sets of trained models were applied to all participants in AIBL, MACC, and OASIS for cross-cohort evaluation (Figure 1).
Following TADPOLE convention, for participants in the ADNI validation and test sets, as well as all participants in AIBL, MACC, and OASIS, the first half of each participant’s timepoints were used to predict the second half of the same participant’s timepoints. For example, if a participant had 10 timepoints, then the first 5 timepoints were used as input (observed) timepoints, and we sought to predict the second 5 timepoints (unobserved). On the other hand, the entire longitudinal time series of training participants were used during training to increase data efficiency.
The Optuna library (Akiba et al., 2019) was utilized to find the best hyperparameters by maximizing model performance on the validation set. We note that this optimization was performed independently for each training/validation/test split of the dataset. The hyperparameter search spaces for each algorithm are described in their respective method sections.
2.4 MinimalRNN
MinimalRNN is a recurrent neural network (RNN) with less parameters than LSTM (Long short-term memory) to mitigate overfitting. In our previous study (Nguyen et al., 2020), we found that MinimalRNN performed better than the more complex LSTM, as well as a simpler linear state space model in the TADPOLE challenge. As such, the MinimalRNN struck the perfect complexity balance, yielding the best prediction performance among the RNN models we tested.
In RNNs, the same computational unit is repeated at each time step, where the output at the current step becomes the input at the next step. Therefore, the longitudinal data of a participant is analyzed sequentially, where the input features at a particular timepoint is used to update the internal “disease” state of the participant. This internal state is then used to predict the input features at the next time point.
For our experiments, we utilized the publicly available code from https://github.com/ThomasYeoLab/CBIG/tree/master/stable_projects/predict_phenotypes/Nguyen2020_RNNAD. However, we replaced the HORD hyperparameter search algorithm (Eriksson et al., 2019; Ilievski et al., 2017) employed in our previous study (Nguyen et al., 2020) with Optuna (Akiba et al., 2019) because of its ease of use.
Table 3 summarizes the hyperparameters optimized by Optuna and their corresponding search range. Consistent with the original study (Nguyen et al., 2020), MinimalRNN only utilized the recurring MRI features, cognitive features and clinical diagnosis (Table 1), but not any baseline features (sex, education, marital status and number of APOE-ε4) and age. Our previous experiments (not shown) found that these additional information did not improve prediction performance.
2.5 AD Course Map (AD-Map)
AD-Map is a parametric Bayesian non-linear mixed-effects model designed to predict cognition and brain atrophy. It was shown to outperform MinimalRNN (Maheux et al., 2023). AD-Map assumes that each biomarker follows a logistic curve, with different biomarkers exhibiting distinct progression rates and ages at inflexion point. The model adjusts these curves for each individual by learning individual-specific shifts in disease onset, progression rates, and the timing/ordering of biomarker progression. As a result, the model predicts an individual-specific set of logistic curves, which show the value of each biomarker at any age of the participant.
For our experiments, we used the Leaspy software (https://gitlab.com/icm-institute/aramislab/leaspy) and optimized hyperparameters (Table 4) via Optuna (Akiba et al., 2019). We note that intracranial volume (ICV) was not included as a feature since AD-Map requires time-dependent features, and ICV shows minimal change with time (Courchesne et al., 2000; Jenkins et al., 2000). However, we remind the reader that the other MRI volumetric features were normalized with respect to ICV, consistent with other algorithms.
Consistent with the original study (Maheux et al., 2023), we did not include clinical diagnosis as a biomarker. Instead, to predict clinical diagnosis, we converted the predicted CDR score into probabilities for CN, MCI, and DEM using standard cut-off points (O’Bryant et al., 2010; Tariot et al., 2024): CDR 0 was mapped to CN, CDR 0.5 was mapped to MCI, and CDR 1, 2, and 3 were mapped to DEM. For predicted CDR scores between 0 and 0.5, we used linear interpolation, so for instance, a CDR score of 0.1 resulted in 80% probability for CN, and 20% probability for MCI, and 0% probability for DEM. For scores between 0.5 and 1, we again used linear interpolation, so for instance, a CDR score of 0.6 resulted in 0% probability for CN, and 80% probability for MCI, and 20% probability for DEM. CDR scores of 1 or higher are fully assigned to DEM.
We also explored including clinical diagnosis directly into AD-Map by treating it as a score (CN = 0, MCI = 1, DEM = 2). This approach improved clinical diagnosis prediction, but led to very poor MMSE and ventricular volume prediction. Therefore, consistent with the original study (Maheux et al., 2023), we did not include clinical diagnosis as a feature in the AD-Map algorithm. Furthermore, the AD-Map package does not take in any baseline features (sex, education, marital status and number of APOE-ε4), so in summary, the AD-Map only used recurring MRI features (excluding ICV), cognitive features and age.
2.6 Original FROG: Longitudinal-to-Cross-sectional XGBoost with Windows (L2C-XGBw)
2.6.1 Longitudinal-to-Cross-sectional (L2C) transformation
The TADPOLE problem set up is challenging for standard machine learning algorithms (e.g., support vector machine) because of the variable length of observed timepoints. The winner of the TADPOLE challenge FROG used a feature engineering technique (Nanopoulos et al., 2001; Deng et al., 2013; Barandas et al., 2020) that transformed the longitudinal visit history of participants into a cross-sectional format, which we will refer to L2C (Longitudinal-to-Cross-sectional) transformation.
More specifically, suppose for a given participant, we observed data at m timepoints t1, t2, t3, t4, …, tm, and we would like to predict clinical diagnosis (or MMSE or ventricular volume) at a future timepoint tf. Note that these timepoints might not be equally spaced in time. To convert the variable length input features, FROG proposed the following L2C transformation, in which each continuous input modality (i.e., MMSE, CDR_GLOBAL, six anatomical ROI volumes) is converted into seven features (Table 5) and clinical diagnosis is converted into eight features (Table 6), resulting in 8 x 7 + 8 = 64 features. These 64 features were augmented by age at future timepoint tf, baseline sex, baseline education level, baseline marital status, APOE status and number of months between future timepoint tf and the first (baseline) visit. In total, there were 64 + 6 = 70 features. Please refer to Supplementary Table S6 for the complete set of L2C features used by FROG.
2.6.2 Data augmentation
In addition to the L2C transformation, FROG proposed the following data augmentation strategy during training. Suppose we observed data at m timepoints t1, t2, t3, t4, …, tm for a particular training participant. FROG then generated m − 1 training samples by using t1 to predict t2, or t3 or t4 or t5 etc. FROG also generated another m − 2 training samples by using t1and t2 to predict t3 or t4 or t5 etc. In total, given m timepoints, FROG generated m ∗ (m − 1)/2 training samples.
2.6.3 L2C eXtreme Gradient Boost with separate windows (L2C-XGBw)
The L2C transformation converted variable length input features into fixed length input features (Section 2.6.1), while the data augmentation procedure generated more training samples (Section 2.6.2). The original FROG team used eXtreme Gradient Boost (XGBoost; Chen & Guestrin, 2016) to predict the target variables from the L2C features. Gradient boosting is a model ensemble of individual decision trees that are trained sequentially such that a new tree improves the error of the previous tree ensemble. XGBoost is an optimized distributed gradient boosting library. The original FROG submission used the XGBoost R library, while we reimplemented the FROG algorithm in python. We performed 5 repetitions of train/validation/test split in the ADNI dataset to ensure our python implementation yields numerically the same results as the R code.
Furthermore, consistent with the original FROG submission to the TADPOLE challenge, we trained separate XGBoost models for each target variable (clinical diagnosis, MMSE, ventricle volume). Following the original FROG submission, we also trained separate models based on specific forecast interval ranges, with the assumption that certain models may excel in short-term predictions while others in long-term forecasts. The forecast interval ranges (i.e., forecast windows) for each target variable (measured in months) adhere to the FROG team’s settings (Table 7). Hence, we referred to this algorithm as L2C eXtreme Gradient Boost with separate windows (L2C-XGBw).
Three important hyperparameters were tuned in the ADNI validation sets using Optuna (Akiba et al., 2019). The three hyperparameters and search ranges are detailed in Table 8. We note that there is no extra feature normalization or missing data imputation since the XGBoost package handles such issues internally.
2.7 FROG variant 1: L2C eXtreme Gradient Boost with no window (L2C-XGBnw)
L2C-XGBw (FROG) involved training a separate XGBoost model for each forecast window. This is not ideal because the forecast windows are themselves hyperparameters, which might be hard to pick for new target variables. We hypothesized that the multiple forecast windows might not be necessary because L2C features like "time since baseline" and "time since most recent measurement" already encode the necessary temporal information for the model. Therefore, we considered a variant of FROG, where a single XGBoost model was trained for all future timepoints, as opposed to a separate model for each time window. We refer to this baseline as L2C eXtreme Gradient Boost with no window (L2C-XGBnw). All other implementation details remain consistent with those of L2C-XGBw.
2.8 FROG variant 2: L2C Fully-Connected Feedward Neural Network (L2C-FNN)
L2C-XGBnw trained a separate XGBoost model for each target variable. Previous studies have suggested that predicting multiple target variables can potentially improve prediction performance. By learning shared representations to capture common patterns among related tasks, these shared representations might enhance data efficiency, accelerate learning, and mitigate overfitting issues (Rahim et al., 2017; Crawshaw, 2020).
A natural choice to incorporate multi-task learning is to replace XGBoost with a fully connected feedforward neural network (FNN) model, with the output layer predicting all target variables jointly. Adding new target variables only increases the dimension of the output layer, which eliminates the need for separate models and simplifies the coding and hyperparameter tuning. Similar to L2C-XGBnw, we will train a single FNN model to predict all future timepoints, instead of the original FROG implementation which trained a separate XGBoost model for each forecast window. We will refer to this model as the L2C Fully-Connected Feedward Neural Network (L2C-FNN).
Figure 2 illustrates the L2C-FNN architecture. LeakyReLU (Maas, 2013) was chosen as the activation function, and dropout (Srivastava et al., 2014) was applied after each activation function to enhance model generalizability. The FNN output is a 5-dimensional vector: the first three elements represent the individual’s probabilities of being diagnosed as CN, MCI, or DEM at a future timepoint, computed using a SoftMax function, while the fourth and fifth elements correspond to MMSE and ventricle volume predictions.
Input features are similar to L2C-XGBw (FROG), with additional preprocessing steps. Due to FNN sensitivity to input scale and missing data compared to tree-based models such as XGBoost, we performed Gauss Rank normalization, a special form of quantile normalization (Zhao et al., 2020), with a Gaussian reference distribution. The transformation was performed using the Scikit-learn quantile transform function (Pedregosa et al., 2011). Discrete features, such as APOE, sex, and most recent diagnosis (mr_dx) were encoded using one-hot encoding. To handle missing data, an "unknown" class was introduced for all discrete features, and missing values were assigned to this class. Numeric feature imputation involved replacing missing values with the median of the training set before Gauss Rank transformation. During inference, the learned Gauss Rank transformations and statistics from the training set were used to impute and transform validation and test data. Notably, the "time since milder clinical diagnosis" feature was removed due to its high proportion of NaN values (91.57%) which primarily stems from two causes: either because there is no milder clinical diagnosis in the patient’s history or the clinical diagnosis data is missing. Overall, this yields a 101-dimensional input vector for L2C-FNN (see Supplementary Table S7 for details).
The loss function is computed by comparing the predictions with the ground truth. Similar to MinimalRNN, cross-entropy loss was used for clinical diagnosis prediction, while mean absolute error (MAE) loss was employed for MMSE and ventricle volume prediction (based on the Gauss Rank values). Because MAE was based on Gauss Rank values, the three losses were of similar magnitude, and so the three losses were added together with equal weighting. Changing the relative weights of the three terms could potentially influence the model performance. However, this would increase the number of hyperparameters, so we did not experiment with different weights in this study.
Finally, stochastic gradient descent (SGD) with momentum (Qian, 1999; Sutskever et al., 2013) was chosen as the optimizer, with Optuna utilized to search for optimal hyperparameters. Table 9 shows all the hyperparameters considered and their corresponding search range. The ExponentialLR scheduler was employed to regulate learning rate behavior.
2.9 Further analyses
We performed two additional analyses to study the effectiveness of all five models (MinimalRNN, AD-Map and the three FROG variants).
2.9.1 Impact of the number of observed timepoints on cross-cohort prediction accuracy
For a disease progression model to be effective in early detection of AD-dementia risk, it should ideally perform well with a small number of input timepoints. We evaluated the performance of all four models on external test datasets using only 1, 2, 3, or 4 input timepoints. This contrasts with the main benchmarking analysis (Section 3.4), where half the total number of timepoints (for each participant) were used to predict the remaining timepoints.
For the OASIS dataset, test subjects with fewer than 4 input timepoints were discarded so that the same test subjects were evaluated across the four conditions (i.e., 1, 2, 3, or 4 input timepoints). In contrast, the maximum number of input timepoints for each subject is less than 4 in the AIBL and MACC datasets (2 for AIBL and 3 for MACC). Consequently, we discarded test subjects with fewer than 2 input timepoints for AIBL and fewer than 3 input timepoints for MACC. Because we excluded some test subjects, the results of this analysis are not directly comparable to those of the main benchmarking analysis (Section 3.4).
2.9.2 Breakdown of cross-cohort prediction in yearly intervals
We extended our investigation of cross-cohort prediction performance by breaking down the prediction results into yearly intervals up to 6 years into the future. Each participant’s future timepoints were categorized into yearly intervals based on the duration between the last input timepoint and the target future timepoint for prediction. For instance, considering a participant with 10 timepoints, if the last input timepoint (5th timepoint) was at month 60 and the 6th timepoint was at month 70, the prediction at the 6th timepoint would be classified as 1 year into the future due to the 10-month duration.
We anticipated that all tested algorithms would experience a decline in performance as the prediction horizon extended further into the future. Nevertheless, for effective early detection of AD-dementia, a robust algorithm was expected to maintain relatively high performance even in later years, ensuring clinical utility.
2.10 Deep neural network implementation
MinimalRNN and L2C-FNN were implemented using PyTorch (Paszke et al., 2019) and computed on NVIDIA RTX 3090 GPUs with CUDA 11.0.
2.11 Performance evaluation and statistical tests
In the preceding sections, we utilized a 20-fold cross-validation procedure to train the five models (MinimalRNN. AD-Map and three FROG variants) on data from ADNI, predicting clinical diagnosis, MMSE score, and ventricle volume. Our evaluation aims to assess the performance of these algorithms within the ADNI dataset (within-dataset evaluation) and across external test datasets (cross-dataset evaluation), with a specific focus on evaluating their generalizability. This section provides a detailed description of the statistical evaluation procedure.
Diagnosis classification accuracy was evaluated using the multiclass area under the operating curve (mAUC; Hand & Till, 2001) following the TADPOLE challenge. The mAUC was computed as the average of three two-class AUC (DEM vs not DEM, MCI vs. not MCI, and CN vs not CN). For mAUC, higher values indicate better performance. The mAUC is a group-level metric whereby the predictions were first pooled over all test participants across their entire forecast horizon into a vector of length # total future timepoints, before calculating the mAUC, resulting one value per test set.
MMSE and ventricles prediction accuracy was evaluated using mean absolute error (MAE). Lower MAE indicates better performance. The MAE were averaged across all forecast timepoints within each participant, resulting in a vector of length #test_participant values per test set.
For within-cohort evaluation, because of the 20-fold cross-validation, there were 20 mAUC values, 20 MAE values for MMSE and 20 MAE values for ventricle volumes. Although the test sets do not overlap, the participants used for training do overlap across the test sets. Therefore, the prediction metrics were not independent across the 20 test sets. To account for the non-independence, we utilized the corrected resampled t-test (Bouckaert & Frank, 2004) to assess performance differences between algorithms. Separate tests were performed for mAUC, MMSE and ventricle volume.
For cross-cohort evaluation, the final performance was computed by averaging the performance metrics across 20 trained models of each algorithm. To assess performance differences between algorithms, since each participant in the external datasets were independent, we performed paired sample t-test (Cohen, 1988) for MMSE MAE and ventricle volume MAE, as well as a permutation test (Good, 2000) for group-level metrics (mAUC). Supplementary Figures S3 and S4 illustrate the t-test and permutation test respectively.
Multiple comparisons were corrected with a false discovery rate (FDR) of q < 0.05 (Benjamini & Hochberg, 1995) for both within and cross-cohort evaluations.
3 Results
3.1 FROG variants perform the best for within-cohort clinical diagnosis prediction
Figure 3 illustrates the performance of MinimalRNN, AD-Map and three FROG variants (L2C-XGBw, L2C-XGBnw and L2C-FNN) for within-cohort (ADNI) clinical diagnosis, MMSE and ventricle volume prediction. All models exhibited similar performance for predicting MMSE and ventricle volume. However, for predicting future clinical diagnosis, the three FROG variants were better than MinimalRNN, which was in turn better than AD-Map.
3.2 AD-Map and L2C-FNN performed the best for cross-cohort MMSE prediction
Figure 4 illustrates the prediction error of MinimalRNN, AD-Map and three FROG variants (L2C-XGBw, L2C-XGBnw and L2C-FNN) for cross-cohort MMSE prediction in three external datasets (AIBL, MACC, and OASIS). In the AIBL dataset, L2C-FNN was the best, followed by AD-Map and MinimalRNN. In the MACC dataset, AD-Map was the best followed by L2C-FNN. In the OASIS dataset, L2C-FNN was the best, followed closely by AD-Map. Overall, in this analysis, L2C-FNN and AD-Map were the best. Similar conclusions were obtained if we only considered AD dementia, with non-AD dementia set to NaN (Figure S5).
3.3 L2C-FNN, L2C-XGBnw & AD-Map performed the best for cross-cohort ventricle volume prediction
Figure 5 illustrates the prediction error of MinimalRNN, AD-Map and three FROG variants (L2C-XGBw, L2C-XGBnw and L2C-FNN) for cross-cohort ventricle volume prediction in three external datasets (AIBL, MACC, and OASIS). In the AIBL dataset, all approaches had similar performance, although L2C-XGBnw was the best, followed closely by L2C-FNN. In the MACC dataset, L2C-FNN was the best, followed by AD-Map. In the OASIS dataset, AD-Map performed the best, followed by L2C-XGBnw. Overall, in this analysis, L2C-FNN, L2C-XGBnw and AD-Map performed the best. The original FROG algorithm (L2C-XGBw) and MinimalRNN performed the worst. Similar conclusions were obtained if we only considered AD dementia, with non-AD dementia set to NaN (Figure S6).
3.4 L2C-FNN outperformed other models for cross-cohort diagnosis prediction
Figure 6 illustrates the prediction accuracy of MinimalRNN, AD-Map and three FROG variants (L2C-XGBw, L2C-XGBnw and L2C-FNN) for cross-cohort clinical diagnosis prediction in three external datasets (AIBL, MACC, and OASIS). In the AIBL dataset, L2C-FNN was numerically the best, but there was no statistical difference among the FROG variants and MinimalRNN. However, AD-Map was statistically worse than the three FROG variants. In the MACC dataset, the three FROG variants performed similarly well and were all statistically better than MinimalRNN, which was in turn better than AD-Map. Finally, in the OASIS dataset, L2C-FNN was the best, while the original FROG algorithm (L2C-XGBw) was the worst. Overall, in this analysis, L2C-FNN performed the best. Similar conclusions were obtained if we only considered AD dementia, with non-AD dementia set to NaN (Figure S7).
3.5 Further analyses
3.5.1 L2C-FNN compared favorably to other models across varying input timepoints
Figures 7 to 9 show the cross-dataset prediction performance of MinimalRNN, AD-Map and three FROG variants (L2C-XGBw, L2C-XGBnw and L2C-FNN) with varying number of input time points. Due to the constraints of the datasets, the maximum number of input timepoints for each participant is only 2 for AIBL and 3 for MACC. Therefore, results for AIBL with 3 and 4 timepoints and for MACC with 4 timepoints are marked as “N.A.”
Table 10 shows the results of statistical tests comparing L2C-FNN and other approaches. Overall, L2C-FNN consistently matched or outperformed other approaches across all datasets and different number of observed timepoints, with only two exceptions (Table 10). The first exception was that AD-Map was statistically better than L2C-FNN when predicting ventricle volume with 1 input timepoint in the MACC dataset (Table 10). The second exception was that L2C-XGBnw was statistically better than L2C-FNN when predicting ventricle volume with 1 input timepoint in the OASIS dataset (Table 10).
3.5.2 L2C-FNN compared favorably with other methods for all yearly intervals
Figures 10 to 12 show the yearly breakdown in prediction performance from Figures 4 to 6, extending up to 6 years into the future. As anticipated, the prediction accuracy for all algorithms declines as the prediction horizon increases. Table 11 shows the results of statistical tests comparing L2C-FNN with other approaches. L2C-FNN consistently matched or outperformed other methods from year 1 to year 6 across all datasets with two exceptions (Table 11). The exception was that AD-Map was statistically better than L2C-FNN when predicting MMSE in year 0-1 and year 1-2 in the MACC dataset.
4 Discussion
In this study, we evaluated the winning algorithm of the TADPOLE challenge FROG, two FROG variants, MinimalRNN and AD-Map in the ADNI dataset, as well as three external datasets. In the ADNI dataset, all three FROG variants performed similarly well and outperformed MinimalRNN and AD-Map for clinical diagnosis. The excellent performance of FROG in clinical diagnosis was consistent with the outcome of the TADPOLE challenge, where FROG was ranked 1st in clinical diagnosis prediction (Marinescu et al., 2021). In the three external datasets, the FROG variant L2C-FNN compared favorably with the other two FROG variants, MinimalRNN and AD-map.
An inherent challenge in the TADPOLE problem set up is the pervasive missing data in each participant. Missing data occurs when participants fail to show up or fail to complete certain tests or scans during visits. In most longitudinal datasets, not all data is collected at all timepoints by design. For example, one visit might only involve MRI scans, while another visit might only involve detailed neuropsychological exams. Therefore, the implication is that in every participant, there is missing data at almost every observed time point.
State-based models, such as MinimalRNN, require specialized techniques to handle the missing data. By contrast, the L2C feature transformation significantly reduces the ratio of missing data. As a result, the input to models, comprising L2C features, contains substantially fewer missing data. This advantage of L2C transformation might explain the advantage of FROG variants over MinimalRNN. Another theoretical disadvantage of MinimalRNN is that each future prediction is based on previous predictions, which might lead to error accumulation (Fan et al., 2019). This error accumulation becomes particularly pronounced in longer-term predictions. This might be another reason why FROG variants outperformed MinimalRNN.
The original FROG algorithm (L2C-XGBw) constructs separate XGBoost models for each target variable and different forecast windows, resulting in a total of 15 models. However, the optimal window ranges might vary significantly between datasets. Furthermore, dividing training samples into different bins based on the forecast interval range reduces the available training samples in each bin. We hypothesized that eliminating forecast window stratification and enabling the model to implicitly leverage temporal information within L2C features (e.g., time since baseline, time since most recent measurements) could enhance model generalizability. To test this hypothesis, we trained a single FNN for all training samples with varying forecast intervals. Experimentally, L2C-FNN outperformed L2C-XGBw in nearly every prediction task (i.e., clinical diagnosis, MMSE, ventricle volume) across all external datasets.
Another feature of L2C-FNN is the use of multi-task learning, such that a single FNN is used to predict all three target variables (ventricle volume, cognition, clinical diagnosis) simultaneously. Multi-task learning leverages shared representations to capture common patterns among related tasks, which might enhance data efficiency, accelerate learning, and mitigate overfitting (Crawshaw, 2020). The FROG variant (L2C-XGBnw) helped to dissociate the effects of multi-task learning and the elimination of forecast window stratification by training a separate XGBoost model for each target variable. L2C-FNN again outperformed L2C-XGBnw in nearly every prediction task (i.e., clinical diagnosis, MMSE, ventricle volume) across all external datasets. Therefore, these results suggest the potential advantage of multi-task learning.
In our evaluation, AD-Map was highly competitive in terms of predicting MMSE and ventricle volume, but performed poorly for clinical diagnosis. Because AD-Map utilized a sigmoid-like parameterization, it might predict continuous variables (e.g., MMSE and ventricle volume) better than categorical variables (e.g., clinical diagnosis). Consistent with the original study (Maheux et al., 2023), we did not directly model clinical diagnosis in the AD-Map algorithm. We have also experimented with including clinical diagnosis in the AD-Map model, which improved clinical diagnosis prediction, but resulted in much worse MMSE and ventricle prediction (not shown).
A limitation of the current study is that the models are trained from a single dataset (ADNI). We expect that models trained from multiple datasets might lead to better generalization to new populations and scanners (Dou et al., 2019; Liu et al., 2020; Chen et al., 2024). Therefore, a potential future work is to collate multiple datasets and train a single L2C-FNN model for future usage. Another limitation is that the current study only considered biomarkers that existed in all four datasets, so blood and PET biomarkers were excluded. Blood and PET biomarkers have been shown to be important markers of AD dementia (Nordberg et al., 2010; Mattsson et al., 2017; Chételat et al., 2020; Chong et al., 2021), therefore future studies could likely benefit from the incorporation of these additional biomarkers.
5 Conclusion
In this study, we evaluated three FROG variants, MinimalRNN and AD-Map in predicting future dementia progression in the ADNI dataset and three external datasets. We found that a FROG variant (L2C-FNN) performed the best in the three external datasets. L2C-FNN maintained better prediction performance regardless of the number of observed timepoints in a participant. L2C-FNN also consistently matched or outperformed other approaches from year 1 to year 6 across all external datasets, underscoring its potential for reliable long-term prediction in dementia progression.
7 Data availability statement
The ADNI and the AIBL datasets can be accessed via the Image & Data Archive (https://ida.loni.usc.edu/). The MACC dataset can be obtained via a data-transfer agreement with the MACC (http://www.macc.sg/). The OASIS dataset can be requested from (https://www.oasis-brains.org/).
8 Code availability statement
Code for all five models can be found here (GITHUB_LINK). Two co-authors (L.A. and N.W.) reviewed the code before merging it into the GitHub repository to reduce the chance of coding errors.
9 Author contribution statement
C.Z.: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Software; Visualization; Writing – original draft; Writing – review & editing. L.A.: Software; Validation; Visualization; Writing – review & editing. N.W.: Software; Validation; Visualization; Writing – review & editing. K.N.: Investigation; Software; Data curation; Visualization; Writing – review and editing. C.O.: Visualization; Writing – review and editing. P.C.: Visualization; Writing – review & editing. C.C.: Resource; Writing – review & editing. J.H.Z.: Resource; Writing – review & editing. K.L.: Methodology; Software; Writing – review & editing. B.T.T.Y.: Conceptualization; Formal analysis; Funding acquisition; Investigation; Methodology; Resource; Supervision; Visualization; Writing – original draft; Writing – review & editing.
10 Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
6 Acknowledgment
We would like to thank Christina Rabe, and Paul Manser from Team FROG for sharing their code with us, which significantly facilitated the current study. This research is supported by the NUS Yong Loo Lin School of Medicine (NUHSRO/2020/124/TMR/LOA), the Singapore National Medical Research Council (NMRC) LCG (OFLCG19May-0035), NMRC CTG-IIT (CTGIIT23jan-0001), NMRC STaR (STaR20nov-0003), NMRC OF-IRG (OFIRG24jan-0030), Singapore Ministry of Health (MOH) Centre Grant (CG21APR1009), the Temasek Foundation (TF2223-IMH-01), and the United States National Institutes of Health (R01MH120080 & R01MH133334). Our computational work was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not reflect the views of the Singapore NMRC, MOH, Temasek Foundation or USA NIH.
Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.;Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.;Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.
Data were provided in part by OASIS-3: Longitudinal Multimodal Neuroimaging: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P30 AG066444, P50 AG00561, P30 NS09857781, P01 AG026276, P01 AG003991, R01 AG043434, UL1 TR000448, R01 EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly owned subsidiary of Eli Lilly.
Footnotes
↵* Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) and the Australian Imaging Biomarkers and Lifestyle Study of Aging (AIBL) database (https://aibl.org.au/). As such, the investigators within the ADNI and AIBL contributed to the design and implementation of ADNI and AIBL and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI and AIBL investigators can be found at https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf and https://aibl.org.au/
↵1 Although the goal is to predict indefinitely into the future, evaluation can only rely on the available timepoints in the test set.