Abstract
Background Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease characterised by a highly variable clinical presentation and multifaceted genetic and biological bases that translate into great patient heterogeneity. The identification of homogeneous subgroups of patients in terms of both clinical presentation and biological causes, could favour the development of effective treatments, healthcare, and clinical trials. We aimed to identify and characterise homogenous clinical subgroups of ALS, examining whether they represent underlying biological trends.
Methods Latent class clustering analysis, an unsupervised machine-learning method, was used to identify homogenous subpopulations in 6,523 people with ALS from Project MinE, using widely collected ALS-related clinical variables. The clusters were validated using 7,829 independent patients from STRENGTH. We tested whether the identified subgroups were associated with biological trends in genetic variation across genes previously linked to ALS, polygenic risk scores of ALS and related neuropsychiatric traits, and in gene expression data from post-mortem motor cortex samples.
Results We identified five ALS subgroups based on patterns in clinical data which were general across international datasets. Distinct genetic trends were observed for rare variants in the SOD1 and C9orf72 genes, and across genes implicated in biological processes relevant to ALS. Polygenic risk scores of ALS, schizophrenia and Parkinson’s disease were also higher in distinct clusters with respect to controls. Gene expression analysis identified different altered biological processes across clusters reflecting the genetic differences. We developed a machine learning classifier based on our model to assign subgroup membership using clinical data available at first visit, and made it available on a public webserver at http://latentclusterals.er.kcl.ac.uk.
Conclusion ALS subgroups characterised by highly distinct clinical presentations were discovered and validated in two large independent international datasets. Such groups were also characterised by different underlying genetic architectures and biology. Our results showed that data-driven patient stratification into more clinically and biologically homogeneous subtypes of ALS is possible and could help develop more effective and targeted approaches to the biomedical and clinical study of ALS.
Introduction
Amyotrophic lateral sclerosis (ALS) is a devastating neurodegenerative disease characterised by progressive neuromuscular degeneration leading to death, typically from respiratory failure within three years of onset (1). The disease affects upper and lower motor neurons in the brain and spinal cord and has an estimated lifetime risk of between 1 in 300-400 people.
ALS is clinically defined, yet its clinical presentation varies greatly. The mean age of onset is around 60 years although people may develop ALS at almost any age during adulthood (2). ALS normally leads to death within 3-5 years of onset, however, in some patients it can occur within a year, and 5-10% of people live for over 10 years (3–5). Between 60 and 70% of people first develop symptoms in the spinal innervated muscles, with others having bulbar, mixed or, in about 3%, respiratory onset (2, 6–8). The extent of involvement of upper and lower motor neurons varies, ranging between a pure upper motor neuron (primary lateral sclerosis; PLS) and pure lower motor neuron phenotype (progressive muscular atrophy; PMA), with most presentations being a mixture of the two (9, 10). An overlap with frontotemporal dementia (FTD) is also recognised, with a joint diagnosis for up to 15% of people in some studies and cognitive dysfunction in around 50%, according to disease stage (11, 12).
The genetic landscape of ALS is similarly heterogenous, with recognised monogenic, oligogenic, and polygenic contributions to disease (1, 13–17). Known genetic variation explains disease for around 15-20% of people, implicating variants in over 40 genes as causal for or modifiers of ALS (15, 18, 19). Likewise, ALS is associated with disruption to various biological processes, including cytoskeletal transport, RNA function, autophagy, and proteostasis (1, 7, 20).
Unfortunately, the efficacy of existing treatments for ALS is limited; the most effective drug therapies extend life expectancy from onset by no more than a few months (21, 22). This poor efficacy is partially due to the heterogeneity of the disease, and therefore more effective treatments may be discovered within a precision medicine framework. Existing research supports this hypothesis. For instance, treatment with lithium appears to extend survival trajectories specifically among people with an UNC13A variant (23). Further research examining the utility of gene therapies that aim to offset aberrant gene function associated with specific genetic variation is ongoing (24). This is valuable for SOD1-ALS which appears biologically separate from non-SOD1 ALS (25) but may be suboptimal when the biological disease signature is indistinguishable between people with and without a given variant, as for C9orf72 and non-C9orf72 associated ALS (26), given the breadth of processes implicated in the disease.
Patient stratification is an effective avenue for discovery of biological mechanisms relevant to particular subgroups (27). Such stratification has been attempted previously. For instance, clusters of ALS have been identified based on biological trends in transcriptomic and neuroanatomical data, with evidence suggesting that these groups may be reflected in the phenotype (28–33).
Previous attempts to clustering in ALS have been limited to individual national cohorts and were not generalisable. For example, clinical clusters, predictive of disease duration, were identified in a British cohort using latent class cluster analysis (LCA) (34). Independently, semi-supervised machine-learning applied to high-dimensional clinical data from two independent Italian cohorts identified clusters conforming to the following clinical subgroups: bulbar, respiratory, flail arm, classical, pyramidal, and flail leg (35).
No study to date has validated the subgroups they identified using independent samples from different populations which greatly limits their applicability. Moreover, despite ALS being clinically defined, no study has examined whether data-driven clinical subgroups differ biologically beyond subgroups defined by individual gene variants. The identification of clinically and biologically homogenous subgroups that are robust and consistent across populations of patients will likely benefit development of precision medicine approaches for ALS extending beyond those targeting specific genetic vulnerabilities.
Accordingly we examined: (A) whether data-driven clusters defined by widely collected clinical measures could be identified and validated across international ALS cohorts; (B) clinical characteristics defining these clusters; (C) whether clusters differ biologically in terms of (i) the frequency of rare variants in genes previously associated with ALS, (ii) common genetic variation captured within polygenic risk scores (PRS) for risk of ALS and related neuropsychiatric disorders, and (iii) molecular signatures via gene expression levels; (D) the extent to which clinically driven clusters could be identified using data attainable around the time of diagnosis. Across these investigations we used data of over 14,000 ALS patients from the Project MinE (36), the Survival, Trigger and Risk, Epigenetic, eNvironmental and Genetic Targets for motor neuron Health (STRENGTH) consortia, and the King’s College London (KCL) Brain Bank (37, 38).
Methods
Sample
People with a diagnosis of ALS, PMA, or PLS were sampled from two international consortia: Project MinE (36) and STRENGTH. All samples underwent pre-processing (see Figure 1) to determine the sample entered into LCA and subsequent analyses (N: Project MinE = 6,523, STRENGTH = 7,829). Missingness was present in both cohorts (see Figure S1). This is handled in LCA using a full information maximum likelihood approach, which enables model fitting for records with incomplete data (39). Subsequent analyses employed a complete case analysis approach, retaining only people with information recorded for all variables used in that analysis.
Age- and sex-matched control participants from Project MinE (N = 2,414 after pre-processing; see Figure 1) were also sampled as a comparison group for analysis of trends in common genetic variation.
Study design
Clinical data
Phenotypic information from the Project MinE and STRENGTH datasets were used as features for clustering. The included variables are all frequently collected for people with ALS: sex at birth (male or female), site of onset (not-bulbar or bulbar), clinical diagnosis (ALS, PLS, or PMA), age of onset (years), disease duration (years) from onset until death or last status update with associated censoring status (alive or deceased), and delay from onset until diagnosis. Diagnostic delay was standardised by country to account for any inter-country differences (see Figure S2). Standardisation was performed by centring each person’s diagnostic delay by the per-country mean and scaling relative to the per-country standard deviation (see Table S1).
Genetic data
Associations between clinically-defined clusters and biological trends were performed using data from Project MinE. Whole-genome sequence data were generated as previously described (36).
Information on rare genetic variation was extracted for a panel of 36 genes previously implicated in ALS: ALS2, ANG, ANXA11, ATXN1, ATXN2, CFAP410 (formerly C21orf2), CHCHD10, CHMP2B, C9orf72, DAO, DCTN1, ERBB4, FIG4, FUS, hnRNPA1, MATR3, MOBP, NEFH, NEK1, OPTN, PFN1, SCFD1, SETX, SIGMAR1, SOD1, SPG11, SQSTM1 (p62), TAF15, TARDBP, TBK1, TUBA4A, UBQLN2, UNC13A, VAPB, VCP, VEGFA. Summaries of variants occurring across Project MinE samples are available in the databrowser (40).
Variation in C9orf72 and ATXN2 was reported regarding presence or absence of known pathogenic short tandem repeat expansions in each gene. Repeat expansion lengths were determined using Expansion Hunter (41). C9orf72 expansions were confirmed using repeat-primed PCR, being classified as inconsistent if the dry and wet lab results did not match (42, 43). The minimum number of repeat units denoting presence of a repeat expansion was 30 for C9orf72 and 28 for ATXN2 (44). For C9orf72, inconsistent expansions were reported in 32 people and coded as missing.
The presence or absence of rare variants (MAF <0.01 in both Project MinE controls and Gnomad controls v2.1.1) predicted by the Ensembl Variant Effect Predictor (45) to have a high or moderate impact upon gene function was recorded across the remaining 34 genes. Moderate impact variants included missense, in-frame insertions and deletions, and protein altering variants. High impact variants included stop lost and gained, start lost, transcript amplification, frameshift, transcript ablation and splice acceptor and donor variants.
Variants across all genes except C9orf72 and SOD1 were aggregated into burden groups (see Table S2). The main burden groups described three functional pathways related to cellular processes disrupted in ALS: ‘autophagy and proteostasis’, ‘RNA function’, and ‘cytoskeletal dynamics and axonal transport’. Genes were assigned to pathways according to the involvement of their protein products within them and to the processes which are disrupted by deleterious variation in each gene (see Table S2); this was determined through literature review. A further ‘any pathway’ group, aggregating across the three functional pathways, was also defined. Each burden group was binary-coded according to presence or absence of variants in at least one gene assigned to the group.
The SOD1 and C9orf72 genes were analysed individually since their variants are the most frequent genetic causes of ALS, likely reflecting that they are implicated in various disease pathways (46, 47), and each occurred with sufficient frequency to be tested individually.
Associations between clusters and common genetic variation were examined using PRS indicating risk for ALS and related neuropsychiatric diseases. PRS were derived from European ancestry genome-wide association study (GWAS) summary statistics of risk for ALS (14), FTD (48), Alzheimer’s disease (49), Parkinson’s disease (50), and schizophrenia (51). Scores were calculated with SBayesR (52) under the reference-standardised approach of GenoPredPipe (53) (see Supplementary Methods 1.1).
Since samples from Project MinE are included within the ALS GWAS, PRS for this trait were generated based on GWAS summary statistics that exclude meta-analysis ‘stratum 6’, which includes most people from Project MinE. To ensure no sample overlap, analyses including the ALS PRS were performed using only those who were sampled within GWAS stratum 6. All available Project MinE samples were included in analyses based on PRS for other traits.
Gene expression data
The KCL BrainBank expression dataset consists of post-mortem bulk RNA-sequencing samples from the Medical Research Council (MRC) London Neurodegenerative Diseases Brain Bank at KCL. Frozen human post-mortem tissue was taken from the primary motor cortex of patients and controls whose genomes were sequenced as part of Project MinE. The protocols for RNA-sequencing (37, 38) has been described previously.
Procedure
1. Clustering of ALS clinical data
Latent Class Cluster Analysis (LCA) was applied to identify data-driven subgroups in clinical variables (see Supplementary Methods 1.2.). LCA is an unsupervised machine-learning approach with various benefits: information returned enables the immediate inspection of fit quality; data across categorical, ordinal, and continuous modalities can be combined, including time-to-event variables with censoring; an in-built full information maximum likelihood approach (39) enables class (cluster) assignment for people with incomplete records.
Project MinE was used as the discovery cohort, holding back independent samples from STRENGTH for model validation. After validation, the independent Project MinE and STRENGTH samples were pooled together and LCA was repeated within the joint dataset. Consistency between the discovery and joint dataset latent class models was established by inspecting cluster assignment overlap between them. The joint-dataset model was used for subsequent analyses.
2. Clinical characterisation of clusters
Characteristics of the identified clusters predicted by the features used in LCA were first examined using linear discriminant analysis. This algorithm derives linear axes that maximise separation between the classes. The associations between these axes and the predictor variables indicate which variables best distinguish the classes. Although the method performs adequately with categorical data (55, 56), linear discriminant analysis assumes that predictors are continuous and normally distributed. Multinomial logistic regression (57), which makes no such assumption, was therefore applied with stepwise feature selection to support linear discriminant analysis.
Differences in disease duration between classes were examined first via pairwise log-rank tests and second within a cox proportional-hazards model, including class and the other LCA model features as covariates.
Additional procedural information is provided in supplementary methods 1.3.
3. Biological trends across clusters
Associations between the k clusters identified and rare variation in ALS-implicated genes were investigated using kx2 Fisher’s exact tests, with a significance threshold of false-discovery rate adjusted p<0.05. Odds ratios for having a variant in a given cluster vs all other clusters were also determined.
Binary logistic regression models were used to compare PRS for each cluster against all other clusters and against healthy controls. Reflecting the non-independence of these tests, significance was defined by a nominal p<0.05. Each analysis was performed with a given PRS as a univariate predictor and after including the first five principal components of ancestry as covariates.
To determine whether the clusters displayed alterations in biological processes, we performed differential expression and gene enrichment analysis for the samples from Project MinE which had matching motor cortex expression data available (N = 88) and 59 controls. The processing pipeline used for the expression analysis is documented and available on GitHub (https://github.com/rkabiljo/RNASeq_Genes_ERVs and https://github.com/rkabiljo/DifferentialExpression_Genes). Briefly, reads are first interleaved using reformat.sh from BBtools (58), adapters and low-quality reads are trimmed using bbduk.sh (58). The data is then aligned to Hg38 using STAR (59). Transcripts are quantified using HTSeq (60). DESeq2 (61) is used for normalisation and differential expression analysis. An extensive description of the pipeline was recently published (62). Gene enrichment analysis was performed using the most significant 500 differentially expressed genes with clusterProfiler (63).
Further procedural details are provided in supplementary methods 1.4.
4. Prediction of cluster membership using baseline data
Random forest and eXtreme Gradient Boosting classification algorithms were trained to examine whether cluster membership could be predicted using only information accessible around the time of first diagnosis. Six algorithms were trained across the two machine-learning methods with a multiclass classification objective, predicting assignments to the LCA-identified clusters, across three data configurations: first, all clinical features used in LCA except disease duration (which cannot be assessed at this time); second, clinical features from model 1 alongside rare and common genetic variables used to assess biological trends across classes; third, clinical features from model 1, sample-matched with model 2.
Shapley Additive exPlanations (SHAP) (64, 65) were used to examine which features were more influential upon machine-learning algorithm class membership predictions. To further examine which features were important for prediction of exclusively Class 1 versus 2, the largest LCA clusters, 6 additional binary classification algorithms were trained across the two machine-learning methods and three data configurations after restricting only to people assigned to these classes.
Only people without missingness in included features were used in each multiclass (binary) analysis, therefore the total sample size was 12,508 (11,109) in the first data configuration and 3,226 (2,996) in the second and third. The Algorithms were trained with 10-fold cross-validation, repeated 10 times using pseudorandom seeding. Out-of-fold prediction performance was evaluated using the metrics of sensitivity, specificity, precision, and balanced accuracy.
Results
1. Clustering of ALS clinical data
We identified that clinical subgroups of ALS were well described within a 5-class latent class model (see Table S3; Table 1).
The 5-class model was first identified in the Project MinE discovery sample, with lower Akaike information criterion (AIC) and Bayesian Information Criterion (BIC) values than models with fewer classes (Table S3). Latent class models did not converge when considering a higher number of classes. Acceptance of the 5-class model was supported by its entropy (0.791) indicating that the model had reasonable certainty in classification, and by people having a high probability of belonging to their assigned classes (Table 1). An equivalent model, with high entropy (0.850), was identified when repeating LCA after restricting to only people without missingness in diagnostic delay and disease duration (see Table S4).
The external validity of this solution was affirmed through application to independent samples from STRENGTH. High entropy (0.830) and high average class probability of belonging to assigned classes (Table 1) indicated that the solution fit these data well. Accurate prediction of class membership for people from STRENGTH within a k-nearest neighbours algorithm trained upon the clinical data from Project MinE samples further supported the validity of the subgroups identified (see Table S5).
Since the initial 5-class model fitted both the discovery and validation datasets well, LCA was repeated using a joint dataset, pooling across independent samples from Project MinE and STRENGTH. Models of 1 to 9 classes were fitted and the 5-class model was accepted above those with additional classes since the external validity of a 5-class model had already been demonstrated and because improvements to AIC and BIC were not substantial with additional classes (Table S3). Entropy in the 5-class solution remained high (0.836) and people had high probability of belonging to their assigned classes (Table 1). As before, a highly comparable model, with 0.881 entropy, was identified when repeating LCA after restricting to only people with diagnostic delay and disease duration reported (see Table S4).
Equivalence between the 5-class models fitted to the Project MinE and joint datasets was determined by examining cluster similarity: ∼91% of people from Project MinE and ∼92% of STRENGTH were assigned to the equivalent cluster in the joint dataset model (see Figure S4).
Accordingly, the joint dataset 5-class solution was accepted as the final model and used for subsequent analyses.
2. Clinical characterisation of clusters
We examined the clinical characteristics of the clinically-defined ALS subgroups identified with LCA. Table 2 and Figure 2 present descriptive statistics for clinical features across each class. Linear discriminant and multinomial logistic regression analyses highlight that diagnostic delay and disease duration were the main class delineators (See Figure 3; Table 3; Table S7-S8). Interestingly, although both linear discriminant 1 and 2 (LD1 and LD2) presented high correlations with diagnostic delay and disease duration, LD1 positively correlated with these clinical measures, while LD2 positively correlated with disease duration and negatively correlated with diagnostic delay (Table 3). All features were retained in the multinomial logistic regression model analysing only people without censored disease duration; sex was dropped as a predictor when people with censored disease duration were included in the analysis (Table S7-S8).
Survival analysis further demonstrated the relationship between class and disease duration, indicating that the classes are each associated with distinct survival trajectories (Figure 2), and class remains the most influential predictor of survival within a cox proportional-hazards model after adjusting for other clinical features (Table S9).
3. Biological trends across clusters
Fisher’s exact tests were applied to examine associations between rare genetic variation and class for variants occurring within genes previously associated with ALS (see Table 4). SOD1 variants and the C9orf72 expansion differed in frequency across classes. C9orf72 expansions were overrepresented in Class 1 and underrepresented in Class 2; the opposite was observed for SOD1 variants. The frequency of variants in genes linked to RNA processing and to cytoskeletal dynamics and axonal transport also differed across classes.
Binary logistic regression models were fitted to examine associations between class and PRS for ALS and related neuropsychiatric disorders (see Figure 4). Class 1 was significantly associated with higher PRS for risk of ALS, compared both to patients in other classes and to healthy controls. Interestingly, PRSs for schizophrenia and Parkinson’s disease were lower in Class 1 compared to other classes, and were higher in Classes 2 and 5. PRS for schizophrenia were higher in most classes compared to controls, and those for Parkinson’s disease were higher for Classes 2 and 5.
RNA-sequencing data from 88 KCL BrainBank samples (a subset of Project MinE) assigned to classes via LCA (n: Class 1 = 70, Class 2 = 18) and KCL BrainBank controls (n = 59) was used to perform gene enrichment analysis of the 500 most significant differentially expressed genes present in three designs (Class 1 vs Controls, Class 1 vs Class 2, and Class 2 vs Controls). When comparing Classes 1 and 2, we found that the 500 most significant differentially expressed genes were enriched for cytoskeletal, extracellular matrix, calcium and neurotransmitter-specific synaptic signalling and muscle-function related processes, whilst both classes shared enrichments for oxidative phosphorylation and postsynaptic activity-based pathways (Figure 5). Comparisons of each class with controls highlighted processes implicated exclusively in ALS; Class 1 was enriched for ubiquitin and unfolded protein binding and nucleocytoplasmic transport, whereas Class 2 displayed enrichment for multiple neurodegenerative diseases in addition to Golgi-ER transport and T-cell signalling (Figure 5). Interestingly, these analyses showed clear reflections of the genetic differences that we found among clusters. For example, gene expression highlighted cytoskeletal and tubulin processes in the differentiation of Class 1 and 2 and so did the occurrence of rare variants in patients from these two classes. Furthermore, PRS for Parkinson’s disease was higher in Class 2 and differential expression analysis for Class 2 versus controls showed enrichment for genes involved in Parkinson’s disease pathways (KEGG). The full results are available in Tables SX1-3.
4. Prediction of cluster membership using baseline data
We examined the extent to which machine-learning algorithms could predict class membership using data available around the time of diagnosis. Random Forest and eXtreme Gradient Boosting algorithms were trained using clinical data only or a combination of clinical and genetic variables. Comparison of the area under the receiver operating characteristic curve for prediction of each class versus all other classes determined that the two approaches performed comparably (see Figures S7-S9). Table 5 presents performance metrics for the multiclass eXtreme Gradient Boosting algorithms trained upon each data configuration. Class membership was predicted with high accuracy for Classes 3-5 (see Table 5; Figure 6). Prediction of Classes 1 and 2 still performed reasonably, but with poorer specificity in Class 1 and sensitivity in Class 2; most misclassification in these groups was for people in the opposing class (see Figures S7-S9). The algorithm performance was comparable across sample-matched datasets when using clinical features alone and using clinical and genetic measures.
Evaluation of feature importance using SHAP values identified diagnostic delay as the most influential feature upon predictions from either approach and that various other features contributed more greatly to the prediction of Class 1 and 2 in particular (see Figure S10).
Discussion
Latent class clustering analysis was applied to clinical data from large international ALS datasets to identify data-driven clinical subgroups and investigate whether biological differences exist between groups. Five distinct disease subgroups were identified, delineated primarily by diagnostic delay and disease duration, with importance also attributed to age and site of disease onset and clinical phenotype (ALS, PLS, or PMA). Notably, these clusters generalise across 13 countries (see Figure S6).
Survival analysis indicated that the clusters were distinct regarding disease duration; duration was greater for each respective class between 1 and 5. Diagnostic delay also generally increased with respect to class number, but, unlike disease duration, was similar in Class 1 and 2. Although diagnostic delay is commonly regarded as directly correlated with disease duration and often considered its proxy, our results suggest a more complex relationship between these two clinical measures. Indeed, the opposite signs of their correlation coefficients with LD1 and LD2 suggest the existence different patterns of progression across patients that require both diagnostic delay and disease duration to be discerned. Other features also aggregated unequally between clusters; bulbar onset was more frequent in Class 1; age of onset was later in Class 1 and earlier in Class 5, but comparable for Classes 2-4; the PLS clinical subtype was more frequent in Classes 4 and 5; the PMA subtype was more evenly distributed, but less frequent in Class 1. Figure 2 visualises the clinical characteristics of each class.
The main clinical variables discriminating subgroups from this study are consistent with those distinguishing subgroups from a previous application of LCA in a UK ALS cohort (34). Both studies find diagnostic delay as an important discriminator of ALS subgroups and that subgroups have distinct survival trajectories. However, comparing class assignments of people from STRENGTH using the latent class model from each study demonstrates that the clusters are not the same (see Figure S11). The most striking distinction is that most people assigned to Classes 1 and 2 using the present model are assigned to a single class by the model from the previous study, resulting in low resolution model that assigns the great majority of patients to only one group. This difference might be the result of two key differences between the models: the inclusion of disease duration in present model, which was withheld in the previous study and instead used as a tool for clinical validation, and the use of larger international ALS cohorts, instead of data from only one country, to train our model, which underpins its generalisability to all ALS patients.
Interestingly, the identified subgroups do not correspond to clinically defined ALS subtypes (e.g., PMA or PLS). Although this might support the hypothesis that the validity of current clinically defined subtypes should be reconsidered, it may also reflect a high rate of misdiagnosis or that the variables required to distinguish between clinical subtypes were not captured within these models. The use of high-dimensional clinical data identified ALS subgroups congruent with the classification system in a recent study (35). However, these subgroups would have fallen within the pure ALS diagnosis in our study and extreme subtypes such as PMA and PLS were not represented.
Analysis of rare genetic variants showed that the pathogenic C9orf72 repeat expansion was more frequent in Class 1, consistent with evidence that the variant is associated with shorter disease duration (14, 67–69). Putatively deleterious SOD1 variants were overrepresented in Class 2, the characteristics of which were more representative of the disease phenotype associated with variants in this gene than for than Class 1 (70).
Variants in genes associated with cytoskeletal dynamics and axonal transport cell processes were more frequent in Class 2. Variants in genes associated with RNA function were also unequally distributed across groups, although no individual group appeared to drive the association. These findings suggest that the ALS phenotype may present differently according to disruption of certain cellular processes. Recent studies support this possibility, reporting differences in disease progression and survival according to common variants associated with antioxidant and inflammatory disease pathways (71) and according to expression-based clusters (31, 33). Associations between disruption to particular biological processes and the ALS phenotype warrant further investigation.
Trends in common genetic variation were examined using PRS for risk of ALS or related neuropsychiatric traits. Class 1 was associated with higher ALS PRS. Conversely, Classes 2 and 5 were associated with higher schizophrenia PRS, which was also generally higher across ALS subgroups compared to healthy controls. Classes 2 and 5 were also associated with higher, and Class 1 with lower, Parkinson’s disease PRS. No associations emerged between class and PRS for FTD or Alzheimer’s disease. The lack of association with FTD PRS may reflect the limited sample size of the only available GWAS for this trait (48), and thus an underpowered PRS.
The finding that ALS PRS were only higher than controls in Class 1, the largest cluster (∼49% of all ALS), suggests that the GWAS primarily captures variants relevant to disease risk in this subgroup. Future ALS GWASs could benefit from prioritizing patients from this class to maximise their statistical power. Different variants may therefore be relevant for disease risk in the other subgroups. The possibility is supported by the association between Classes 2 and 5 and higher Parkinson’s disease PRS. These associations also suggest that genetic overlaps between ALS and other traits could be driven by certain disease subgroups (13, 14). Interestingly, the schizophrenia PRS was consistently higher than controls across all classes, supporting the hypothesis that common mechanisms are shared by all ALS patients and that these overlap at least partially with those underlying the development of schizophrenia.
Analysis of matching post-mortem motor cortex transcriptomic data available for a subset of Project MinE supported the differences in biological trends between Classes 1 and 2 in several ways. Firstly, gene enrichment analysis identified many significant processes and pathways linked to cytoskeletal dynamics and axonal transport, congruent with our finding that variants in genes linked to this pathway were more frequent in Class 2. Secondly, Class 2 was uniquely enriched for genes involved in Parkinson’s disease and schizophrenia (ErbB signalling (72, 73)), which supports our finding that Class 2 displays higher polygenic risk scores of these diseases with respect to both controls and other classes. Finding differences in oxidative stress and apoptotic processes between Classes 1 and 2 aligns well with our finding that variants in the SOD1 gene (which codes for an antioxidant enzyme that has been proposed to affect ALS via both gain and loss of function mechanisms (47, 74)) were more frequent in Class 2. The variability in biological trends observed across these clinical subgroups supports the perspective that patient stratification may be important for identifying biological disease mechanisms (27).
Lastly, we found that class membership could be predicted with only information attainable around the time of diagnosis using machine-learning classification algorithms (see Table 5; Figure 6). Classes 3 to 5 were clearly distinguished (AUC = 0.999) and lower performance for Classes 1 and 2 (AUCclass1 = 0.847 and AUCclass2 = 0.827) is expected since their main delineator is disease duration which was excluded as a feature (see Table 2; Figure 2; Figure 3). Algorithms making predictions using both genetic and clinical features performed comparably to those trained using clinical features only. For rare variants in particular, this may reflect that the presence of a given variant only informs predictions for a small proportion of the whole cohort.
A limitation of this study is that rare variant analysis was conducted under the assumption that identified variants are relevant for the risk or modification of the ALS phenotype. We included rare variants predicted to have a functional effect upon genes previously implicated in ALS. Such broad inclusion criteria will likely identify a range of variants with a spectrum of relevance to the disease. Without further supporting evidence, or larger sample sizes, it is difficult to ascertain the role of each individual variant (75).
A further limitation is that biological trends could only be tested to a limited degree. Although the genetic architecture of all classes was investigated, analysis with transcriptomic data was constrained by the availability of data for relatively limited samples from only Classes 1 and 2. The genetic analyses were similarly limited by the small sample size for certain clusters. However, our investigation drew upon one of the richest genomic resources for ALS currently available (36) and this constraint should lessen in future studies as resources continue to expand.
Future studies should refine the disease classifications we have developed. The identified subgroups were partly and differentially separated by diagnostic delay and disease duration. Progression has been identified as an indicator of disease duration in ALS (76, 77), and the present patterns suggest non-linearity of progression across the disease course, which has been recognised previously (78). Our measurement of diagnostic delay is likely an imperfect proxy for disease progression. Therefore, future studies should include measures of patterns in disease progression. Recognising overlaps between ALS and other conditions (11, 14), it is also pertinent to measure non-motor features of ALS, such as cognitive or behavioural change.
In conclusion, we illustrated that our data-driven approach can identify distinct clinical subtypes of ALS and suggests that differences between these subgroups reflect the role of distinct biological mechanisms, beyond individual gene variants, upon the phenotype. The study supports the perspective that data-driven patient stratification may aid identification of biological disease mechanisms and, therefore, that such approaches should be considered in the design of future ALS studies. Defining clinically and biologically meaningful subtypes of ALS has important implications for future research and clinical practice regarding: the inherent utility of an early and reliable prediction of disease prognosis for improving and personalising patient care; improved matching of people across clinical trial placebo and active arms to facilitate testing of treatment efficacy; precision medicine development owing to easier identification of disease processes relevant for particular subgroups; the design of genetic and biological studies.
Future research should aim to develop detailed understanding of ALS subtypes by employing multi-omics datasets to examine how they are reflected across the spectrum of genome to phenome.
Data availability
The Project Mine clinical and genetic data is available upon request at https://www.projectmine.com/research/data-sharing/. Accessing the STRENGTH clinical data requires a collaboration and data sharing agreement with the EU JPND STRENGTH consortium, administered through King’s College London (contact ammar.al-chalabi@kcl.ac.uk).
Funding
HM was supported by GlaxoSmithKline and the KCL funded centre for Doctoral Training (CDT) in Data-Driven Health. GPH was supported by the Perron Institute for Neurological and Translational Science and the KCL funded centre for Doctoral Training (CDT) in Data-Driven Health. AAK was funded by ALS Association Milton Safenowitz Research Fellowship (grant number 22-PDF-609.doi: 10.52546/pc.gr.150909), The Motor Neurone Disease Association (MNDA) Fellowship (Al Khleifat/Oct21/975-799), The Darby Rimmer Foundation, and The NIHR Maudsley Biomedical Research Centre. This project was also funded by the MND Association and the Wellcome Trust. This is an EU Joint Programme-Neurodegenerative Disease Research (JPND) project. The project is supported through the following funding organizations under the aegis of JPND–http://www.neurodegenerationresearch.eu/ [United Kingdom, Medical Research Council (MR/L501529/1 and MR/R024804/1) and Economic and Social Research Council (ES/L008238/1)]. AAC was a NIHR Senior Investigator. AAC received salary support from the National Institute for Health Research (NIHR) Dementia Biomedical Research Unit at South London and Maudsley NHS Foundation Trust and King’s College London. The work leading up to this publication was funded by the European Community’s Health Seventh Framework Program (FP7/2007–2013; grant agreement number 259867) and Horizon 2020 Program (H2020-PHC-2014-two-stage; grant agreement number 633413). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement no. 772376–EScORIAL. This study represents independent research part funded by the NIHR Maudsley Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, King’s College London, or the Department of Health and Social Care. Funding was provided by the King’s College London DRIVE-Health Centre for Doctoral Training and the Perron Institute for Neurological and Translational Science. AI is funded by South London and Maudsley NHS Foundation Trust, MND Scotland, Motor Neurone Disease Association, National Institute for Health and Care Research, Spastic Paraplegia Foundation, Rosetrees Trust, Darby Rimmer MND Foundation, the Medical Research Council (UKRI) and Alzheimer’s Research UK. OP is supported by a Sir Henry Wellcome Postdoctoral Fellowship [222811/Z/21/Z]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Project MinE Belgium was supported by a grant from IWT (n° 140935), the ALS Liga België, the National Lottery of Belgium and the KU Leuven Opening the Future Fund.
This research was funded in whole or in part by the Wellcome Trust [222811/Z/21/Z]. For the purpose of open access, the author has applied a CC-BY public copyright licence to any author accepted manuscript version arising from this submission.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Acknowledgments
Samples used in this research were in part obtained from the UK National DNA Bank for MND Research, funded by the MND Association and the Wellcome Trust. We thank people with MND and their families for their participation in this project. We acknowledge sample management undertaken by Biobanking Solutions funded by the Medical Research Council at the Centre for Integrated Genomic Medical Research, University of Manchester. The authors acknowledge use of the research computing facility at King’s College London, Rosalind (https://rosalind.kcl.ac.uk), which is delivered in partnership with the National Institute for Health Research (NIHR) Biomedical Research Centres at South London and Maudsley and Guy’s and St. Thomas’ NHS Foundation Trusts, and part-funded by capital equipment grants from the Maudsley Charity (award 980) and Guy’s and St. Thomas’ Charity (TR130505). The authors also acknowledge the use of the CREATE research computing facility at King’s College London (79). We also acknowledge Health Data Research UK, which is funded by the UK Medical Research Council, Engineering and Physical Sciences Research Council, Economic and Social Research Council, Department of Health and Social Care (United Kingdom), Chief Scientist Office of the Scottish Government Health and Social Care Directorates, Health and Social Care Research and Development Division (Welsh Government), Public Health Agency (Northern Ireland), British Heart Foundation and Wellcome Trust. The Authors acknowledge the ERN Euro-NMD for support. PVD holds a senior clinical investigatorship of FWO-Vlaanderen (G077121N) and is supported by the E. von Behring Chair for Neuromuscular and Neurodegenerative Disorders, the ALS Liga België and the KU Leuven funds “Een Hart voor ALS”, “Laeversfonds voor ALS Onderzoek” and the “Valéry Perrier Race against ALS Fund”. Several authors of this publication are member of the European Reference Network for Rare Neuromuscular Diseases (ERN-NMD).
Footnotes
↵$ co-senior authors.