Abstract
Parkinson’s Disease is the second most common neurodegenerative disorder in the United States, and is characterized by a largely irreversible worsening of motor and non-motor symptoms as the disease progresses. A prominent characteristic of the disease is its high heterogeneity in manifestation as well as the progression rate. For sporadic Parkinson’s Disease, which comprises 90% of all diagnoses, the relationship between the patient genome and disease onset or progression subtype remains largely elusive. Machine learning algorithms are increasingly adopted to study the genomics of diseases due to their ability to capture patterns within the vast feature space of the human genome that might be contributing to the phenotype of interest. In our study, we develop two machine learning models that predict the onset as well as the progression subtype of Parkinson’s Disease based on subjects’ germline mutations. Our best models achieved an ROC of 0.77 and 0.61 for disease onset and subtype prediction, respectively. To the best of our knowledge, our models present state-of-the-art prediction performances of PD onset and subtype solely based on the subjects’ germline variants. The genes with high importance in our best-performing models were enriched for several canonical pathways related to signaling, immune system, and protein modifications, all of which have been previously associated with PD symptoms or pathogenesis. These high-importance gene sets provide us with promising candidate genes for future biomedical and clinical research.
1 Background
Parkinson’s Disease (PD) is the second most common neurodegenerative disease next to Alzheimer’s Disease in the United States, with an occurrence of 160 cases per 100,000 in adults age 65 years or older. [1] The main symptoms of PD can be categorized into motor symptoms, including resting tremor, bradykinesia, and rigidity, and non-motor symptoms, including neuropsychiatric decline, sleep decline, and sensory decline. [2] These symptoms are exacerbated as the disease progresses in a largely irreversible manner. There is no cure for PD to date, and the only available treatments are for relieving the symptoms. Since these treatments are more effective when begun at an earlier stage of the disease, [3] an early assessment of risk for PD is critical.
A person’s genomic profile, typically represented as a list of germline variants, is often rich with information regarding their disease risk, and offers a promising path for risk assessment of PD. A recent increase in publicly available genomic datasets from large-scale PD cohorts has resulted in many studies that provide useful insights on the genomic risk factors as well as the disease etiology of PD. [4, 5] However, the currently known genomic risk factors only explain a small fraction of the disease risk, often referred to as “missing heritability”. [6] This poses a need for better risk models. There are several approaches to improve on existing models, including the incorporation of previously unused features and adoption of a higher-complexity model. Recently, Bandres-Ciga et al. achieved the former by performing a systematic, bias-free approach with polygenic-risk-scoring and transcriptomic network analysis to identify certain canonical pathways as risk factors of PD in a large-scale cohort. [4] This work provides us with highly interpretable and actionable results. One limitation of such polygenic risk modeling approach, however, is that it generally does not capture nonlinear relationships between features.
Another prominent research question in the field of PD is its heterogeneity in the type of manifestation and progression rate among patients. [7, 8] Although there have been many attempts at characterizing distinct progression subtypes of PD, [9, 10] a gold standard is yet to be established. A method for assessing a patient’s progression subtype at an early stage of the disease will help clinicians make a more informed decision about a patient’s treatment plan.
Machine learning (ML) has recently been increasingly adopted for studying the genomics of diseases such as cancer, owing to its ability to capture intricate relationships between the numerous components of the genome. [11, 12] The method is particularly advantageous in that it can effectively model nonlinear relationships that might not have been captured by traditional risk modeling approaches. An accurate model to predict either the onset or the progression subtype of PD would prove valuable not only in its predictive capability, but also in providing potential insight into disease etiology by highlighting features that are predictive of the outcome of interest.
Many existing ML studies in PD utilize known clinical, demographic, and genetic risk factors as their model input. [13, 14] Although such models often have satisfactory performance in small-scale cohort studies, this approach fails to explain the “missing heritability” as it cannot find patterns in unknown or uncharacterized features that might be contributing to disease pattern. Therefore, when modeling PD risk based on genomic features, it is crucial to fully utilize whole-exome sequencing (WES) or whole-genome sequencing (WGS) data to develop a model with the best possible performance.
In this study, we utilize WES and WGS data from large-scale cohort studies to train machine learning models for two tasks: PD disease onset prediction and PD progression subtype prediction. We then interpret our best performing models to derive insights that are of clinical and biomedical relevance. To the best of our knowledge, our work presents the state-of-the-art predictive performance of PD onset and progression subtpye solely based on the patients’ germline genomic data.
2 Results
2.1 Data Description
2.1.1 Acquisition Source
We obtained our data from the Accelerating Medicines Partnership: Parkinson’s Disease (AMP-PD) integrated database. AMP-PD is a harmonized collection of multiple cohort studies of PD including the Michael J. Fox Foundation (MJFF) and National Institutes of Neurological Disorders and Stroke (NINDS) BioFIND study, the NINDS Parkinson’s disease Biomarkers Program (PDBP), and the MJFF Parkinson’s Progression Markers Initiative (PPMI). The database includes genomic sequences, longitudinal clinical test results, and biospecimen data from PD patients as well as healthy controls from multiple, multi-institute studies.
2.1.2 Target Labels
Disease Onset Prediction Target
The AMP-PD database provides binary labels for PD patients (cases) and healthy individuals (controls). We discarded observations from subjects whose case/control label changes overtime for simplicity. We utilized WGS data for this task as it was the type of sequencing data most widely available in this cohort. Among the AMP-PD subjects that had germline sequencing data available, the disease onset label was available for 2800 subjects. We refer to this group of subjects as our Onset Cohort.
Progression Subtype Prediction Target
As there are no agreed-upon progression subtypes for PD, we adopted the progression subtypes characterized in Zhang et al.. [13] This study used a deep learning algorithm, Long-Short Term Memory, to represent each patient’s disease progression as a multi-dimensional time-series data from motor and non-motor clinical exams commonly used to measure the severity of PD. This representation was then used to categorize the progression patterns into three distinct subtypes with different manifestations as well as progression rate. The patients’ genomic data was not used in this study. Subtype I was characterized by moderate functional decline in motor ability but a relatively mild cognitive decay, while Subtype II was characterized by mild functional decay in both motor and non-motor abilities, and Subtype III was characterized by a rapid progression in both motor and non-motor symptoms. We utilized WES data for this task as this was the type of sequencing data most widely available in subjects with subtype labels. This cohort consisted of 351 patients, all of whom belong to the PPMI cohort study. We will refer to this group of subjects as the Subtype Cohort.
2.1.3 Input Data
Our input data comprises of the number of likely-disruptive variants per gene for each subject. (See Methods for a detailed description of the processing approach.) The variant count distributions of our two cohorts are summarized in Fig. 1. Since WGS sequences a larger proportion of the genome compared to WES, this is likely the reason why the Onset Cohort appears to have a generally higher number of variants compared to the Subtype Cohort. This disparity is not an issue in our study as we train entirely separate models for our two cohorts.
2.2 Cohort Summary
The case-control composition of our two cohorts are summarized in Table. 1. The demographic information of our two cohorts are summarized in Table. 2. For both our Onset Cohort and Subtype Cohort, the age is recorded at the beginning of the longitudinal study.
Onset Cohort
For our Onset Cohort, there was a statistically significant difference in age between the case and control groups (two-sample Student’s t-test p=5e-21). This is likely due to the fact that, since age is a strong risk factor in PD, older subjects are recruited for the case group. Since our predictor variables are germline mutations and are not affected by age, this difference should not confound our models. The gender composition in the case/control groups also showed a statistically significant imbalance (Fisher’s exact test p=1.5e-15). The larger proportion of male subjects is somewhat representative of the overall diagnosed gender ratio of PD. [15] The class imbalance is corrected for during model training by inversely weighting the observations with their class counts in the loss function.
Subtype Cohort
We observed statistically significant differences in age between our subtype groups (one-way ANOVA p=2.6e-10), although it is important to note that subject age was taken into consideration when defining the three subtypes in Zhang et al.. Gender shows a roughly consistent 2:1 ratio in all subtypes, which is a sharper skew than overall gender ratio in the diagnoses of PD. [15]
2.3 Model Performance
The classification performance of our trained models is summarized in Table. 3. The metrics all indicate performane on the held-out test set, and is macro-averaged for the multiclass classification in the Subtype Cohort. For both the disease onset prediction task and the subtype prediction task, Random Forest and Support Vector Machine algorithms had the overall best ROC-AUC as well as balanced accuracy. The best performance for onset prediction achieved and ROC-AUC of 0.7699 for the onset prediction with SVM, and 0.6090 for the subtype prediction with RF.
2.4 Model Interpretation
The Random Forest model provides us with interpretable variable importance measures, calculated as the total reduction of sum of squared errors across all splits at which a given variable is used. Since RF was among the best performing algorithms for both onset prediction and subtype prediction tasks, we interpret our results by selecting the top 500 features with the highest importance of the RF models for the two tasks respectively. First, we split up any feature that contains more than one gene; some features contain more than one gene due to the fact that some genes overlap, and therefore some variants can be hosted by more than one gene. After this splitting, we are left with 520 important genes for the onset prediction task, and 547 important genes for the subtype prediction task. The overlap between these two gene sets are shown in Fig. 2. We then performed the Gene Set Enrichment Analysis https://www.gsea-msigdb.org/gsea/index.jsp) to these gene sets to see whether any biological pathways are enriched in these sets. The resulting pathways and their relative importance to the classification model are shown in Table. 4.
Pathways with the highest total importance for both the onset prediction and subtype prediction tasks included immune-related pathways such as innate immune system and neutrophil degranulation pathways. Other common hits with high importance included signaling pathways such as olfactory signaling and GPCR signaling, as well as pathways related to diseases associated with O-glycosylation.
Pathways that were enriched only in the onset-related gene set included ones related to nucleotide metabolism such as nucleotide catabolism and purine salvage pathways, and other immune-related hits such as Dectin-2 family and antigen presentation pathways. The O-glycan synthesis pathway was also a unique hit for the Onset Genes. Pathways that were enriched only in the Subtype Genes included additional immune-related pathways, namely interferon-signaling and endosomal/vacuolar pathways.
3 Discussions
In this study, we trained machine learning models for two tasks, PD onset prediction and PD subtype prediction, using the number of deleterious variants per gene as the input features. The onset prediction task achieved a satisfactory performance of 0.77 ROC, and the subtype prediction task achieved a moderate performance of 0.60 macro-averaged
ROC, indicating that there seems to be some relationship between a person’s deleterious germline variants and their PD risk and prognosis. We interpreted our results by evaluating the variables of high importance in our top performing models.
For both of our tasks, Random Forest was one of the top-performing algorithms, with an ROC of 0.77 and 0.60 for the onset and subtype prediction, respectively. We derived high-importance gene sets for each task, and performed Gene Set Enrichment Analysis to obtain canonical biological pathways that were enriched in our high-importance gene sets.
One of the most notable patterns of our enrichment analysis result was the high number of immune-related pathway hits, some common between the Onset Genes and Subtype Genes, while others unique to one gene set. The adaptive immune system, innate immune system, and neutrophil degranulation were all pathways that were also defined in Bandres-Ciga et al. as pathways indicative of PD risk. Although some studies claim it is yet unknown whether the immune response is directly involved in the etiology of PD or simply a consequence of other PD processes such as α-synucleinopathy, [16] our results, together with others, suggests that the immune pathways likely play a role in the etiology of PD. Some ways in which the immune pathway could be related to PD etiology include an autoimmune response potentially responsible for dopaminergic neuronal death through neuroinflammation and degeneration. [17]
Interestingly, Signaling by GPCR pathway had a high contribution to both the onset and subtype classification tasks, which is again consistent with the results of Bandres-Ciga et al.. Most all genes that contributed to the Olfactory Signaling pathway as a hit were also a part of the GPCR pathway. This is a fascinating result, considering olfactory dysfucntion has long been known as a prodromal symptom of PD and have been used in its diagnosis. [18] In spite of being one of the earliest symptoms of PD, it is unknown whether olfactory decline is merely a symptom or is involved in disease etiology. One potential explanation behind olfactory decline is the accumulation of Lewy bodies in the olfactory bulb and its surrounding areas. Lewy pathology in the olfactory bulb is highly specific to PD, and its severity is known to correlate well with motor-symptoms. [19] It is hypothesized by some that the olfactory bulb is the induction site of Lewy pathology, which then spreads to the brain via the brainstem. [20] Taken together with our results, this highlights the possibility that certain germline mutations of genes related to olfactory function could have some causative relationship with disease course.
Hits in pathways related to protein glycation and glycosylation were also intriguing, considering the pathogenesis of PD. Protein glycation is widely known for its deleterious effects in aging [21]. Protein glycosylation affects protein structure and stability, and elevated levels of protein glycosylation have been associated with neurodegeneration for its effect on impairing clearance of dysfunctional proteins. [22] Several studies have established connections between PD and glycosylation as well as glycation. [23, 24]. Our results, taken together with these previous findings, imply that protein glycation and glycosylation could be involved in the development and progression of PD.
Another interesting aspect of the Gene Set Enrichment Analysis is that, although there was only 23-24% overlap between the two gene sets, the two gene sets shared many of the enriched pathways. It is possible that, although similar biologicaly pathways are involved in disease etiology, different genes, as components of pathways, contribute to separate aspects of the disease such as onset or subtype. Future studies might benefit from taking biological pathways into consideration when training predictive models.
Although our study provides a state-of-the-art PD prediction models based on germline variants, it still suffers from several limitations. The biggest limitation is the available sample size, particularly for the progression subtype prediction task. The sample size was limited to what was published in Zhang et al. due to the lack of standard definitions of progression subtypes in PD. The particular underrepresentation of Subtype II in our Subtype Cohort must also be taken into consideration when interpreting our models. However, the high performance of our Onset Prediction task is a promising implication that there indeed is some relationship behind PD disease course and a patient’s germline variants. Future studies would benefit from clearly defined progression subtype information in an integrated cohort study with a larger sample size such as AMP-PD. Another limitation of our current study is the lack of racial diversity in the AMP-PD cohorts. Many studies have shown that race plays a large role in PD risk, and therefore it is imperative to recruit a diverse cohort in future studies.
Overall, our study results provide insights that are of relevance to future biomedical and clinical research. Our high-importance gene sets call for further controlled laboratory experiments as well as clinical cohort studies to further confirm their role in the risk of PD development or defining its disease course. This could lead us not only to a better understanding of PD pathogenesis, but also to stronger clinical insights such as PD risk screening and patient-specific prognoses.
4 Conclusions
In this study, we developed predictive models for Parkinson’s Disease onset as well as progression subtypes using patients’ WGS and WES data, specifically using the counts of deleterious germline variants in each gene. We obtained gene sets that were top-importance in predicting the onset or subtype of PD, and interpreted the enriched pathways to draw connections to what is known about PD etiology. These gene sets provide us with promising candidates for future biomedical and clinical research.
5 Methods
5.1 Data processing and feature engineering
The files obtained from the AMP-PD and PPMI databases had been processed into Variant Call Format (VCF) files using the Genome Annotation Toolkit (GATK). [25] We first annotated the VCF files using the ANNOVAR toolkit, which provides attributes for each variant with information such as hosting gene, the variant’s functional effect, and its functional region. [26] We chose the well-established refGene database for this annotation. We then filtered out any variant with a low quality score (Phred-scaled base quality score<30, GQ<10) or low sequencing depth (DP<10). We only selected variants that are highly likely to be deleterious based on the variant’s functional effect, by selecting variants that are either frameshift insertions/deletions, splicing variants, stop-gain, or stop-loss variants. We did not perform any feature elimination based on known genomic features of PD to keep an unbiased approach. Finally, we generated a variant count matrix by counting the number of deleterious variants per gene in a given subject. Any feature (gene) in the resulting matrix that was either extremely common or rare (present in <5% or <95% of subjects) was eliminated as likely not contributing to disease pattern.
5.2 Model Training
We split our Onset Cohort and Subtype Cohort data into a training and held-out test set in a stratified manner to maintain the class composition in each set. We performed a 8:2 split on the Onset Cohort data and 9:1 split on the Subtype Cohort. We set the training set to be larger in the Subtype Cohort due to its small sample size. We trained seven different machine learning algorithms for our two tasks, respectively: logistic regression with L2 regularization (LR), LASSO, ElasticNet, Support Vector Machine (SVM), Random Forest Classifier (RF), Gradient Boosting Tree (GB), and XGBoost (XGB). We perform the a 5-fold cross-validation on the training set to tune the hyperparameters for our list of algorithms. Model training and evaluation were performed using a Python environment with Anaconda3, specifically with Scikit-learn 0.23.2 and the Python API of XGBoost 1.2.0. We account for class imbalance during model training by weighting the observations inversely proportional to their class count.
5.3 Model Evaluation and Interpretation
We evaluate our trained models by making predictions on our held-out test set. We calculate a number of evaluation metrics including balanced accuracy, area under the curve of receiver operating characteristic (ROC-AUC), precision, recall, and f1 score. Since there is class imbalance in our datasets, we macro-average our metrics across classes.
We then derive a set of 500 genes that had the highest importance in our best-performing models for the onset prediction task and the subtype prediction task, respectively. We ran Gene Set Enrichment Analysis to find out which biological pathways were enriched in our gene sets. We narrowed our search to canonical pathways characterized in the Reactome Pathway Database, and set a threshold of 0.05 for the FDR-adjusted p-value.
For each pathway that is significantly enriched in our gene sets, we evaluate the overall importance of the pathway by summing the importance scores of the genes that comprised that pathway.
The details of data processing and analysis have been made available at https://github.com/sayadennis/amppd as well as https://github.com/sayadennis/ppmi.
Data Availability
All data used in this study are available from Accelerating Medicines Partnership: Parkinson's Disease (https://amp-pd.org/) and the Parkinson's Progression Marker's Initiative (https://www.ppmi-info.org/).