Abstract
Parkinson’s disease (PD) is a complex neurodegenerative disease with a range of causes and clinical presentations. Over 76 genetic loci (comprising 90 SNPs) have been associated with PD by the most recent GWAS meta-analysis. Most of these PD-associated variants are located in non-coding regions of the genome and it is difficult to understand what they are doing and how they contribute to the aetiology of PD. We hypothesised that PD-associated genetic variants modulate disease risk through tissue-specific expression quantitative trait loci (eQTL) effects. We developed and validated a machine learning approach that integrated tissue-specific eQTL data on known PD-associated genetic variants with PD case and control genotypes from the Wellcome Trust Case Control Consortium, the UK Biobank, and NeuroX. In so doing, our analysis ranked the tissue-specific transcription effects for PD-associated genetic variants and estimated their relative contributions to PD risk. We identified roles for SNPs that are connected with INPP5P, CNTN1, GBA and SNCA in PD. Ranking the variants and tissue-specific eQTL effects contributing most to the machine learning model suggested a key role in the risk of developing PD for two variants (rs7617877 and rs6808178) and eQTL associated transcriptional changes of EAF1-AS1 within the heart atrial appendage. Similarly, effects associated with eQTLs located within the brain cerebellum were also recognized to confer major PD risk. These findings warrant further mechanistic investigations to determine if these transcriptional changes could act as early contributors to PD risk and disease development.
Introduction
Parkinson’s disease (PD) is a complex neurodegenerative disease with a range of causes and clinical presentations. The diagnosis of PD is based on the presence of the cardinal motor symptoms (bradykinesia; muscular rigidity; 4-6 Hz resting tremor; postural instability)1. Genome wide association studies (GWAS) have identified human genetic variants that are associated with the risk of developing PD2,3. There are 290 PD-associated GWAS SNPs listed in the GWAS catalog. In the most recent PD GWAS meta-analysis, Nalls et al. identified 90 independent single nucleotide polymorphisms (SNPs) that are significantly associated with PD risk2. However, it is difficult to understand how these variants confer PD risk because the majority of the PD SNPs are located in non-coding regions of the genome4–6.
Non-coding SNPs have been shown to be enriched at regulatory loci and can act as expression quantitative trait loci (eQTLs)7–11. eQTLs typically explain a fraction of the variation in mRNA expression levels for target genes, either in cis (<1Mb apart in the linear sequence) or trans (>1Mb apart or located on a different chromosome). Regulatory variants (i.e. eQTLs) can impact different genes in different tissues, making it challenging to determine how SNPs convey risk for a phenotype. Determining the relative contributions of the eQTLs to the risk of developing a disease would help identify the eQTL-gene-tissue combinations that convey the risk associated with the variant. We have demonstrated that the three-dimensional structure of the genome can be used to help identify eQTL-gene pairs and thus the biological pathways that putatively contribute to disease etiology12,13. Yet, approaches that calculate relative estimates of the tissue specific contributions that SNPs make to disease development remain elusive.
We reasoned that if PD-associated SNPs contribute to disease development through gene regulatory effects, then the tissue-specificity of these eQTLs may be an important consideration for the aetiology of the disease14,15. Therefore, we developed a machine-learning predictor model for PD disease status that utilises and selects SNPs (without eQTLs in GTEx) and tissue-specific eQTL data, for case and control cohorts, to reveal the tissue-specific regulatory effects that are associated with PD risk. Briefly, we used a matrix of: 1) PD-associated SNPs that act as eQTLs, 2) the genes regulated by these eQTLs; 3) the tissues in which the eQTL effects were observed; and 4) SNPs that do not have eQTLs in GTEx to build a logistic predictor that was validated using genotype data from three independent studies3,16,17. The logistic predictor model that had the highest PD predictive ability, was trained and selected using the Wellcome Trust Case Control Consortium (WTCCC) cohort. The predictor model was then validated using two datasets derived from the UK Biobank17 and NeuroX-dbGap16. Our predictor ranked the relative contributions of six non-eQTL PD SNPs, and additional eQTLs that modulated gene regulation specifically within the heart atrial appendage as making the largest contributions to PD risk development.
Methods
Workflow for developing the PD predictor
We developed a machine learning approach that incorporates feature selection and cross validation, to calculate the additive tissue-specific contributions of spatial eQTLs within genotypes from individuals who developed PD (Figure 1).
Generation of tissue specific PD eQTL reference table
GWAS SNPs associated with PD (n=290, p-value <1.0 × 10−5; Supplementary Table 1) were obtained from the GWAS catalogue (www.ebi.ac.uk/gwas, downloaded 27th August 2020). This SNP set included young adult-onset Parkinsonism SNPs18 and the 90 SNPs identified by the most recent meta-analysis by Nalls et al.2.
The PD-associated SNPs (Supplementary Table 1) were analysed using the Contextualize Developmental SNPs in 3-Dimensions (CoDeS3D) algorithm6, with the beta effect calculation option, to identify: a) the genes that physically interact with the PD-associated SNPs; and b) which of these SNP-gene interactions are eQTLs. Physical interactions between PD-associated SNPs and genes were identified using Hi-C chromatin contact libraries (Supplementary Table 2) captured from:
Cell lines from primary human tissues (e.g. brain, skin and spinal cord)
Immortalised cell lines that represent the embryonic germ layers (i.e. HUVEC, NHEK, HeLa, HMEC, IMR90, KBM7, K562, and GM12878)
These libraries were chosen to represent tissues with the largest number of possible interactions that occur in the human system.
The potential regulatory effects (normalized effect size [NES]) of the spatial connections were mapped by leveraging the eQTL information from 49 human tissues (Genotype-Tissue Expression database [GTEx] v8; www.gtexportal.org). eQTL significance levels were adjusted for multiple testing [Benjamini–Hochberg FDR]19 and considered significant if q<0.05.
WTCCC cohort cleaning and genotype imputation
The PD genotype dataset was acquired from the WTCCC (Request ID 10584). The WTCCC PD genotype dataset, generated using Illumina microarrays, contained one case cohort (2197 individual samples) and two control cohorts (58C: 2930 individual samples and NBS: 2737 individual samples). Thus, the total number of control samples (n=5667) was more than double the number of cases. It is likely that the use of an imbalanced training dataset would create a biased disease status predictor. Therefore, only the control samples for the 58C 1958, British Birth cohort were used in this study.
SNPs and individual samples that were of poor quality and were recommended for study exclusion by the WTCCC were removed (Supplementary Table 3). SNPs within individual genotypes were converted to dbSNP rsIDs and genomic positions mapped (GRCh37, hg19) by Python scripts. PLINK (v1.90b6.2, 64-bit)20 was used for quality control. Genotypes were cleaned using the Method-of-moments F coefficient estimate to remove case homozygosity outliers (F values < -0.02 or 0.02 < F values) and the control outliers (F values < -0.016 or 0.19 < F values). Related individuals were identified and removed using proportion IBD (PI_HAT > 0.08). Ancestry outliers (identified by principal component analysis [PCA] plotting), individuals with sex genotype errors (identified by PLINK), or individuals with missing genotype data (missing rate > 5%) were also removed. Finally, SNPs that were significantly outside of Hardy-Weinberg Equilibrium (p < 10−6) or had a minor allele frequency < 1% were also removed.
The WTCCC PD case and control genotype data were obtained using two different Illumina microarrays (Human670-QuadCustom and Human1-2M-DuoCustom_v1_A)3. Therefore, we only used the 526,576 SNPs that were present in both microarrays for imputation. SNP data imputation was performed to recover a total of 27,590,399 SNPs using the Sanger imputation service (https://imputation.sanger.ac.uk), EAGLE+PBWT pipeline21,22, and Haplotype Reference Consortium(r1.1)23. Imputation was performed according to the default instructions (https://imputation.sanger.ac.uk/?instructions=1). Following imputation, PLINK was used to update the genotype data with rsIDs and remove SNPs with an: impute2 score < 0.3; missing data rate > 5%; or those that were not in Hardy-Weinberg Equilibrium (p < 10−6). The genotypes for 281 of the 290 PD SNPs used in this study (Supplementary Table 1) were extracted from the imputed PD genotype data.
Creation of a weighted WTCCC PD genotype eQTL matrix
We created a matrix that combined individual genotypes with the eQTL effects for the PD-associated SNPs. There were three groups of data fields in the PD genotype eQTL table:
Individual sample information (sex, and disease status)
Individual sample PD-associated SNP genotype (SNP minor allele count) weighted by GTEx tissue-specific eQTL normalised effect sizes
Individual PD-associated SNP genotype for the SNPs with no eQTL effect information
The tissue-specific eQTL normalised effect size (NES) for the PD-associated SNPs were extracted from the GTEx eQTL summary table of significant eQTLs (Supplementary Table 4). The NES for each tissue-specific eQTL was weighted by the number of alternative alleles (0, 1 or 2) at the eQTL SNP position in each individual’s genome. 54 of the 290 PD-associated SNPs had no identifiable eQTL effects (Supplementary Table 5) and were input into the model unweighted, using solely SNP allele count from the imputed genotype.
We created two regularised logistic regression models (see below): for model-1, we created a weighted WTCCC PD genotype eQTL matrix for all 290 SNPs that were imputed and represented in the PD eQTL reference table. By contrast, for model-2 we created a weighted WTCCC PD genotype eQTL matrix for the subset of 90 SNPs from the PRS identified in Nalls et al.2.
Generation, training, and validation of the regularised logistic regression models (model-1 and model-2)
We developed a regularised logistic regression predictor that incorporated a: 1) Mann-Whitney U tests in combination with Benjamini-Yekutieli procedure for controlling False Discovery Rate (FDR) to identify relevant information; and 2) multivariate prediction step that considers all features in context and removes redundant information, to identify the best combination of features for prediction of PD. Regularised logistic regression was incorporated into the models to enable the identification of features that contribute most to the final score.
The weighted WTCCC PD genotype eQTL matrix that contained all the case and control genotypes that passed quality control (4366 individual samples: 1698 cases and 2668 controls) was used to train model-1 (using all 290 SNPs), or model-2 (using the 90 Nalls et al. SNPs).
The Mann-Whitney U test (tsfresh version 0.16.0)24 was used to select the individual feature columns within the weighted WTCCC PD genotype eQTL matrix from the full training dataset that were the most relevant attributes for predicting PD status (i.e. the relevant attribute subset; FDR = 0.05)25. The relevant attribute subset was then used to train a multivariate logistic regression model (Scikit-learn version 0.23.2)26 implemented with elastic net regularisation using the SAGA solver to predict PD disease status. The machine learning elastic net regularisation prevented overfitting the predictor model by further sub-selecting the essential features for delivering the best prediction.
Training was optimised (measured by area under the receiver operating characteristic curve [AUC])27 using a Scikit-learn Grid Search algorithm26,28 with 10-fold cross-validation setting to select the optimised predictor model hyperparameters from the training stage (90% of cohort used for training, 10% used for cross validation). The optimised predictor hyperparameters for model-1 were: C=0.5, l1_ratio=0.6, max_iter=800, penalty=‘elasticnet’, random_state=1, solver=‘saga’ from the search space of following:
‘C’: 0.01, 0.05, 0.1, 0.5,1, 10, 20, 30,
‘max_iter’: 200, 500, 800, 1000, 1200, 1400, 1500, 1600,
‘l1_ratio’: 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1.
The optimised predictor hyperparameters for model-2 were: C=0.6, l1_ratio=0.1, max_iter=130, penalty=‘elasticnet’, random_state=1, solver=‘saga’ from the search of following:
‘C’: 0.0001, 0.001, 0.01, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8, 1, 3,
‘max_iter’: 1, 5, 70, 100, 130, 150, 170, 180, 200, 300, 500, 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 3000,
‘l1_ratio’: 1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1.
The search space of models-1 and 2 did not include l1_ratio = 0 for excluding L2 regularisation and implementing feature selection. To calculate the variation in AUCs of the models with the optimised parameters we undertook 5 repeats of 10-fold cross-validation of model generation and validation by the Scikit-learn RepeatedKFold algorithm26. The 10-fold cross-validation started with the random generation of 10 equal parts from the full dataset. Nine parts of the data were used for training, and the remaining data were for validation. Mann Whitney U test filtering controlled by FDR = 0.05 was applied to the training set. Subsequently, the filtered training data were modelled by the multiple regularised logistic regression algorithm with the optimised predictor hyperparameters of (model-1 or model-2). This process was repeated until all parts of the data were used for validation. The result of this process is the final PD predictor for each model, with in-sample (training data) predictive performance as assessed by AUC.
Calculation of tissue-specific contributions to PD risk
The 50 PD regularised logistic regression predictors created from the 5 repeats of 10-fold cross-validation above were used to test the predictive power of the model created with the optimised predictor hyperparameters from the tissue-specific eQTL effects. Tissue-specific contributions to the PD risk were extracted from each of the 50 PD regularised logistic regression predictors as the sum of the absolute values of the model weights associated with each tissue.
Validation of model-1 and model-2
The generalising PD predictive power of models-1 and -2 was validated by testing on two independent test datasets derived from the UK Biobank and NeuroX-dbGap genotype data.
As the UK Biobank only has a small number of PD case samples, we created 30 different test cohorts of individual samples (without missing data for those SNPs included in model-1 or 2) using the same PD cases with 30 independently and randomly chosen non-PD diagnosed controls. Each cohort was used to create a weighted eQTL-genotype matrix for testing the predictive power and evaluating by AUC. The mean AUC of the 30 predictive tests was used as the validation result of the models.
NeuroX-dbGap was the largest PD single array study2,16,29, and we selected all the PD case and control samples of NeuroX-dbGap (without missing data for those SNPs included in model-1 or 2) to build a weighted eQTL-genotype matrix for validating the PD predictive power of each model.
Model-1 was validated using the following two independent datasets: 1) 30 cohorts of 2384 individual samples (928 cases and 1456 controls) derived from the UK Biobank; and 2) the 5,224 cases and 5,563 controls from NeuroX-dbGap (the largest PD single array study)2,16,29.
Model-2 was also validated by the same two independent datasets: 1) the 30 cohorts of 3812 individual samples (1484 cases and 2319 controls) derived from the UK Biobank; and 2) the cases and control from the NeuroX-dbGap dataset (5,224 cases and 5,563 controls) 2,16,29. Note the number of cases and controls in the UK Biobank differed as there were fewer cases excluded due to missing SNP data (see below).
UK Biobank cohort definition and genotype imputation
Genotypes (case and control) that were used from the UK Biobank were selected as follows. European Caucasian samples identified by genetic clustering methods were selected and imputed (487,411 individual samples). The genomic relatedness analysis excluded SNPs that the UK Biobank recommended were removed from the selected case and control data.
The cases (model-1: 928 cases or model-2:1484 cases) were selected using the following criteria:
PD patient identified by the UK Biobank developed algorithm (field 42033)
PD patient identified by hospital records G20
PD patient had no missing data for any SNPs within the predictor model (model-1 or model-2). The greater number of SNPs used in model-1 meant that more cases were excluded due to missing data.
Control genotypes, not having records of Parkinsonism and without missing data for any of the SNPs included in the final predictor, were randomly selected from the healthy controls within the UK Biobank data for each of the 30 test cohorts. As model 2 had more cases, more controls were also included so as to match the ratio of case:control.
Genotype data of the UK Biobank case and control samples in each test cohort were used to build a weighted eQTL-genotype matrix for testing model-1 or 2 to recognise the disease status of individual samples correctly.
NeuroX-dbGap cohort definition and genotype imputation
Genotypes were also obtained from the NeuroX-dbGap dataset. Genotypes were cleaned by removing all insertion and deletion variants. SNP IDs were converted to dbSNP rsIDs. Variants in chromosome 24(Y), 25(XY) and 26(MT) that are not included in the study due to the inconsistency with the Sanger imputed SNP data were also removed. Ancestry outliers (identified by principal component analysis [PCA] plotting), individuals with sex genotype errors (identified by PLINK), or individuals with missing genotype data (missing rate > 5%) were also removed. Finally, SNPs that were not in Hardy-Weinberg Equilibrium (p < 10−5) or had a minor allele frequency < 1% were removed. All the variants in the final model (model-1 or model-2) which were not present in the NeuroX-dbGap data were replaced with proxy SNPs using linkage disequilibrium information (r2> 0.5)29 calculated by PLINK from European 1000 genome genotype data (https://www.internationalgenome.org/about)30. The European 1000 genome genotype data were downloaded from (https://ctg.cncr.nl/software/magma)31 on 10th August 2020.
Mann-Whitney U test filtering on 290 PD and 313 T1D SNPs derived eQTL matrix
We generated a GTEx eQTL summary table of significant eQTLs (Supplementary Table 7) using 313 Type 1 Diabetes (T1D) associated SNPs (Supplementary Table 6). We mixed the 290 PD and 313 T1D SNPs to create a weighted WTCCC PD and T1D genotype eQTL matrix, as outlined above. Mann-Whitney U test filtering (controlled by FDR = 0.05) was applied to the weighted WTCCC PD and T1D genotype eQTL matrix to determine the filtering power for removing non-related PD features.
Data analysis
All statistical tests were performed with Scikit-learn (version 0.23.2)26, and tsfresh (version 0.16.0)24
Code Availability
The CoDeS3D pipeline is available at: https://github.com/Genome3d/codes3d-v2.
The Python scripts and machine learning code used in this analysis are available at: https://github.com/Genome3d/PD_lg_predictor_analysis
Python version 3.7.3 was used for all the Python scripts.
Data availability
The data that support the findings of this study are available were derived from the following resources available in the public domain:
The SNPs are listed in Supplementary table 1.
The HiC datasets are listed in Supplementary table 2.
eQTL information from 49 human tissues were obtained from the Genotype-Tissue Expression database [GTEx] v8; www.gtexportal.org
PD genotype datasets were acquired from the: Wellcome Trust Case Control Consortium (Request ID 10584); UKBioBank (Application Number 61507); NeuroX-dbGap (dbGap project#98581-1)
Results
PD-associated SNPs are tissue specific eQTLs for 1,334 eGenes
We hypothesised that PD SNPs modulate disease risk through tissue-specific eQTL effects (i.e. eQTL-eGene)14,15. We analysed 290 PD-associated GWAS SNPs (Supplementary Table 8) for spatial eQTL interactions7,32,33 across 49 GTEx tissues14. 231 of the 290 (79.7%) PD SNPs tested were involved in 18,041 tissue-specific eQTL associations (Benjamini– Hochberg FDR < 0.0519; Supplementary Table 4), regulating 1,334 eGenes across the 49 GTEx tissues. Gene ontology analysis (David Functional Annotation)34 identified that the regulated genes were significantly enriched for intracellular signal transduction, antigen processing and presentation of peptides, among other pathways (Supplementary Table 9).
Modelling genotype data to identify the genetic risk associated with tissue-specific eQTL effects for PD disease status
Understanding the impacts and complex networks associated with eQTLs is challenging. We hypothesised that regularised logistic regression models could be used to identify and rank the tissue-specific eQTLs that were significant contributors to PD risk.
We integrated the CoDeS3D eQTL analysis of the 290 PD SNPs with the genotype data for individuals within the WTCCC35 PD cohort (4366 individual samples: 1698 cases and 2668 controls; methods)3. Of the 290 PD SNPs, 281 SNPs were present in the WTCCC data. This resulted in the generation of a PD-SNP derived weighted WTCCC PD genotype eQTL effect matrix containing 17,829 tissue-specific eQTL-eGene pairs (227 SNPs, 1310 eGenes, 49 tissues) and 54 (of the 281) SNPs that had no known eQTL effects following our CoDeS3D analysis. Uninformative features for PD prediction were removed using a Mann-Whitney U test36 (FDR < 0.05) (Methods). After filtering, 11,288 PD SNP derived features (53 SNPs, 245 eGenes, 49 tissues) remained within the relevant attribute subset of the weighted WTCCC PD genotype eQTL effect matrix.
To test the effectiveness of the Mann-Whitney U test filter, we generated a PD and type 1 diabetes (T1D) SNP derived eQTL effect matrix using a mixed set of 290 PD and 313 T1D-associated SNPs and integrating with the WTCCC PD cohort genotypes (Supplementary Table 6). The PD + T1D SNP derived tissue-specific eQTL effect matrix included 25,052 SNP related data fields (556 SNPs, 1927 eGenes, 49 tissues). After the Mann-Whitney U test filtering (FDR < 0.05), 11,147 of the data fields (45 SNPs, 209 eGenes, 49 tissues) were selected using PD as the phenotypic outcome. Only one of the 313 (0.32%) T1D-associated SNP, rs1052553, remained following the Mann-Whitney U test filtering. Although rs1052553 has not previously been associated with PD in GWA studies, it has been implicated in PD as part of a PD risk haplotype37,38. Therefore, these results confirm that the Mann-Whitney U test filters uninformative data while preserving valuable PD information for our modelling.
We created regularised logistic regression models for PD risk using the Mann-Whitney U test filtered PD variant derived eQTL effect matrix (11,288 PD-SNP derived features [53 SNPs, 245 eGenes, 49 tissues]). The AUCs of the 50 PD regularised logistic regression predictors had a mean of 0.565 (distributed from 0.516 to 0.637) and a standard deviation of 0.024 (generated with the optimised predictor model hyperparameters by 5 repeats of 10-fold cross validation). The final PD predictor model (model-1) was trained using the entire WTCCC PD cohort. After the Mann-Whitney U test filtered WTCCC PD variant derived eQTL effect matrix contained 17,829 variant derived features. Model-1 selected 827 tissue-specific eQTLs and 6 SNPs with no eQTL effect (Supplementary Table 10). Model-1 had an enhanced diagnostic ability as represented by an AUC of 0.627 obtained using the training data.
We validated the predictive power of model-1 using two independent PD cohorts (UK Biobank17 (30 datasets of 923 cases and 1456 controls) and NeuroX-dbGap16,29). Model-1 was validated in both cohorts, producing mean AUCs of 0.572 and 0.571 in the UK BioBank and NeuroX-dbGap cohorts, respectively. These two validation results are highly consistent and within the range of the model AUCs (0.516 to 0.637) estimated by the 50 optimised logistic regression predictor models.
eQTLs specific to the heart atrial appendage contribute to genetic risk in PD
We used the magnitude of the model weights (coefficients) for the genetic features, grouped by tissue-specificity of the effects, in the logistic regression model-1 as proxies for the contribution of the features to PD risk.
Six SNPs that had no identified eQTL effects (from CoDeS3D analysis of GTEx) made the most significant group contribution (18% of the total model weight) to the risk of PD development (Table 1; Figure 2). The six non-eQTL SNPs are: rs117896735, rs144210190, rs35749011, rs12726330, rs356220 and rs5019538 (Table 1). Note that the GTEx study12 removed rs356220 and rs5019538 from the tissue-specific eQTL data as part of their QC processing. Therefore, we were unable to test if rs356220 and rs5019538 were eQTLs. rs117896735 also has no eQTL effect information found in GTEx database. The other three SNPs (rs144210190, rs35749011 and rs12726330) were not detected by CoDeS3D to have spatial eQTL and eGene interactions within the Hi-C libraries used in this study.
The next most significant contributions to the risk of PD development involved eQTLs that affected the heart atrial appendage (9%) and brain cerebellum (4%; Figure 2). The substantia nigra is viewed as a central brain region in PD yet eQTL gene regulation specific to the substantia nigra contributed ∼1.5% of the risk of PD development. We repeated the calculation of the tissue-specific contribution ranking using data from the 50 optimised predictor models, generated with model-1’s hyperparameters by 5 repeats of 10-fold cross validation (randomizing the full Mann-Whitney U test filtered PD variant derived eQTL effect matrix),. Again, SNPs lacking known eQTL effects, heart atrial appendage, and brain cerebellum were identified as the top three genetic contributors to the risk of PD development (Figure 3).
Fifteen eQTLs contributed to the heart atrial appendages contribution to the risk of developing PD measured in model-1 (Table 1). Notably, the two biggest eQTL contributors, rs7617877 and rs6808178, each accounted for approximately 3% to the total model weight. rs7617877 and rs6808178 are in high linkage disequilibrium (R2 = 0.86)39 within European populations. rs7617877 and rs6808178 do not show detectable spatial regulatory associations with their nearest genes and instead both act as eQTLs for a gene > 13Mb downstream, EAF1-AS1, in the heart atrial appendage. EAF1-AS1 is a long antisense non-coding RNA gene transcribed in antisense to EAF1, that undergoes an isoform switch, and has a significantly different transcript usage in the brains of patients with Parkinson’s disease40. Interestingly, rs6808178 also acts as a heart atrial appendage eQTL for TMEM161B-AS1 (Table 1), which has also been implicated in neurodegeneration41.
Creating a PD logistic regression predictor model using the 90 SNPs from the PRS calculated by Nalls et al
In the latest PD GWAS meta-analysis, Nalls et al. identified 90 SNPs that contribute to a PRS model for PD risk2. We therefore sought to understand the PD risk contribution that was specific to these 90 SNPs and created a logistic regression predictor model using only this subset. 88 of the 90 variants passed quality control (post-imputation data cleaning and quality checking). The 88 SNPs were integrated with the WTCCC PD genotype data to create a PD SNP derived eQTL effect matrix of WTCCC individual samples (4366 individual samples: 1698 cases and 2668 controls). The PD SNP derived eQTL effect matrix contained 3,206 features consisting of related tissue-specific eQTL-eGene pairs (76 SNPs, 518 genes, 49 tissue types) and 12 SNPs that lacked CoDeS3D detectable eQTL effects. Mann-Whitney U test filtering (FDR < 0.05) left 920 features (12 SNPs, 95 genes, 49 tissue types) that were used in the subsequent logistic regression modelling26. Model training was repeated using the optimised predictor hyperparameters and the eQTL effect matrix for the full WTCCC cohort to create predictor model-2. Model-2 achieved in-sample PD prediction with an AUC = 0.604 using 311 features (12 SNPs, 46 genes, 49 tissue types) (Supplementary Table 11) that included 308 tissue-specific eQTLs and 3 SNPs without known eQTL effects. Model-2 was validated using the UK Biobank17 (AUC = 0.554) and NeuroX-dbGap16,29 (AUC = 0.568) genotype data.
We determined the tissue-specific distribution for the 50 predictors that were created with model-2’s hyperparameters. The results we observed were consistent with what we observed using model-1 (Figure 4). Specifically, three SNPs (rs117896735, rs35749011 and rs5019538) with no identifiable eQTL effects (Table 2) and the eQTLs within the heart atrial appendage were the top contributors to the risk of developing PD (Figure 4 and Table 2). The three non-eQTL SNPs appeared in both Model-1 and Model-2 and were observed to have similar effect sizes (both magnitude and direction) across both models. Also consistent with model-1, model-2 identified rs6808178 as the top eQTL contributing to the heart atrial appendage signal.
Discussion
The mechanisms by which PD-associated genetic variants4,16,42,43 contribute to disease risk and development have not been fully elucidated. Yet, it is critical that we identify the mechanisms by which they impact on PD because this will allow patient stratification and the development of therapeutics that target disease progression and not just pathology. We used machine learning to understand the genetic architecture of PD risk, by identifying and ranking the pivotal risk variants and tissue-specific eQTL effects that contribute to such risk. Curated PD-associated SNPs from the GWAS catalogue44 were analysed to identify their tissue-specific eQTL effects. Regularised logistic regression predictor models that evaluated PD risk were built and validated across three independent case:control cohorts3,16,17. Model-1 achieved superior predictivity in comparison to Model-2, and delivered an in-sample predictive AUC = 0.627, and was subsequently validated in two independent test datasets derived from the UK Biobank17 (AUC = 0.572) and NeuroX-dbGap16 (AUC = 0.571). Although greater PRS predictivity has been achieved for PD by other groups, our main aim was to determine the SNPs-genes-tissue combinations that have the greatest contribution. Model-1 (generated from 290 SNPs) identified 6 SNPs without known eQTL effects and the SNP modulated gene regulation within the heart atrial appendage as being the major contributors to the predicted risk of developing PD. A second model (Model-2) that was generated using only 90 SNPs2 (which were previously identified to have the greatest predictive power with a PRS analysis) confirmed a subset of the top predictors we observed with model-1. Collectively, our results confirm roles for SNPs that are significantly connected with INPP5P, CNTN1, GBA and SNCA in PD and separately suggest a key role for transcriptional changes within the heart atrial appendage in the risk of developing PD. Effects associated with eQTLs located within the brain cerebellum were also recognized to confer major PD risk in the more extensive model (model-1) consistent with current hypotheses suggesting the brain cerebellum plays a role in PD development45–47.
For the top six contributing SNPs to the model, our analyses did not identify any spatial eQTL interactions. However, previous research has shown connections between these SNPs and three well-known PD-associated genes (INPP5F, GBA, SNCA)48–51, and an additional gene (CNTN1). rs117896735, the top contributor to model-1, is an intronic variant of INPP5F and has previously been identified as eQTL for INPP5F transcript levels (the IPDGC locus browser52). INPP5F is a known risk gene for PD49 that regulates STAT3 intracellular signalling pathways53 and has functional roles in cardiac myocytes and axons54,55. rs1442190 is an intronic variant within CNTN1, a known risk gene for dementia with Lewy bodies56,57 that encodes a cell adhesion protein, which is important for axon connections and nervous system development58. rs35749011 and rs12726330 are linked to the well-known PD-associated gene GBA50 through strong linkage disequilibrium connections (R2 = 0.7739) with rs223028850,59, a missense coding variant located within GBA. rs35749011 has eQTL effects on GBA gene identified by the IPDGC database52. The final two SNPs, rs356220 and rs5019538, are located downstream of SNCA. SNCA encodes α-synuclein, which is central to PD pathogenesis51. The IPDGC database52 indicates that rs5019538 has eQTL effects on SNCA. Notably, rs356220 had the strongest association to PD in the original WTCCC GWAS3. Therefore, there is sufficient evidence that has previously associated these six variants with PD through connections to PD risk genes.
Allele specific regulatory changes in the heart atrial appendage confer PD risk
We identified that eQTLs specific to the heart atrial appendage make a reproducible and substantial (2nd highest) contribution to the risk of developing PD. The heart atrial appendage is a trigger site of atrial fibrillation (AF)60 and highly associated with hypertension and stroke61–64. There is a growing body of research indicating a close relationship between cardiovascular health and PD development65–70. For example, AF has been strongly related to early-stage PD65. Moreover, Moon et al. identified that patients with PD have an increased risk of AF, with a threefold increased risk (HR: 3.06, 95% CI: 1.20-7.77) of AF in younger PD patients (age: 40-49 years)71. Observations of a cross-sectional PD patient cohort have identified abnormal blood flow patterns in brains72 and it is argued that AF-associated perturbation of the brain blood supply networks promotes tissue inflammation and damage leading to PD pathogenesis73.
Amongst the 15 eQTL features that combined to make the heart atrial appendage’s contribution to the risk of developing PD (Table 1 & 2), eQTL up-regulation of EAF1-AS1 (a long non-coding mRNA) made the greatest contribution. EAF1-AS1 has different isoforms some of which overlap EAF1 and COLQ (collagen like tail subunit of asymmetric acetylcholinesterase). Elevated EAF1-AS1 transcript levels have previously been identified by differential gene expression analyses in brain tissue samples from PD patients40. It is interesting to speculate that the impact of this change is mediated through the interaction of EAF1-AS1 with EAF1. Notably, EAF1 has been associated with both neural development74 and TGF-β signalling75, which is a key pathway in many cardiac physiological processes76. As such, the deregulation of EAF1-AS1 might impact on cardiac health. However, the anti-sense overlap is limited to the 3’ UTR of EAF1 (UCSC Genome browser GRCh38/hg38). Therefore, we propose that future studies should investigate the regulatory impacts of EAF1-AS1 on EAF1 and the consequences of alterations in expression levels on heart function and PD disease. We contend that understanding this relationship may help to decipher the complex interactions connecting cardiovascular fitness and PD pathogenesis.
Similar to our work, Li et al.77 used linkage disequilibrium score regression (LDSC) analysis78,79 to identify enrichments of PD risk signals in six GTEx14 central nervous system tissues. However, three subsequent studies using LDSC have failed to reproduce Li et al.’s results 80–82. LDSC focuses on measuring the risk enrichment of genes uniquely expressed in each GTEx tissue78,79. By contrast, our model does not assume unique tissue expression. Rather, it identifies the risk associated with the PD-SNP, or the expression of all genes modulated specifically by PD-SNPs in different or multiple GTEx tissues. We therefore hypothesise that the fact that Li et al. did not identify any signals in heart tissues is likely due to the differences in the assumptions underlying the methodologies.
We acknowledge several limitations within our work. Firstly, the low predictive power of the models, in part, is due to the sample sizes and SNPs that were present within the cohorts we used to train and validate our models. We also acknowledge that the individuals in the included datasets are predominantly of European descent, and thus the significance of our findings are limited to this ethnicity. One limitation that impacts the vast majority of PD research is the lack of consistency in diagnostic criteria from one cohort to the other, and our study is not exempt from this.
The limitations within our study do not detract from the strengths of our model which included the fact that contributing features were: 1) easily identifiable; 2) validated across three independent cohorts; and 3) consistently identified genomic regions that are unanimously recognised as being associated with PD (e.g. SNCA).
Our approach provides a significant advance over other previously reported methods. The novelty revolves around the ability of our method to: 1) rank the contributions that SNPs make to a phenotype through regulatory changes; 2) identify the tissues in which these changes are occurring; and 3) include effects from variants that do not have detectable eQTLs in the reference library that is used in the assay. Finally, the consistency between models and ability to filter extraneous SNPs (e.g. T1D eQTLs) out of the final predictor is another strength of this study. The higher predictive power observed for model-1 (Supplementary Table 12) may be explained by the observation that the final model included more features (827 vs 308). However, given that model-1 leveraged 290 PD-associated SNPs, the result also suggests that the 90 SNPs, originally identified as part of the Nalls et al. PRS analysis2, do in fact contain the major genetic components that are associated with the risk of developing PD. Therefore, while other genetic signals clearly remain to be identified, the finding that both models consistently identified the same SNPs and heart atrial appendage eQTLs as the top contributors to the risk of developing PD further confirms the significance of these observations.
Conclusion
In conclusion, we applied machine learning algorithms to rank the pivotal variants and tissue-specific eQTL effects that may contribute to the risk of developing PD by integrating PD-associated SNPs with information on genome organisation, tissue-specific eQTLs and the genotypes of PD cases and controls. Across our two models we consistently identified the same SNPs and heart atrial appendage eQTLs, linked to EAF1-AS1 regulation, as the top contributors to the risk of developing PD. These findings warrant further mechanistic investigations to determine if they are early contributors to PD risk and disease development.
Data Availability
The data that support the findings of this study are available from the following resources in the public domain: The SNPs are listed in Supplementary table 1. The HiC datasets are listed in Supplementary table 2. eQTL information from 49 human tissues were obtained from the Genotype-Tissue Expression database [GTEx] v8; www.gtexportal.org PD genotype datasets were acquired from the: Wellcome Trust Case Control Consortium (Request ID 10584); UKBioBank (Application Number 61507); NeuroX-dbGap (dbGap project#98581-1)
Authors’ roles
DH was responsible for research project execution (1A), statistical analysis, design, execution (2A & B); manuscript preparation, writing of first draft (3A). WS, AKL, and JO co-supervise DH and were responsible for statistical analysis review and critique (2C). WS, AKL, AC, SF, JOS were responsible for manuscript review and critique (3B). JO was responsible for research project conception, organization and execution (1A, B and C).
Financial Disclosures of all authors (for the preceding 12 months)
Full Financial Disclosures of all Authors for the Past Year: Information concerning all sources of financial support and funding for the preceding twelve months, regardless of relationship to current manuscript, must be submitted with the following categories suggested.
Funding sources
WS, SF, AC, and JO were funded by the Michael J. Fox Foundation for Parkinson’s research and the Silverstein Foundation for Parkinson’s with GBA – grant ID 16229 to JOS. SF was funded by a Liggins Institute Doctoral Scholarship and Dines Family Trust. AC received grant funding from the Australian government. DH was funded by an MBIE Catalyst grant (The New Zealand-Australia Life-Course Collaboration on Genes, Environment, Nutrition and Obesity (GENO); UOAX1611; to JOS).
Acknowledgements
The authors would like to thank Tayaza Fadason and the Genomics and Systems Biology Group at the Liggins Institute for their helpful discussions. The eQTL data used for the analyses were obtained from Genotype-Tissue Expression (GTEx) Portal. We would like to acknowledge the funders of GTEx Project – common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.
This study makes use of data generated by the Wellcome Trust Case-Control Consortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113, 085475 and 090355.
This research has been conducted using the UK Biobank Resource under Application Number 61507.
Footnotes
Relevant conflicts of interest/financial disclosure: Nothing to report.
Funding sources: WS, SF, AC, and JO were funded by the Michael J. Fox Foundation for Parkinson’s research and the Silverstein Foundation for Parkinson’s with GBA – grant ID 16229 to JOS. SF was funded by a Liggins Institute Doctoral Scholarship and Dines Family Trust. AC received grant funding from the Australian government. DH was funded by an MBIE Catalyst grant (The New Zealand-Australia Life-Course Collaboration on Genes, Environment, Nutrition and Obesity (GENO); UOAX1611; to JOS).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.
- 6.↵
- 7.↵
- 8.
- 9.
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.
- 63.
- 64.↵
- 65.↵
- 66.
- 67.
- 68.
- 69.
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵