Abstract
Polygenic risk scores (PRS) are relative measures of an individual’s genetic propensity to a particular trait or disease. Most PRS methods assume that mutation effects scale linearly with the number of alleles and are constant across individuals. While these assumptions simplify computation, they increase error, particularly for less-represented racial groups. We developed and provide Delphi (deep learning for phenotype inference), a deep-learning method that relaxes these assumptions to produce more predictive PRS. In contrast to other methods, Delphi can integrate up to hundreds of thousands of SNPs as input. We compare our results to a standard, linear PRS model, lasso regression, and a gradient-boosted trees-based method. We show that deep learning can be an effective approach to genetic risk prediction. We report a relative increase in the percentage variance explained compared to the state-of-the-art by 11.4% for body mass index, 18.9% for systolic blood pressure, 7.5% for LDL, 35% for C-reactive protein, 16.2% for height, 29.6 % for pulse rate; in addition, Delphi provides 2% absolute explained variance for blood glucose while other tested methods were non-predictive. Furthermore, we show that Delphi tends to increase the weight of high-effect mutations. This work demonstrates an effective deep learning method for modeling genetic risk that also showed to generalize well when evaluated on individuals from non-European ancestries.
Introduction
The total genetic component of common traits and diseases is attributable, at least in part, to a combination of small effects from a large number of mutations on the entire genome (1). Genome-wide association studies (GWAS) can identify univariate relationships between common single nucleotide polymorphisms (SNPs) and a given trait. A GWAS output comprises an estimated effect size coupled with a P-value of association for each tested SNP. A single scalar indicating relative genetic risk can be obtained by summing up the number of alleles weighted by the estimated effect size of SNPs, with or without non-genetic risk factors (2). These so-called polygenic risk scores (PRS) are commonly used to quantify an individual’s genetic propensity for a particular trait or disease and have potential clinical applications in prevention, diagnosis, and treatment (3; 4; 5).
Methods for PRS estimation have evolved considerably over the past decade. It was first found that including mutations below GWAS statistical significance would increase predictive power (6; 7). Taking linkage disequilibrium (LD) into account by either clumping and thresholding (C+T) or by using a shrinkage method also improved performance (8; 9). More recent work has included advancements in statistical learning and an improved understanding of biology to increase the predictive performance of PRS. For instance, Bayesian approaches can also consider minor allele frequency (MAF) (10) or incorporate functional priors (11) to modify the effect size estimates. These different methods generally offer marginal improvements over one another and suffer from similar limitations: effects are constant and scale linearly with the number of alleles.
PRS typically become significantly less predictive when applied to other less represented ancestries (12). This performance drop can be partly attributed to allele frequency differences between cohorts and other genetic and environmental factors. These limitations hinder the application of PRS in medical settings (13), and this performance gap can only be bridged with additional data collection from under-represented ancestries. Multiple approaches have been proposed to increase the generalizability of PRS, for instance, by aggregating results from multiple GWAS studies (14; 15) or prioritizing functional variants (16). Recently, increased prediction performance was observed through the use of a gradient-boosted model taking a standard PRS and a selection of high-impact SNPs as inputs (17). Very recently, GenoBoost (18), another gradient-boosted approach, showed improved performance by modeling non-additive mutation effects.
Deep learning (DL) offers the ability to learn complex patterns directly from large labeled datasets with minimal assumptions. In genetics, DL has been applied for many problems such as variant calling (19), motif discovery (20), and image-derived phenotyping for GWAS (21; 22). Explainable DL approaches (23) could provide additional insight into the genetic factors influencing the disease. Abe et al. recently constructed a knowledge graph (24) to generate text-based explanations for individual variants. Using deep learning for genetic risk prediction could provide unique advantages, as overparametrization has recently been shown to improve generalization (25), which is important for PRS to be applicable in under-represented populations.
Using DL for PRS estimation has been attempted before, although the proposed approaches consisted in using shallow networks (max. 4 fully connected or convolutional layers) on a small set of SNPs (max. 5K) (26; 27; 28; 29; 30; 31). In those examples, DL was shown only marginally to improve results, if at all. For instance, Badré et al. (29) found that including 5273 high-impact SNPs in a deep neural network slightly improved the predictive performance of PRS for breast cancer over logistic regression, and including more SNPs did not improve performance. Zhou et al.(31) showed that a small neural network with three fully connected layers improved Alzheimer’s disease genetic risk prediction in a small (N≈10K) cohort.
In this work, we propose Delphi (deep learning for phenotype inference), a deep learning method that alleviates some of the issues of PRS mentioned above by tuning risk score estimates in a data-driven and hypothesis-free manner. In contrast to previous methods, we use a transformer architecture to capture non-linear interactions. Unlike other approaches, we modify effect sizes before the summation, allowing allele effects to depend on sex,, and other mutations. Our method can fine-tune effects from any classical PRS method such as LDpred (8) and Lassosum (9). We report state-of-the-art results for 5 phenotypes from the UK Biobank dataset (UKBB) (32), and show that Delphi tends to increase the estimated effect of high-impact mutations. We also validate our predictions on individuals from under-represented ancestries and show that Delphi generalizes better than other tested approaches.
Results
The Delphi Method
At a high level, Delphi (Figure 1) uses genotyping and covariate information to learn perturbations of mutation effect estimates. Our approach contained two main steps. (1) the dataset was split into training, validation, and test sets before pre-processing. Mutation effect sizes were estimated with standard PRS techniques, and genotyping data was converted into a format enabling fast loading during training. (2) In the training step, a covariate model based on gradient-boosted trees (33) estimated the phenotype from age, sex, and genetic principal components, and a deep neural network learned to perturb individual effect sizes for all mutations included in the PRS summation. The modified effect sizes were then summed up to form a personalized PRS. The covariate model outputs and the PRS summation were finally linearly combined to form the final prediction.
GWAS and PRS
485’231 UKBB subjects were randomly split into different sets. The training set was used for principal component analysis (PCA), GWAS, PRS computation, and deep neural network training. The validation set was used for PRS validation and model selection after training. The held-out test set remained unseen until the final evaluation. We only considered 1.3M SNPs from the HapMap3 set (34) with an INFO score > 0.8 and MAF > 0.01. PCA on the genotype matrix was used to capture population structure.
GWAS for all phenotypes only included subjects within the training set from British-white ancestry (UKBB field 22006) to reduce spurious associations, and any subjects further than three standard deviations away from the first six principal components were removed. Sex, age, and the first 20 principal components were used as covariates. Classical PRS methods use LD, MAF, and other measures to reweight the effect estimates. We found some performance improvement by using these re-weighted effect estimates as a baseline instead of the GWAS summary statistics. PRS were obtained with three different methods: C+T, Lassosum (9), and LDpred (8). The pre-processing step was implemented in R, using the bigsnpr (35) library.
Learning perturbations of mutation effects
The second step consists of learning individualized effect perturbations. As in GWAS, covariates were age, sex, and the first 20 PC loading. Before training, an XGBoost model was fitted on covariate data and is referred to as the covariate model. Separately, the genotype data was converted from .bgen files to a hierarchical format that allows for fast data retrieval of all HapMap3 SNPs of a small number of subjects. We trained the neural network on the residuals of this model, which made convergence easier when some covariates had a high impact on the phenotype. The neural network’s architecture was a standard 8-layer transformer with variable sequence length depending on the number of input SNPs. SNPs were aggregated into fixed-size groups and linearly mapped to form a sequence of embeddings of size 512. In addition, covariates were included as the first embedding in the sequence, and zero padding was used when necessary. The transformer’s output was then mapped back into a vector the size of the number of input SNPs. This vector represents individualized variations in the SNP effect. As in traditional PRS methods, these modified effects were then summed up and linearly mapped in combination with the output of the covariate model to form a final prediction. A graphical overview of the method is presented in figure 1.
Baseline PRS results
Three PRS methods (C+T, Lassosum2, and LDpred2) were compared to provide baselines. We compared the proportion of explained variance (EVR) for all phenotypes. Predictions were made with a linear or logistic model, using age, sex, and the first 20 genetic PCs as covariates and the estimated score. Our results are displayed in Figure 2. LDpred2 outperformed the other two tested methods on all tested phenotypes. Thus, we chose the effect estimates from LDpred2 as the baseline for our method in all further analyses. We also compared the performance of three variants of LDpred2: LDpred2-grid and LDpred2-auto. We found that LDpred2-auto was superior to LDpred2-grid for all tested phenotypes and used this variant.
Trait Prediction
We evaluated the performance of Delphi on ten continuous phenotypes, using three different train/test splits, and used explained variance as performance metrics. We also compared Delphi with linear and Lasso regressions and an approach using XGBoost to modify effect sizes (17) with the base weights from LDpred. Hyperparameters for all three methods were tuned with three-way cross-validation on the same validation set. Results showed that Delphi resulted in lower error than other approaches on all phenotypes Figure 3 provides detailed results. It should be noted that the explained variance from benchmarked PRS methods is lower than the ones shown on the validation set in Figure2, as the validation set was used for parameter tuning on this set for all 3 methods.
We then compared Delphi prediction to the next best method, XGboost, in terms of the distribution of errors for ten phenotypes. Delphi generally tended to have fewer large prediction errors, as shown by the ratio of quartile difference between the two methods (Figure 4). This is especially visible for height, for which ratios had to be bounded between 0.9 and 1.1 for visibility. We suppose the sharp gain in performance for height is due to the fact that this phenotype is known to be highly polygenic and to exhibit SNP-sex and SNP-PC interactions (36), which makes this phenotype particularly suitable for our approach. This difference in prediction distribution is also visible from histograms of distances between predicted and ground truth decile values, as shown in figure 5.
Predictions for Delphi showed, in general, lower absolute error than XGboost prediction (supplementary section 1). The relative increase in the percentage variance explained compared to the state-of-the-art was 11% for body mass index, 19% for systolic blood pressure, and 35% for C-reactive protein; in addition, Delphi provided 2% absolute explained variance for blood glucose while other tested methods were non-predictive.
Performance comparison on non-white British for multiple phenotypes
We also compared performance on the subset of the test set with non-British white ancestries (N≈13K, depending on split and phenotype). Ancestry was determined according to Field 22006, which indicates subjects who self-identified as ‘White British’ according to Field 21000 and have very similar genetic ancestry based on a principal components analysis of the genotypes. Results are shown in Figure 6. The relative increase in the percentage variance explained compared to the state-of-the-art on the set with non-British white ancestries was 18% for body mass index, 25% for systolic blood pressure, 2% for LDL, and 15% for C-reactive protein; in addition, Delphi provided 2% absolute explained variance for blood glucose while other tested methods were non-predictive. Notably, the EVR is higher for non-British white individuals for some phenotypes (BMI and LDL). This predictive gain is due to the reduction of the total variance and is not reflected in the mean absolute error (see supplementary section 1). Results on the British white set were otherwise very similar to the total set shown in figure 3 as these individuals from approximately 95% of the held-out test set.
We also compared performance on subsets of the test set that self-reported as either African (N≈560), Chinese (N≈290), or Indian (N≈920). Results are shown in figure 7. Despite the low number of subjects in each group, Delphi outperforms other tested approaches on most phenotypes.
Observed Trends in Effect modulation
We observed interesting patterns when inspecting the average effect modulations before the summation. As shown in Figure 8, Delphi tends to down-weight the absolute effect of SNPs with low absolute effect. Interestingly, we do not observe the same trend when grouping SNPs by minor allele frequency decile. Other SNP-heritability estimation methods such as LDAK (37) include MAF, LD estimates and functional anotations to refine predictions. As LDPred modifies the effect estimates before any modification by the deep neural network, we expect the LD structure to be included in the effect estimates. This observation might indicate that the absolute effect may be an additional parameter of interest for future Bayesian methods.
Discussion
We introduced Delphi, a deep-learning-based method for trait prediction from genetic data. We demonstrated that deep learning enhances the predictive power of polygenic risk scores. Treating genetic risk estimation as a deep prediction problem allowed us to relax the usual assumptions of traditional PRS methods, yielding significant performance improvements on multiple phenotypes over previous PRS computation methods.
Several studies have tested deep learning approaches for phenotype inference from genetic data. These approaches all have a similar structure: use GWAS summary statistics to select a subset of SNPs, then use these as input for the neural network. Uppu et al. (26) used a 3-layer feed-forward network applied to breast cancer data. To our knowledge, this study contains the first use of a neural network for genetic risk prediction. In contrast, Bellot et al. (27) did not find any performance gain when comparing convolutional and fully connected neural networks to traditional methods. Recently, Huang et al. proposed DL-PRS (28), a method that also uses a shallow network to predict COPD, achieving marginal performance gains over traditional methods on UKBiobank data. A similar approach has been used by Badre et al. (29) with a 4-layer FC neural network on breast cancer data. Very recently, Zhou et al. (31) used a graph neural network for Alzeihmer’s prediction by constructing a graph from a few correlated locis. Elgart et al. (17) showed the strongest evidence for the superiority of non-linear methods for PRS by using gradient-boosted trees. This publication obtained robust results across multiple traits, which motivated our study.
All previously mentioned approaches only consider a small subset of SNPs (typically less than 1000) as input and become less predictive when including more small-effect SNPs. Training becomes difficult as smaller effects add noise to the input due to their minimal individual impact on the phenotype and a lack of clearly exploitable patterns. This is particularly a problem for PRS, which can include tens of thousands of SNPs. To guide the neural network towards meaningful predictions, we chose to perturb the estimated effect sizes rather than predicting the phenotype directly. As a result, we can effectively integrate up to a hundred thousand SNPs as input, which would not be feasible with other methods.
We have also shown that our method generalizes well when evaluated on individuals from non-European ancestries, although our training set is composed of 95 % European. This is an essential point for the success of PRS in any clinical setting, or their application can potentially reinforce racial bias (38). Our approach could be combined with other methods for the standardization of PRS, for instance, by combining summary statistics from multiple GWAS studies (39) or through some debiasing measure (40). The performance and fairness of PRS is an ongoing problem and requires more data acquisition from non-European cohorts. To reduce these disparities, it is necessary to assess and maintain prediction performance for all populations thoroughly.
Our study presents several limitations. The high dimensionality of the data, combined with the sizeable but still limited number of samples, means some trade-offs had to be taken to maintain consistent prediction performance across all traits. Similarly to other PRS methods, the most crucial hyperparameter to tune is the minimum threshold probability for SNP inclusion. This threshold also affects the number of SNPs we batch in a single embedding vector, and training can diverge when including too many non-significant SNPs. A threshold of > 1% in MAF also limits our ability to generalize to other ancestries but is required for estimating effect sizes. Similarly, we removed individuals from non-European ancestry for GWAS to avoid spurious associations but kept them during training. Our approach also takes significantly more computational power and time than the other compared methods.
Delphi tends to increase the effect of SNPs with high effect estimates and down-weights low effect SNPs. Similar heuristics have been shown to improve heritability estimates by tuning effects based on minor allele frequency (MAF) (10). Although MAF is correlated with effect size, we found no such association by inspecting variations modulation and MAF quantiles. Unfortunately, we also found that the effect estimates of individual SNPs would vary drastically between different data splits, making the interpretation of SNP effect modulation challenging, as the variations of the neural network were much smaller than the differences from data randomization. This limitation might be alleviated by including summary statistics from another cohort.
Methods
UK Biobank
The UK Biobank (UKBB) (32) is a large-scale ongoing prospective study including over half a million individuals from across the United Kingdom. Participants were first recruited between 2006 and 2010 and underwent extensive testing, including blood biomarkers, health and lifestyle questionnaires, and genotyping. Longitudinal hospitalization data for any disease represented by an ICD-10 code is also provided between the recruitment date and the present time. UKBB contains genotypes for 488,377 individuals at the time of download (March 2023), 409,519 of which are from ‘white British’ ancestry. Ancestry was inferred using the data field 22006, which uses self-reports and principal component analysis of the genotypes. Variant quality control included the removal of SNPs with imputation info score < 0.8 and retaining SNPs with hard-call genotypes of > 0.9 probability and MAF > 0.01. To reduce the initial dimensionality of the data, we only considered 1,054,330 HapMap3 (HM3) (41) SNPs as they have shown to be a sufficient set for traditional PRS methods (42) and are the standard set for polygenic risk score evaluation.
Data Splits and Phenotypes
We evaluated the performance on all ten traits of our method using three independent train/validation/test splits. For the quality control of our samples, we only considered 407,008 subjects used in the principal components analysis of the UKB dataset (field 22020). These subjects are unrelated, did not withdraw consent from the study, and passed some genotyping quality control tests. Subjects were not selected based on ancestry at this stage. We used 80% (325,606) of the dataset for training, 5000 subjects for validation, and the rest (76,402 subjects) for testing. Some individuals were further removed depending on missing data for each phenotype. We kept this exact split for the preprocessing and the training of the neural network. The same training set was used to compute the GWAS and train the neural network. The validation set was used to select the best hyperparameters for polygenic risk scores and benchmark algorithms and to stop the deep neural network training. We assessed the performance of our method on ten continuous phenotypes. BMI, height, SBP, LDL, and C reactive protein values were taken directly from the UK Biobank first time point measurements.
The LD reference panel used for LDpred was previously computed (42) with some individuals from the test set. The choice of LD reference panel was shown to have a limited impact on performance (42), and the same weights from LDpred were used for all benchmarked methods. Finally, this panel did not contain individuals from non-British white ancestry, ensuring that performance results on non-British white (see section Performance comparison on non-white British for multiple phenotypes) are unbiased.
PCA and GWAS
PCA eigenvectors were obtained from HM3 SNPs using only genotype information from the training set for each data split. As recommended (43), we used a truncated PCA method with initial pruning that iteratively removes long-range LD regions. For GWAS computation, the training set was pruned by removing individuals with no British white ancestry (field 22006) and who were beyond two standard deviations of the Mahalanobis distance of the first 6 PCs. This additional subject selection was only applied for the GWAS to prevent spurious relationships that can arise with heterogeneous cohorts. Covariates for the regression included age, sex, the first 20 principal components, age2, age·sex and age2·sex. PCA and GWAS were computed using the bigsnpr (35) R package (version 1.9.10).
PRS Computation
Polygenic risk scores were computed for each phenotype and data split using clumping and thresholding (C+T), lassosum2, and LDpred2. In C+T, correlated variants are first clumped together, leaving only the ones with the lowest P-values while others are removed. We used 50 P-value thresholds combined with stacking to learn an optimal linear combination of C+T scores in a 10-fold cross-validation on the train set. The remaining variants are then pruned by discarding the ones with a P-value larger than a chosen significance level. Lassosum uses L1 and L2 regularization on the effect sizes and a linkage disequilibrium (LD) correlation matrix to penalize correlated and low-effect variants. The regularization coefficients were chosen by measuring model performance on the validation set. LDpred is a Bayesian method that uses a prior on effect sizes and an LD correlation matrix to re-weight effect estimates. LDpred2-auto (44) is a variant of LDpred in which two key model parameters, the SNP heritability and polygenicity, are estimated from the data. The LD correlation matrix was obtained from a reference panel (42). We used the validation set to identify optimal hyper-parameters such as P-value cutoffs and regularization coefficients for each method. We used the C+T, Lassosum2, and LDpred2 implementations of the bigsnpr (35) R package (version 1.9.10), using the default hyperparameters ranges for each PRS method.
Network Architecture
We designed a deep learning neural network (DNN) to predict phenotypes from genetic data, illustrated in Figure 9. During training, SNPs are loaded in memory as a matrix of size B×S, where B represents the batch size and S is the number of SNPs. To reduce the dimensionality of the data, SNPs were filtered by a tunable p-value threshold T. The neural network’s architecture is an 8-layer pre-norm transformer (45) with two attention heads and GELU activation function (46). Similarly to vision transformers (45), we batched SNPs into arbitrary patches of length L to form a sequence of embeddings, using zero-padding to complete the last embedding. Inputs were then linearly mapped to match the input size of the transformer (512 in all experiments), and a vector containing covariate information was added as the first embedding of the sequence.
We found that training directly on the phenotype would result in divergent training due to the large dimensionality of the data, the low impact of individual SNPs, and the fact that phenotypes are not fully described by their inputs. To remedy this problem, we used the effect sizes βi from a classical PRS method to guide the neural network’s predictions. We decoded the output of the transformer using a linear layer to match the original input size and predict variations of each effect using a tanh activation function: where βi is the effect size from the PRS, is the effect modified by the neural network , and x is the input to the neural network (covariates + 100K SNPs).
Each output represents an individual modification of the estimated effect that depends on covariates and the presence of other SNPs. Modified effect estimates were then summed up to form the modified PRS prediction of the DNN, . We designed the DNN such that, were it to output only zeros, we would recover the unmodified PRS score. We found that using the effect estimates as a guide during training to be the only way for the neural network to output predictive results.
We used the effect sizes from LDpred2 to train the neural network in all our experiments. We used a batch size of B = 512, transformer input size of 512, feed-forward dimension of 512, and 0.3 dropout during training. The covariate vector included the same covariates from the GWAS for each trait. The patch size (L∈{128, 2048}), P-value thresholds (range 0.01-10−6), and learning rate (range 0.05-5·10−4) were individually tuned for each trait and were the only parameters that varied between traits. For a specific trait, we used the same patch size and p-value threshold for each data split. We used a linear decay for the learning rate with 300 warmup steps. Models were trained on a single NVIDIA GeForce RTX 3090 (24 GB) and were composed of approximately 14M parameters. We used the AdamW optimizer (47) with ϵ = 10·10−8, β = (0.9, 0.999). Training averaged between 8 to 12 hours for each trait. For binary traits, we used the ROC AUC as the evaluation metric.
For all phenotypes, we used explained variance (Equation 2) as the evaluation metric: where y is the ground truth and ŷ is the prediction.
We used these metrics to select the best-performing model on the validation set and for the final evaluation of the held-out test set. We used smooth L1 loss (48) during training.
Interestingly, we observed different convergence patterns for each phenotype. Some, like BMI and SBP, tended to converge after one epoch despite a very low learning rate and even overfit after this point. On the other hand, height required a much larger learning rate and converged after around 20 epochs. Binary traits included fewer SNPs due to differences in the distribution of GWAS p-values. Consequently, binary traits required a smaller patch size. We tuned the patch size such that the sequence length lay between 20 and 70, keeping patch sizes multiples of 2 between 256 and 2048.
Covariate Model
As some covariates can greatly impact the phenotype (e.g., sex and height), we found that directly using the phenotype as a ground truth would make the neural network diverge during training for some phenotypes. To solve this problem, we used another model that only used the covariates as input to predict the phenotype and trained the deep neural network on the residuals. We chose XGBoost, a gradient-boosted trees method similar to the method we used to benchmark, without the additional high-impact SNPs. To be consistent, we used the same hyperparameters across data splits and phenotypes with 3-fold cross-validation for optimal model selection. For the XGBoost hyperparameters, we used a maximum depth of 5, α = 0, γ = 0, η = 0.01, a subsample of 80%, and a minimum chid weight of 10 in all our experiments. The weighted sum of effect estimates with P-values lower than 0.05 but higher than the P-value threshold of the deep neural network was then added back to the output of the covariate model. Finally, The DNN predictions yDNN were linearly combined with the covariate model to form the final prediction.
Data Loading During Training
Gene sequence variations formats such as bgen and pgen are compressed and optimized to query a single variant at a time to enable fast GWAS analysis. For our purposes, we needed a format that could allow us to efficiently load in memory all HM3 variants for a small number of subjects. We chose to convert the HM3 SNPs in bgen format to a Hierarchical Data Format (HDF5) with a Python script using the h5py (49) and bgen_reader (50) libraries. When loading the data, we implemented an efficient dataloader that merges genotype and phenotype information. This data format allowed us to load a batch of 512 samples containing 1.1M SNPs in memory in less than 10 ms, which was acceptable for training. Bgens were converted to a single HDF file with a Python script, which only needed to be done once. We encoded allele dosage as 0, 1, and 2 for homozygous reference, heterozygous, and homozygous allele. The samples were ordered as in the sample file from the .bgen of the first chromosome.
Adding in low effects as constants
To keep the inputs’ dimensionality relatively low and avoid including extremely small effects, we summed up the effects that were lower than 0.05% of the maximum and only included the others in the input of the neural network. The sum of the smaller effects was then added to the yDNN output. Assuming that the deep neural network output only zeros, the network’s architecture is such that output would be the same as the LDpred weighted sum.
Model performance evaluation and comparison to existing methods
We compared the performance of our approach to three other state-of-the-art methods. We used the polygenic risk score predictions from LDpred2 for each method and included the same covariates as our approach. It was recently found (17) that including high-impact SNPs in a non-linear model can increase the quality of genetic prediction. We modified this existing method to be computationally feasible while enabling fair comparison. To be precise, instead of filtering SNPs with LASSO regression before XGboost, which we found to be computationally expensive due to the size of UKB, we filtered them by P-value thresholding. We considered for inclusion in the model all SNPs with a p value < 10−4 using our GWAS summary statistics for the corresponding trait, keeping the same data splits as previously described. We then used eight relative thresholds α values between 0 and 1 and kept SNPs with a P-value in the top T percentile.
We fitted XGBoost and LASSO models by including covariates (sex, age, first 20 PCs), selected SNPs, and the LDPred2 risk score prediction. We selected the model that minimized the MSE for each phenotype. For XGBoost, we always used a learning rate of 0.01, maximum depth of 5, minimum child weight of 10, and subsample of 80%. Each model was fit using 3-fold cross-validation on the training set, allowing up to 2000 boosted trees with early stopping after 20 rounds. We repeated this process for all 10 traits and 3 data splits. Analysis was conducted using Python 3 and the scikit-learn and xgboost packages.
Data Availability
Data referred to in the present work is available by application at https://www.ukbiobank.ac.uk/.
Data Analysis with R
Data analysis was performed with publicly available packages: tidyverse v1.3.1 (51), and dplyr v1.0.8 (52).
Data Analysis with Python
The covariate model was implemented using the xgboost python library (53). The deep learning model was implemented in Pytorch (54).
Code Availability
Code used for processing genetic data, GWAS analyses, and training of the neural network for this manuscript is provided on a dedicated GitLab repository https://gitlab.com/cgeo/delphi.
Data availability
No data were generated in the present study. UK Biobank data are publicly available by application (https://www.ukbiobank.ac.uk/enable-your-research/register).
Acknowledgements
This research has been conducted using the UK Biobank resource under application number 80108, with funding from the Swiss National Science Foundation (Sinergia CRSII5_202276/1).
Footnotes
Added information in main script