Abstract
When quantitative longitudinal traits are risk factors for disease progression, endogenous, and/or subject to random errors, joint model specification of multiple time-to-event and multiple longitudinal traits can effectively identify direct and/or indirect genetic association of single nucleotide polymorphisms (SNPs) with time-to-event traits. Here, we present a joint model that integrates: i) a linear mixed model describing the trajectory of each longitudinal trait as a function of time, SNP effects and subject-specific random effects, and ii) a frailty Cox survival model that depends on SNPs, longitudinal trajectory effects, and a subject-specific frailty term accounting for unexplained dependency between time-to-event traits. Inference is based on a two-stage approach with bootstrap joint covariance estimation. We develop a hypothesis testing procedure to identify direct and/or indirect SNP association with each time-to-event trait. Motivated by complex genetic architecture of type 1 diabetes complications (T1DC) observed in the Diabetes Control and Complications Trial (DCCT), we show by realistic simulation study that joint modelling of two time-to-T1DC (retinopathy, nephropathy) and two longitudinal risk factors (HbA1c, systolic blood pressure) reduces bias and improves identification of direct and/or indirect SNP associations, compared to alternative methods ignoring measurement errors in intermediate risk factors. Through analysis of DCCT, we identify two SNPs with indirect associations with multiple time-to-T1DC traits and obtain similar conclusions using alternative formulations of time-dependent HbA1c effects on T1DC. In total, joint analysis of multiple longitudinal and multiple time-to-event traits provides insight into aetiology of complex traits.
Introduction
Despite their known ability to improve inference in clinical and epidemiological studies, particularly in the presence of informative censoring/dropout or when longitudinal traits are measured with error1–3, joint models for longitudinal and time-to-event outcomes have received limited attention in genetic association studies. Genome-wide association studies (GWAS) of intermediate quantitative traits (QTs) typically require follow-up studies to identify whether SNP associations detected with each of the QT(s), analyzed separately, also affect related outcomes through direct and/or indirect effects induced by those traits. Such intermediate traits may include established risk factors for related clinical outcomes and be measured with errors (e.g. biological variation). By accounting for QT measurement error and dependencies among traits, joint models can improve accuracy and efficiency of effect estimation, as well as detection of SNP associations.
The so-called shared-random-effects joint model4,5 consists of a sub-model for a single longitudinal trait and a sub-model for a single right-censored time-to-event trait. The longitudinal sub-model describes the QT as an underlying smoothed trajectory that depends on fixed effects of time and baseline covariates, as well as subject-specific random effects. The joint model association structure is induced via the functional dependence between the hazard of an event at any time t and the longitudinal trait trajectory6,7. This relationship can be based on prior biological knowledge of the link between that longitudinal and time-to-event traits. As elucidated by Ibrahim et al1, this class of joint models lends itself to interpretations of direct/indirect effects because the relationship between a baseline covariate, such as a SNP genotype, and each of the longitudinal and time-to-event traits, as well as the relationship between the longitudinal and time-to-event traits can be specified via model parameters corresponding to direct, indirect and overall effects. Extensions of joint models for multiple longitudinal and/or multiple time-to-event traits can further improve inference by borrowing information among related traits. Extensions of joint models have been reviewed for multiple longitudinal traits6,7 or multiple time-to-event traits8. A few extensions for both multiple longitudinal and multiple time-to-event traits have been developed (9–11, among others), but these models are often formulated for a specific study question, and thus can lack generalizability. Such extensions raise computational challenges for maximisation of the likelihood under the multivariate random effect’s distribution. Two-stage approaches12–15 for joint model fitting appear more computationally efficient and lends itself to more flexible model formulation but inference can be mis-calibrated because parameter estimates and predictions from Stage 1 are obtained from the longitudinal model without consideration of the time-to-event outcome, and the uncertainty in Stage 1 estimates is ignored during Stage 2 estimation, a problem known as propagation of errors16.
Motivated by the complex genetic architecture of long-term type 1 diabetes complications (T1DC), we propose joint-model extension to evaluate genetic associations with multiple time-to-event traits and multiple longitudinal risk factors. Risk of development of T1DC, including diabetic retinopathy (DR) and diabetic nephropathy (DN), is hypothesized to result from multiple genetic factors with potential direct and/or indirect effects via multiple - shared and/or specific - risk factors17. In addition to genetic factors, hyperglycemia (measured by Hemoglobin A1c, referred as HbA1c) represents the major QT risk factor for T1DC. Other measured longitudinal traits, that can also be influenced by genetic factors, have postulated associations with T1DC, for example, association of systolic blood pressure (SBP) with DN. Because conventional analysis methods in GWAS ignores dependency between related traits, SNPs can appear associated with multiple traits (including SNPs with direct and/or indirect effects induced by intermediate QT(s) measured or unmeasured). Accurately distinguishing between direct and/or indirect associations is crucial to elucidate the biological pathways through which genetic factors operate into the etiology of those complex traits, like in our motivating example of the T1DC genetic architecture. A naïve approach to disentangle direct from indirect associations can be achieved by including the intermediate QT as a time-dependent covariate in a Cox Proportional Hazard (PH) model. However, measurement errors in intermediate QT(s), presence of informative censoring in longitudinal trait, unmeasured shared risk factors between longitudinal/time-to-event traits can challenge accurate classification of direct and/or indirect SNP associations.
The major contribution of our work is a general formulation for the joint analysis of multiple time-to-event traits and multiple longitudinal QT risk factors in genetic association studies. We develop inference methods for statistical genetic analysis based on joint model parameters estimation, including a hypothesis testing procedure to classify direct and/or indirect SNP associations with each time-to-event trait. A second contribution of the paper is the implementation of a data-informed simulation algorithm to generate multiple causal SNPs with various direct effects on simulated time-to-event traits and/or indirect effects via observed (measured) longitudinal QTs in the Diabetes Control and Complications Trial (DCCT) study18,19 and unobserved (simulated) longitudinal QTs. This algorithm also provides a general approach to estimate achievable power given sample size. Our simulation results show bias reduction and improved classification of direct and/or indirect SNP associations in comparison with alternative approaches ignoring measurement errors in longitudinal QT(s). Thirdly, we apply the joint model in DCCT data and clarifies two SNPs as having indirect associations via the HbA1c longitudinal risk factor, with consistent conclusions using alternative time-dependent association structures that account for established cumulative and time-weighted effects of HbA1c on T1DC traits20,21. Example R codes for data simulation and for application of the proposed joint model are available on GitHub.
Methods
Joint Modelling Approach
Model Formulation
We assume that a set of M SNPs are available with K (1 ≤ k ≤ K) unordered and non-competing time-to-event traits, such as multiple disease complications, and L (1 ≤ l ≤ L) longitudinal QT traits (i.e. intermediate risk factors) possibly measured with error (i.e. biological variation) in N unrelated individuals indexed by i (1 ≤ i ≤ N). For each SNPm (1 ≤ m ≤ M), with minor allele frequency (MAF) pm, the genotypes are coded as the number of copies of the minor allele, under an additive genetic model. For each individual denotes the vector of quantitative measures collected over scheduled visits at tij for j=1,…, ni for the lth longitudinal trait (1≤l≤ L) with ni represents the maximum number of visits recorded. Let (Ti(k), δi(k)) be the vector of observed right-censored event time Ti(k) and event indicator δi(k) for the kth time-to-event trait.
We assume , where is the latent (uncensored) time-to-event k and Ci is the censoring time (e.g. administrative censoring). We define , with δi(k) = 1 if the event occurs during the observation period and δi(k) = 0 otherwise.
To characterize the genetic architecture of a system of multiple time-to-events and multiple longitudinal risk factors, we formulate a shared-random-effects joint model, as specified in Figure 1, combining: (i) longitudinal and (ii) time-to-event sub-models connected by (iii) specified time-dependent association structures.
(i) Longitudinal sub-model
The longitudinal sub-model is specified by a multivariate mixed model for the L longitudinal QTs (or risk factors), based on the multivariate extension of the Laird and Ware linear mixed model22. It describes each vector of observed longitudinal QT measures Yi(l)(tij) as noisy observations of a true and unobserved smoothed subject-specific longitudinal trajectory, , with noise terms . Each smooth trajectory describes subject-specific evolution and depends on time, SNP effect, individual-level random effects bi(l) and can also include effects of other baseline covariates, for example confounding factors or ancestry-related principal components. To simplify the presentation, we assume a linear trajectory, but the longitudinal sub-models can be adapted for nonlinear trajectories using, for example, higher order polynomials or splines23.
Multivariate mixed model
Where:
β0(l), β1(l), βg(l) and βh(l) denote, respectively, the fixed intercept, and baseline effects of time, SNP and other covariate(s) on each longitudinal trait l.
bi = ((bi0(l), bi1(l))′, …, (bi0(L), bi1(L))′) ′ is the 2xL vector of random effects with bi∼N2L(0, D).
denotes the overall variance-covariance matrix for random effects, accounting for serial dependencies within longitudinal traits (Dl,l, unstructured) and cross-dependencies between longitudinal traits (Dl,m for two traits l≠m).
is the vector of error terms where is assumed independent and identically distributed.
is the residual variance of the QT l. We further assume that εij(l) and bi(l) are independent22.
(ii)Time-to-event sub-model
The time-to-event sub-model is specified by a Proportional Hazards (PH) mixed effects model (also known as frailty PH model) where the hazard function of each time-to-event trait k depends on the SNP effect adjusted for a function of the subject’s longitudinal trajectories Wi(k)(t). We introduce a subject-specific random effect (frailty term, ui) to capture unexplained dependencies (e.g. due to unmeasured shared factors) among the K time-to-event traits. Given the common frailty term ui, the K time-to-event traits are assumed independent24.
Proportional Hazards mixed effects sub-model
Where:
λ0(k)(t) is a (parametric or non-parametric) baseline hazard function for the time-to-event trait k.
γg(k) denotes the SNP effect on the time-to-event k adjusted for the indirect SNP effects via longitudinal QT(s), taken into account by Wi(k)(t).
γv(k) is the effect of the baseline covariate vector(s). Those covariates can be shared or trait-specific covariates. They can also be time-dependent covariates for all or a subset of the K time-to-event traits (in this case, Vi(k) = Vi(k)(t)).
ui and bi, random effects from the time-to-event and longitudinal sub-models respectively, are assumed independent. We assume ui∼Γ(a, b), with a, b>0.
(iii) Time-dependent association structures
The time-dependent association structures, Wi(k)(t), that connect the longitudinal and time-to-event sub-models (Figure 1), account for the specific temporal relationships between each set of Lk (1 ≤ Lk ≤ L) associated QT(s) and each time-to-event trait k.
Where:
1 ≤ Lk ≤ L is the set of longitudinal risk factors of the time-to-event trait k.
denotes the functional form of the exposure effect on time-to-event trait k corresponding to the l-th longitudinal risk. In the case of a contemporaneous parametrization, the hazard of an event k at a time t depends on the longitudinal trait value at the same time t (i.e. . Other parametrizations have been described in the literature6,7,25.
αl(k) denotes the association parameter of the function of the smooth longitudinal trajectory with the time-to-event k.
, with 0 ≤ s ≤ t, 1 ≤ l ≤ L} denotes the history of the underlying longitudinal trajectory for trait l up to time t.
Interpretation
When the joint model is correctly specified, the parameters accurately represent multiple relationships between each SNP and the correlated traits. Note that βg(l) is the SNP effect on the longitudinal risk factor l, γg(k) is the direct SNP effect on the time-to-event trait k via other mechanisms than the indirect SNP association induced by the observed longitudinal risk factors. In this context, θk(l) = µg(k,l) + γg(k) is interpreted as the overall SNP effect on the time-to-event trait k, where µg(k,l) = αl(k)βg(l) represents the indirect1 SNP effect on the time-to-event trait k via the longitudinal QT l.
Under the proposed joint model formulation, when a longitudinal risk factor l is associated with a time-to-event trait k (αl(k) ≠ 0), a SNP that has an effect on the longitudinal QT risk factor l (βg(l) ≠ 0), but no effect on the time-to-event trait k (γg(k) = 0), is interpreted as having an indirect SNP association and its overall effect is equal to its indirect effect (θk(l) = µg(k,l), with µg(k,l) = αl(k)βg(l)). A SNP with an effect on the time-to-event trait k (γg(k) ≠ 0), but no effect on the longitudinal QT l (βg(l) = 0), is interpreted as having a direct SNP association and its overall effect is equal to its direct effect (θk(l) = γg(k)). A SNP with an effect on the longitudinal risk factor l (βg(l) ≠ 0) and an effect on the time-to-event trait k (γg(k) ≠ 0) is interpreted as having both direct and indirect SNP associations via distinct genetic pathways. In this case, its overall effect is an aggregation of both direct and indirect effects (θk(l) = µg(k,l) + γg(k), with µg(k,l) = αl(k)βg(l)). It is obvious that when an associated longitudinal risk factor is omitted from a time-to-event model, as occurs in separate analyses of longitudinal and time-to-event traits, the marginal SNP effect on the time-to-event trait will reflect partially the overall SNP effect θk(l). This is particularly the case in GWAS when intermediate risk factors are ignored in the analysis models.
Effect estimation and test statistic construction
To mitigate the computational complexity involved in the maximization of the joint likelihood and allow more flexibility in the model formulated, we estimate the parameters using a two-stage approach (see Appendix). We work within the framework defined by Tsiatis and Davidian26,27 which guarantees that our estimators are consistent and asymptotically normal. Specifically, in Stage 1 we fit the multivariate mixed model (Equation 1) using the mvlme() function from the R package JoineRML28 to estimate the parameters of the longitudinal trajectories of the risk factors (Equation 2), and obtain fitted values of the smoothed trajectories. In Stage 2, we fit a Cox PH frailty time-to-event model (Equation 3) adjusting for functions of the smoothed trajectories as time-dependent covariates using the coxph() function from the R survival29,30 package. We assume a Gamma distribution for the frailty term (ui) and a separate baseline hazard function for each time-to-event trait using the strata argument in coxph(). To account for propagation of errors, due to the uncertainty in Stage 1 estimates ignored during Stage 2 parameter estimation16 and empirically estimate the joint covariance matrix of SNP and trajectories effects we apply a nonparametric bootstrap31. For each bootstrap sample b (1 ≤ b ≤ B, B is the total number of bootstraps), we generate a new dataset by randomly sampling N individuals with replacement and refitting the joint model on each new dataset b. We compute the empirical joint covariance matrix of the estimated parameters using the B bootstrap parameter estimates. Wald statistics for each βg(l) are computed as using the empirical bootstrap standard errors , and similarly for each γg(k) as .
Procedure to classify direct/indirect SNP associations
In Table 1, we present a practical hypothesis testing procedure to classify SNPs as having direct and/or indirect association with a time-to-event trait k and each associated longitudinal risk factor l. This procedure requires two significance thresholds, and for βg(l) and γg (k) respectively, to be specified prior to the analysis and adjusted for the number of SNPs tested. Depending on the research question, we can choose different values for and , or the same one . The latter is applicable for instance, when we want to systematically classify direct/indirect association among a set of M SNPs, and the former when assessing which SNPs, among the ones reported to be associated with the longitudinal risk factor, have a direct effect on a time-to-event trait.
Motivation: The DCCT Studies
The DCCT randomized-controlled trial was pivotal in demonstrating that intensive insulin therapy prevents and delays progression of long-term T1DC18. Due to significant outcome differences between the intensive and conventional treatment groups at interim analysis, the DCCT trial was stopped early. At the closeout visit, patients were administratively censored and had a mean follow-up time of 6.5 years (range 3 to 9), 99% had completed the study, and more than 95% of all scheduled examinations were completed18. DCCT participants continued to be followed for disease progression after the trial in a long-term epidemiologic cohort study and were genotyped for GWAS in the DCCT Genetics Study. Because the goal of intensive therapy was to reduce HbA1c into the non-diabetic range, which produced treatment differences in HbA1c values, we focus on N=667 unrelated individuals of European descent ancestry from the Conventional treatment group. Longitudinal measurements for HbA1c and SBP were collected at up to 39 quarterly visits during DCCT, while DR and DN events were diagnosed at annual and semi-annual visits respectively. HbA1c and SBP were recorded irrespective of the occurrence of any complication event(s). Although the DCCT implemented robust quality assurance procedures to minimize potential sources of error during and after data collection32,33, HbA1c and SBP are subject to measurement error (e.g. individual variability). The first GWAS to analyze the DCCT study phenotypes19 identified two SNPs associated with cumulative HbA1c at genome-wide significance in the Conventional treatment arm, namely rs10810632 (in BNC2, 9p22.2) and rs1358030 (near SORCS1, 10q25.1), and reported consistent associations with secondary outcomes of time-to-DR and/or time-to-DN. Subsequent association studies also reported suggestive evidence for pleiotropy between DR and DN34. To preserve the inherent within-individual variability, we designed simulation study evaluations of the joint modelling approach that generates event times from HbA1c and SBP measurements in the DCCT Study data.
Simulation study
Design of the DCCT-data-based simulation study
To assess the performances of the methods, we simulate the data under a genetic causal scenario that imitates the complex genetic architecture of T1DC (Figure 2). The genetic association model involves N=667 DCCT subjects with K=2 non-independent simulated time-to-T1DC traits (DR, DN) that depend on M=5 causal SNPs with direct effects on each time-to-T1DC and/or indirect effects via L=3 longitudinal QTs: two as measured in DCCT (HbA1c, SBP) and one other simulated (U) to induce some unexplained shared dependency between the T1DC traits. We also introduce an effect of sex on SBP, and effects of T1D duration (at baseline) on both T1DC traits, as observed in the original DCCT data.
Algorithm for realistic data generation under a complex genetic architecture
To generate such a data structure that combines observed and simulated traits, we formulate a data generating model including: (i) L=3 linear mixed models linking each of the SNPs with an indirect effect to each longitudinal risk factor, and (ii) K=2 non-independent parametric time-to-event models depending on SNPs with direct effects and on functions of the longitudinal QT trajectories. For each DCCT individual i with observed HbA1c (Yi(1)), SBP (Yi(2)), visit times (ti) vectors at ni time points, and covariates (Hi(l), Vi(k)), we simulate longitudinal trait vector Ui, time-to-event traits and genetic data at M causal SNPs as illustrated in Figure 3 and detailed in Supplementary Information SI-1. All SNPs are simulated under Hardy-Weinberg and linkage equilibrium assumptions. Particularly, SNPs with indirect effects are simulated from the observed (SNP1, SNP5) or simulated (SNP3) longitudinal QTs, while SNPs with direct effects (SNP2, SNP4) on time-to-T1DC traits are simulated independently of the longitudinal QTs and are included in the specified hazard function used to simulate each time-to-event trait k (Figure 3).
For each SNPm, we specify MAF (pm) and the SNP effects (βg(l), γg(k)), as well as the other parameter values according to the DCCT Genetics Study and the T1DC literature (Table 2, Tables S1). We assume contemporaneous association structures for HbA1c and SBP effects on simulated time-to-T1DC traits. We simulate each time-to-event trait k under a Weibull model, with shape and scale parameters specified to generate ∼54% DR events and ∼25% DN events in each of the R=1000 replicated datasets generated under the causal scenario from Figure 2. Under the global null genetic scenario, where none of the SNPs is associated with any traits, we simulate M SNPs independently of the traits with the same MAF as for the causal SNPs.
Scenarios for DCCT-based complex genetic architecture
Overall, the simulated complex genetic architecture covers multiple types of SNP-trait associations (Figure 2 and Table 2): direct association with each T1DC trait (SNP2, SNP4), indirect associations with both T1DC traits via measured (SNP1) and unmeasured (SNP3) longitudinal QTs; and direct and indirect associations via a measured longitudinal QT (SNP5); all longitudinal risk factors are subject to measurement error. Except for SNP3, all other SNP scenarios represent GWAS discovery SNPs detectable by separate GWAS of a longitudinal risk factor (SNP1, SNP5) or a time-to-event trait (SNP2, SNP4, SNP5).
SNP1, SNP3 and SNP5 have indirect effects on T1DC traits, such that their associations with the T1DC traits are detectable with the marginal Cox PH time-to-event models (Tables 2 and S2). SNP1 corresponds roughly to rs10810632 and rs1358030 reported in the motivating DCCT GWAS of HbA1c19, while SNP5 represents a top hit detected by each GWAS (longitudinal and time-to-event traits), analyzed separately. Because of their association with longitudinal and time-to-event traits, SNP1 and SNP5 associations with time-to-event outcomes are conventionally investigated using the Cox PH model adjusting for the observed longitudinal trait as a time-dependent risk factor to determine if SNP association with the time-to-event trait is fully (or partially) explained via the associated longitudinal QT(s).
Analysis of the simulated data
For the analysis of each SNP, done separately, we compare four analysis models on the simulated data:
JM-cmp: a completely specified joint model that includes all other non-SNP variables used for the data simulation.
JM-mis: includes the same variables as JM-cmp, except U, the unobserved longitudinal QT.
JM-sep(k): includes the same variables as JM-mis but is fitted separately for the two time-to-event traits. JM-sep(k) does not account for the shared dependency between the time-to-event traits.
CM-obs: a Cox PH frailty model that includes the same variables as JM-mis, but adjusts for the observed longitudinal QT values as time-dependent covariates.
Due to the latent nature of U, JM-cmp cannot be fitted in practice because U is unobserved, but we include it as a benchmark for comparison against the observable models fitted without U. For all these models, empirical covariance matrices are computed using 500 bootstrap iterations. We assess the performance of the hypothesis testing procedure to classify direct/indirect association for each of the 5 causal SNPs analyzed separately with each joint model (JM-cmp, JM-mis). To assess the impact of longitudinal trait measurement errors on classification of SNPs, we substitute test for γg(k) estimated by the joint model by obtained with CM-obs. In our evaluations, we assume the level of significance P∗ such that with P∗ varying from 0.05 to 5×10−8. Under the causal genetic scenario, we estimate for each SNP the proportion of replicates that detect the correct direct and/or indirect association simulated (correct classification rate) and the proportion of replicates where the causal SNP association is misclassified (misclassification rate). Under the global null genetic scenario, we assess the classification test for each SNP and each type of association at P∗ = 0.05 and P∗ = 0.01 to assess the type I error control of the hypothesis testing procedure for each test of direct and/or indirect association.
Availability
DCCT data are available to authorized users at https://repository.niddk.nih.gov/studies/edic/ and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000086.v3.p1. Example R codes for DCCT-data-based simulation and analysis of the simulated data are provided on GitHub (https://github.com/brossardMyriam/Joint-model-for-multiple-trait-genetics). File S1 includes the Supplementary Information; File S2 includes the Supplementary Figures and File S3 includes the Supplementary Tables.
Results
Simulation results
Estimation accuracy and single-parameter tests validity
When the longitudinal QT(s) are observed and measured with random errors, the simulation results confirm that the proposed joint model reduces the bias of the parameter estimates for all types of SNP association (Figure S1 and Tables S3-S6), even when the analysis model is not fully specified. This applies when: (i) two correlated time-to-event traits are analyzed jointly rather than separately in the joint model (JM-cmp/JM-mis compared to JM-sep(k)); (ii) a SNP has a direct effect on a time-to-event trait with a low event rate (SNP4/SNP5, JM-cmp/JM-mis compared to CM-obs); (iii) a SNP has an indirect effect via a longitudinal QT measured with errors (for example SNP5, JM-cmp/JM-mis compared to CM-obs). Particularly, for SNP5, CM-obs overestimates the direct SNP5 effect on time-to-DN due to its inability to fully account for the indirect SNP5 effect induced via SBP. Direct SNP2 and SNP4 effects on time-to-T1DC traits show overall the larger relative bias reduction using joint models versus CM-obs (Ranges with JM-mis 0.702-0.983, with JM-cmp 0.934-0.966, see Tables S4 and S6).
When a SNP has an indirect effect on both T1DC traits fully explained via an unmeasured longitudinal QT (SNP3), all methods that ignore the longitudinal QT (JM-mis, JM-sep(k) or CM-obs compared to JM-cmp) produce biased direct SNP3 estimates for both T1DC towards the overall SNP3 effect (θk(U) = αU(k)βg(U) =0.36, Figure S1-c and S1-d, Table S5). In this scenario, the use of the frailty term in JM-mis does not reduce this bias (JM-mis vs JM-sep).
Classification of direct/indirect SNP associations with time-to-event traits
Under the global null genetic scenario of no genetic association with any of the traits, single-parameter SNP test P-values and from the joint model do not show departure from the expected large sample distributions (χ2 with 1 degree of freedom (df), see Figure S2) and the type I error of each test under the null is reasonably well controlled (Table S7). The hypothesis testing procedure applied to single-parameter SNP tests, classifies each SNP as direct or indirect association with each T1DC trait with rates close to the nominal level (Table 3 and Figure S3). However, the classification rates for test of direct and indirect SNP association test appears less than nominal (see Table 3 for SNP5 with SBP/DN traits and Figure S3 for the other SNPs with other QT/time-to-event trait pairs).
Under the causal scenario, the hypothesis testing procedure has high correct classification rates for indirect (SNP1) or direct associations (SNP2, SNP4), as shown in Figures 4-a, 4-b and S4. Conclusions are similar for JM-mis, with attenuated correct classification rates compared to JM-cmp for direct SNP associations but still higher than classification rates based on CM-obs (particularly for SNP4). Compared to results based on CM-obs, the largest relative classification improvement is found for direct SNP4 association with DN (ranges: 1.18-251.9 with JM-cmp and 1.18-110.98 with JM-mis, Figure 4).
On the other hand, SNP3 with an indirect effect via the unobserved longitudinal trait U, is most frequently detected incorrectly as a direct association with each T1DC trait by both JM-mis and CM-obs (Figure 4-b). SNP5, which has both direct and indirect effects on DN induced via SBP, is most frequently misclassified by JM-cmp and JM-mis as an indirect association at significance levels varying from P∗ = 0.01 to 5×10−8 (Figure 4-c and 4-d). SNP5 misclassification rate even increases with P∗ decreasing to 5×10−8, due to the direct SNP5 . In comparison, the procedure based on CM-obs results most frequently correctly classify SNP5 association in more than 75% of the replicates at the same significance levels (Figures 4-c and 4-d); this is explained by the overestimated direct SNP5 effect, arising from measurement error in SBP as noted before.
In summary, our simulations show that by accounting for measurement errors in the longitudinal QT risk factors and for dependencies between/within the traits, the proposed joint model improves estimation accuracy and correct classification of associations of SNPs directly associated with each time-to-event trait or indirectly associated via a measured longitudinal QT in comparison to classification using the CM-obs approach. However, when a SNP has both direct and indirect effects on a time-to-event trait, the proposed testing procedure can be conservative since it requires the joint significance of the two SNP effects, βg(l) and γg(k), where the power of each test depends on effect size and the trait distribution. As a result, a SNP with a direct and an indirect association can be misclassified as either a direct or an indirect association. Finally, when a SNP has an indirect effect on both T1DC traits via an unmeasured QT, the testing procedure based on JM-mis, that captures some of the unexplained dependency between time-to-event traits through the frailty term, does not prevent misclassification as a direct association. This observation also shows the importance of the correct specification of the joint model - including all the intermediate QT(s) - to avoid misclassification of direct and/or indirect SNP associations.
Application in the DCCT Genetics Study data
We demonstrate the feasibility of the proposed approach by an application in the DCCT Genetics Study data. We use time to mild DR and time to persistent microalbuminuria, for DR and DN outcomes respectively, as previously defined in the motivating GWAS of HbA1c19 (see Supplementary information SI-2 for details). Genome-wide genotyping in DCCT subjects was performed using HumanCoreExome Bead Array (Illumina, San Diego, CA, USA) and standard quality controls procedures were applied to individuals and genetic markers19,35. Ungenotyped autosomal SNPs were imputed using 1000 Genomes data phase 336 (v5) and minimac337 (v.1.0.13), as previously described35.
Out of the 667 DCCT individuals, we analyze N=516 subjects with genetic data, without mild to moderate non-proliferative retinopathy or without DN event at DCCT baseline. By the time of the DCCT close-out visit, 297 (57.6%) experienced a DR event, 61 (11.8%) a DN event, including 47 subjects (9.1%) that experienced both events. After SNP filtering and pruning on linkage disequilibrium (see Supplementary Information SI-2 for details), we analyze 307 candidate SNPs reported as associated with HbA1c, SBP, and multiple definitions of DR and/or DN19,34,38–42 (see Table S8 for the full list of SNPs).
Compared to the DCCT-based simulated datasets, the proportion of DN events observed in the application is lower. We expect that the test power for direct SNP association with DN under the joint model to be reduced; this implies reduce chances to correctly classify a SNP with either a direct or a direct and indirect association with DN. We also observe larger effects of HbA1c on T1DC traits (α1(k), k=1, 2) and a smaller effect of SBP on DN (α2(2)), as shown in Tables 2 vs S9. Larger αl(k) values increase the contribution of the indirect effect (µg(l,k)) to the overall SNP effect (θk(l)), which can result in a direct SNP effect (γg(k)) more severely biased if the longitudinal QT is not accounted properly in the joint model, while lower αl(k) values reduce µg(l,k) and thus γg(k) is less severely biased. Overall, in the DCCT application, we expect a SNP with an indirect effect via HbA1c on any T1DC trait to be more subject to be misclassified as having both a direct and indirect association, compared to the simulation results, while a SNP with an indirect effect via SBP on DN is less subject to this misclassification.
Given prior evidence for cumulative effects of HbA1c on T1DC traits20,21, we compare joint model results obtained with contemporaneous, updated cumulative mean, and time-weighted cumulative HbA1c effects on T1DC (Supplementary Information SI-2 for details). We present results under the latter association structure since it has a stronger prior association with T1DC and in the DCCT individuals analyzed here (Table S9). We obtain similar conclusions with alternative association structures (Figure S5-a). As shown in Figure 5-a, rs10810632 and rs1358030 are classified as indirect associations with both T1DC traits via their association with HbA1c shared risk factor ( and , P∗=1.7×10−4 after Bonferroni correction for the 289.02 effective SNPs tested43). These two SNPs have calculated indirect effects (µg(1,k)) and 95% bootstrap confidence intervals that do not include 0 (Figure 5-b). As noted before, rs10810632 and rs1358030 were discovered previously in a GWAS of HbA1c in DCCT19. Classification of these two SNPs as direct and/or indirect associations may be affected by the Winner’s curse bias, due to the modest sample size in this analysis, and the same individuals contributing to discovery and classification44,45. For the other candidate SNPs which were discovered in larger and/or independent studies19,34,38–42, we find little evidence of association with the discovery traits in marginal analyses (Table S8).
Conclusions regarding classification of direct and/or indirect associations based on the procedure using from CM-obs are similar to those based on the joint model results for all SNPs (Figure S5-b), although we noticed some differences in ; that could be explained by the attenuated association of HbA1c and SBP with T1DC traits due to measurement errors ignored from CM-obs (Tables S10 vs S9).
Discussion
We present new methods for statistical genetic analysis under a joint model specification of multiple time-to-event traits and multiple longitudinal risk factors designed to characterize the complex genetic architecture of related traits in longitudinal studies of disease progression. The proposed model accounts for dependencies within and between traits and can account for trait-specific and shared covariates, measurement errors in the longitudinal traits, as well as effects of unobserved baseline confounding factors between the time-to-event traits through the subject-specific frailty term.
Evaluation by realistic data-informed simulation study of complex T1DC genetic architecture shows that the proposed joint model improves estimation accuracy and efficiency compared to the Cox PH frailty model adjusted for observed longitudinal trait values as time-dependent covariates. This improvement holds particularly when the longitudinal risk factors are measured with errors or the time-to-event outcome has a low event rate. Indeed, measurement error attenuates the association of the longitudinal risk factor with the time-to-event trait, and as a result, if a SNP has an indirect effect induced by the QT, the estimated direct SNP effect on a time-to-event is biased away from zero. The magnitude of this bias tends to be worse with larger measurement error14,15. Identification of direct and/or indirect SNP associations is also facilitated under joint modelling in the presence of risk factor measurement error. When, however, a SNP has an indirect effect via an unmeasured longitudinal risk factor, both the proposed and Cox PH model approaches produce biased estimates of the direct SNP effects. The extent of this bias depends on the magnitude of the effect sizes βg(l) and αl(k). Although the frailty term in the joint model time-to-event sub-model captures some of the unexplained shared dependency between the time-to-event traits, we did not observe bias reduction of the direct SNP estimate in the case of an unmeasured risk factor, which can lead to misclassification of the SNP as a direct association. This may be a limitation of using a time-invariant frailty term that does not fully account for the indirect SNP effect via the unmeasured longitudinal risk factor. Because joint significance of both SNP effects is required to classify a SNP as having both direct and indirect associations, a SNP with both direct and an indirect association can be misclassified as having either a direct or an indirect association if one of the SNP effects is too small to be detected at the specified level. The proposed procedure can be adapted to other research questions, for example to identify SNPs that have direct association among the SNPs found associated with a longitudinal risk factor. In this case, it may be desirable to apply a less stringent significance threshold for the test of the direct SNP effect while controlling the overall type I error. Alternatively, intersection-union tests, as based on the likelihood ratio test46, could be used to assess joint significance of βg(l) and γg(k) but would require computation of the full joint likelihood.
Application of the proposed joint model approach in longitudinal studies of disease progression, such as in the DCCT Genetics Study, improves classification of direct and/or indirect SNP association which can help to elucidate the genetic architecture of complex traits. In the context of mediation analysis, Liu et al47 discussed various formulations and interpretations of joint models, with shared-random-effects accounting for potential unmeasured baseline confounding factors between one longitudinal and one time-to-event traits. Using applications in datasets from two clinical trials, they illustrate interpretation of sensitivity analysis to unmeasured baseline confounders. Adaptation of the joint model we propose for multiple longitudinal and multiple time-to-event traits for mediation analysis requires extension of the mediation assumptions48,49 to the case of multiple mediators and multiple time-to-event traits. Specific evaluations of the proposed model under these assumptions are also warranted.
Although our primary aim in this report is to develop statistical methods to accurately distinguish among direct and/or indirect SNP associations with each time-to-event trait, the multi-trait aspect of the joint model lends itself to development of multi-trait SNP association testing for SNP discovery. In Supplementary Information SI-3, we present a joint-parameter test based on a generalized Wald statistic. In application to the simulated DCCT-based complex genetic architecture, we observe good type I error control under the global genetic null scenario, improved power for SNP discovery when a SNP has multiple trait effects, and power maintenance in other SNP association scenarios.
We acknowledge several features of the proposed joint model approach that warrant examination in further work. Firstly, to reduce computational complexity and improve model flexibility, we used two-stage parameter estimation. In some circumstances, this approach can produce biased estimates and/or underestimated standard errors16. Biased estimates can result from non-random censoring of the longitudinal trait values due to the occurrence of an event or from informative dropout50,51. Our simulation results show minimal biases in the absence of informative censoring, even when the joint model is mis-specified. In the DCCT application, characterized by administrative censoring and a high completion rate, these biases are of minimal concern because longitudinal trait values continued to be recorded regardless of the occurrence of any T1DC events; we estimated the trajectories using all the available measurements. Furthermore, we obtain robust bootstrap estimates of the covariance matrix, and simulation results under the null did not show deviation from expected distributions. Use of the bootstrap also facilitates joint testing of parameters from both stages. In the presence of informative censoring, we recommend sensitivity analysis using existing implementations joint likelihood estimation. To our knowledge those implementations only exist for simpler joint model formulation with either one longitudinal and one time-to-event trait52 or multiple longitudinal traits and one time-to-event outcome28,53. Secondly, because the joint model integrates longitudinal and time-to-event sub-models, model misspecification can occur in multiple ways and lead to invalid inference54. Therefore, we recommend a careful assessment of the model assumptions using joint model diagnostic tools23. Thirdly, patient visits were scheduled with high frequency in DCCT, so we ignored the modest degree of interval censoring in the current implementation of the joint model; when there are longer gaps between visits, extended methods are needed to account for interval censoring with additional simulation studies to assess impact on joint model estimates. Lastly, joint models are very computational demanding, particularly for genetic association studies that test millions of variants. In DCCT application, it took ∼60 seconds to fit the joint model for each SNP and ∼1090.32 more seconds (∼ 18 minutes) to estimate the covariance matrix with 500 bootstraps run in parallel on 4 nodes (each node with 40 CPU and 202 GB RAM). While analysis at the genome level involving, up to 85 million 1000G-imputed SNPs in DCCT, seems computationally unrealistic, a screening approach without bootstrap to select the SNPs based on their association P-values, followed by the bootstrap refinement would reduce the computational burden. Recent computationally efficient algorithms have been developed to improve the efficiency of the linear mixed model55 and Cox PH model56,57 for genetic association studies, but to date, they remain to be implemented for multivariate outcomes.
With the increasing development of national biobanks58–60 consisting of a large collection of phenotypes, environmental factors, biomarkers and other risk factors collected over time and linked with genetic data, we anticipate that joint model methods can be applied to characterize the genetic architecture of complex traits using these data resources61. In addition, the results of such analysis could contribute towards the translation of human genetic findings to personalized medicine by providing more efficient SNP effect estimates for increased precision in polygenic risk score development62, and causal inference using mediation and mendelian randomization studies. Lastly, the joint model framework can also enable dynamic prediction beneficial for dynamic risk assessment7,63 and optimization of intervention strategies64,65.
Data Availability
DCCT data are available to authorized users at https://repository.niddk.nih.gov/studies/edic/ and https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000086.v3.p1. Example R codes for DCCT-data-based simulation and analysis of the simulated data are provided on GitHub (https://github.com/brossardMyriam/Joint-model-for-multiple-trait-genetics).
https://repository.niddk.nih.gov/studies/edic/
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000086.v3.p1.
https://github.com/brossardMyriam/Joint-model-for-multiple-trait-genetics
Contributions
All named authors affirm that authorship is merited based on the International Committee of Medical Journal Editors (ICMJE) authorship criteria. MB, SBB and RVC designed the study. MB conducted the simulation study and data analysis and drafted the manuscript. SBB, RVC and OEG contributed to the data analysis and interpretation of the results and reviewed/edited the manuscript. ADP contributed to the acquisition of the data and reviewed/edited the manuscript. All authors approve the final version to be published. SBB and MB are the guarantors of this work.
Acknowledgements
This study uses the data provided by the Diabetes Control and Complications Trial / Epidemiology of Diabetes Interventions and Complications (DCCT/EDIC) Research Group which is sponsored by the National Institute of Diabetes and Digestive and Kidney Diseases contract (#N01-DK-6-2204), National Institute of Diabetes and Digestive and Kidney Diseases grants (#R01-DK-077510, #R01-DK077489, and #P60-DK20595). Funding for genotyping by Illumina HumanCoreExome was provided by JDRF (#17-2013-9). The authors are grateful to the subjects in the DCCT/EDIC cohort for their long-term participation. A complete list of the individuals and institutions participating in the DCCT/EDIC Research Group can be found in Supplementary Information. This project was supported by: CIHR Operating/Project Grants (#MOP-84287, #PJT159509), CANSSI postdoctoral fellowship (MB), CIHR STAGE fellowships (MB and OEG, #GET-101831). Computations were performed on the Niagara supercomputer at the SciNet HPC Consortium and Galen, the HPC facility at the Lunenfeld-Tanenbaum Research Institute (LTRI). SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto. The LTRI HPC facility was also supported by the Canada Foundation for Innovation.
Appendix
Inference under the joint likelihood can be based on joint maximum-likelihood estimation, Bayesian estimation of the joint likelihood parameters, or a two-stage strategy23. Joint maximum-likelihood and Bayesian methods enable joint parameter estimation but can be computationally challenging, especially for multiple traits. Two-stage approaches12–15 sequentially fit the longitudinal (Stage 1) and time-to-event (Stage 2) sub-models, with trajectory estimates from Stage 1 incorporated into Stage 2; this can be relatively computationally efficient and lends itself to flexible model formulation. However, as noted in introduction, inference can be mis-calibrated because of propagation of errors16.
Joint likelihood function
Let Ω be the full parameter vector containing all fixed parameters from the longitudinal and time-to-event sub-models respectively. We assume:
Given (A1)-(A4), it is appropriate to assume that the random effects bi account for: the association between the L longitudinal traits; the association between both the longitudinal and time-to-event outcomes66; and the serial correlation between the repeated measurements in each longitudinal process22; and that given (A4), the frailty term accounts for the residual dependency between the time-to-event traits24. Under these conditional independence assumptions, the joint likelihood function of the model parameters given the observed data is: where:
(Ti δi) = ((Ti(1), δi(1)) ′,…, (Ti(k), δi(k)) ′, …, (Ti(K), δi(K)) ′) ′
Yi = (Yi(1) ′, …, Yi(l) ′, …, Yi(L) ′) ′
, where q is the dimension of the D matrix
i.e. we assume ui∼Γ(a, b), with a, b>0.
Calculation of this full likelihood requires multivariate integration with respect to the random effect’s distribution, which can lead to demanding computation. When the random effect vector bi, has a small dimension, say less than 3, the integral can be evaluated via Gaussian quadrature which approximates the integral by a weighted sum of the target function evaluated at pre-specified sample points. However, when the dimension is larger, it is demanding to calculate the integrals with satisfactory approximation accuracy. Although a full likelihood specification enables rigorous study of asymptotic properties, its large sample approximation may not be accurate when sample size is small. In comparison, the Bayesian paradigm does not require asymptotic approximations, but the design of an efficient sampling algorithm to study the posterior distribution is challenging.
Likelihood functions under the two-stage approximation
Let Ω and Г be the vectors containing all fixed parameters from the longitudinal and time-to-event sub-models respectively.
Stage1: Multivariate mixed model
Stage2: Multivariate Cox PH model adjusted for fitted values of the trajectories
Where:
With
Unlike the joint likelihood function, where the shared random effects bi account for the dependencies between the longitudinal and the time-to-event traits, the two-stage approach accounts for the dependencies between the longitudinal and time-to-event traits via the fitted values of the longitudinal trajectories. As mentioned previously, this approximation can produce biased estimates and/or underestimated standard errors16, particularly in presence of non-random censoring of the longitudinal trait values due to the occurrence of an event or from informative dropout50,51 and/or because of propagation errors of Stage 1 parameters in Stage216. Indeed, informative missingness/dropouts lead in differential follow-up between patients with and without an event, the time-to-event processes are related to the length of the follow-up, and thus the random effects bi(l) can depend on the event times (e.g. patients who have an event early are more likely to have positive random slopes). However, in absence of informative dropouts/missingness, as we showed in our simulation studies, this approach has low bias and is computationally feasible for genetic association studies.
Abbreviations
- DCCT
- Diabetes control and complications trial
- DR
- diabetic retinopathy
- DN
- diabetic nephropathy
- GWAS
- genome-wide association study
- HbA1c
- Hemoglobin A1c
- LD
- linkage disequilibrium
- MAF
- minor allele frequency
- PH
- proportional hazards
- QT(s)
- quantitative trait(s)
- SBP
- systolic blood pressure
- SNP
- single nucleotide polymorphism
- T1DC
- type 1 diabetes complications