ABSTRACT
Multivariable Mendelian randomization (MVMR) is an instrumental variable technique that generalizes the MR framework for multiple exposures. Framed as a linear regression problem, it is subject to the pitfall of multi-collinearity. The bias and efficiency of MVMR estimates thus depends on the correlation of exposures. Dimensionality reduction techniques such as principal component analysis (PCA) provide transformations of all the included variables that are effectively uncorrelated. We propose the use of sparse PCA (sPCA) algorithms that create principal components of subsets of the exposures and may provide more interpretable and reliable MR estimates. The approach consists of three steps. We first apply a sparse dimension reduction method and transform the variant-exposure summary statistics to principal components. We then choose a subset of the principal components based on data-driven cutoffs, and estimate their strength as instruments with an adjusted F-statistic. Finally, we perform MR with these transformed exposures. This pipeline is demonstrated in a simulation study of highly correlated exposures and an applied example using summary data from a genome-wide association study of 118 highly correlated lipid metabolites. As a positive control, we tested the causal associations of the transformed exposures on CHD. Compared to the conventional inverse-variance weighted MVMR method and a weak-instrument robust MVMR method (MR GRAPPLE), sparse component analysis achieved a superior balance of sparsity and biologically insightful grouping of the lipid traits.
Key Messages
In multivariable MR, investigation of multiple highly correlated exposures can hinder the efficiency of the estimators and mask true associations.
Dimensionality reduction approaches such as principal component analysis (PCA) appear to be effective in summarising the variant-exposure summary statistics data in an example of correlated metabolite data.
Sparse PCA approaches have the additional benefit of providing interpretable PCs as only a few exposures contribute to each. This benefit is shown in simulation studies where there is a gain in accuracy over PCA.
In a positive control analysis, the sparse PCs were representing biologically meaningful groups of metabolites (VLDL, LDL, HDL) and in general were associated with coronary heart disease in an anticipated mannert.
Introduction
Mendelian randomization (MR) uses genetic variants as instrumental variables (IVs) to investigate the causal effect of a genetically predicted exposure on an outcome of interest1. The random allocation of genetic variants at conception helps overcome the environmental confounding that can hinder traditional epidemiological study designs based on observational associations. Genetically predicted levels of lifelong exposures are also less likely to be affected by reverse causation, as genetic variants are allocated before onset of the outcomes of interest.
When evidence suggests that multiple correlated phenotypes may contribute to a health outcome, multi-variable MR (MVMR), an extension of the basic univariable approach, can disentangle more complex causal mechanisms and shed light on mediating pathways. An example of this would be investigation into the effect of various lipid traits on coronary heart disease (CHD) risk2. While MVMR can model correlated exposures, it performs suboptimally when there are highly correlated exposures due to multi-collinearity. This can be equivalently understood as a problem of conditionally weak instruments3. That is, the genetic variants used to proxy all exposures have to be strongly associated with each exposure conditionally on all the other included exposures. An assessment of the extent to which this assumption is satisfied can be made using the conditional F-statistic, with a value of 10 being considered sufficiently strong3. In settings when multiple highly correlated exposures are analysed, a set of genetic instruments are much more likely to be conditionally weak instruments. In this event, causal estimates can be subject to extreme bias and are therefore unreliable. Estimation bias can be addressed to a degree by fitting weak-instrument robust MVMR methods4,5, but usually at the cost of a further reduction in precision. Furthermore, MVMR models investigate causal effects for each individual exposure, under the assumption that it is possible to intervene and change each one whilst holding the others fixed. In the high dimensional, highly correlated exposure setting, this is potentially an unachievable intervention in practice.
Our aim in this paper is instead to use dimensionality reduction approaches to concisely summarise a set of highly correlated genetically predicted exposures into a smaller set of independent principal components (PCs). We then perform MR directly on the PCs, thereby estimating their effect on health outcomes of interest. We additionally suggest employing sparsity methods to reduce the number of exposures that contribute to each principal component, in order to improve their interpretability in the resulting factors.
Using summary genetic data for multiple hightly correlated lipid fractions and CHD6,7, we first illustrate the pitfalls encountered by the standard MVMR approach. We then apply a range of sparse PCA methods within an MVMR framework to the data. Finally, we examine the comparative performance of the sparse PCA approaches in a detailed simulation study, in a bid to understand which ones perform best in this setting.
Methods
Workflow Overview
Our proposed analysis strategy is presented in Figure 1. Using summary statistics for the singlenucleotide polymorphism (SNP)-exposure (γ) and SNP-outcome (Γ) associations, where γ exhibits strong correlation, we initially perform a principle component analysis (PCA) on γ. Additionally, we perform multiple sparse PCA modalities that aim to provide sparse loadings that are more interpretable (block 3, Fig. 1). The choice of the number of principal components (PCs) is guided by permutations or an eigenvalue threshold. Finally, the PCs are used in place of γ in an IVW MVMR meta-analysis to obtain an estimate of the causal effect of the PC on the outcome. Similar to PC regression and in line with unsupervised methods, the outcome (SNP-outcome associations (Γ) and corresponding standard error (SEΓ)) is not transformed by PCA and is used in the second-step MVMR in the original scale. In the real data application and in the simulation study, the best balance of sparsity and statistical power was observed for the method of sparse component analysis (SCA)8. This favoured method and the related steps are coded in an R function and are available at github (https://github.com/vaskarageorg/SCA_MR/).
Data
The risk factor dataset reports the associations of 148 genetic variants (SNPs) with 118 NMR-measured metabolites6. This study reports a high-resolution profiling of mainly lipid subfractions. Fourteen size/ density classes of lipoprotein particles (ranging from extra small (XS) high density lipoprotein (HDL) to extra-extra-large (XXL) very low density lipoprotein (VLDL)) were available and, for each class, the traits of total cholesterol, triglycerides, phospholipids, and cholesterol esters content, and the average diameter of the particles were additionally provided. Through the same procedure, estimates from NMR on genetic associations of amino acids, apolipoproteins, fatty and fluid acids, ketone bodies and glycerides were also included. Instrument selection for this dataset has been previously described10. Namely, 185 variants were selected, based on association with either one of: LDL-cholesterol, HDL-cholesterol or triglycerides in the external sample of the Global Lipid Genetics Consortium at the genome-wide level p < 5 × 10−8)11. Then, this set was pruned to avoid inclusion of genes in linkage disequilibrium (threshold: r2 < 0.05) and filtered (variants in distance less than 1 megabase pairs were excluded) resulting in the final set. This pre-processing strategy was performed with a view to study CHD risk.
Positive control outcome assessment is recommended in MR as an approach of investigating a risk factor that has an established causal effect on the outcome of interest12. We used CHD as a positive control outcome, given that lipid fractions are known to modulate its risk, with VLDL-and LDL-related traits being positively associated with CHD and HDL-related traits negatively12. Summary estimates from the CARDIoGRAMplusC4D Consortium and UK Biobank meta-analysis7,13 were used. For both datasets, a predominantly European population was included, as otherwise spurious results could have arisen due to population-specific, as opposed to CHD-specific, differences14.
MR Assumptions
The first assumption (IV1) states that IVs should be associated with the exposure. The second assumption (IV2) states that the IVs should be independent of all confounders of the risk factor-outcome association (IV2) and, finally, independent of the outcome conditional on the exposure and the confounders (IV3). The validity of the final assumption cannot be directly tested with the available data. For the inferences to be valid, it is necessary that the three IV assumptions apply15. In situations when IV3 is not deemed likely, additional risk factors that are possible mediators can be introduced in a multivariable MR (MVMR) model2. Additional assumptions should hold for MVMR results to be valid. Particularly, a) all included exposures have to be associated with at least one of the IVs, and, b) there should be no evidence of multicollinearity. If there is significant overlap (if G1 and G2 associate with both the exposures in Fig. 2), this may result in conditionally weak instruments4. The latter assumption and the way it limits eligibility of high-dimensional, correlated exposures is a key motivation for the present work.
Univariable (UVMR) & Multivariable MR (MVMR)
To examine how each of the metabolites is causally associated with CHD, a univariable MR (UVMR) approach was first undertaken under the two-sample summary data framework. This uses SNP-exposure and SNP-outcome GWAS summary statistics ( and respectively), obtained by regressing the exposure or outcome trait on each SNP individually. Usually, an additional adjustment has been made for variables such as age, gender, and genetic principal components. A UVMR analysis is the most straightforward way to obtain an estimate for the causal effect of the exposure, but is only reliable if all SNPs are valid instruments. However, in the Kettunen dataset, where the exposure set comprises 118 highly correlated lipid fractions, one variant may influence multiple exposures. This will generally invalidate a UMVR analysis and a multivariable MR (MVMR) approach is more appropriate.
Estimating equation (1) provides a general means for representing the mechanics of a UMVR or MVMR analysis: where
represents the association of SNP j with exposure k, with variance ;
represents the association of SNP j with the outcome, with variance ;
βk represents the causal effect of exposure k on the outcome to be estimated.
σlk j represents .
In a UMVR analysis there is only one exposure, so that K=1, whereas in a MVMR analysis K ≥ 2 (or in the case of the Kettunen data, K=118). In an IVW UMVMR or MVMR analysis, the causal effect parameters estimated are obtained by finding the values that minimise QA, under the simplifying assumption that . This is justified if all causal effects are zero, or if the uncertainty in the SNP-exposure associations is negligible (the No Measurement Error (NOME) assumption). In the general MVMR context, the NOME assumption is approximately satisfied if the conditional F-statistics (CFS) for each exposure are large, but if it is not, then IVW MVMR estimates will suffer from weak instrument bias. For exposure k, CFSk takes the form where: and where the are obtained by OLS regression. CFS statistics will generally be small (indicating weak instrument bias) whenever there is a high degree of correlation between the SNP-exposure association estimates. Such a high correlation is an indication of multi-colinearity. This would be the case in an MVMR analysis of all 118 lipid fractions in the Kettunen data. One option to address this would be to use weak instrument robust MVMR, such as MR-GRAPPLE5. This method performs estimation using the full definition of in (1). It can work well for MVMR analyses of relatively small numbers of exposures (e.g. up to 10) and relatively weak instruments (CFS statistics as low as 3), but the dimension and high correlation of the Kettunen data is arguably too challenging. This motivates the use of dimension reduction techniques, such as PCA.
Dimension reduction via PCA
The PCA method used was a singular value decomposition of the p × K matrix of SNP-exposure associations as: where U and V are orthogonal matrices and D is a square matrix whose diagonal values are the variances explained by each component and all off diagonal values are 0. V is the loadings matrix and serves as an indicator of the contribution of each metabolite to the transformed space of the PCs. The matrix UD (PCs/ scores matrix) is used in the second-step IVW regression in place of γ. As V estimation does not necessarily guarantee sparsity, certain exposures can contribute to multiple components, making the interpretation more complicated. Therefore, we assessed multiple sparse PCA methods that facilitate this.
Sparse PCA (Zou et al)
Sparse PCA by Zou et al.16 estimates the loadings matrix through an iterative procedure that progressively penalises exposures so that they do not contribute to certain PCs. In principle, this leads to a more clear picture for the consistency of each PC. This is performed as follows
1. Setting a fixed matrix, the following elastic net problem is solved
ξj = argminξ (αj − ξ)T γT γ(αj − ξ) + λ1 j∥ξ ∥ + λ ∥ξ ∥2, where j is the PC.
2. For a fixed Ξ, γT γΞ = UDVT is estimated and update A = UVT
3. Repeat until convergence
As a result of the additional λ1∥ξ ∥ norm, there is sparsity in the loadings matrix and only some of the SNP-exposure associations γ contribute to each PC, specifically a particular subset of highly correlated exposures in γ.
RSPCA
This approach differs in that it employs a robust measure of dispersion (Qn estimator)17,18. As above, an L1 norm is used to induce sparsity. For optimisation, the Tradeoff Product Optimization (TPO) was maximised. Its advantage is that it does not impose a single λ value on all PCs, thus allowing different degrees of sparsity. The maximum number of iterations was set at 100. In our application of a robust measure of dispersion for transforming the Kettunen data, this could result in data such that outlying associations may not distort the resulting transformations. Outliers in this context are defined only in γ, in contrast with the median and mode methods19,20.
Sparse Fused Principal Component Analysis (SFPCA)21
A method that can in theory exploit distinct correlation structures. The goal is to derive a loadings matrix in which highly positively correlated variables are similar in sign and highly negative ones are opposite. Similar magnitudes also tend to be obtained for those variables that are in the same blocks in the correlation matrix. Like the optimisation in Zou et al.16, the following criterion is used s.t. AT A = IK. The ‘fused’ penalty (last term) purposely penalises discordant loadings for variables that are highly correlated. The choice of the sparsity parameters is based on a BIC criterion.
Sparse Component Analysis (SCA)
SCA8 is motivated by the relative inadequacy of the classic approaches in promoting significant sparsity. The authors present a solution of the variance maximisation problem in which the eigenvectors are rotated in a way that approximate sparsity is achieved. This is prompted by the fact that the potential loadings for the problem of variance maximisation in sPCA form an entire space if the orthogonality constraint does not hold and, thereby, a loadings solution can be rotated so that it’s (approximately) sparse and explains the same proportion of variance. Then, a soft-thresholding approach is applied on the loadings matrix through constraint with a sparsity parameter. The simulation studies support the increased sparsity in a dataset of gene expression, among other examples8.
Choice of Components
In all dimensionality reduction approaches, there is no upper limit to how many transformed factors can be estimated. However, only some of them are informative in the sense of explaining a meaningful proportion of variance in the original dataset. For this purpose, a permutation-based approach was implemented22. The γ matrix was randomly reshuffled (permuted) and the sparse PCA method of interest was applied in the permuted set. In this case, the eigenvalue that is computed reflects the eigenvalue obtained under the null of a non-informative component. This process is repeated multiple times (perm = 1000) and the mean eigenvalue for all components is stored. Finally, the sparse PCA method is performed in the original γ matrix and whichever component has an eigenvalue larger than the mean of the permuted sets is considered informative.
Due to the computational complexity of the permutation method, particularly for SFPCA, an alternate method -the Karlis-Saporta-Spinakis (KSS) criterion9 - was also used. This is based on a simple correction on the minimum non-trivial eigenvalue (). The authors show that the method is not sensitive to non-normal distributions9. Although KSS was not compared with the above described permutation approach, it performed better than simpler approaches, such as choosing those PCs whose eigenvalue is larger than 1 (Kaiser criterion), the broken stick method23 and the Velicer method24.
Instrument Strength of PCs
In MVMR, the IV 1 assumption requires a set of genetic variants that robustly associate with at least one of the exposures X (MR Assumptions). This is quantified by the F-statistic and CFS4. As a rule of thumb, if the F-statistic surpasses the cutoff of 10, the transformed exposure is strongly instrumented. With summary statistics of the SNP-X associations γp,k (p: SNP, k: exposure), the mean F−statistic for the exposure k is estimated as Since we transform γ and obtain a matrix of lower dimensionality, formula 3 can’t be used as there is no longer a one-to-one correspondence of the SEs with the PCs. Likewise, a conditional F−statistic for the PCs also cannot be computed for this reason. We aim to arrive at a modified formula that bypasses this issue. For this purpose, we take advantage of two concepts, first an expression of the F−statistic for an exposure k (Fk) in matrix notation and, second, the use of this expression to estimate F−statistics for the PCs (FPC) from γ decomposition.
We make the assumption that the uncertainty in the γG,XK estimates is similar in all K exposures, i.e. γG,X uncertainty estimates do not substantially differ among exposures. This is not implausible as the uncertainty is predominantly driven by sample size and minor allele frequency25. In specific, the authors of25 show that , where MAF is the minor allele frequency, nk is the sample size in exposure k and Var(Xk) is the phenotypic variance. What this means is that, in experiments such as6 where nk is the same across all exposures and Var(Xk) can be standardised to 1, the main driver of differences in Var(γXK) is differences in MAF. As MAF is the same for each SNP across all exposures, the collation of SEs across exposures per SNP is well motivated.
We can then define a matrix Σ as follows. The elements in the diagonal represent the mean variance of γ for each SNP and all off-diagonal elements are zero. What is achieved through this is a summary of the uncertainty in the SNP-X associations that is not sensitive on the dimensions of the exposures. Instead of Eq. 3, we can then express the vector of the mean F-statistics for each exposure F1−K = [F1, F2, …, FK] as where γ is the matrix of the SNP-exposure associations. In a simulation study, we generate data under the mechanism in Figure 3a. The strength of association is different in the three exposures. It is observed that the estimates with both methods (Eq. 4 and Eq. 3) align well (Figure 3b), supporting the equivalence of the two formulae.
Our second aim is to use this matrix notation formula for the F−statistic to quantify the instrument strength of each PC with the respective F−statistic (FPC). As presented above, we are not limited by the dimensions of point estimates and uncertainty matching exactly and we can use the formula in Eq. 4 and substitute γ with the PCs. For the PCA approach, where we decompose γ as γ = UDVT and carry forward M << K non-trivial PCs, we use the matrix UD in place of γ. Then, the mean FPC can be estimated as follows. The vector contains the FPC statistics for the M PCs. In a similar manner, we estimate FPC for the sparse PCA methods but, instead of the scores matrix UD, we use the scores of the sparse methods. We illustrate the performance of this approach in a simulation study with an identical configuration for exposure generation as the one presented in Figure 8. In a configuration with b = 6 blocks of p = 30 genetically correlated exposures (Figure 7), we vary the strength of association γ per block. This way, the first block has the highest strength of association and the last block the lowest, quantified by a lower mean F−statistic in the exposures of this block (red, Figure 4). The instrument strength of the PCs and the sPCs follow closely the corresponding F−statistics of the individual exposures; in other word, in a PC of five exposures, and F1−5 align well (Figure 4).
Results
Univariable MR (UVMR) & Multivariable MR (MVMR)
A total of 67 traits were associated with CHD at or below the Bonferroni-corrected level (p = 0.05/118, Table 2). Two genetically-predicted lipid exposures (M.HDL.C, M.HDL.CE) were negatively associated with CHD and 65 were positively associated (Table 1). In a MVMR model including only the 67 Bonferroni-significant traits, fitted with the purpose of illustrating the instability of IVW-MVMR in conditions of high collinearity, the conditional F-statistic (CFS)4 was lower than 2.2 for all exposures (with a mean CFS of 0.81), highlighting the weak instrument problem. The causal estimates for two traits (ApoB, M.LDL.PL) reached significance (OR (95% CI): 1.031(1.012, 1.37) and 0.904(0.881, 0.923) (Fig. 9). For M.LDL.PL, its UVMR estimate (1.52(1.35, 1.71), p < 10−10)) was of opposite sign to the MVMR estimate. Since these traits are correlated (r2 = 0.89), we interpret this change of effect direction as an artefact of multicollinearity. Additionally, the ability to detect any further causal effects is likely masked by variance inflation. To see if the application of a weak-instrument robust MVMR method could improve the analysis, we applied MR GRAPPLE5). Although the method did not identify any of the exposures as significant at Bonferroni-adjusted significance level, the estimate for M.LDL.PL is still negative but closer to zero and not statistically significant. The only trait that retains statistical significance is ApoB.
PCA
Standard PCA with no sparsity constraints was used as a benchmark. PCA estimates a square loadings matrix of coefficients with dimension equal to the number of exposures K. The coefficients in the first column define the linear combination of exposures with the largest variability (PC1). Column 2 defines PC2, the linear combination of exposures with the largest variability that is also independent of PC1, and so on. This way, the resulting factors seek to reduce redundant information and project highly correlated SNP-exposure associations to the same PC.
In PC1, VLDL-and LDL-related traits were the major contributors (Figure 5a). ApoB received the 8th largest loading (0.1343, maximum was 0.137 for cholesterol content in small VLDL) and LDL.C received the 49th largest (0.1122). In PC2, HDL-related traits were predominant. The first 18 largest positive loadings are HDL-related and 12 describe either large or extra-large HDL traits. PC3 receives its scores mainly from VLDL traits.
Seven components were deemed significant through the permutation-based approach (Fig. 1, Methods). This method randomly permutes γ and performs the PCA decomposition on each rearranged dataset. Eigenvalues are calculated for each PC based on the permuted matrix. The PCs with an eigenvalue higher than the mean of the permuted sets are deemed informative/ non-trivial and are retained for further analyses. In the second-step IVW regression (Fig. 1), a modest yet significant (OR = 1.002(1.0014, 1.0024), p < 10−10) association of PC1 with CHD was observed. Conversely, PC3 was marginally significantly associated with CHD (OR = 0.998 (0.998, 0.999), p = 0.049). Since γ has been transformed with linear coefficients (visualized in loadings matrix, Fig. 5), estimation of the magnitude of an association is not straightforward; however, significance and orientation of effects can be interpreted. Specifically, when positive loadings are applied to a block of correlated exposures that positively associate with the outcome, the MR estimate is positive; conversely, if negative loadings are applied, the MR estimate is negative.
Sparse PCA methods
We next employed multiple sparse PCA methods (Table 3) that each shrink a proportion of loadings to zero. The way this is achieved differs in each method. Their underlying assumptions and details on differences in optimisation are presented in Table 3 and further described in Methods.
The loadings for all methods (Figures 5a-e) are presented for LDL.c, a previously prioritised trait27. We do not expect the loadings for the same component to be numerically identical across methods, eg. an HDL-dominant component may be ranking second (in variance explained) in one method and fourth in another. Ideally, we would like to see metabolites contributing to a a small number or only one component. Using a visualisation technique proposed by Kim et al.28, this is indeed observed in Fig. 12. This visualisation was also repeated for urea, a metabolite that was assumed to be of little contribution in the overall variance of the (mainly lipid trait-focused) NMR dataset. Urea receives loadings of exactly zero or near zero in all methods (Fig. 12), thus serving as a successful negative control. In the second-step IVW meta-analysis, it appears that the PCs that comprise of predominantly VLDL/LDL and HDL traits robustly associate with CHD, with differences among methods (Table 4).
Sparse PCA (Zou et al., 2006)
This method adapts the PCA decomposition to a ridge regression and adds an L1 norm to introduce sparsity16. In order to get an estimate for the degree of sparsity to be induced, a grid search was performed. Multiple sPCA models with different numbers of sparse traits per component were generated and the lowest Bayesian Information Criterion (BIC) was achieved for a total of six metabolites not regularised to 0 per each sPC. While the minimum BIC is achieved when one metabolite is not regularised to 0, 6 non-zero loadings were preferred as this appears to be an inflection point (Figure 11). Under this level of sparsity, the permutation-based approach suggests that seven sPC’s are to be retained. Seventy-one exposures received a zero loading across all components. The first sPC is comprised of VLDL traits, ApoB and triglycerides (Fig. 5b) and is estimated to exert a significant causal effect on CHD (OR (95% CI): 1.003(1.0003, 1.005)). The sPCs 2 to 7 were not associated with CHD (range: 0.999(0.998, 1.001), 1.003(1.000, 1.062), all p values > 0.05).
RSPCA
This approach is due to Croux et al.26. Apart from sparsity, the authors also use an alternative measure of dispersion, the Qn estimator18, that is robust to outliers in γ. Optimisation and the Karlis-Saporta-Spinakis (KSS) criterion pick seven components to be informative9. KSS is a less computationally expensive method (compared to the permutation method) for defining an eigenvalue threshold for PC inclusion9. The overlap of the PCs is visualised in Figure 5c. Large positive loadings are observed for VLDL-and LDL-related traits in sPC1 and five HDL-related traits are regularised to 0. LDL.C and ApoB received the 9th and 12th largest positive loadings. In the second sPC, HDL traits are mainly represented. None of the sPCs are robustly associated with CHD.
Sparse Fused PCA (SFPCA)
SFPCA works by assigning highly correlated variables the exact same loadings as opposed to numerically similar ones (Figure 5d). This is achieved with two norms in the objective function: λ1 which regulates the L1 norm that induces sparsity and λ2 for the L2 regularisation (squared magnitude of γ) to guard against singular solutions. A grid search is used to identify appropriate parameters for λ1 and λ2. Applying the KSS criterion indicated that 6 sPCs should be retained. The loadings matrix is presented in Figure 5d and the identical coloring visualises the unique loadings in sPCs 1, 3 and 4, with only one metabolite (M.HDL.P) received a zero loading in sPC 1. Two large clusters are apparent; one with HDL-related molecules and one with VLDL-, LDL-, IDL-related and other molecules. Both ApoB and LDL.C received the largest negative loading (−0.11). This was also received by 43 other metabolites (26 VLDL-related, 9 LDL-related, 6 IDL related, Serum TG, S.HDL.TG). The first 2 sPC’s were significantly negatively associated with CHD (0.998(0.997, 0.999), p value = 6.63x10−7 and 0.9989(0.9980, 9998), p value = 0.016). Our results show that we generally retain the expected direction of effects by examining the factor loadings. Here, the component that consists of the weighted sum of VLDL and LDL related traits (sPC1) is negatively associated with CHD because of the negative loadings (Fig. 5d, in red).
Sparse Component Analysis (SCA)
SCA performs an initial rotation of the obtained loadings in order to achieve approximate sparsity8. Seven components were retained after a permutation test. In the final model, more than 60 metabolites were regularised to zero in all components. In the heatmap, little overlap is noted among the metabolites (Figure 5e). The first sPC mainly receives loadings from LDL and IDL, sPC2 from VLDL and sPC3 from large and extra-large HDL traits. sPC6 gathers the largest loadings for LDL, the fifth for VLDL and the contribution of HDL is split in sPC’s 3 (large and extra-large HDL traits) and 4 (small and medium HDL traits). sPC1 was positively associated with CHD (OR (95% CI), p value:1.013(1.009, 1.017), < 10−8). sPC4, receiving loadings mainly from medium-sized HDL molecules, was negatively associated with CHD (0.997(0.995, 0.999), 0.02), while sPC3 (large and extra-large HDL traits), was positively associated.
Comparison with Univariable MR
In principle, all PC methods derive independent components. This is strictly the case in PCA, where subsequent PCs are orthogonal, and is relaxed to some degree in sparse implementations. We hypothesised that UVMR and MVMR could provide similar causal estimates of the associations of metabolite PCs with CHD. The results are presented in Figure 6 and concordance between UVMR and MVMR is quantified with the R2. The largest agreement of the causal estimates is observed in PCA. In the sparse methods, SCA8 and sPCA16 provide similarly consistent estimtates, whereas some disagreement is observed in the estimates of PC4 and
Instrument Strength
Instrument strength for the chosen PCs was assessed via an F-statistic, calculated using a bespoke formula that accounts for the PC process (see Methods and Appendix). The F-statistics for all transformed exposures cross the cutoff of 10. There was a trend for the first components being more strongly instrumented in all methods (see Figure 10). In the individual exposures, the conditional F-statistic for all exposures is lower than three for all exposures. Thus it appears that, while individual exposures are weakly jointly instrumented with the current instrument, PCs in contrast exhibit an improved instrument strength.
Simulation Study
We investigated the efficiency and reliability of various PCA methods compared to standard MVMR and Bonferroni-corrected MVMR in a simulation setting. K = 30−60 exposures within b = 4−6 blocks were generated. Exposures within blocks were genetically correlated and all exposures were correlated due to shared environmental confounding. A simplified causal diagram consistent with the data generating mechanism is presented in Fig. 7 for the case of only K = 6 exposures and b = 2 blocks. Only some of the exposures exert a true causal effect on the outcome. (e.g. X 1 and X 2 in Fig. 7).
In order to compare the results of a PC regression with the results from an MVMR with all exposures, we interpret the dimensionality reduction methods as tests. We then define true positive, true negative, false positive and false negative results. For the MVMR method, a true positive (TP) is an exposure that is causal in the underlying model (i.e X1 and X2 in Fig. 7) and is identified as statistically significant by MVMR. Likewise, a true negative (TN) would be a non causal exposure (e.g. X4) that is not statistically significant in the MVMR model.
In the PCA and sparse PCA methods, this classification is determined with respect to the causal estimate of a specific PC. If a PC is determined by the correlated exposures X1, X2 and X3 and it has a statistically significant causal estimate, then X1 and X2 are true positives and X3 is false positive. Similarly, if the PC constructed by X4, X5 and X6 is not significant in an MVMR analysis, all three would count as true negatives as they do not causally affect Y.
Across a range of simulation scenarios we collated summary data on TP, FP, TN and FN rates for each method, by averaging over the individual results for each exposure. This included varying the sample sizes (n = 1000 40000, which in turn corre-sponded to a mean instrument strength in γ of between and mean conditional F statistic . We then derived the sensitivity (SNS), specificity (SPC) and false positive rate (FPR) as and compared all methods by their area under the receiver-operating characteristic (ROC) curve (AUC) and the Youden’s index J (J = SNS + SPC−1). Specifically, the AUC was derived from the extrapolated ROC curve using the method of Reitsma et al.29, which jointly meta-analyses SNS and SPC.
Two sparse PCA methods (SCA8, sPCA16) consistently achieve the highest AUC (Fig. 8). This advantage is mainly driven by an increase in sensitivity for both these methods compared with MVMR. A closer look at the individual simulation results corroborates the discriminatory ability of these two methods, as they consistently achieve high sensitivities and most 14. Both standard and Bonferroni-corrected MVMR performed poorly in terms of AUC (AUC 0.660 and 0.885 respectively), due to poor sensitivity. While there was a modest improvement with PCA (AUC 0.755), it appears that PCA and RSPCA do not accurately identify negative results (median specificity 0 and 0.28 respectively). This extreme result can be understood by looking at the individual simulation results in Figure 14; both PCA and RSPCA cluster to the upper right end of the plot, suggesting a consistently low performance in identifying true negative exposures. In specific, the estimates with both these methods were very precise across simulations and this resulted in many false positive results and low specificity. We note a differing performance among the top ranking methods (SCA, sPCA); while both methods are on average similar, the results of SCA are more variable in both sensitivity and specificity (Table 5). The Youden indexes for these methods are also the highest (Fig. 8a).
Even with large sample sizes ( for a sample size n = 6000), MVMR can still not discriminate between positive and negative exposures as robustly as the sparse PCA methods. A major determinant of the accuracy of these methods appears to be the number of truly causal exposures, as in a repeat simulations with only four of the exposures being causal there was a drop in sensitivity and specificity across all methods. Sparse PCA methods still outperformed other methods in this case, however (Table 6).
Discussion
We propose the use of sparse PCA methods in MVMR in order to reduce high-dimensional exposure data to a low number of PCs and infer the latter’s causal contribution. As the dimensionality of available data sets for MR investigations increases (e.g. in NMR experiments30 and imaging studies), such approaches are becoming ever more useful. Our results support the notion that sparse PCA methods retain the information of the initial exposures. Specifically, the SCA8 and sPCA16 methods, performed best in simulation studies, whereas the SCA approach performed best in the positive control example of lipids and CHD.
By employing sparse PCA methods in a real dataset6, we show that the resulting PCs group VLDL, LDL and HDL traits together, whilst metabolites acting via alternative pathways receive zero loadings. This is a desirable property and indicates that the second-step MR enacted on the PCs obtains causal estimates for intervening on biologically meaningful pathways.31. This is in contrast with unconstrained PCA, in which all metabolites contribute to all PCs. Previously, Sulc et al. used PCA in MR to summarise highly correlated anthropometric variables32. To our knowledge, this is the first investigation of different sparse PCA modalities in the context of MR. We additionally provide a number of ways to choose the number of components in a data-driven manner. Our proposed approach of a sparse PCA method could potentially reduce overlap across components; for instance, in the paper by Sulc et al., there are PCs that are driven in common by trunk, arm and leg lean mass, basal metabolic rate and BMI. This is an important topic for further research.
This approach is conceptually different from the robust methods that have been developed for standard multivariable MR in the presence of weak instruments, such as MR GRAPPLE5. Our method reduces the need for a pre-selection of which exposures to include in a MVMR model. We present a complementary workflow through which we can include all available exposures with no prior selection, collate them in uncorrelated and interpretable components and then investigate the causal contribution of these groups of exposures and avoids the risk of generating spurious results in such an extreme setting of high collinearity compared with MVMR IVW and MR GRAPPLE formulations. For example, a 2019 three-sample MR study that assessed 82 lipoprotein subfraction risk factors’ effects on CHD used a UVMR and a robust extension of MVMR. A positive effect of VLDL-and LDL-related subfractions on CHD was reported, consistent in magnitude across the sizes of the subfractions33. Results were less definitive on the effect of HDL subfractions of varying size on CHD. Both positive and negative effects were observed. In our study, the HDL subfractions were uniformly projected to similar subspaces, yielding a single component that was mainly HDL populated in all models, except for the SCA model 15 which projected the small/ medium and large/ extra-large HDL traits in two different components. In all cases, the association of the sPCs with CHD was very low in magnitude. Nevertheless, the direction of effects was in line with the established knowledge on the relationship between lipid classes and CHD.
In the sparse PCA methods, there were differences in the results. The sPCA method16 favoured a sparser model in which less than 10 metabolites per PC were used. This observation is also made by Guo et al21. The SCA method8 achieved good separation of the traits and very little overlap was observed. A separation of HDL-related traits according to size, not captured by the other methods, was noted. Clinical relevance of a more high-resolution HDL profiling, with larger HDL molecules mainly associated with worse outcomes, has been previously reported34.
Limitations
In the present study, many tuning parameters needed to be set in order to calibrate the PCA methods. We therefore caution against extending our conclusions on the best method outside the confines of our simulation. Not all available sparse dimen-sionality reduction approaches were assessed in our investigation and other techniques could have provided better results.
The sparse PCA approach outlined in this paper enables the user to perform a MVMR analysis with a large number of highly correlated exposures, but one downside is that the effect sizes are not as interpretable. Of course, if a proven pharma-ceutical or lifestyle intervention exists which simultaneously affects a number of lipid fractions, and those fractions are also combined within a single principle component, then arguably the PC approach’s causal estimate is more relevant than that of a single lipid fraction in a multivariable model. Further foundational work on lipid metabolism pathways could shed light on this.
Conclusion
In the present study, we underline the utility of sparse dimensionality reduction approaches for highly correlated exposures in MR. We present a comprehensive review of methods available to perform dimensionality reduction and describe their performance in a real data application and a simulation.
Data Availability
This is a computational study and no new data were generated. The code for the analyses is publicly available.
Data Availability
The GWAS summary statistics for the metabolites (http://www.computationalmedicine.fi/data/NMR_GWAS/) and CHD (http://www.cardiogramplusc4d.org/) are publicly available. We provide code for the SCA function, the simulation study and related documentation on github (https://github.com/vaskarageorg/SCA_MR/).
Appendix
MVMR with IVW and GRAPPLE
A small negative effect for M.LDL.PL is noted as nominally significant in Fig. 9. This is not concordant with the UVMR direction of effect. In GRAPPLE, no traits surpass the nominal significance threshold.
Simulation Study Design
We generate a minor allele frequency MAF ∼U (0.01, 0.5) and an unmeasured confounder U ∼N(0, 5). Conditionally on the MAF, we generate the genes G ∼Bin(2, MAF). We define Q blocks and divide the P genes of G to q blocks, each containing P/Q genes. This approach aims to model the biological phenomenon of multiple traits being influenced by a common set of variants. For instance, the traits of LDL cholesterol content and LDL phospholipid content appear to be associated with the variants in a largely similar manner6.
For each block q = 1, 2, …, Q, we define a matrix γq whose elements γ1−γP/Q are non-zero and ∼N(4, 1). This matrix is what will direct only certain variants to influence certain exposures, leading to multicollinearity. We then generate the K exposures sequentially per block, following the parametric equation Xq = γqGq + U + εX. This way, specific genetic variants generate blocks of exposures as shown in Fig. 7 and the exposures X1 −XK/Q are highly correlated. We derive the SNP-exposure associations from this dataset as . We set K=50-100 and P=100-150 and the blocks 5-8. We let these values vary across simulations in order to generate more varying values in diagnostic accuracy. For a given sample size, s = 1000 simulation studies were performed.
To retain the workflow of a two-sample MR, we generate a second exposure set identically as above but independent. This step is important as it guarantees the no sample overlap assumption of two-sample MR35. Based on this second X ′ matrix, we generate the outcome Y = β X ′ + U + εY. The vector of causal estimates β is generated based on any number of exposures being causal in the two blocks. This includes the null. We obtain the SNP-outcome associations as .
The effect of the data generating mechanism in a single dataset is visualised in Fig. 13. The methods employed are
MVMR IVW with and 36.
MVMR IVW with Bonferroni correction.
PCA with the scores from .
sparse PCA as implemented in the elasticnet package16.
RSPCA26.
SCA8.
Due to computational complexity, sparse PCA in the elasticnet package (spca_en in Fig. 8) was implemented with a simplification regarding the sparsity parameter. In specific, it was assumed that the number of non-zero exposures per component was P/Q. For SCA, we use the cross-validation method in PMA R package37.
To generate the summary ROC curves presented in Fig. 8, we treated simulation results as individual studies and meta-analysed them with the bivariate method of Reitsma et al.29 in the R package. The logit sensitivity and specificity (which are correlated) are jointly meta analysed by modelling them as a bivariate normal distribution and employing a random-effects model for accomodating this correlation and framing it as heterogeneity.
The results for a set simulations in N = 3000 individuals are presented in Figure 14. We observe that MVMR and MVMR with Bonferroni correction estimates cluster to the bottom left of the plot, suggesting a low sensitivity. The estimates from SCA and sPCA_EN are spread in the upper left half of the plot and often achieve a sensitivity of 1.00. The PCA and RSPCA mainly provide highly sensitive estimates but perform relatively worse in specificity.
Footnotes
↵* e-mail: vaskarageorg{at}hotmail.com