A Two-Sample Robust Bayesian Mendelian Randomization Method Accounting for Linkage Disequilibrium and Idiosyncratic Pleiotropy with Applications to the COVID-19 Outcome ======================================================================================================================================================================== * Anqi Wang * Zhonghua Liu ## ABSTRACT Mendelian randomization (MR) is a statistical method exploiting genetic variants as instrumental variables to estimate the causal effect of modifiable risk factors on an outcome of interest. Despite wide uses of various popular two-sample MR methods based on genome-wide association study summary level data, however, those methods could suffer from potential power loss or/and biased inference when the chosen genetic variants are in linkage disequilibrium (LD), and also have relatively large direct effects on the outcome whose distribution might be heavy-tailed which is commonly referred to as the idiosyncratic pleiotropy phenomenon. To resolve those two issues, we propose a novel Robust Bayesian Mendelian Randomization (RBMR) model that uses the more robust multivariate generalized *t*-distribution (Arellano-Valle and Bolfarine, 1995) to model such direct effects in a probabilistic model framework which can also incorporate the LD structure explicitly. The generalized *t*-distribution can be represented as a Gaussian scaled mixture so that our model parameters can be estimated by the EM-type algorithms. We compute the standard errors by calibrating the evidence lower bound using the likelihood ratio test. Through extensive simulation studies, we show that our RBMR has robust performance compared to other competing methods. We also apply our RBMR method to two benchmark data sets and find that RBMR has smaller bias and standard errors. Using our proposed RBMR method, we find that coronary artery disease is associated with increased risk of critically ill coronavirus disease 2019 (COVID-19). We also develop a user-friendly R package *RBMR* ([https://github.com/AnqiWang2021/RBMR](https://github.com/AnqiWang2021/RBMR)) for public use. Key Words * COVID-19 outcome * Mendelian randomization * Idiosyncratic pleiotropy * Linkage disequilibrium * Multivariate generalized *t*-distribution * EM-type algorithm ## 1. Introduction Mendelian randomization (MR) is a useful statistical method that leverages genetic variants as instrumental variables (IVs) for assessing the causal effect of a modifiable risk factor on a health outcome of interest even in the presence of unmeasured confounding factors (Ebrahim and Smith, 2008; Lawlor et al., 2008; Evans and Davey Smith, 2015). Because of the inborn nature of genetic variants, the associations between genetic variants and phenotypes after adjusting for possible population stratification will not be confounded by the environmental factors, socio-economic status and life styles after birth. Genome-wide association studies (GWAS) have identified tens of thousands of common genetic variants associated with thousands of complex traits and diseases (MacArthur et al., 2017). Those GWAS summary level data contain rich information about genotype-phenotype associations ([https://www.ebi.ac.uk/gwas/](https://www.ebi.ac.uk/gwas/)), and thus provide us valuable resources for MR studies. Therefore, we have seen a boost of two-sample MR method developments and applications based on GWAS summary statistics recently due to the increasing availability of candidate genetic variant IVs for thousands of phenotypes. (Burgess et al., 2013; Bowden et al., 2015; Pickrell et al., 2016). In particular, a genetic variant serving as a valid IV must satisfy the following three core assumptions (Martens et al., 2006; Lawlor et al., 2008): 1. **Relevance:** The genetic variant must be associated (not necessarily causally) with the exposure; 2. **Effective Random Assignment:** The genetic variant must be independent of any (measured or unmeasured) confounders of the exposure-outcome relationship; 3. **Exclusion Restriction:** The genetic variant must affect the outcome only through the exposure, that is, the genetic variant must have no direct effect on the outcome not mediated by the exposure. When these three core IV assumptions hold, the inverse variance weighted (IVW) (Ehret et al., 2011) method can be simply used to obtain unbiased causal effect estimate of the exposure on the outcome. However, among those three core assumptions, only the IV relevance assumption can be empirically tested, for example, by checking the empirical association strength between the candidate IV and the exposure using the GWAS catalog ([https://www.ebi.ac.uk/gwas/](https://www.ebi.ac.uk/gwas/)). The association between the IV and the exposure must be strong enough (the IV explains a large amount of the variation of the exposure variable) to ensure unbiased causal effect estimate. The problem of weak IVs has been studied previously in the econometric literature (Bound et al., 1995; Hansen et al., 2008). In MR settings, the method that uses genetic score by combining multiple weak IVs together to increase the IV-exposure association strength to reduce weak IV bias has also been proposed (Evans et al., 2013). Unfortunately, the other two IV core assumptions cannot be empirically tested and might be violated in practice. Violation of the exclusion restriction assumption can occur when the genetic variant indeed has a non-null direct effect on the outcome not mediated by the exposure, referred to as systematic pleiotropy (Solovieff et al., 2013; Verbanck et al., 2018; Zhao et al., 2020b). However, very often, genetic variants might have relatively large direct effects whose distribution exhibit heavy-tailed pattern, a phenomenon referred to as the idiosyncratic pleiotropy in this paper. To address those possible violations of the IV core assumptions and potential risk, many efforts have been made recently. The MR-Egger regression method introduced an intercept term to capture the presence of unbalanced systematic pleiotropy under the Instrument Strength Independent of Direct Effect (InSIDE) assumption (Bowden et al., 2015). However, MR-Egger would be biased when there exists idiosyncratic pleiotropy. Zhu et al. (2018) proposed the GSMR method that removes suspected genetic variants with relatively large direct effects and also takes the LD structure into account by using the generalized least squares approach. However, removal of a large number of relatively large direct effects might lead to efficiency loss. Zhao et al. (2020b) proposed MR-RAPS to improve statistical power for causal inference and limit the influence of relatively large direct effects by using the adjusted profile likelihood and robust loss functions assuming that those SNP IVs are independent. However, this independent IV assumption might not hold in practice because SNPs within proximity tend to be correlated. Cheng et al. (2020) proposed a two-sample MR method named MR-LDP that built a Bayesian probabilistic model accounting for systematic pleiotropy and LD structures among SNP IVs. One drawback of the MR-LDP method is that it cannot handle relatively large direct effects well. To overcome the limitations of those aforementioned methods, we propose a more robust method named ‘Robust Bayesian Mendelian Randomization (RBMR)’ accounting for LD, systematic and idiosyncratic pleiotropy simultaneously in a unified framework. Specifically, to account for LD, we first estimate the LD correlation matrix of SNP IVs and then explicitly include it in the model likelihood. To account for idiosyncratic pleiotropy, we propose to model the direct effects using the more robust multivariate generalized *t*-distribution (Arellano-Valle and Bolfarine, 1995; Frahm, 2004) which will be shown to have improved performance than using the Gaussian distribution when the idiosyncratic pleiotropy is present. Moreover, this more robust distribution can be represented as a Gaussian scaled mixture to facilitate model parameter estimation using the parameter expanded variational Bayesian expectation maximization algorithm (PX-VBEM) (Yang et al., 2020) which combines the VB-EM (Beal et al., 2003) and the PX-EM (Liu et al., 1998) together. We further calculate the standard error by calibrating the evidence lower bound (ELBO) according to a nice property of the likelihood ratio test (LRT). Both extensive simulation studies in Section 3 and analysis of two real benchmark data sets in Section 4 show that our proposed RBMR method outperforms competitors. We also find that coronary artery disease (CAD) is associated with increased risk of critically ill COVID-19 outcome. ## 2. Methods ### 2.1 The Linear Structural Model Suppose that we have *J* possibly correlated genetic variants (for example, single-nucleotide polymorphisms, or SNPs) *G**j*, *j* = 1, 2, …, *J*, the exposure variable *X*, the outcome variable *Y* of interest and unknown confounding factors *U*. Let *δ**X* and *δ**Y* denote the effects of confounders *U* on exposure *X* and outcome *Y* respectively. The coefficients *γ**j* (*j* = 1, 2, …, *J*) denote the SNP-exposure true effects. Suppose that all the IVs are valid, then the exposure can be represented as a linear structural function of the SNPs, confounders and an independent random noise term *e**X*. The outcome can be represented as a linear structural function of the exposure, confounders and the independent random noise term *e**Y*. The true effect size of the exposure on the outcome is denoted as *β*. Then, we have the following linear structural equation models (Bowden et al., 2015): ![Formula][1] Let Γ*j* (*j* = 1, 2, …, *J*) be the true effects of SNPs on the outcome. With valid IVs, we have ![Formula][2] To accommodate possible violations of the exclusion restriction assumption, we now consider the following modified linear structural functions (Bowden et al., 2015): ![Formula][3] where the coefficients *α**j* (*j* = 1, 2, …, *J*) represent the direct effects of the SNPs on the outcome. Then we have ![Formula][4] So far, many existing MR methods assign the Gaussian distribution on each direct effect *α**j*, that is ![Graphic][5] (Zhao et al., 2020b; Cheng et al., 2020; Zhao et al., 2020a), where ***α*** = [*α*1, …, *α**J*]T is a *J*-dimensional vector of direct effects. However, real genetic data might contain some relatively large direct effects whose distribution can be heavy-tailed, and thus the Gaussian distribution might not be a good fit. Therefore, we propose to assign the multivariate generalized *t*-distribution on ***α*** (Arellano-Valle and Bolfarine, 1995; Kotz and Nadarajah, 2004), which is a robust alternative to the Gaussian distribution (Frahm, 2004). ### 2.2 The Robust Bayesian MR Model Let ![Graphic][6] and ![Graphic][7] be the GWAS summary statistics for the exposure and the outcome respectively, where ![Graphic][8] are the corresponding estimated standard errors. Many existing MR methods assume that IVs are independent from each other (Ehret et al., 2011; Bowden et al., 2015; Zhao et al., 2020b), and the uncorrelated SNPs can be chosen by using a tool called LD clumping (Hemani et al., 2016; Purcell et al., 2007), which might remove many SNP IVs and thus cause efficiency loss. To include more SNP IVs even if they are in LD, we need to account for the LD structure explicitly. To achieve this goal, we use a reference panel sample to assist with reconstructing LD matrix, such as the 1000 Genome Project Phase 1 (*N* =379) (Consortium et al., 2012). We first apply the LDetect method to partition the whole genome into *Q* blocks (Berisa and Pickrell, 2016) and then estimate the LD matrix **Θ** using the estimator ![Graphic][9] first proposed by Rothman (2012). Then, the distributions of ![Graphic][10] and ![Graphic][11] are given by ![Formula][12] ![Formula][13] where ![Graphic][14] and ![Graphic][15] are both diagonal matrices (Zhu and Stephens, 2017). To account for the presence of idiosyncratic pleiotropy, we propose to model the direct effects ***α*** using the more robust multivariate generalized *t*-distribution (Arellano-Valle and Bolfarine, 1995; Kotz and Nadarajah, 2004; Ala-Luhtala and Piché, 2016) whose density function is given by ![Formula][16] where 𝒩 (***α***|**0, Σ***/w*) denotes the *J*-dimensional Gaussian distribution with mean **** and covariance ![Graphic][17] is a *J* × *J* diagonal matrix, and *𝒢* (*w*|*α**w*, *β**w*) is the Gamma distribution of a univariate positive variable *w* referred to as a weight variable ![Formula][18] where *f* denotes the Gamma function. When *α**w* = *β**w* = *ν/*2 in equation (2.8), the distribution in equation (2.7) reduces to a multivariate *t*-distribution, where *ν* is the degree of freedom. Gaussian scaled mixture representation enables the use of EM-type algorithms for statistical inference, such as the PX-VBEM (Yang et al., 2020) described in Section 2.3. Then we denote the distribution of the latent variable ***γ*** as ![Formula][19] where ***σ***2 = *σ*2**I****J** is a *J* × *J* diagonal matrix. By assuming that ***γ, α*** and *w* are latent variables, the complete data likelihood can be written as ![Formula][20] ### 2.3 Estimation and Inference The standard expectation-maximization (EM) algorithm (Dempster et al., 1977) is a popular choice for finding the maximum likelihood estimate in the presence of missing (latent) variables. However, one difficulty for implementing the EM algorithm is to calculate the marginal likelihood function which might involve difficult integration with respect to the distributions of the latent variables. In addition, the original EM algorithm might be slow (Liu et al., 1998). To address these numerical issues, we utilize a parameter expanded variational Bayesian expectation-maximization algorithm, namely, PX-VBEM (Yang et al., 2020), by replacing the EM algorithm in VB-EM (Beal et al., 2003) with PX-EM algorithm (Liu et al., 1998) to accelerate the speed of convergence. To start with, for the purpose of applying the PX-EM algorithm, the distribution of ![Graphic][21] in equation (2.5) can be rewritten as follows: ![Formula][22] We also rewrite the complete data likelihood in equation (2.10) as: ![Formula][23] where the expanded model parameters for RBMR are ![Graphic][24]. Let *q*(***γ, α***, *w*) be a variational posterior distribution. The logarithm of the marginal likelihood can be decomposed into two parts, ![Formula][25] Where ![Formula][26] Given that the *ℒ* (*q*) is an evidence lower bound (ELBO) of the marginal log-likelihood, the non-negative Kullback-Leibler (KL) divergence 𝕂 𝕃 (*q*‖/*p*) is equal to zero if and only if the variational posterior distribution is equal to the true posterior distribution. Minimizing the KL divergence is equivalent to maximizing ELBO. Before calculating the maximization of ELBO, due to the fact that latent variables are independent of each other, the decomposition form of the posterior distribution *q*(***γ, α***, *w*) is obtained using the mean field assumption (Blei et al., 2017), ![Formula][27] In the PX-VB-E step, the optimal variational posterior distributions for ***γ, α*** and *w* can be written as: ![Formula][28] The updating equations for the parameters are given by ![Formula][29] where ![Graphic][30] and ![Graphic][31]. In the PX-VB-M step, by setting the derivate of the ELBO to be zero, the model parameters ***θ*** can be obtained as: ![Formula][32] Where ![Graphic][33] and ![Graphic][34]. Finally, we use the updated model parameters ***θ*** to construct the evidence lower bound to check the convergence. Since we adopt PX-EM algorithm, the reduction step should be used to process the obtained parameters. More technical details can be found in the Supplementary Materials. After obtaining an estimate of the causal effect, we further calculate the standard error according to the property of likelihood ratio test (LRT) statistics which asymptotically follows the ![Graphic][35] under the null hypothesis (Van der Vaart, 2000). We first formulate the statistical tests to examine the association between the risk factor and the outcome. ![Formula][36] the likelihood ratio test (LRT) statistics for the causal effect is given by: ![Formula][37] Where ![Graphic][38] and ![Graphic][39] are collections of parameter estimates obtained by maximizing the marginal likelihood under the null hypothesis *ℋ* and under the alternative hypothesis *ℋ**a*. We utilize PX-VBEM algorithm to maximize the ELBO to get the ![Graphic][40] and ![Graphic][41] instead of maximizing the marginal likelihood to overcome the computational intractability. Although PX-VBEM produces accurate posterior mean estimates (Blei et al., 2017; Dai et al., 2017; Yang et al., 2018), it would underestimate the marginal variance because we use the estimated posterior distribution from the ELBO to approximate the marginal likelihood in equation (2.20) (Wang and Titterington, 2005). Thus, we calibrate ELBO by plugging our estimates (![Graphic][42] and ![Graphic][43]) from PX-VBEM into the equation (2.20) to construct the test statistics (Yang et al., 2020): ![Formula][44] Then, we can get the well-calibrated standard error as ![Graphic][45]. ## 3. Simulation Studies Although our proposed method is based on summary level data, we still simulate the individual-level data to better mimic real genetic data sets. Specifically, the data sets are generated according to the following models: ![Formula][46] Where ![Graphic][47] is the exposure vector, ![Graphic][48] is the outcome vector, ![Graphic][49] and ![Graphic][50] are the genotype datasets for the exposure **X** and the outcome **Y**, ![Graphic][51] and ![Graphic][52] are matrices for confounding variables, *n**X* and *n**Y* are the corresponding sample sizes of exposure **X** and outcome **Y**, *J* is the number of genotyped SNPs. The error terms ***ε******X*** and ***ε******Y*** are independent noises from ![Graphic][53] and ![Graphic][54] respectively. In model (3.1), *β* is the true causal effect and ***α*** represents the direct effect of the SNPs on the outcome not mediated by the exposure variable. An external reference panel ![Graphic][55] is chosen for estimating the LD matrix among SNPs, where *n**r* is the sample size of the chosen reference panel. The R package named *MR*.*LDP* is available to generate genotyped matrices **G*****X***, **G*****Y*** and **G*****r***. We fix *n**X* = *n**Y* = 20000 and *n**r* = 2500. The total number of SNPs is *J* = 300. For confounding variables, each column of **U*****X*** and **U*****Y*** is sampled from a standard normal distribution while each row of corresponding coefficients ![Graphic][56] and ![Graphic][57] of confounding variables is obtained from a multivariate normal distribution 𝒩 (**0, *S******η***) where the diagonal elements of ***S******η*** ∈ ℝ2×2 are 1 and the off-diagonal elements are 0.8. We simulate the following three cases of idiosyncratic pleiotropy: 1. **case 1:** ![Graphic][58]. We randomly select {5%, 10%, 20%, 25%} of IVs so that their direct effects *α**j*s have mean 0 and variance ![Graphic][59]. 2. **case 2:** ![Graphic][60]. We randomly select {5%, 10%, 20%, 25%} of IVs so that their direct effect *α**j*s have mean 10*σ* and variance ![Graphic][61]. 3. **case 3:** ![Graphic][62], the values of freedom *df* are {10, 15, 20}. The ![Graphic][63] in case 1 and case 2 is controlled by the heritability *h****α*** due to systematic pleiotropy, ![Graphic][64] which is set to be ![Graphic][65] and 0.07 respectively. The signal magnitude for ***γ*** is chosen such that the heritability ![Graphic][66] The true causal effect *β* is set to be 0.2. We first run single-variant genetic association analysis for the exposure and the outcome respectively, and then we obtain the summary-level statistics ![Graphic][67] with their corresponding standard errors ![Graphic][68] for all the three cases. Then we use the summary-level data to conduct MR analyses using the proposed RBMR, MR-LDP, MR-Egger, RAPS, GSMR and IVW methods. As the prerequisite for MR-Egger, RAPS and IVW methods is that the instrumental variables are independent of each other, we adopt a step-wise GSMR method to remove SNPs with LD structure. We repeat such experiment for 100 times. The simulation results are shown in Figure 3.1. In all the three cases considered, we find that the proposed RBMR and MR-LDP methods are more stable than the other four methods: RAPS, GSMR MR-Egger and IVW. However, we found that our proposed RBMR method has smaller bias and root mean square error (RMSE) than the MR-LDP method. More detailed results are provided in the Supplementary Materials. ![Figure 3.1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/03/29/2021.03.02.21252801/F1.medium.gif) [Figure 3.1:](http://medrxiv.org/content/early/2021/03/29/2021.03.02.21252801/F1) Figure 3.1: Comparisons of MR methods at heritability level of 0.05 and 0.07: The Figure (a), (c) and (e) represent comparisons of the causal estimates ![Graphic][69] of RBMR, MR-LDP, RAPS, GSMR, MR-Egger and IVW methods at heritability level of 0.05 for three cases of idiosyncratic pleiotropy, respectively. The Figure (b), (d) and (f) represent comparisons of the causal estimates ![Graphic][70] of RBMR, MR-LDP, RAPS, GSMR, MR-Egger and IVW methods at heritability level of 0.07 for three cases of idiosyncratic pleiotropy, respectively. ## 4. Real Data Analysis In this section, we analyzed three real data sets to demonstrate the performance of our proposed method. The 1000 Genome Project Phase 1 (1KGP) is used as the reference panel to compute the LD matrix (Consortium et al., 2012). We first analyze two benchmark data sets commonly used for method comparison purpose, then we will estimate the causal effect of coronary artery disease (CAD) on the risk of critically ill COVID-19 outcome defined as those who end up on respiratory support or die from COVID-19. The first benchmark data analysis is based on the summary-level data sets from two non-overlapping GWAS studies for the coronary artery disease (CAD), usually referred to as the CAD-CAD data. The true causal effect should be exactly one. The selection data set is from the Myocardial Infarction Genetics in the UK Biobank, the exposure data is from the Coronary Artery Disease (C4D) Genetics Consortium (Consortium et al., 2011), and the outcome data is from the transatlantic Coronary Artery Disease Genome Wide Replication and Meta-analysis (CARDIoGRAM) (Schunkert et al., 2011). We first filter the genetic variants using the selection data under different association *p*-value thresholds (*p*-value ≤ 1 × 10*−*4, 5 × 10*−*4, 1 × 10*−*3). Then we applied our proposed RBMR method and the MR-LDP to all the selected and possibly correlated SNPs by accounting for the LD structure explicitly. We applied the GSMR, IVW, MR-Egger and MR-RAPS methods using the independent SNPs after LD pruning because those methods require independent SNPs. We obtain causal effect point estimates and the corresponding 95% confidence intervals (CI) as shown in Figure 4.1(a). We found that our proposed RBMR method outperforms other methods because it has the smallest bias and shortest confidence intervals for a range of *p*-value thresholds. ![Figure 4.1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/03/29/2021.03.02.21252801/F2.medium.gif) [Figure 4.1:](http://medrxiv.org/content/early/2021/03/29/2021.03.02.21252801/F2) Figure 4.1: The results of CAD-CAD and BMI-BMI using 1KGP as the reference panel with shrink-age parameter λ = 0.15 and screening the corresponding SNPs under three thresholds (p-value ≤ 1 × 10−4, 5 × 10−4, 1 × 10−3),, RBMR, MR-LDP, RAPS, GSMR, MR-Egger and IVW methods use SNPs selected to calculate the casual effect estimate ![Graphic][71]. To further investigate the performance of our proposed RBMR method, we consider the case that both the exposure and outcome are body mass index (BMI). We select SNPs based on previous research (Locke et al., 2015). The exposure is the BMI for physically active men and the outcome is the BMI for physically active women, both are of European ancestry ([https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT\_consortium\_data\_files#2018\_GIANT\_and\_UK\_BioBank\_Meta\_Analysis\_for\_Public\_Release](https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT\_consortium\_data\_files#2018\_GIANT\_and\_UK\_BioBank\_Meta\_Analysis_for_Public_Release)). The point estimates and the corresponding 95% confidence intervals are shown in Figure 4.1(b). We found that our proposed RBMR method has smaller bias than other competing methods. More numerical results are provided in the Supplementary Materials. We apply our proposed RBMR method together with other competing methods to estimate the causal effect of CAD on the risk of critically ill coronavirus disease 2019 (COVID-19) defined as those who end up on respiratory support or die from COVID-19. Specifically, the selection data set is the Myocardial Infraction Genetics in the UK Biobank and the exposure data set is from Consortium et al. (2011). The outcome is obtained from Freeze 5 (January 2021) of the COVID-19 Host Genetics Initiative (COVID-19 HGI) Genome-Wide Association Study (Initiative et al., 2020) ([https://www.covid19hg.org/results/](https://www.covid19hg.org/results/)). The data combines the genetic data of 49562 patients and two million controls from 46 studies across 19 countries (Initiative et al., 2021). We mainly consider the GWAS data on the 6179 cases with critical illness due to COVID-19 and 1483780 controls from the general populations in our analysis. We use the selection data with *p*-value ≤ 1 × 10*−*4 threshold to select genetic variants as IVs. As shown in Figure 4.2(a), we found a significant effect of CAD on the risk of critically ill COVID-19 using our RBMR method (![Graphic][72], *p-*value = 0.0273, 95% CI = [0.0834, 0.3648]), MR-LDP (![Graphic][73], *p-*value = 0.0325, 95% CI = [0.0773, 0.3621]), RAPS (![Graphic][74], *p-*value = 0.0412, 95% CI = [0.0074, 0.3596])and MR-Egger (![Graphic][75], *p-*value = 0.012, 95% CI = [0.0607, 0.4967]) However, the results of GSMR (![Graphic][76], *p-*value = 0.0558, 95% CI = [− 0.0034, 0.2719]) and IVW (![Graphic][77], *p-*value = 0.0700 95% CI = [− 0.0111, 0.2796]) are not significant (*p*-value *>* 0.05). Although MR-LDP and RBMR give similar point estimate, however, our RBMR is more accurate as its confidence interval is slightly shorter and its *p*-value is more significant. ![Figure 4.2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/03/29/2021.03.02.21252801/F3.medium.gif) [Figure 4.2:](http://medrxiv.org/content/early/2021/03/29/2021.03.02.21252801/F3) Figure 4.2: The results of CAD-COVID-19 using 1KGP as the reference panel with shrinkage parameter λ = 0.15 and screening the corresponding SNPs under thresholds (p-value ≤ 1 × 10−4), RBMR, MR-LDP, RAPS, GSMR, MR-Egger and IVW methods use SNPs selected to calculate the casual effect estimate ![Graphic][78]. Each point of scatter plot in Figure (b) is augmented by the standard error of ![Graphic][79] and ![Graphic][80] on the vertical and horizontal sides. Dashed lines are the slopes fitted by six methods. ## 5. Discussion In this paper, we propose a novel two-sample robust MR method RBMR by accounting for the LD structure, systematic pleiotropy and idiosyncratic pleiotropy simultaneously in a unified framework. Specifically, we propose to use the more robust multivariate generalized *t*-distribution rather the less robust Gaussian distribution to model the direct effects of the IV on the outcome not mediated by the exposure. Moreover, the multivariate generalized *t*-distribution can be reformulated as Gaussian scaled mixtures to facilitate the estimation of the model parameters using the parameter expanded variational Bayesian expectation-maximum algorithm (PX-VBEM). Through extensive simulations and analysis of two real benchmark data sets, we found that our method outperforms the other competing methods. We also found that CAD is associated with increased risk of critically ill COVID-19 outcome using our RBMR method. We make the following two major contributions. First, our method can account for the LD structure explicitly and thus can include more possibly correlated SNPs to reduce bias and increase estimation efficiency. Second, our RBMR method is more robust to the presence of idiosyncratic pleiotropy. One limitation of our proposed method is that it cannot handle correlated pleiotropy where the direct effect of the IV on the outcome might be correlated with the IV strength. We leave it as our future work. ## Data Availability All data are publicly available as indicated or provided in the article. ## Acknowledgements We would like to thank the GIANT Consortium and COVID-19 host genetics initiative for providing publicly available summary statistics to support our analysis. * Received March 2, 2021. * Revision received March 29, 2021. * Accepted March 29, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission. ## References 1. Ala-Luhtala, J. and Piché, R. (2016). Gaussian scale mixture models for robust linear multivariate regression with missing data. Communications in Statistics-Simulation and Computation, 45(3). 2. Arellano-Valle, R. B. and Bolfarine, H. (1995). On some characterizations of the t-distribution. Statistics & Probability Letters, 25(1):79–85. 3. Beal, M. J. et al. (2003). Variational algorithms for approximate Bayesian inference. University of London London. 4. Berisa, T. and Pickrell, J. K. (2016). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btv546&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26395773&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 5. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(528):859–877. 6. Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association, 90(430):443–450. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2291055&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1995RA10400009&link_type=ISI) 7. Bowden, J., Davey Smith, G., and Burgess, S. (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. International Journal of Epidemiology, 44(2):512–525. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyv080&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26050253&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 8. Burgess, S., Butterworth, A., and Thompson, S. G. (2013). Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology, 37(7):658–665. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/gepi.21758&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24114802&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 9. Cheng, Q., Yang, Y., Shi, X., Yeung, K.-F., Yang, C., Peng, H., and Liu, J. (2020). MR-LDP: a two-sample mendelian randomization for gwas summary statistics accounting for linkage disequilibrium and horizontal pleiotropy. NAR Genomics and Bioinformatics, 2(2):qaa028. 10. Consortium, . G. P. et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature11632&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23128226&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000310434500030&link_type=ISI) 11. Consortium, C. A. D. C. G. et al. (2011). A genome-wide association study in europeans and south asians identifies five new loci for coronary artery disease. Nature Genetics, 43(4):339. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.782&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21378988&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000288903700014&link_type=ISI) 12. Dai, M., Ming, J., Cai, M., Liu, J., Yang, C., Wan, X., and Xu, Z. (2017). IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies. Bioinformatics, 33(18):2882–2889. 13. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from in-complete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.2517-6161.1977.tb01600.x&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1977DM46400001&link_type=ISI) 14. Ebrahim, S. and Smith, G. D. (2008). Mendelian randomization: can genetic epidemiology help redress the failures of observational epidemiology? Human Genetics, 123(1):15–33. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00439-007-0448-6&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=18038153&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000252799200002&link_type=ISI) 15. Ehret, G. B., Munroe, P. B., Rice, K. M., Bochud, M., Johnson, A. D., Chasman, D. I., Smith, A. V., Tobin, M. D., Verwoert, G. C., Hwang, S.-J., et al. (2011). Genetic variants in novel pathways influence blood pressure and cardiovascular disease risk. Nature, 478(7367):103. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature10405&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21909115&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000295575400043&link_type=ISI) 16. Evans, D. M., Brion, M. J. A., Paternoster, L., Kemp, J. P., McMahon, G., Munafo, M., Whitfield, J. B., Medland, S. E., Montgomery, G. W., Timpson, N. J., et al. (2013). Mining the human phenome using allelic scores that index biological intermediates. PLoS Genet, 9(10):e1003919. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1003919&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24204319&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 17. Evans, D. M. and Davey Smith, G. (2015). Mendelian randomization: new applications in the coming age of hypothesis-free causality. Annual Review of Genomics and Human Genetics, 16:327–350. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1146/annurev-genom-090314-050016&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25939054&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 18. Frahm, G. (2004). Generalized elliptical distributions: theory and applications. PhD thesis, Universitätsbibliothek. 19. Hansen, C., Hausman, J., and Newey, W. (2008). Estimation with many instrumental variables. Journal of Business & Economic Statistics, 26(4):398–422. 20. Hemani, G., Zheng, J., Wade, K. H., Laurin, C., Elsworth, B., Burgess, S., Bowden, J., Langdon, R., Tan, V., Yarmolinsky, J., et al. (2016). MR-Base: a platform for systematic causal inference across the phenome using billions of genetic associations. BioRxiv, page 078972. 21. Initiative, C.-. H. G. et al. (2020). The covid-19 host genetics initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the sars-cov-2 virus pandemic. European Journal of Human Genetics, 28(6):715. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 22. Initiative, C.-. H. G. et al. (2021). Mapping the human genetic architecture of covid-19 by worldwide meta-analysis. MedRxiv. 23. Kotz, S. and Nadarajah, S. (2004). Multivariate t-distributions and their applications. Cambridge University Press. 24. Lawlor, D. A., Harbord, R. M., Sterne, J. A., Timpson, N., and Davey Smith, G. (2008). Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Statistics in Medicine, 27(8):1133–1163. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.3034&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17886233&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 25. Liu, C., Rubin, D. B., and Wu, Y. N. (1998). Parameter expansion to accelerate em: the px-em algorithm. Biometrika, 85(4):755–770. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/85.4.755&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000077944900001&link_type=ISI) 26. Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R., Powell, C., Vedantam, S., Buchkovich, M. L., Yang, J., et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538):197–206. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature14177&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25673413&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 27. MacArthur, J., Bowler, E., Cerezo, M., Gil, L., Hall, P., Hastings, E., Junkins, H., McMahon, A., Milano, A., Morales, J., et al. (2017). The new NHGRI-EBI Catalog of published genome-wide association studies (gwas catalog). Nucleic Acids Research, 45(D1):D896– D901. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkw1133&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27899670&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 28. Martens, E. P., Pestman, W. R., de Boer, A., Belitser, S. V., and Klungel, O. H. (2006). Instrumental variables: application and limitations. Epidemiology, pages 260–267. 29. Pickrell, J. K., Berisa, T., Liu, J. Z., Ségurel, L., Tung, J. Y., and Hinds, D. A. (2016). Detection and interpretation of shared genetic influences on 42 human traits. Nature Genetics, 48(7):709. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3570&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27182965&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 30. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I., Daly, M. J., et al. (2007). PLINK: a tool set for wholegenome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3):559–575. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1086/519795&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17701901&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 31. Rothman, A. J. (2012). Positive definite estimators of large covariance matrices. Biometrika, 99(3):733–740. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/ass025&link_type=DOI) 32. Schunkert, H., König, I. R., Kathiresan, S., Reilly, M. P., Assimes, T. L., Holm, H., Preuss, M., Stewart, A. F., Barbalic, M., Gieger, C., et al. (2011). Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genetics, 43(4):333– 338. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.784&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21378990&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 33. Solovieff, N., Cotsapas, C., Lee, P. H., Purcell, S. M., and Smoller, J. W. (2013). Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics, 14(7):483–495. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nrg3461&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23752797&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 34. Van der Vaart, A. W. (2000). Asymptotic Statistics, volume 3. Cambridge University Press. 35. Verbanck, M., Chen, C.-y., Neale, B., and Do, R. (2018). Detection of widespread horizontal pleiotropy in causal relationships inferred from mendelian randomization between complex traits and diseases. Nature Genetics, 50(5):693–698. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-018-0099-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29686387&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F03%2F29%2F2021.03.02.21252801.atom) 36. Wang, B. and Titterington, D. (2005). Inadequacy of interval estimates corresponding variational bayesian approximations. In AISTATS. Citeseer. 37. Yang, Y., Dai, M., Huang, J., Lin, X., Yang, C., Chen, M., and Liu, J. (2018). LPG: A four-group probabilistic approach to leveraging pleiotropy in genome-wide association studies. BMC Genomics, 19(1):503. 38. Yang, Y., Shi, X., Jiao, Y., Huang, J., Chen, M., Zhou, X., Sun, L., Lin, X., Yang, C., and Liu, J. (2020). CoMM-S2: a collaborative mixed model using summary statistics in transcriptome-wide association studies. Bioinformatics, 36(7):2009–2016. 39. Zhao, J., Ming, J., Hu, X., Chen, G., Liu, J., and Yang, C. (2020a). Bayesian weighted mendelian randomization for causal inference based on summary statistics. Bioinformatics, 36(5):1501–1508. 40. Zhao, Q., Wang, J., Hemani, G., Bowden, J., and Small, D. S. (2020b). Statistical inference in two-sample summary-data mendelian randomization using robust adjusted profile score. Annals of Statistics, 48(3):1742–1769. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1214/19-aos1866&link_type=DOI) 41. Zhu, X. and Stephens, M. (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. The Annals of Applied Statistics, 11(3):1561. 42. Zhu, Z., Zheng, Z., Zhang, F., Wu, Y., Trzaskowski, M., Maier, R., Robinson, M. R., McGrath, J. J., Visscher, P. M., Wray, N. R., et al. (2018). Causal associations between risk factors and common diseases inferred from gwas summary data. Nature Communications, 9(1):1–12. [1]: /embed/graphic-1.gif [2]: /embed/graphic-2.gif [3]: /embed/graphic-3.gif [4]: /embed/graphic-4.gif [5]: /embed/inline-graphic-1.gif [6]: /embed/inline-graphic-2.gif [7]: /embed/inline-graphic-3.gif [8]: /embed/inline-graphic-4.gif [9]: /embed/inline-graphic-5.gif [10]: /embed/inline-graphic-6.gif [11]: /embed/inline-graphic-7.gif [12]: /embed/graphic-5.gif [13]: /embed/graphic-6.gif [14]: /embed/inline-graphic-8.gif [15]: /embed/inline-graphic-9.gif [16]: /embed/graphic-7.gif [17]: /embed/inline-graphic-10.gif [18]: /embed/graphic-8.gif [19]: /embed/graphic-9.gif [20]: /embed/graphic-10.gif [21]: /embed/inline-graphic-11.gif [22]: /embed/graphic-11.gif [23]: /embed/graphic-12.gif [24]: /embed/inline-graphic-12.gif [25]: /embed/graphic-13.gif [26]: /embed/graphic-14.gif [27]: /embed/graphic-15.gif [28]: /embed/graphic-16.gif [29]: /embed/graphic-17.gif [30]: /embed/inline-graphic-13.gif [31]: /embed/inline-graphic-14.gif [32]: /embed/graphic-18.gif [33]: /embed/inline-graphic-15.gif [34]: /embed/inline-graphic-16.gif [35]: /embed/inline-graphic-17.gif [36]: /embed/graphic-19.gif [37]: /embed/graphic-20.gif [38]: /embed/inline-graphic-18.gif [39]: /embed/inline-graphic-19.gif [40]: /embed/inline-graphic-20.gif [41]: /embed/inline-graphic-21.gif [42]: /embed/inline-graphic-22.gif [43]: /embed/inline-graphic-23.gif [44]: /embed/graphic-21.gif [45]: /embed/inline-graphic-24.gif [46]: /embed/graphic-22.gif [47]: /embed/inline-graphic-25.gif [48]: /embed/inline-graphic-26.gif [49]: /embed/inline-graphic-27.gif [50]: /embed/inline-graphic-28.gif [51]: /embed/inline-graphic-29.gif [52]: /embed/inline-graphic-30.gif [53]: /embed/inline-graphic-31.gif [54]: /embed/inline-graphic-32.gif [55]: /embed/inline-graphic-33.gif [56]: /embed/inline-graphic-34.gif [57]: /embed/inline-graphic-35.gif [58]: /embed/inline-graphic-36.gif [59]: /embed/inline-graphic-37.gif [60]: /embed/inline-graphic-38.gif [61]: /embed/inline-graphic-39.gif [62]: /embed/inline-graphic-40.gif [63]: /embed/inline-graphic-41.gif [64]: /embed/inline-graphic-42.gif [65]: /embed/inline-graphic-43.gif [66]: /embed/inline-graphic-44.gif [67]: /embed/inline-graphic-45.gif [68]: /embed/inline-graphic-46.gif [69]: F1/embed/inline-graphic-47.gif [70]: F1/embed/inline-graphic-48.gif [71]: F2/embed/inline-graphic-49.gif [72]: /embed/inline-graphic-50.gif [73]: /embed/inline-graphic-51.gif [74]: /embed/inline-graphic-52.gif [75]: /embed/inline-graphic-53.gif [76]: /embed/inline-graphic-54.gif [77]: /embed/inline-graphic-55.gif [78]: F3/embed/inline-graphic-56.gif [79]: F3/embed/inline-graphic-57.gif [80]: F3/embed/inline-graphic-58.gif