Deciphering causal protein biomarkers in Alzheimer’s disease: Integrating a novel robust Mendelian randomization method for proteomics data analysis and AlphaFold3 for predicting 3D structural alterations
==============================================================================================================================================================================================================

* Minhao Yao
* Gary W. Miller
* Badri N. Vardarajan
* Andrea A. Baccarelli
* Zijian Guo
* Zhonghua Liu

## Abstract

Hidden confounding bias is a major threat in identifying causal protein biomarkers for Alzheimer’s disease in non-randomized studies. Mendelian randomization (MR) framework holds the promise of removing such hidden confounding bias by leveraging protein quantitative trait loci (pQTLs) as instrumental variables (IVs) for establishing causal relationships. However, some pQTLs might violate core IV assumptions, leading to biased causal inference and misleading scientific conclusions. To address this urgent challenge, we propose a novel MR method called MR-SPI that first Selects valid pQTL IVs under the Anna Karenina Principle and then performs valid Post-selection Inference that is robust to possible pQTL selection error. We further develop a computationally efficient pipeline by integrating MR-SPI and AlphaFold3 to automatically identify causal protein biomarkers and predict protein 3D structural alterations. We apply this pipeline to analyze genome-wide summary statistics for 912 plasma proteins in 54,306 participants from UK Biobank and for Alzheimer’s disease (AD) in 455,258 samples. We identified seven proteins associated with Alzheimer’s disease - TREM2, PILRB, PILRA, EPHA1, CD33, RET, and CD55 - whose 3D structures are altered by missense genetic variations. Our findings offers novel insights into their biological roles in AD development and may aid in identifying potential drug targets.

## 1. Introduction

Alzheimer’s disease (AD) stands as the primary cause of dementia globally, exerting a considerable strain on healthcare resources1,2. Despite extensive efforts, the etiology and pathogenesis of AD are still unclear, and strategies aimed at impeding or delaying its clinical advancement have largely remained challenging to achieve1,3,4. The amyloid cascade hypothesis posits that AD begins with the accumulation and aggregation of amyloid-beta (Aβ) peptides in the brain, culminating in the formation of β-amyloid fibrils, leading to tau hyperphosphorylation, neurofibrillary tangle formation and neurodegeneration5,6. However, current AD therapies targeting Aβ production and amyloid formation offer only transient symptomatic relief and fail to halt disease progression, resulting in a lack of effective drugs for AD1,7. Therefore, it is imperative and urgent to identify causal protein biomarkers to elucidate the underlying mechanisms of AD, and to expedite the development of effective therapeutic interventions for AD.

In causal inference, randomized controlled trials (RCTs) serve as the gold standard for evaluating the causal effect of an exposure on the health outcome of interest. However, it might be neither feasible nor ethical to perform RCTs where protein levels are considered as the exposures. Mendelian randomization (MR) leverages the random assortment of genes from parents to offspring to mimic RCTs to establish causality in non-randomized studies8-10. MR uses genetic variants, typically single-nucleotide polymorphisms (SNPs), as instrumental variables (IVs) to assess the causal association between an exposure and a health outcome11. Recently, many MR methods have been developed to investigate causal relationships using genome-wide association study (GWAS) summary statistics data that consist of effect estimates of SNP-exposure and SNP-outcome associations from two sets of samples, which are commonly referred to as the two-sample MR designs12-15. Since summary statistics are often publicly available and provide abundant information of associations between genetic variants and complex traits/diseases, two-sample MR methods become increasingly popular14,16-18. In particular, recent studies with large-scale proteomics data have unveiled numerous protein quantitative traits loci (pQTLs) associated with thousands of proteins19,20, facilitating the application of two-sample MR methods, where pQTLs serve as IVs and protein levels serve as exposures, to identify proteins as causal biomarkers for complex traits and diseases.

To employ MR for identifying causal protein biomarkers, conventional MR methods require the pQTLs included in the analysis to be valid IVs for reliable causal inference. A pQTL is called a valid IV if the following three core IV assumptions hold9,21:

(A1). **Relevance**: The pQTL is associated with the protein;

(A2). **Effective Random Assignment**: The pQTL is not associated with any unmeasured confounder of the protein-outcome relationship; and

(A3). **Exclusion Restriction**: The pQTL affects the outcome only through the protein in view.

Among the three core IV assumptions (A1) - (A3), only the first assumption (A1) can be tested empirically by selecting pQTLs significantly associated with the protein. However, assumptions (A2) and (A3) cannot be empirically verified in general and may be violated in practice, which may lead to a biased estimate of the causal effect. For example, violation of (A2) may occur due to the presence of population stratification9,22; and violation of (A3) may occur in the presence of horizontal pleiotropy9,23, which is a widespread biological phenomenon that the pQTL IV affects the outcome through other biological pathways that do not involve the protein in view, for example, through alternative splicing or micro-RNA effects24-26.

Recently, several MR methods have been proposed to handle invalid IVs under certain assumptions, as summarized in Table 1. Some of these additional assumptions for the identification of the causal effect in the presence of invalid IVs are listed below:

View this table:
[Table 1:](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/T1)

Table 1: 
Comparison of MR methods and the underlying assumptions for handling invalid IVs. Balanced pleiotropy means on average the pleiotropic effects have zero mean. NOME assumption refers to NO Measurement Error in the exposure data.

(i) The Instrument Strength Independent of Direct Effect (InSIDE) assumption: the pQTL-protein effect is asymptotically independent of the horizontal pleiotropic effect when the number of pQTL IVs goes to infinity. For example, the random-effects inverse-variance weighted (IVW) method27, MR-Egger28, MR-RAPS (Robust Adjusted Profile Score)16, and the Mendelian randomization pleiotropy residual sum and outlier (MR-PRESSO) test29.

(ii) Majority rule condition: up to 50% of the candidate pQTL IVs are invalid. For example, the weighted median method30 and MR-PRESSO.

(iii) Plurality rule condition or the ZEro Modal Pleiotropy Assumption (ZEMPA)15,31: a plurality of the candidate IVs are valid, which is weaker than the majority rule condition. For example, the mode-based estimation31, MRMix32 and the contamination mixture method33.

(iv) Other distributional assumptions. For example, MRMix and the contamination mixture method impose normal mixture distribution assumption on the genetic associations and the ratio estimates, respectively.

Despite many existing efforts, current MR methods still face new challenges when dealing with pQTLs IVs for analyzing proteomics data. First, it’s worth noting that the number of pQTLs for each protein tends to be small. For example, in two proteomics studies, the median number of pQTLs per protein is 420,34. With such a limited number of IVs, those MR methods based on the InSIDE assumption which requires a large number of IVs or other distributional assumptions might yield unreliable results in the presence of invalid IVs15,28. Second, current MR methods require an ad-hoc set of pre-determined genetic IVs, which is often obtained by selecting genetic variants with strong pQTL-protein associations in proteomics data35. Since such traditional way of selecting IVs only requires the proteomics data, hence the same set of selected IVs is used for assessing the causal relationships between the protein in view and different health outcomes. Obviously, this one-size-fits-all strategy for selecting IVs might not work well for different outcomes because the underlying genetic architecture may vary across outcomes. For example, the pattern of horizontal pleiotropy might vary across different outcomes. Therefore, it is desirable to develop an automatic and computationally efficient algorithm to select a set of valid genetic IVs for a specific protein-outcome pair to perform reliable causal inference, especially when the number of candidate pQTL IVs is small.

In this paper, we develop a novel all-in-one pipeline for causal protein biomarker identification and 3D structural alteration prediction using large-scale genetics, proteomics and phenotype/disease data, as illustrated in Figure 1. Specifically, we propose a two-sample MR method and algorithm that can automatically Select valid pQTL IVs and then performs robust Post-selection Inference (MR-SPI) for the causal effect of proteins on the health outcome of interest. The key idea of MR-SPI is based on the Anna Karenina Principle which states that all valid instruments are alike, while each invalid instrument is invalid in its own way – paralleling Leo Tolstoy’s dictum that “all happy families are alike; each unhappy family is unhappy in its own way”36. In other words, valid instruments will form a group and should provide similar ratio estimates of the causal effect, while the ratio estimates of invalid instruments are more likely to be different from each other. With the application of MR-SPI, we can not only identify the causal protein biomarkers associated with disease outcomes, but can also obtain missense genetic variations (used as pQTL IVs) for those identified causal proteins. These missense pQTL IVs will induce changes of amino acids, leading to 3D structural changes of these proteins. A classic example of a missense mutation was found in sickle cell disease, where the mutation at SNP rs334, located on chromosome 11 (11p15.4), results in the change of codon 6 of the beta globin chain from [GAA] to [GTA] 37-39. This substitution leads to the replacement of glutamic acid with valine at position 6 of the beta chain of the hemoglobin protein, altering the structure and function of hemoglobin protein. Consequently, red blood cells assume a crescent or sickle-shaped morphology, impairing blood flow to various parts of the body40,41. Moreover, to further offer novel biological insights into the interpretation of the causal effect at the molecular level, we incorporate AlphaFold342-45 into our pipeline to predict the 3D structural alteration resulting from the corresponding missense pQTL IVs for the causal proteins identified by MR-SPI. Our pipeline can elucidate the mechanistic underpinnings of how missense genetic variations translate into 3D structural alterations at the protein level, thereby advancing our understanding of disease etiology and potentially informing targeted therapeutic interventions.

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/02/2023.02.20.23286200/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/F1)

Figure 1. 
Overview of the pipeline. First, we apply MR-SPI for each protein to (1) select valid pQTL IVs under the plurality condition, and (2) estimate the causal effect on the outcome of interest. Second, we perform the Bonferroni correction procedure for causal protein identification. Third, for each causal protein biomarker, we apply AlphaFold3 to predict the 3D structural alterations due to missense pQTL IVs selected by MR-SPI.

Our pioneering pipeline for the first time integrates the identification of causal protein biomarkers for health outcomes and the subsequent analysis of their 3D structural alterations into a unified framework, leveraging increasingly publicly available GWAS summary statistics for health research. Within our framework, the proposed MR-SPI serves dual purposes: (1) identifying causal protein biomarkers; and (2) selecting valid missense pQTL IVs for subsequent 3D structural analysis. Compared to existing two-sample MR methods, MR-SPI is the first MR method that utilizes both exposure and outcome data to automatically select a set of valid IVs, especially when the number of candidate IVs is small in proteomics data, which is a prominent challenge with no satisfactory solution up to date. We note that while MR-PRESSO also selects valid IVs for MR analysis, it requires both the stronger majority rule condition and the InSIDE assumption, as well as a minimum number of four candidate IVs for implementation29. In contrast, our proposed MR-SPI does not require the InSIDE assumption and only requires the plurality rule condition that is weaker than the majority rule condition, and only requires a minimum number of three IVs for the proposed voting procedure. Therefore, MR-SPI is more suitable for analyzing proteomics data. Extensive simulations show that our MR-SPI method outperforms other competing MR methods under the plurality rule condition. We employ MR-SPI to perform omics MR (xMR) with 912 plasma proteins using the large-scale UK Biobank proteomics data in 54,306 UK Biobank participants20 and find 7 proteins significantly associated with the risk of Alzheimer’s disease. We further use AlphaFold342-45 to predict the 3D structural alterations of these 7 proteins due to missense genetic variations, and then illustrate the structural alterations graphically using the PyMOL software ([https://pymol.org](https://pymol.org)), providing new biological insights into their functional roles in AD development and may aid in identifying potential drug targets.

## 2. Results

### 2.1 Overview of the pipeline

Our proposed all-in-one pipeline for the identification and 3D structural alteration prediction of causal protein biomarkers consists of three primary steps, as illustrated in Figure 1. First, for each protein biomarker, we employ our proposed MR-SPI to select valid pQTL IVs by incorporating the proteomics GWAS and disease outcome GWAS summary data together, and then estimate the causal effect of each protein on the outcome using the selected valid pQTL IVs. The main idea and more detailed implementation steps for MR-SPI is described in Section 2.2. Second, we perform Bonferroni correction46 for the *p*-values of the estimated causal effects to identify putative causal protein biomarkers associated with the outcome. Third, for each identified protein, we apply AlphaFold3 to predict and compare the 3D structures of both the wild-type protein and mutated protein resulting from missense pQTL IVs.

### 2.2 MR-SPI selects valid genetic instruments by a voting procedure

MR-SPI is an automatic procedure to select valid pQTL instruments and perform robust causal inference using two-sample GWAS and proteomics data. In summary, MR-SPI consists of the following four steps, as illustrated in Figure 2:

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/02/2023.02.20.23286200/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/F2)

Figure 2. 
The MR-SPI framework. First, MR-SPI selects relevant IVs with strong pQTL-protein associations. Second, each relevant IV provides a ratio estimate of the causal effect and then receives votes on itself to be valid from the other relevant IVs whose degrees of violation of (A2) and (A3) are small under this ratio estimate of causal effect. For example, by assuming pQTL 1 is valid, the slope of the line connecting pQTL 1 and the origin represents the ratio estimate of pQTL 1, and pQTLs 2 and 3 vote for pQTL 1 to be valid because they are close to that line, while pQTLs 4, 5 and 6 vote against it since they are far away from that line. Third, MR-SPI estimates the causal effect by fitting a zero-intercept OLS regression of pQTL-outcome associations on pQTL-protein associations and construct the robust confidence interval using selected valid pQTL IVs in the maximum clique of the voting matrix, which encodes whether two pQTLs mutually vote for each other to be valid IVs.

*   (1). select relevant pQTL IVs that are strongly associated with the protein;

*   (2). each relevant pQTL IV provides a ratio estimate of the causal effect, and then all the other relevant pQTL IVs votes for it to be a valid IV if their degrees of violation of assumptions (A2) and (A3) are smaller than a data-dependent threshold as in equation (4);

*   (3). select valid pQTL IVs by majority/plurality voting or by finding the maximum clique of the voting matrix that encodes whether two relevant pQTL IVs mutually vote for each other to be valid (the voting matrix is defined in equation (6) in STAR (structured, transparent, accessible reporting) Methods);

*   (4). estimate the causal effect using the selected valid pQTL IVs and construct a robust confidence interval with guaranteed nominal coverage even if in the presence of possible IV selection error in finite samples.

Most current two-sample MR methods only use step (1) to select (relevant) pQTL instruments for downstream MR analysis, while the selected pQTL instruments might violate assumptions (A2) and (A3), leading to possibly unreliable scientific findings. To address this issue, MR-SPI automatically select valid pQTL instruments for a specific protein-outcome pair by further incorporating the outcome GWAS data. Our key idea of selecting valid pQTL instruments is that, under the plurality rule condition, valid IVs will form the largest group and should give “similar” ratio estimates according to the Anna Karenina Principle (see STAR Methods). More specifically, we propose the following two criteria to measure the similarity between the ratio estimates of two pQTLs *j* and *k* in step (2):

**C1** We say the *k* th pQTL “votes for” the *j* th pQTL to be a valid IV if, by assuming the *j* th pQTL is valid, the *k* th pQTL’s degree of violation of assumptions (A2) and (A3) is smaller than a data-dependent threshold as in equation (4);

**C2** We say the ratio estimates of two pQTLs *j* and *k* are “similar” if they mutually vote for each other to be valid.

In step (3), we construct a symmetric binary voting matrix to encode the votes that each relevant pQTL receives from other relevant pQTLs: the (*k, j*) entry of the voting matrix is 1 if pQTLs *j* and *k* mutually vote for each other to be valid, and 0 otherwise. We propose two ways to select valid pQTL IVs based on the voting matrix (see STAR Methods): (1) select relevant pQTLs who receive majority voting or plurality voting as valid IVs; and (2) use pQTLs in the maximum clique of the voting matrix as valid IVs47. Our simulation studies show that the maximum clique method can empirically offer lower false discovery rate (FDR)48 and higher true positive proportion (TPP) as shown in Table S4 and Supplementary Section S6.

In step (4), we estimate the causal effect by fitting a zero-intercept ordinary least squares regression of pQTL-outcome associations on pQTL-protein associations using the set of selected valid pQTL IVs, and then construct a standard confidence interval for the causal effect using standard linear regression theory. In finite samples, some invalid IVs with small (but still nonzero) degrees of violation of assumptions (A2) and (A3) might be incorrectly selected as valid IVs, commonly referred to as “locally invalid IVs”49. To address this possible issue, we propose to construct a robust confidence interval with a guaranteed nominal coverage even in the presence of IV selection error in finite-sample settings using a searching and sampling method49, as described in Supplementary Figure S17 and STAR Methods.

### 2.3 Comparing MR-SPI to other competing MR methods in simulation studies

We conduct extensive simulations to evaluate the performance of MR-SPI in the presence of invalid IVs. We simulate data in a two-sample setting under four setups: (**S1**) majority rule condition holds, and no locally invalid IVs exist; (**S2**) plurality rule condition holds, and no locally invalid IVs exist; (**S3**) majority rule condition holds, and locally invalid IVs exist; (**S4**) plurality rule condition holds, and locally invalid IVs exist. More detailed simulation settings are described in STAR Methods. We compare MR-SPI to the following competing MR methods: (1) the random-effects IVW method27, (2) MR-RAPS16, (3) MR-PRESSO29, (4) the weighted median method30, (5) the mode-based estimation31, (6) MRMix32, and (7) the contamination mixture method33. We exclude MR-Egger in this simulation since it is heavily biased in our simulation settings. For simplicity, we shall use IVW to represent the random-effects IVW method hereafter.

In Figure 3, we present the percent bias, empirical coverage, and average lengths of 95% confidence intervals of those MR methods in simulated data with a sample size of 5,000 for both the exposure and the outcome. Additional simulation results under a range of sample sizes (n=5,000, 10,000, 20,000, 40,000, 80,000) can be found in Supplementary Figure S1 and Tables S1-S3. When the plurality rule condition holds and no locally invalid IVs exist, MR-SPI has small bias and short confidence interval, and the empirical coverage can attain the nominal level. When locally invalid IVs exist, the standard confidence interval might suffer from finite-sample IV selection error, and thus the empirical coverage is lower than 95% if the sample sizes are not large (e.g., 5,000). In practice, we can perform sensitivity analysis of the causal effect estimate by changing the threshold in the voting step (see STAR Methods and Supplementary Figure S14). If the causal effect estimate is sensitive to the choice of the threshold, then there might exists finite-sample IV selection error. In such cases, the proposed robust confidence interval of MR-SPI can still attain the 95% coverage level and thus is recommended for use. We also examine the performance of MR-SPI in overlapped samples mimicking real data settings with simulation set-up and results given in Supplementary Section S8, and we find that our MR-SPI can still provide valid statistical inference.

![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/02/2023.02.20.23286200/F3.medium.gif)

[Figure 3.](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/F3)

Figure 3. 
Empirical performance of MR-SPI and the other competing MR methods in simulated data with sample size 5,000. (a) Boxplot of the percent bias in causal effect estimates. (b) Empirical coverage of 95% confidence intervals. The black dashed line in (b) represents the nominal level (95%). (c) Average lengths of 95% confidence intervals.

### 2.4 Identifying plasma proteins associated with the risk of Alzheimer’s disease

Omics MR (xMR) aims to identify omics biomarkers (e.g., proteins) causally associated with complex traits and diseases. In particular, xMR with proteomics data enables the identification of disease-associated proteins, facilitating crucial advancements in disease diagnosis, monitoring, and novel drug target discovery. In this section, we apply MR-SPI to identify putative causal plasma protein biomarkers associated with the risk of Alzheimer’s disease (AD). The proteomics data used in our analysis comprises 54,306 participants from the UK Biobank Pharma Proteomics Project (UKB-PPP)20. In the UKB-PPP consortium, up to 22.6 million imputed autosomal variants across 1,463 proteins post quality control were analyzed, discovering 10,248 primary associations through LD (Linkage Disequilibrium) clumping ±1Mb around the significant variants, including 1,163 in the *cis* region and 9,085 in the *trans* region20. As described in Sun, et al. 20, the following filtering steps are used to retain pQTLs in the UKB-PPP summary level proteomics data: (1)genome-wide significant (*p* -value < 3.40 × 10−11), after Bonferroni correction; and (2) independent pQTLs using LD clumping (*r*2 < 0.01). Thus, all these candidates pQTL IVs are independent and strongly associated with the proteins. Summary statistics for AD are obtained from a meta-analysis of GWAS studies for clinically diagnosed AD and AD-by-proxy, comprising 455,258 samples in total50. For MR method comparison, we analyze 912 plasma proteins that share four or more candidate pQTLs within the summary statistics for AD, because the implementation of MR-PRESSO requires a minimum of four candidate IVs 29.

As presented in Figure 4(a), MR-SPI identifies 7 proteins that are significantly associated with AD after Bonferroni correction, including CD33, CD55, EPHA1, PILRA, PILRB, RET, and TREM2. The detailed information of the selected pQTL IVs for these 7 proteins can be found in Supplementary Table S6. Among them, four proteins (CD33, PILRA, PILRB, and RET) are positively associated with the risk of AD while the other three proteins (CD55, EPHA1, and TREM2) are negatively associated with the risk of AD. We also note that some competing MR methods may detect additional proteins, which are likely spurious due to invalid pQTL IVs, as demonstrated in Supplementary Section 11. Previous studies have revealed that some of those 7 proteins and the corresponding protein-coding genes might contribute to the pathogenesis of AD51-56, as shown in Supplementary Table S8. For example, it has been found that CD33 plays a key role in modulating microglial pathology in AD, with TREM2 acting downstream in this regulatory pathway53. Besides, a recent study has shown that a higher level of soluble TREM2 is associated with protection against the progression of AD pathology57. Additionally, RET at mitochondrial complex I is activated during ageing, which might contribute to an increased risk of ageing-related diseases including AD55. Using the UniProt database58, we also find that genes encoding these 7 proteins are overexpressed in tissues including hemopoietic tissues and brain, as well as cell types including microglial, macrophages and dendritic cells. These findings highlight the potential therapeutic opportunities that target these proteins for the treatment of AD. Furthermore, in the Therapeutic Target Database (TTD)59 and DrugBank database60, we find existing US Food and Drug Administration (FDA)-approved drugs that target these proteins identified by MR-SPI. For example, gemtuzumab ozogamicin is a drug that targets CD33 and has been approved by FDA for acute myeloid leukemia therapy61,62. Besides, pralsetinib and selpercatinib are two RET inhibitors that have been FDA-approved for the treatment of non-small-cell lung cancers63,64. Therefore, these drugs might be potential drug repurposing candidates for the treatment of AD.

![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/02/2023.02.20.23286200/F4.medium.gif)

[Figure 4.](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/F4)

Figure 4. 
**(a)** Volcano plot of associations of plasma proteins with Alzheimer’s disease using MR-SPI. The horizontal axis represents the estimated effect size (on the log odds ratio scale), and the vertical axis represents the − log10(*p*-value). Positive and negative associations are represented by green and red points, respectively. The size of a point is proportional to − log10(*p* -value). The blue dashed line represents the significance threshold using Bonferroni correction (*p* -value < 5.48 × 10−5). **(b)** 3D Structural alterations of CD33 predicted by AlphaFold3 due to missense genetic variation of pQTL rs2455069. The ribbon representation of 3D structures of CD33 with Arginine and Glycine at position 69 are colored in blue and red, respectively. The amino acids at position 69 are displayed in stick representation, with Arginine and Glycine colored in green and yellow, respectively. The predicted template modeling (pTM) yields a score of 0.6 for both structures, which suggests that AlphaFold3 provides good predictions for these two 3D structures. **(c)** Forest plot of significant associations of proteins with Alzheimer’s disease identified by MR-SPI. Confidence intervals are clipped to vertical axis limits. **(d)** Bubble plot of GO analysis results using the 7 significant proteins detected by MR-SPI. The horizontal axis represents the *z*-score of the enriched GO term, and the vertical axis represents the − log10(*p*-value) after Bonferroni correction. Each point represents one enriched GO term. The blue dashed line represents the significance threshold (adjusted *p*-value< 0.05 after Bonferroni correction).

In Figure 4(b), we present the 3D structural alterations of CD33 due to missense genetic variation of pQTL rs2455069, as predicted by AlphaFold342,43,45. The 3D structures are shown in blue when the allele is A, and in red when the allele is G at pQTL rs2455069 A/G, which is a cis-SNP located on chromosome 19 (19q13.41) and is selected as a valid IV by MR-SPI. The presence of the G allele at pQTL rs2455069 results in the substitution of the 69th amino acid of CD33, changing it from Arginine (colored in green if the allele is A) to Glycine (colored in yellow if the allele is G), consequently causing a local change in the structure of CD33 (R69G). Previous studies have found that CD33 is overexpressed in microglial cells in the brain65, and the substitution of Arginine to Glycine in the 69th amino acid of CD33 might lead to the accumulation of amyloid plaques in the brain66, thus the presence of the G allele at pQTL rs2455069 might contribute to an increased risk of AD. We also apply AlphaFold3 to predict the 3D structures of the other proteins that are detected to be significantly associated with AD by MR-SPI, which are presented in Supplementary Figure S16.

In Figure 4(c), we present the point estimates and 95% confidence intervals of the causal effects (on the log odds ratio scale) of these 7 proteins on AD using the other competing MR methods. In Figure 4(c), these proteins are identified by most of the competing MR methods, confirming the robustness of our findings. To the best of our knowledge, there may be two reasons for the differences in results between MRMix and other MR methods for some proteins: (1) MRMix assumes that the pQTL-protein and pQTL-outcome associations follow a bivariate normal-mixture model with four mixture components while the contamination mixture models assume that the ratio estimator follows a normal distribution with two mixture components, and therefore it may be more challenging to obtain reliable causal effect estimates using the MRMix model with a small number of pQTLs per protein; and (2) the default grid search values implemented in the MRMix R package might not be optimal for some proteins. Notably, MR-SPI detects one possibly invalid IV pQTL rs10919543 for TREM2-AD relationship, which is associated with red blood cell count according to PhenoScanner67. Red blood cell count is a known risk factor for AD68,69, and thus pQTL rs10919543 might exhibit pleiotropy in the relationship of TREM2 on AD. After excluding this potentially invalid IV, MR-SPI suggests that TREM2 is negatively associated with the risk of AD (![Graphic][1]</img>, *p*-value= 1.20 × 10−18). Additionally, we perform the gene ontology (GO) enrichment analysis using the g:Profiler web server70 ([https://biit.cs.ut.ee/gprofiler/gost](https://biit.cs.ut.ee/gprofiler/gost)) to gain more biological insights for the 7 proteins identified by MR-SPI, and the results are presented in Figure 4(d) and Supplementary Table S7. After Bonferroni correction, the GO analysis indicates that these 7 proteins are significantly enriched in 20 GO terms, notably, the positive regulation of phosphorus metabolic process and major histocompatibility complex (MHC) class I protein binding. It has been found that increased phosphorus metabolites (e.g., phosphocreatine) are associated with aging, and that defects in metabolic processes for phospholipid membrane function is involved in the pathological progression of Alzheimer’s disease71,72. In addition, MHC class I proteins may play a crucial role in preserving brain integrity during post-developmental stages, and modulation of the stability of MHC class I proteins emerges as a potential therapeutic target for restoring synaptic function in AD73-75.

## 3. Discussion

In this paper, we develop a novel integrated pipeline that combines our proposed MR-SPI method with AlphaFold3 to identify putative causal protein biomarkers for complex traits/diseases and to predict the 3D structural alterations induced by missense pQTL IVs. Specifically, MR-SPI is an automatic algorithm to select valid pQTL IVs under the plurality rule condition for a specific protein-outcome pair from two-sample GWAS summary statistics. MR-SPI first selects relevant pQTL IVs with strong pQTL-protein associations to minimize weak IV bias, and then applies the proposed voting procedure to select valid pQTL IVs whose ratio estimates are similar to each other. In the possible presence of locally invalid IVs in finite-sample settings, MR-SPI further provides a robust confidence interval constructed by the searching and sampling method49, which is immune to finite-sample IV selection error. The valid pQTL IVs selected by MR-SPI serve dual purposes: (1) facilitating more reliable scientific discoveries in identifying putative causal proteins associated with diseases; and (2) shedding new light on the molecular-level mechanism of causal proteins in disease etiology through the 3D structural alterations of mutated proteins induced by missense pQTL IVs. We employ MR-SPI to conduct xMR analysis with 912 plasma proteins using the proteomics data in 54,306 UK Biobank participants and identify 7 proteins significantly associated with the risk of Alzheimer’s disease. The 3D structural changes in these proteins, as predicted by AlphaFold3 in response to missense genetic variations of selected pQTL IVs, offering new insights into their biological functions in the etiology of Alzheimer’s disease. We also found existing FDA-approved drugs that target some of our identified proteins, which provide opportunities for potential existing drug repurposing for the treatment of Alzheimer’s disease. These findings highlight the great potential of our proposed pipeline for identifying protein biomarkers as new therapeutic targets and drug repurposing for disease prevention and treatment.

We emphasize three main advantages of MR-SPI. First, MR-SPI incorporates both proteomics and outcome data to automatically select a set of valid pQTL instruments in genome-wide studies, and the selection procedure does not rely on any additional distributional assumptions on the genetic effects nor require a large number of candidate IVs. Therefore, MR-SPI is the first method to offer such a practically robust approach to selecting valid pQTL IVs for a specific exposure-outcome pair from GWAS studies for more reliable MR analyses, which is especially advantageous in the presence of wide-spread horizontal pleiotropy and when only a small number of candidate IVs are available in xMR studies. While our real data application specifically focuses on the identification of putative causal protein biomarkers for Alzheimer’s disease through the integration of MR-SPI with AlphaFold3, it’s important to highlight that MR-SPI holds broader applicability in elucidating causal relationships across complex traits and diseases. For additional data analysis results and insights into the utility of MR-SPI in this context, please refer to Supplementary Sections S9 and S10. Second, we propose a robust confidence interval for the causal effect using the searching and sampling method, which is immune to finite-sample IV selection error. Therefore, when locally invalid IVs are incorrectly selected, MR-SPI can still provide reliable statistical inference for the causal effect using the proposed robust confidence interval. Third, MR-SPI is computationally efficient. The average computation time for constructing the standard CI and the robust CI with 20 candidate IVs is 0.02 seconds and 10.60 seconds, respectively, using a server equipped with an Intel Xeon Silver 4116 CPU and 64 GB RAM memory.

MR-SPI has some limitations. First, MR-SPI uses independent pQTLs as candidate IVs after LD clumping, which might exclude strong and valid pQTL IVs. We plan to extend MR-SPI to include correlated pQTLs with arbitrary LD structure to increase statistical power. Second, the proposed robust confidence interval is slightly more conservative, which is the price to pay for the gained robustness to finite-sample IV selection error. We plan to construct less conservative confidence intervals with improved power to detect more putative causal proteins. Third, we will incorporate colocalization analysis76-79 into our pipeline to better understand the shared genetic architecture between proteins and disease outcomes when unfiltered GWAS summary statistics are available in future studies.

In conclusion, MR-SPI is a powerful tool for identifying putative causal protein biomarkers for complex traits and diseases. The integration of MR-SPI with AlphaFold3 as a computationally efficient pipeline can further predict the 3D structural alterations caused by missense pQTL IVs, improving our understanding of molecular-level disease mechanisms. Therefore, our pipeline holds promising implications for drug target discovery, drug repurposing, and therapeutic development.

## STAR Methods

### Key resources table

View this table:
[Table2](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/T2)

### Resource availability

#### Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Zhonghua Liu (zl2509{at}cumc.columbia.edu)

#### Materials availability

The materials that support the findings of this study are available from the corresponding authors upon reasonable request. Please contact the lead contact, Zhonghua Liu (zl2509{at}cumc.columbia.edu) for additional information.

#### Data and code availability

All the GWAS data analyzed are publicly available with the following URLs:

*   GWAS for Alzheimer’s disease: [https://ctg.cncr.nl/software/summary\_statistics](https://ctg.cncr.nl/software/summary_statistics);

*   UK Biobank proteomics data: [https://www.biorxiv.org/content/10.1101/2022.06.17.496443v1.supplementary-material](https://www.biorxiv.org/content/10.1101/2022.06.17.496443v1.supplementary-material)

The R package **MR.SPI** is publicly available at [https://github.com/MinhaoYaooo/MR-SPI](https://github.com/MinhaoYaooo/MR-SPI).

### Method details

#### Two-sample GWAS summary statistics

Suppose that we obtain *p* independent pQTLs ***Z*** = (*Z*1, ⋯, *Z**p*)⊤ by using LD clumping that retains one representative pQTL per LD region81. We also assume that the pQTLs are standardized82 such that 𝔼*Z*j = 0 and Var (*Z*j) = 1 for 1 ≤ *j* ≤ *p*. Let *D* denote the exposure and *Y* denote the outcome. We assume that *D* and *Y* follow the exposure model *D* = ***Z***⊤***γ*** + *δ* and the outcome model *Y* = *Dβ* + ***Z***⊤*π* + *e*, respectively, where *β* represents the causal effect of interest, ***γ*** = (*γ*1, ⋯, *γ**p*)⊤ represents the IV strength, and *π* = (*π*1, ⋯, *π**p*)⊤ encodes the violation of assumptions (A2) and (A3)83,84. If assumptions (A2) and (A3) hold for pQTL *j*, then *π**j* = 0 and otherwise *π*j ≠ 0 (see Supplementary Section S1 for details). The error terms *δ* and *e* with respective variances ![Graphic][2]</img> and ![Graphic][3]</img> are possibly correlated due to unmeasured confounding factors. By plugging the exposure model into the outcome model, we obtain the reduced-form outcome model *Y* = ***Z***⊤(*β****γ*** + *π*) + *ϵ*, where *ϵ* = *βδ* + *e*. Let **Γ** = (Γ1, ⋯, Γ *p*)⊤ denote the pQTL-outcome associations, then we have **Γ** = *β****γ*** + *π*. If *γ*j ≠ 0, then pQTL *j* is called a relevant IV. If both *γ*j ≠ 0 and *π*j = 0, then pQTL *j* is called a valid IV. Let 𝒮= {*j*: *γ*j ≠ 0,1 ≤ *j* ≤ *p*) denote the set of all relevant IVs, and 𝒱 = {*j*: *γ*j ≠ 0 and *π*j = 0,1 ≤ *j* ≤ *p*} denote the set of all valid IVs. The majority rule condition can be expressed as ![Graphic][4]</img>, and the plurality rule condition can be expressed as | 𝒱| > max*c*≠0 |{*j* ∈ 𝒮: *π*j/*γ*j = *c*}| 83. If the plurality rule condition holds, then valid IVs with the same ratio of pQTL-outcome effect to pQTL-protein effect will form a plurality. Based on this key observation, our proposed MR-SPI selects the largest group of pQTLs as valid IVs with similar ratio estimates of the causal effect using a voting procedure described in detail in the next subsection.

Let ![Graphic][5]</img> and ![Graphic][6]</img> be the estimated marginal effects of pQTL *j* on the protein and the outcome, and ![Graphic][7]</img> and ![Graphic][8]</img> be the corresponding estimated standard errors respectively. Let ![Graphic][9]</img> and ![Graphic][10]</img> denote the vector of estimated pQTL-protein and pQTL-outcome associations, respectively. In the two-sample setting, the summary statistics ![Graphic][11]</img> and ![Graphic][12]</img> are calculated from two non-overlapping samples with sample sizes *n*1 and *n*2 respectively. When all the pQTLs are independent of each other, the joint asymptotic distribution of ![Graphic][13]</img> and ![Graphic][14]</img> is ![Formula][15]</img>  where the diagonal entries of **V***γ* and **V**Г are ![Graphic][16]</img> and ![Graphic][17]</img>, respectively, and the off-diagonal entries of **V***γ* and **V**Г are ![Graphic][18]</img> and ![Graphic][19]</img>, respectively. The derivation of the limit distribution can be found in Supplementary Section S2. Therefore, with the summary statistics of the protein and the outcome, we estimate the covariance matrices ![Graphic][20]</img> and ![Graphic][21]</img> as: ![Formula][22]</img>  After obtaining ![Graphic][23]</img>, we then perform the proposed IV selection procedure as illustrated in Figure 2 in the main text.

#### Selecting valid instruments by voting

The first step of MR-SPI is to select relevant pQTLs with large IV strength using proteomics data. Specifically, we estimate the set of relevant IVs 𝒮 by: ![Formula][24]</img>  where ![Graphic][25]</img> is the standard error of ![Graphic][26]</img> in the summary statistics, Φ−1(·) is the quantile function of the standard normal distribution, and α* is the user-specified threshold with the default value of 1 × 10−6. This step is equivalent to filtering the pQTLs in the proteomics data with *p*-value < *α**, and is adopted by most of the current two-sample MR methods to select (relevant) genetic instruments for downstream MR analysis. Note that the selected pQTL instruments may not satisfy the IV independence and exclusion restriction assumptions and thus maybe invalid. In contrast, our proposed MR-SPI further incorporates the outcome data to automatically select a set of valid genetic instruments from ![Graphic][27]</img> for a specific protein-outcome pair.

Under the plurality rule condition, valid pQTL instruments with the same ratio of pQTL-outcome effect to pQTL-protein effect (i.e., Γj/*γ*j) will form a plurality and yield “similar” ratio estimates of the causal effect. Based on this key observation, MR-SPI selects a plurality of relevant IVs whose ratio estimates are “similar” to each other as valid IVs. Specifically, we propose the following two criteria to measure the similarity between the ratio estimates of two pQTLs *j* and *k* :

**C1**: We say the *k*th pQTL “votes for” the *j*th pQTL to be a valid IV if, by assuming the *j*th pQTL is valid, the *k*th pQTL’s degree of violation of assumptions (A2) and (A3) is smaller than a threshold as in equation (4);

**C2**: We say the ratio estimates of two pQTLs *j* and *k* are “similar” if they mutually vote for each other to be valid IVs.

The ratio estimate of the *j*th pQTL is defined as ![Graphic][28]</img>. By assuming the *j*th pQTL is valid, the plug-in estimate of the *k*th pQTL’s degree of violation of (A2) and (A3) can be obtained by ![Formula][29]</img>  as we have Γ*k* = *βγ**k* + *π**k* for the true causal effect *β*, and ![Graphic][30]</img> for the ratio estimate ![Graphic][31]</img> of the *k* th pQTL. From equation (3), ![Graphic][32]</img> has two noteworthy implications. First, ![Graphic][33]</img> measures the difference between the ratio estimates of pQTLs *j* and *k* (multiplied by the *k*th pQTL-protein effect estimate ![Graphic][34]</img>), and a small ![Graphic][35]</img> implies that the difference scaled by ![Graphic][36]</img> is small. Second, ![Graphic][37]</img> represents the *k*th IV’s degree of violation of assumptions (A2) and (A3) by regarding the *j*th pQTL’s ratio estimate ![Graphic][38]</img> as the true causal effect, thus a small ![Graphic][39]</img> implies a strong evidence that the *k*th IV supports the *j*th IV to be valid. Therefore, we say the *k*th IV votes for the *j*th IV to be valid if: ![Formula][40]</img>  where ![Graphic][41]</img> is the standard error of ![Graphic][42]</img>, which is given by: ![Formula][43]</img>  and the term ![Graphic][44]</img> in equation (4) ensures that the violation of (A2) and (A3) can be correctly detected with probability one as the sample sizes go to infinity, as shown in Supplementary Section S3.

For each relevant IV in ![Graphic][45]</img>, we collect all relevant IVs’ votes on whether it is a valid IV according to equation (4). Then we construct a voting matrix ![Graphic][46]</img> to summarize the voting results and evaluate the similarity of two pQTLs’ ratio estimates according to criterion C2. Specifically, we define the (*k, j*) entry of ![Graphic][47]</img> as: ![Formula][48]</img>  where *I*(·) is the indicator function such that *I*(*A*) = 1 if event *A* happens and *I*(*A*) = 0 otherwise. From equation (6), we can see that the voting matrix ![Graphic][49]</img> is symmetric, and the entries of ![Graphic][50]</img> are binary: ![Graphic][51]</img> represents pQTLs *j* and *k* vote for each other to be a valid IV, i.e., the ratio estimates of these two pQTLs are close to each other; ![Graphic][52]</img> represents that they do not. For example, in Figure 2, ![Graphic][53]</img> since the ratio estimates of pQTLs 1 and 2 are similar, while ![Graphic][54]</img> because the ratio estimates of pQTLs 1 and 4 differ substantially, as pQTLs 1 and 4 mutually “vote against” each other to be valid according to equation (4).

After constructing the voting matrix ![Graphic][55]</img>, we select the valid IVs by applying majority/plurality voting or finding the maximum clique of the voting matrix47. Let ![Graphic][56]</img> be the total number of pQTLs whose ratio estimates are similar to pQTL *k*. For example, **VM**1 = 3 in Figure 2, since three pQTLs (including pQTL 1 itself) yield similar ratio estimates to pQTL 1 according to criterion **C2**. A large **VM***k* implies strong evidence that pQTL *k* is a valid IV, since we assume that valid IVs form a plurality of the relevant IVs. Let ![Graphic][57]</img> denote the set of IVs with majority voting, and ![Graphic][58]</img> denote the set of IVs with plurality voting, then the union ![Graphic][59]</img> can be a robust estimate of 𝒱 in practice. Alternatively, we can also find the maximum clique in the voting matrix as an estimate of 𝒱. A clique in the voting matrix is a group of IVs who mutually vote for each other to be valid, and the maximum clique is the clique with the largest possible number of IVs47.

#### Estimation and inference of the causal effect

After selecting the set of valid pQTL instruments ![Graphic][60]</img>, the causal effect *β* is estimated by ![Formula][61]</img>  where ![Graphic][62]</img> and ![Graphic][63]</img> are the estimates of pQTL-protein associations and pQTL-outcome associations of the selected valid IVs in ![Graphic][64]</img>, respectively. The MR-SPI estimator in equation (7) is the regression coefficient obtained by fitting a zero-intercept ordinary least squares regression of ![Graphic][65]</img> on ![Graphic][66]</img>. Since the pQTLs are standardized, the genetic associations ![Graphic][67]</img> and ![Graphic][68]</img> are scaled by ![Graphic][69]</img> (compared to the genetic associations calculated using the unstandardized pQTLs, denoted by ![Graphic][70]</img> and ![Graphic][71]</img>), where *f*j is the minor allele frequency of pQTL *j*. As *f*j(1− *f*j) is approximately proportional to the inverse variance of ![Graphic][72]</img> when each pQTL IV explains only a small proportion of variance in the outcome85, the MR-SPI estimator of the causal effect in equation (7) is approximately equal to the inverse-variance weighted estimator27 calculated with ![Graphic][73]</img>.

Let α ∈ (0,1) be the significance level and *z*1−α/2 be the (1− α/2)-quantile of the standard normal distribution, then the (1− α) confidence interval for *β* is given by: ![Formula][74]</img>  where ![Graphic][75]</img> is the estimated variance of ![Graphic][76]</img>, which can be found in Supplementary Section S4. As min{*n*1, *n*2} → ∞, we have ![Graphic][77]</img> under the plurality rule condition, as shown in Supplementary Section S5. Hence, MR-SPI provides a theoretical guarantee for the asymptotic coverage probability of the confidence interval under the plurality rule condition.

We summarize the proposed procedure of selecting valid IVs and constructing the corresponding confidence interval by MR-SPI in Algorithm 1.

#### A robust confidence interval via searching and sampling

In finite-sample settings, the selected set of relevant IVs ![Graphic][78]</img> might include some invalid IVs whose degrees of violation of (A2) and (A3) are small but nonzero, and we refer to them as “locally invalid IVs”49. When locally invalid IVs exist and are incorrectly selected into ![Graphic][79]</img>, the confidence interval in equation (8) becomes unreliable, since its validity (i.e., the coverage probability attains the nominal level) requires that the invalid IVs are correctly filtered out. In practice, we can multiply the threshold ![Graphic][80]</img> in the right-hand side of equation (4) by a scaling factor *η* to examine whether the confidence interval calculated by equation (8) is sensitive to the choice of the threshold. If the confidence interval varies substantially to the choice of the scaling factor *η*, then there might exist finite-sample IV selection error especially with locally invalid IVs. We demonstrate this issue with two numerical examples presented in Supplementary Figure S14. Supplementary Figure S14(a) shows an example in which MR-SPI provides robust inference across different values of the scaling factor, while Supplementary Figure S14(b) shows an example that MR-SPI might suffer from finite-sample IV selection error, as the causal effect estimate and the corresponding confidence interval are sensitive to the choice of the scaling factor *η*. This issue motivates us to develop a more robust confidence interval.

To construct a confidence interval that is robust to finite-sample IV selection error, we borrow the idea of searching and sampling49, with main steps described in Supplementary Figure S17. The key idea is to sample the estimators of ***γ*** and **Γ** repeatedly from the following distribution: ![Formula][81]</img>  where *M* is the number of sampling times (by default, we set *M* = 1,000). Since ![Graphic][82]</img> and ![Graphic][83]</img> follow distributions centered at ***γ*** and **Γ**, there exists *m** such that ![Graphic][84]</img> and ![Graphic][85]</img> are close enough to the true values ***γ*** and **Γ** when the number of sampling times *M* is sufficiently large, and thus the confidence interval obtained by using ![Graphic][86]</img> and ![Graphic][87]</img> instead of ![Graphic][88]</img> and ![Graphic][89]</img> might have a larger probability of covering *β*.

For each sampling, we construct the confidence interval by searching over a grid of *β* values such that more than half of the selected IVs in ![Graphic][90]</img> are detected as valid. As for the choice of grid, we start with the smallest interval [*L, U*] that contains all the following intervals: ![Formula][91]</img>  where ![Graphic][92]</img> is the ratio estimate of the *j* th pQTL IV, ![Graphic][93]</img> is the variance of ![Graphic][94]</img>, and ![Graphic][95]</img> serves the same purpose as in equation (4). Then we discretize [*L, U*] into ℬ = {*b*1, *b*2, ⋯, *b**K*} as the grid set such that *b*1 = *L, b**K* = *U* and ![Graphic][96]</img> for 1 ≤ *k* ≤ *K*− 2, where *n*min = min(*n*1, *n*2). We set the grid size ![Graphic][97]</img> so that the error caused by discretization is smaller than the parametric rate ![Graphic][98]</img>.

For each grid value *b* ∈ ℬ and sampling index 1 ≤ *m* ≤ *M*, we propose an estimate of *π*j by ![Graphic][99]</img>, where ![Graphic][100]</img> is a data-dependent threshold, Φ−1(·) is the inverse of the cumulative distribution function of the standard normal distribution, *α* ∈ (0,1) is the significance level, and ![Graphic][101]</img> when M is sufficiently large) is a scaling factor to make the thresholding more stringent so that the confidence interval in each sampling is shorter, as we will show shortly. Here, ![Graphic][102]</img> indicates that the *j*th pQTL is detected as a valid IV in the *m* th sampling if we take ![Graphic][103]</img> as the estimates of genetic associations and *b* as the true causal effect. Let ![Graphic][104]</img>, then we construct the *m* th sampling’s pseudo confidence interval pCI(*m*) by searching for the smallest and largest *b* ∈ *B* such that more than half of pQTLs in ![Graphic][105]</img> are detected to be valid. Define ![Graphic][106]</img> and ![Graphic][107]</img>, then the *m* th sampling’s pseudo confidence interval is constructed as ![Graphic][108]</img>.

From the definitions of ![Graphic][109]</img> and pCI(*m*), we can see that, when *λ* is smaller, there will be fewer pQTLs in ![Graphic][110]</img> being detected as valid for a given *b* ∈ *B*, which leads to fewer *b* ∈ ℬ satisfying ![Graphic][111]</img>, thus the pseudo confidence interval in each sampling will be shorter. If there does not exist *b* ∈ *B* such that the majority of IVs in ![Graphic][112]</img> are detected as valid, we set pCI(*m*) = ∅. Let ℳ = {1 ≤ *m* ≤ *M*: pCI(*m*) ≠ ∅} denote the set of all sampling indexes corresponding to non-empty searching confidence intervals, then the proposed robust confidence interval is given by: ![Formula][113]</img>  We summarize the procedure of constructing the proposed robust confidence interval in Algorithm 2.

#### Simulation settings

We set the number of candidate IVs *p* = 10, as the average number of candidate pQTL IVs for the plasma proteins in the UK Biobank proteomics data is around 7.4. We set the sample sizes *n*1 = *n*2 = 5,000, 10,000, 20,000, 40,000, or 80,000. We generate the *j* th genetic instruments *Z*j and *X*j independently from a binomial distribution Bin (2, *f*j), where *f*j ∼ *U*(0.05,0.50) is the minor allele frequency of pQTL *j*. Then we generate the protein level ![Graphic][114]</img> and the outcome ![Graphic][115]</img> according to the exposure model and the outcome model, respectively. Finally, we calculate the genetic associations and their corresponding standard errors for the protein and the outcome, respectively. As for the parameters, we fix the causal effect *β* = 1, and we consider 4 settings for *γ* ∈ ℝ*p* and *π* ∈ ℝ*p* :

(**S1**): set ***γ*** = 0.2 · (**1**5,−**1**5)⊤ and *π* = 0.2 · (****6, **1**4)⊤.

(**S2**): set ***γ*** = 0.2 · (**1**5,−**1**5)⊤ and *π* = 0.2 · (****4, **1**3,−**1**3)⊤.

(**S3**): set ***γ*** = 0.2 · (**1**5,−**1**5)⊤ and *π* = 0.2 · (****6, **1**2, 0.25,0.25)⊤.

(**S4**): set ***γ*** = 0.2 · (**1**5,−**1**5)⊤ and *π* = 0.2 · (****4, **1**2, 0.25, **1**2,−0.25)⊤.

Settings (**S1**) and (**S3**) satisfy the majority rule condition, while (**S2**) and (S4) only satisfy the plurality rule condition. In addition, (**S3**) and (**S4**) simulate the cases where locally invalid IVs exist, as we shrink some of the pQTLs’ violation degrees of assumptions (A2) and (A3) down to 0.25 times in these two settings. In total, we run 1,000 replications in each setting.

#### Implementation of existing MR methods

We compare the performance of MR-SPI with eight other MR methods in simulation studies and real data analyses. These methods are implemented as follows:

*   Random-effects IVW, MR-Egger, the weighted median method, the mode-based estimation and the contamination mixture method are implemented in the R package “MendelianRandomization” ([https://github.com/cran/MendelianRandomization](https://github.com/cran/MendelianRandomization)). The mode-based estimation is run with “iteration = 1000”. All other methods are run with the default parameters.

*   MR-PRESSO is implemented in the R package “MR-PRESSO” ([https://github.com/rondolab/MR-PRESSO](https://github.com/rondolab/MR-PRESSO)) with outlier test and distortion test.

*   MR-RAPS is performed using the R package “mr.raps” ([https://github.com/qingyuanzhao/mr.raps](https://github.com/qingyuanzhao/mr.raps)) with the default options.

*   MRMix is run with the R package “MRMix” ([https://github.com/gqi/MRMix](https://github.com/gqi/MRMix)) using the default options.

Algorithm 1:
### Selecting pQTL IVs and Performing Causal Inference by MR-SPI

![Figure5](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/02/2023.02.20.23286200/F5.medium.gif)

[Figure5](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/F5)

Algorithm 2:
### Constructing A Robust Confidence Interval via Searching and Sampling

![Figure6](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2024/10/02/2023.02.20.23286200/F6.medium.gif)

[Figure6](http://medrxiv.org/content/early/2024/10/02/2023.02.20.23286200/F6)

## Data Availability

All data produced in the present study are available upon reasonable request to the authors. 

## Footnotes

*   Figure revised; Manusript structure revised.

*   Received February 20, 2023.
*   Revision received October 1, 2024.
*   Accepted October 2, 2024.


*   © 2024, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1.Self, W.K. & Holtzman, D.D. Emerging diagnostics and therapeutics for Alzheimer disease. Nature medicine 29, 2187–2199 (2023).
    
    
2.  2.Nichols, E. et al. Estimation of the global prevalence of dementia in 2019 and forecasted prevalence in 2050: an analysis for the Global Burden of Disease Study 2019. The Lancet Public Health 7, e105–e125 (2022).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2468-2667(21)00249-8&link_type=DOI) 

3.  3.Van Dyck, C.H. et al. Lecanemab in early Alzheimer’s disease. New England Journal of Medicine 388, 9–21 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJ-Moa2212948&link_type=DOI) 

4.  4.Sims, J.R. et al. Donanemab in early symptomatic Alzheimer disease: the TRAILBLAZER-ALZ 2 randomized clinical trial. Jama 330, 512–527 (2023).
    
    
5.  5.Hardy, J.A. & Higgins, G.G. Alzheimer’s disease: the amyloid cascade hypothesis. Science 256, 184–185 (1992).
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6MzoiUERGIjtzOjExOiJqb3VybmFsQ29kZSI7czozOiJzY2kiO3M6NToicmVzaWQiO3M6MTI6IjI1Ni81MDU0LzE4NCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDI0LzEwLzAyLzIwMjMuMDIuMjAuMjMyODYyMDAuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 

6.  6.Karran, E. & De Strooper, B. The amyloid hypothesis in Alzheimer disease: new insights from new therapeutics. Nature reviews Drug discovery 21, 306–318 (2022).
    
    
7.  7.Khoury, R., Rajamanickam, J. & Grossberg, G.G. An update on the safety of current therapies for Alzheimer’s disease: focus on rivastigmine. Therapeutic advances in drug safety 9, 171–178 (2018).
    
    
8.  8.Davey Smith, G. & Ebrahim, S. ‘ Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology 32, 1–22 (2003).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyg070&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=12689998&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000182341300001&link_type=ISI) 

9.  9.Lawlor, D.A., Harbord, R.M., Sterne, J.A.C., Timpson, N. & Davey Smith, G. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Statistics in Medicine 27, 1133–1163 (2008).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.3034&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17886233&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

10. 10.Davey Smith, G. & Hemani, G. Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Human Molecular Genetics 23, R89–R98 (2014).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/hmg/ddu328&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25064373&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000349825700013&link_type=ISI) 

11. 11.Davey Smith, G. & Ebrahim, S. Mendelian randomization: prospects, potentials, and limitations. International Journal of Epidemiology 33, 30–42 (2004).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyh132&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=15075143&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000220615000009&link_type=ISI) 

12. 12.Burgess, S., Butterworth, A. & Thompson, S.S. Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology 37, 658–665 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/gepi.21758&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24114802&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

13. 13.Pierce, B.L. & Burgess, S. Efficient design for Mendelian randomization studies: subsample and 2-sample instrumental variable estimators. American Journal of Epidemiology 178, 1177–1184 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/aje/kwt084&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23863760&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000325151700023&link_type=ISI) 

14. 14.Lawlor, D.A. Commentary: Two-sample Mendelian randomization: opportunities and challenges. International Journal of Epidemiology 45, 908 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyw127&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27427429&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

15. 15.Slob, E.A.W. & Burgess, S. A comparison of robust Mendelian randomization methods using summary data. Genetic Epidemiology 44, 313–329 (2020).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

16. 16.Zhao, Q., Wang, J., Hemani, G., Bowden, J. & Small, D.D. Statistical inference in two-sample summary-data Mendelian randomization using robust adjusted profile score. The Annals of Statistics 48, 1742–1769 (2020).
    
    
17. 17.Morrison, J., Knoblauch, N., Marcus, J.H., Stephens, M. & He, X. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics. Nature Genetics 52, 740–747 (2020).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

18. 18.Cheng, Q., Zhang, X., Chen, L.S. & Liu, J. Mendelian randomization accounting for complex correlated horizontal pleiotropy while elucidating shared genetic etiology. Nature Communications 13, 1–13 (2022).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-021-27863-8&link_type=DOI) 

19. 19.Yao, C. et al. Genome-wide mapping of plasma protein QTLs identifies putatively causal genes and pathways for cardiovascular disease. Nature communications 9, 3268 (2018).
    
    
20. 20.Sun, B.B. et al. Genetic regulation of the human plasma proteome in 54,306 UK Biobank participants. BioRxiv, 2022-06 (2022).
    
    
21. 21.Didelez, V. & Sheehan, N. Mendelian randomization as an instrumental variable approach to causal inference. Statistical Methods in Medical Research 16, 309–330 (2007).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/0962280206077743&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17715159&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000248753000001&link_type=ISI) 

22. 22.Sanderson, E., Richardson, T.G., Hemani, G. & Davey Smith, G. The use of negative control outcomes in Mendelian randomization to detect potential population stratification. International Journal of Epidemiology 50, 1350–1361 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyaa288&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33570130&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

23. 23.Solovieff, N., Cotsapas, C., Lee, P.H., Purcell, S.M. & Smoller, J.J. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics 14, 483–495 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nrg3461&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23752797&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

24. 24.Sivakumaran, S. et al. Abundant pleiotropy in human complex diseases and traits. The American Journal of Human Genetics 89, 607–618 (2011).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2011.10.004&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22077970&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

25. 25.Parkes, M., Cortes, A., Van Heel, D.A. & Brown, M.M. Genetic insights into common pathways and complex relationships among immune-mediated diseases. Nature Reviews Genetics 14, 661–673 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nrg3502&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23917628&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

26. 26.Schmidt, A.F. et al. Genetic drug target validation using Mendelian randomisation. Nature Communications 11, 3255 (2020).
    
    
27. 27.Bowden, J. et al. A framework for the investigation of pleiotropy in two-sample summary data Mendelian randomization. Statistics in Medicine 36, 1783–1802 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.7221&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28114746&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

28. 28.Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. International Journal of Epidemiology 44, 512–525 (2015).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyv080&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26050253&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

29. 29.Verbanck, M., Chen, C.-Y., Neale, B. & Do, R. Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics 50, 693–698 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-018-0099-7&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29686387&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

30. 30.Bowden, J., Davey Smith, G., Haycock, P.C. & Burgess, S. Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genetic Epidemiology 40, 304–314 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/gepi.21965&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27061298&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

31. 31.Hartwig, F.P., Davey Smith, G. & Bowden, J. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. International Journal of Epidemiology 46, 1985–1998 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyx102&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29040600&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

32. 32.Qi, G. & Chatterjee, N. Mendelian randomization analysis using mixture models for robust and efficient estimation of causal effects. Nature Communications 10, 1–10 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-019-10274-1&link_type=DOI) 

33. 33.Burgess, S., Foley, C.N., Allara, E., Staley, J.R. & Howson, J.M.M. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Communications 11, 1–11 (2020).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-020-17902-1&link_type=DOI) 

34. 34.He, B., Shi, J., Wang, X., Jiang, H. & Zhu, H.-J. Genome-wide pQTL analysis of protein expression regulatory networks in the human liver. BMC biology 18, 1–16 (2020).
    
    
35. 35.Swerdlow, D.I. et al. Selecting instruments for Mendelian randomization in the wake of genome-wide association studies. International Journal of Epidemiology 45, 1600–1616 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyw088&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27342221&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

36. 36.Zaneveld, J.R., McMinds, R. & Vega Thurber, R. Stress and stability: applying the Anna Karenina principle to animal microbiomes. Nature microbiology 2, 1–8 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nmicrobiol.2017.59&link_type=DOI) 

37. 37.Shriner, D. & Rotimi, C.C. Whole-genome-sequence-based haplotypes reveal single origin of the sickle allele during the holocene wet phase. The American Journal of Human Genetics 102, 547–556 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2018.02.003&link_type=DOI) 

38. 38.Karki, R., Pandya, D., Elston, R.C. & Ferlini, C. Defining “mutation” and “polymorphism” in the era of personal genomics. BMC medical genomics 8, 1–7 (2015).
    
    
39. 39.Ashley-Koch, A., Yang, Q. & Olney, R.R. Sickle hemoglobin (Hb S) allele and sickle cell disease: a HuGE review. American journal of epidemiology 151, 839–845 (2000).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/oxfordjournals.aje.a010288&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=10791557&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000086698000001&link_type=ISI) 

40. 40.Rees, D.C., Williams, T.N. & Gladwin, M.M. Sickle-cell disease. The Lancet 376, 2018–2031 (2010).
    
    
41. 41.Kato, G.J. et al. Sickle cell disease. Nature reviews Disease primers 4, 1–22 (2018).
    
    
42. 42.Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/S41586-021-03819-2&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=34265844&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

43. 43.Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nature methods 19, 679–682 (2022).
    
    
44. 44.Wayment-Steele, H.K. et al. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature, 1–3 (2023).
    
    
45. 45.Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 1–3 (2024).
    
    
46. 46.Dunn, O.J. Multiple comparisons among means. Journal of the American Statistical Association 56, 52–64 (1961).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2282330&link_type=DOI) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A19611734300002&link_type=ISI) 

47. 47.Ouyang, Q., Kaplan, P.D., Liu, S. & Libchaber, A. DNA solution of the maximal clique problem. Science 278, 446–449 (1997).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIyNzgvNTMzNy80NDYiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8xMC8wMi8yMDIzLjAyLjIwLjIzMjg2MjAwLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

48. 48.Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 289–300 (1995).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2346101&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=WOS:A1995QE4&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1995QE45300017&link_type=ISI) 

49. 49.Guo, Z. Causal inference with invalid instruments: post-selection problems and a solution using searching and sampling. Journal of the Royal Statistical Society Series B: Statistical Methodology 85, 959–985 (2023).
    
    
50. 50.Jansen, I.E. et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics 51, 404–413 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/S41588-018-0311-9&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

51. 51.Naj, A.C. et al. Common variants at MS4A4/MS4A6E, CD2AP, CD33 and EPHA1 are associated with late-onset Alzheimer’s disease. Nature genetics 43, 436–441 (2011).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.801&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21460841&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

52. 52.Rathore, N. et al. Paired Immunoglobulin-like Type 2 Receptor Alpha G78R variant alters ligand binding and confers protection to Alzheimer’s disease. PLoS genetics 14, e1007427 (2018).
    
    
53. 53.Griciuc, A. et al. TREM2 acts downstream of CD33 in modulating microglial pathology in Alzheimer’s disease. Neuron 103, 820–835 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.neuron.2019.06.010&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31301936&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

54. 54.Helgadottir, H.T. et al. Somatic mutation that affects transcription factor binding upstream of CD55 in the temporal cortex of a late-onset Alzheimer disease patient. Human Molecular Genetics 28, 2675–2685 (2019).
    
    
55. 55.Rimal, S. et al. Reverse electron transfer is activated during aging and contributes to aging and age-related disease. EMBO reports 24, e55548 (2023).
    
    
56. 56.Winfree, R.L. et al. TREM2 gene expression associations with Alzheimer ‘ s disease neuropathology are region-specific: implications for cortical versus subcortical microglia. Acta Neuropathologica 145, 733–747 (2023).
    
    
57. 57.Yang, X. et al. Functional characterization of Alzheimer ‘ s disease genetic variants in microglia. Nature Genetics, 1-10 (2023).
    
    
58. 58.UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Research 51, D523– D531 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/NAR/GKAC1052&link_type=DOI) 

59. 59.Zhou, Y. et al. Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Research 50, D1398–D1407 (2022).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkab953&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

60. 60.Wishart, D.S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research 46, D1074–D1082 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkx1037&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=PMC5753335&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

61. 61.Bross, P.F. et al. Approval summary: gemtuzumab ozogamicin in relapsed acute myeloid leukemia. Clinical cancer research 7, 1490–1496 (2001).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImNsaW5jYW5yZXMiO3M6NToicmVzaWQiO3M6ODoiNy82LzE0OTAiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8xMC8wMi8yMDIzLjAyLjIwLjIzMjg2MjAwLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

62. 62.Norsworthy, K.J. et al. FDA approval summary: mylotarg for treatment of patients with relapsed or refractory CD33-positive acute myeloid leukemia. The oncologist 23, 1103– 1108 (2018).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTM6InRoZW9uY29sb2dpc3QiO3M6NToicmVzaWQiO3M6OToiMjMvOS8xMTAzIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTAvMDIvMjAyMy4wMi4yMC4yMzI4NjIwMC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

63. 63.Kim, J. et al. FDA approval summary: pralsetinib for the treatment of lung and thyroid cancers with RET gene mutations or fusions. Clinical Cancer Research 27, 5452–5456 (2021).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImNsaW5jYW5yZXMiO3M6NToicmVzaWQiO3M6MTA6IjI3LzIwLzU0NTIiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyNC8xMC8wMi8yMDIzLjAyLjIwLjIzMjg2MjAwLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

64. 64.Bradford, D. et al. FDA approval summary: selpercatinib for the treatment of lung and thyroid cancers with RET gene mutations or fusions. Clinical Cancer Research 27, 2130– 2135 (2021).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImNsaW5jYW5yZXMiO3M6NToicmVzaWQiO3M6OToiMjcvOC8yMTMwIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjQvMTAvMDIvMjAyMy4wMi4yMC4yMzI4NjIwMC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 

65. 65.Griciuc, A. et al. Alzheimer’s disease risk gene CD33 inhibits microglial uptake of amyloid beta. Neuron 78, 631–643 (2013).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.neuron.2013.04.014&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23623698&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000319491200008&link_type=ISI) 

66. 66.Tortora, F. et al. CD33 rs2455069 SNP: correlation with alzheimer’s disease and hypothesis of functional role. International Journal of Molecular Sciences 23, 3629 (2022).
    
    
67. 67.Kamat, M.A. et al. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics 35, 4851–4853 (2019).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

68. 68.Faux, N.G. et al. An anemia of Alzheimer’s disease. Molecular Psychiatry 19, 1227–1234 (2014).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

69. 69.Winchester, L.M., Powell, J., Lovestone, S. & Nevado-Holgado, A.A. Red blood cell indices and anaemia as causative factors for cognitive function deficits and for Alzheimer ‘ s disease. Genome Medicine 10, 1–12 (2018).
    
    
70. 70.Raudvere, U. et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Research 47, W191–W198 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nbt.4096&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

71. 71.Rijpma, A., van der Graaf, M., Meulenbroek, O., Rikkert, M.G.O. & Heerschap, A. Altered brain high-energy phosphate metabolism in mild Alzheimer’s disease: A 3-dimensional 31P MR spectroscopic imaging study. NeuroImage: Clinical 18, 254–261 (2018).
    
    
72. 72.Parasoglou, P. et al. Phosphorus metabolism in the brain of cognitively normal midlife individuals at risk for Alzheimer’s disease. Neuroimage: Reports 2, 100121 (2022).
    
    
73. 73.Lazarczyk, M.J. et al. Major Histocompatibility Complex class I proteins are critical for maintaining neuronal structural complexity in the aging brain. Scientific reports 6, 26199 (2016).
    
    
74. 74.Kim, M.-S. et al. Neuronal MHC-I complex is destabilized by amyloid-β and its implications in Alzheimer’s disease. Cell & Bioscience 13, 181 (2023).
    
    
75. 75.Le Guen, Y. et al. Multiancestry analysis of the HLA locus in Alzheimer’s and Parkinson’s diseases uncovers a shared adaptive immune response mediated by HLA-DRB1* 04 subtypes. Proceedings of the National Academy of Sciences 120, e2302720120 (2023).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1073/pnas.2302720120&link_type=DOI) 

76. 76.Giambartolomei, C. et al. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS genetics 10, e1004383 (2014).
    
    
77. 77.Barfield, R. et al. Transcriptome-wide association studies accounting for colocalization using Egger regression. Genetic epidemiology 42, 418–433 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/gepi.22131&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29808603&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

78. 78.Hukku, A. et al. Probabilistic colocalization of genetic variants from complex and molecular traits: promise and limitations. The American Journal of Human Genetics 108, 25–35 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2020.11.012&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=33308443&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

79. 79.Zuber, V. et al. Combining evidence from Mendelian randomization and colocalization: Review and comparison of approaches. The American Journal of Human Genetics 109, 767–782 (2022).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2022.04.001&link_type=DOI) 

80. 80.Yavorska, O.O. & Burgess, S. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data. International journal of epidemiology 46, 1734–1739 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dyx034&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28398548&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

81. 81.Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81, 559–575 (2007).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1086/519795&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17701901&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

82. 82.Bulik-Sullivan, B.K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics 47, 291–295 (2015).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3211&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25642630&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom) 

83. 83.Guo, Z., Kang, H., Tony Cai, T. & Small, D.D. Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 793–815 (2018).
    
    
84. 84.Kang, H., Zhang, A., Cai, T.T. & Small, D.D. Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American statistical Association 111, 132–144 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01621459.2014.994705&link_type=DOI) 

85. 85.Burgess, S., Dudbridge, F. & Thompson, S.S. Combining information on multiple instrumental variables in Mendelian randomization: comparison of allele score and summarized data methods. Statistics in medicine 35, 1880–1906 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/sim.6835&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26661904&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2024%2F10%2F02%2F2023.02.20.23286200.atom)

 [1]: /embed/inline-graphic-1.gif
 [2]: /embed/inline-graphic-2.gif
 [3]: /embed/inline-graphic-3.gif
 [4]: /embed/inline-graphic-4.gif
 [5]: /embed/inline-graphic-5.gif
 [6]: /embed/inline-graphic-6.gif
 [7]: /embed/inline-graphic-7.gif
 [8]: /embed/inline-graphic-8.gif
 [9]: /embed/inline-graphic-9.gif
 [10]: /embed/inline-graphic-10.gif
 [11]: /embed/inline-graphic-11.gif
 [12]: /embed/inline-graphic-12.gif
 [13]: /embed/inline-graphic-13.gif
 [14]: /embed/inline-graphic-14.gif
 [15]: /embed/graphic-8.gif
 [16]: /embed/inline-graphic-15.gif
 [17]: /embed/inline-graphic-16.gif
 [18]: /embed/inline-graphic-17.gif
 [19]: /embed/inline-graphic-18.gif
 [20]: /embed/inline-graphic-19.gif
 [21]: /embed/inline-graphic-20.gif
 [22]: /embed/graphic-9.gif
 [23]: /embed/inline-graphic-21.gif
 [24]: /embed/graphic-10.gif
 [25]: /embed/inline-graphic-22.gif
 [26]: /embed/inline-graphic-23.gif
 [27]: /embed/inline-graphic-24.gif
 [28]: /embed/inline-graphic-25.gif
 [29]: /embed/graphic-11.gif
 [30]: /embed/inline-graphic-26.gif
 [31]: /embed/inline-graphic-27.gif
 [32]: /embed/inline-graphic-28.gif
 [33]: /embed/inline-graphic-29.gif
 [34]: /embed/inline-graphic-30.gif
 [35]: /embed/inline-graphic-31.gif
 [36]: /embed/inline-graphic-32.gif
 [37]: /embed/inline-graphic-33.gif
 [38]: /embed/inline-graphic-34.gif
 [39]: /embed/inline-graphic-35.gif
 [40]: /embed/graphic-12.gif
 [41]: /embed/inline-graphic-36.gif
 [42]: /embed/inline-graphic-37.gif
 [43]: /embed/graphic-13.gif
 [44]: /embed/inline-graphic-38.gif
 [45]: /embed/inline-graphic-39.gif
 [46]: /embed/inline-graphic-40.gif
 [47]: /embed/inline-graphic-41.gif
 [48]: /embed/graphic-14.gif
 [49]: /embed/inline-graphic-42.gif
 [50]: /embed/inline-graphic-43.gif
 [51]: /embed/inline-graphic-44.gif
 [52]: /embed/inline-graphic-45.gif
 [53]: /embed/inline-graphic-46.gif
 [54]: /embed/inline-graphic-47.gif
 [55]: /embed/inline-graphic-48.gif
 [56]: /embed/inline-graphic-49.gif
 [57]: /embed/inline-graphic-50.gif
 [58]: /embed/inline-graphic-51.gif
 [59]: /embed/inline-graphic-52.gif
 [60]: /embed/inline-graphic-53.gif
 [61]: /embed/graphic-15.gif
 [62]: /embed/inline-graphic-54.gif
 [63]: /embed/inline-graphic-55.gif
 [64]: /embed/inline-graphic-56.gif
 [65]: /embed/inline-graphic-57.gif
 [66]: /embed/inline-graphic-58.gif
 [67]: /embed/inline-graphic-59.gif
 [68]: /embed/inline-graphic-60.gif
 [69]: /embed/inline-graphic-61.gif
 [70]: /embed/inline-graphic-62.gif
 [71]: /embed/inline-graphic-63.gif
 [72]: /embed/inline-graphic-64.gif
 [73]: /embed/inline-graphic-65.gif
 [74]: /embed/graphic-16.gif
 [75]: /embed/inline-graphic-66.gif
 [76]: /embed/inline-graphic-67.gif
 [77]: /embed/inline-graphic-68.gif
 [78]: /embed/inline-graphic-69.gif
 [79]: /embed/inline-graphic-70.gif
 [80]: /embed/inline-graphic-71.gif
 [81]: /embed/graphic-17.gif
 [82]: /embed/inline-graphic-72.gif
 [83]: /embed/inline-graphic-73.gif
 [84]: /embed/inline-graphic-74.gif
 [85]: /embed/inline-graphic-75.gif
 [86]: /embed/inline-graphic-76.gif
 [87]: /embed/inline-graphic-77.gif
 [88]: /embed/inline-graphic-78.gif
 [89]: /embed/inline-graphic-79.gif
 [90]: /embed/inline-graphic-80.gif
 [91]: /embed/graphic-18.gif
 [92]: /embed/inline-graphic-81.gif
 [93]: /embed/inline-graphic-82.gif
 [94]: /embed/inline-graphic-83.gif
 [95]: /embed/inline-graphic-84.gif
 [96]: /embed/inline-graphic-85.gif
 [97]: /embed/inline-graphic-86.gif
 [98]: /embed/inline-graphic-87.gif
 [99]: /embed/inline-graphic-88.gif
 [100]: /embed/inline-graphic-89.gif
 [101]: /embed/inline-graphic-90.gif
 [102]: /embed/inline-graphic-91.gif
 [103]: /embed/inline-graphic-92.gif
 [104]: /embed/inline-graphic-93.gif
 [105]: /embed/inline-graphic-94.gif
 [106]: /embed/inline-graphic-95.gif
 [107]: /embed/inline-graphic-96.gif
 [108]: /embed/inline-graphic-97.gif
 [109]: /embed/inline-graphic-98.gif
 [110]: /embed/inline-graphic-99.gif
 [111]: /embed/inline-graphic-100.gif
 [112]: /embed/inline-graphic-101.gif
 [113]: /embed/graphic-19.gif
 [114]: /embed/inline-graphic-102.gif
 [115]: /embed/inline-graphic-103.gif