Network Assisted Analysis of De Novo Variants Using Protein-Protein Interaction Information Identified 46 Candidate Genes for Congenital Heart Disease ====================================================================================================================================================== * Yuhan Xie * Wei Jiang * Weilai Dong * Hongyu Li * Sheng Chih Jin * Martina Brueckner * Hongyu Zhao ## Abstract *De novo* variants (DNVs) with deleterious effects have proved informative in identifying risk genes for early-onset diseases such as congenital heart disease (CHD). A number of statistical methods have been proposed for family-based studies or case/control studies to identify risk genes by screening genes with more DNVs than expected by chance in Whole Exome Sequencing (WES) studies. However, the statistical power is still limited for cohorts with thousands of subjects. Under the hypothesis that connected genes in protein-protein interaction (PPI) networks are more likely to share similar disease association status, we develop a Markov Random Field model that can leverage information from publicly available PPI databases to increase power in identifying risk genes. We identified 46 candidate genes with at least 1 DNV in the CHD study cohort, including 18 known human CHD genes and 35 highly expressed genes in mouse developing heart. Our results may shed new insight on the shared protein functionality among risk genes for CHD. Keywords * *De novo* variants * protein-protein interactions * congenital heart disease * network modeling ## Background Congenital heart disease (CHD) is the most common birth defect affecting ∼ 1% of live births and accounts for one-third of all major congenital abnormalities [1-3]. There is substantial evidence that CHD has a strong genetic component [4]. Although it is estimated that aneuploidies and copy number variations account for about 23% of CHD cases, few individual disease-causing genes have been identified in published studies [5-8]. Therefore, the limited knowledge on the underlying genetic causes poses an obstacle to the reproductive counseling of CHD patients [9]. Whole Exome Sequencing (WES) studies have successfully boosted novel causal genes identification for both Mendelian and complex disorders [10, 11]. To narrow down the pool of candidate variants from WES, family-based studies have been conducted to scan for *de novo* variants (DNVs) from parent-offspring trios. DNV studies have been shown to play an important role in risk gene identification for CHD [1, 3, 5, 6, 12-15]. From the analysis of 1,213 CHD parent-offspring trios, Homsy et al. identified greater burden of damaging DNVs, especially in genes with likely functional roles in heart and brain development [12]. Recently, Jin et al. inferred that DNVs in ∼440 genes were likely contributors to CHD [5]. Despite these advances, it remains challenging to capture the causal genes with only DNV data as CHD is very genetically heterogeneous [6]. It is believed that genes interact with each other in biological processes and may jointly affect the disease risk through biological pathways. In recent studies on CHD, researchers reported significant enrichment of genes related to histone modification, chromatin modification, cilia function, transcriptional regulation, neural tube development, and cardiac development and enrichment in pathways including Wnt, Notch, Igf, HDAC, ErbB and NF-kb signaling [1, 3, 12, 14, 16]. These results suggest that considering functional relationships of genes through pathway-level analyses may complement gene-level analyses and improve risk gene identification for CHD. Recently, Nguyen et al. proposed a framework to conduct pathway-level analysis on schizophrenia DNV data [17]. Their method is built on the extTADA framework and integrates pathway information by multiplying a gene-set related term in the likelihood function. However, this method requires curated pathways related to schizophrenia and treats genes as exchangeable without considering gene-gene interactions. To characterize the functional connectivity between genes, we can consider genes as a network by modeling the topology of pathways. Network-based approaches have been successful in prioritizing risk genes for downstream analysis of Genome-Wide Association Studies (GWAS) and gene expression studies [18-20]. Chen et al. [19] proposed a Markov Random Field (MRF) model to incorporate pathway topology structure for GWAS. They showed that their method is more powerful than single gene-based methods through both simulation and real data analyses. Following the idea of this framework, Hou et al. [20] proposed a method that integrates co-expression networks and GWAS results. By applying their method to Crohn’s disease and Parkinson’s disease, they showed that their method could lead to more replicable results and find potential disease-associated pathways. More recently, Liu et al. adopted a similar idea as Chen et al. and Hou et al. to analyze DNV data from WES studies [21]. Their framework, namely DAWN, combines TADA p-values with estimated network from gene co-expression data. In their real data analysis for autism, 333 genes were prioritized by integrating DNV summary statistics and expression data from brain tissue. However, all of these above methods require summary statistics (Z scores or p-values) from genetic association analysis as their input, which may not be provided for DNV analysis results [22, 23]. In addition, it is difficult to apply DAWN directly to CHD because of the lack of ideal co-expression data from relevant tissues (such as developmental heart) for CHD. As there is a limited number of co-expression data sets for human developmental heart, a natural choice for network information would be human protein-protein interaction (PPI) databases. There are multiple public sources of PPI network databases such as BioGRID [24], IntAct [25], DIP [26], MINT [27], and HPRD [28]. Most of network-based studies apply their real data on two or more of the databases to obtain their results. Nonetheless, it is hard to check the overlapping information between two PPI databases and interpret the divergent results. The STRING [29] database provides a platform to resolve the above problems. It imports protein association knowledge from physical interaction and curated knowledge from the aforementioned PPI databases and other pathway information knowledge such as KEGG and GO. In addition, it provides a score to measure the likelihood of interactions. Some studies have used STRING in their post-association analysis for gene-based DNV studies and showed significant enrichment of candidate CHD risk genes in the STRING PPI network [30, 31]. One recent study [3] constructed a priority score based on canonical pathways and PPI networks and identified 23 novel genes for CHD. These results suggest that incorporating PPI network information from STRING may identify additional risk genes with more biological interpretability. In addition, in the post-association analysis of CHD DNV analysis from our previous multi-trait method M-DATA [32], we found that the number of edges formed by candidate CHD genes (47 edges, green line in Fig 1A) identified by M-DATA multi-trait model (33 genes with false discovery rate (FDR) q-values < 0.1) is far larger than the upper tail of the empirical distribution sampled from 33 randomly selected genes in the STRING V11.0 database (score threshold: 400) for 10,000 times (Fig 1A). This suggests that the candidate CHD genes are highly enriched in terms of their interactions in the STRING database. To further illustrate that PPI information may contribute to CHD gene discovery, we use the CHD result from M-DATA single-trait analysis and use the result of autism as a comparison. Fig 1B shows the number of edges formed by the top genes (ranked by FDR q-values) for CHD and autism from M-DATA single-trait model with a more stringent selection of PPI edges in the STRING database (score threshold: 950), respectively. To compare with the number of edges formed by randomly selected genes, we plot the 95% quantile of the empirical distribution sampled from random genes in the STRING v11.0 database (score threshold: 950) for 10,000 times as a baseline. When more than 25 top CHD genes are selected, the number of edges formed by these genes is significantly more than that from randomly selected genes, whereas the number of edges formed by the autism top genes did not differ much from randomly selected genes. This suggests top genes in CHD tend to be neighbors in the STRING PPI network. ![Fig 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F1.medium.gif) [Fig 1.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F1) Fig 1. CHD top genes are more connected than randomly selected genes in the STRING PPI network (A) Empirical distribution of the number of edges formed by 33 randomly selected genes. Green line represents the number of edges formed by the 33 CHD top genes from M-DATA. (B) Number of edges formed by CHD top genes, autism top genes from single-trait analyses and randomly selected genes, respectively. Motivated by the observation from Fig 1, we develop a **N**etwork assisted model for **D**e novo **A**ssociation **T**est using protein-protein inter**A**ction information, named N-DATA, to leverage prior information of interactions among genes from the PPI network to boost statistical power in identifying risk genes for CHD. In the following, we first introduce the inference procedure for our model, and then demonstrate the performance of our method through simulation studies and real data applications. ## Methods In this section, we introduce the statistical model for the proposed framework. The network information in the PPI database is represented by an undirected graph *G* = (*V, E*), where *V* = {1,…,*n*} is a set of *n* genes in the network, and *E* = {< *I, j* >: *i* and *j* are genes connected by the edges}. The degree of a gene *i* is defined as the number of direct neighbors (*N**i*) for gene *i* in the network and denoted as *d**i* We denote the latent association status of gene *i* with a disease of interest, e.g., CHD, as *S**i*, where *S**i* =1 if gene *i* is associated with the disease, *S**i* = −1 if gene *i* is not associated with the disease. *S =*{*S*1,..,*S**n*} are the corresponding latent states for genes in *V* = {1, …, *n*}. The DNV count of the cohort is defined as *Y**i*. To formalize the assumption that genes that have PPIs with risk genes are more likely to be risk genes, we apply a nearest neighbor Gibbs measure [33] to arrive at the following model: ![Formula][1] where *W**i* is the weight for gene *i* and will be chosen based on the characteristics of the network, *θ* = (*h, τ*, *τ*1) are hyperparameters related to the network. Specifically, *h* determines the marginal distribution of *S**i* when all genes are independent i.e., ![Graphic][2]. τ and τ1 characterize the prior weights of edges between non-associated genes and associated genes, respectively. *N* is the sample size of the case cohort, *μ**i* is the mutability of gene *i* estimated using the framework in Samocha et al. [34], and *θ*1 (*γ*) is the relative risk of the DNVs in the risk gene. To reduce the computational burden from a fully Bayesian solution for maximizing the marginal likelihood, we propose an empirical Bayes method to estimate the parameters *θ* and *θ*1, and To reduce the computational burden from a fully Bayesian solution for maximizing the marginal the latent association status *S* by maximizing the pseudo conditional likelihood (PCLK) for *n* genes as follows ![Formula][3] It has been shown that the estimator from the PCLK in a general Markov random field setting is consistent under mild regularity conditions [19] [35]. When maximizing the PCLK, we can estimate the hyperparameters *θ*0, *θ*1 and latent status *S* iteratively. We can obtain an empirical estimate for *θ* by maximizing ![Graphic][4], which is equivalent to maximizing the parameters in the following logistic regression model: ![Formula][5] where ![Graphic][6] and ![Graphic][7] To make sure the estimated *θ* is finite, we can add a ridge penalty term ![Graphic][8] to the likelihood function to solve the maximization problem by the Newton-Raphson’s method [36]. We then update the latent status *S* by maximizing the PCLK using the iterative conditional mode method [35]. After we obtain the updated values *θ* and *S* we can estimate the hyperparameter *θ*1 by maximizing ![Graphic][9] by using the following closed-form expression: ![Formula][10] Algorithm 1: Procedure for Parameter Estimation ![Figure2](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F2.medium.gif) [Figure2](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F2) Finally, after we obtain the estimated ![Graphic][11] and ![Graphic][12], we use Gibbs sampling based on the conditional distribution ![Graphic][13] This method has been proved to be valid for multiple testing under dependence in a compound decision theoretic framework [37, 38]. Then, we can estimate the marginal posterior probability *q**i* = *P* (*S**i* = −1|*Y*).Let *q*(*i*) be the sorted values of *q**i* in descending order. For each gene *i* the null hypothesis and alternative hypothesis are ![Formula][14] As shown by Jiang and Yu [39], the relationship between global FDR and local FDR (lfdr) is FDR = *E* (Ifdr|*Y* ∈ ℛ), where the rejection region ℛ is the set of *Y* such that the null hypothesis can be rejected based on a specific rejection criterion. To control the expected global FDR less than *α*, we propose the following procedure: let ![Graphic][15], we reject all the null hypotheses corresponding to *H*(1), …, *H*(*m*). ## Results ### Simulation Studies For DNV and network data, we considered similar settings as our real data. We randomly selected 2,000 genes, retrieved their mutability from the real data, and extracted the corresponding PPI network formed by these 2,000 genes. First, we simulated the true model with parameter values similar as those from real data analysis to see whether N-DATA can control FDR. Then, we evaluated the power under various settings of sample sizes *N* and relative risk parameter *γ*. We set true network parameter *θ* as (−4, 0.2, 0) to make the total number of risk genes in the network of 2,000 random genes to be around 100. We varied the sample size *N* at 2,000, 5,000 and 10,000 to evaluate the performance of N-DATA in small, medium, and large WES cohorts, respectively. In addition, we varied *β* (log relative risk parameter *γ*) at 3.5, 4 and 4.5 to and 10,000 to investigate the performance of N-DATA around the burden estimated results from real data (In real data analysis, ![Graphic][16]) Each simulation setting was replicated 100 times. For Gibbs sampling-based inference, we used 2,000 MCMC iterations, and set the first 1,000 iterations as burn-ins. These numbers were chosen empirically based on the diagnostic plots for convergence. We report the performance under FDR threshold 0.05 in the main text (Fig 2) and FDR threshold 0.01 and 0.1 in Additional File 1. ![Fig 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F3.medium.gif) [Fig 2.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F3) Fig 2. Comparison of N-DATA w/o and w/ PPI network models. (A) Power comparison of the two models. Three panels from left to right represent cohorts with small, medium, and large sample sizes, respectively. (B) FDR control for the two models. Red dash line represents the preset FDR threshold 0.05. Three panels from left to right represent cohorts with small, medium, and large sample sizes, respectively. First, we compared the performance of N-DATA model with and without the PPI network as input. For N-DATA model without the PPI network, we assigned the weight of gene for inference. Both models controlled FDR well under all the settings. N-DATA model with PPI network had much better power than the model without PPI network when the sample size and were both low. When the sample size and were larger, the power of N-DATA without PPI network improved as expected. Then, we compared the power of TADA-*De novo*, DAWN, and N-DATA using the same parameter settings. Hyperprior of TADA-*De novo* was estimated from the function *denovo*.*MOM* based on the recommendation from the authors [40]. Power of TADA was calculated based on TADA q-values. DAWN v1.0 was downloaded from [http://www.compgen.pitt.edu/DAWN/DAWN\_homepage.htm](http://www.compgen.pitt.edu/DAWN/DAWN_homepage.htm). We adapted the code of DAWN to use adjacency matrices of networks as its input. We used TADA-*De novo* p-values and PPI network as the input of DAWN. The parameter trim threshold was set to 4 based on the observation in our real data that, which corresponds to select the top 5 genes in TADA-*De novo* as fixed risk genes. We compared the performance of TADA-*De novo*, TADA-*De novo* p-values + DAWN and N-DATA under different simulation settings. We reported the performance under FDR threshold 0.05 in the main text (Fig 3) and FDR threshold 0.01 and 0.1 in Additional File 1. We first checked if all three methods could control the global FDR when the threshold is 0.05. As discussed above, N-DATA controlled FDR well under all settings, while there was slight FDR inflation of TADA-*De novo* and DAWN under a couple of settings, especially when both and were very small or very large. N-DATA had the best power among the three methods under all settings. With an increase of and, the performance of TADA-*De novo* and DAWN also became better. For settings with FDR well controlled, DAWN performed better than TADA-*De novo*. ![Fig 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F4.medium.gif) [Fig 3.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F4) Fig 3. Comparison of TADA-*De novo*, TADA-*De novo* p-values + DAWN and N-DATA. Error bars represent standard errors estimated from 100 replications of simulation. (A) Power comparison of the three methods. Three panels from left to right represent cohorts with small, medium, and large sample sizes, respectively. (B) FDR control for the three methods. Red dash line represents the preset FDR threshold 0.05. Three panels from left to right represent cohorts with small, medium, and large sample sizes, respectively. ### Real Data Applications We applied N-DATA to DNV data from 2,645 CHD trios reported in Jin et al [5]. We only considered damaging variants (loss of function (LoF) and deleterious missense (Dmis) variants by the MetaSVM algorithm) in our analysis as the number of non-deleterious variants is not expected to provide information to differentiate cases from controls biologically [41]. For network information, we first downloaded STRING v11.0 with medium edge likelihood via interface from STRINGdb package in R, and call this original network from STRING 𝒢 obtained the curated list of known human CHD genes from Jin el al [5] and expanded the gene list by including additional candidate genes (FDR<0.1) from the single-trait analysis in our previous work [32]. Then, we extracted the subnetwork including the aforementioned gene list and the direct neighbors with likelihood score larger than 950 of those genes, and call this subnetwork 𝒢1. We only kept overlapping genes with our DNV data in 𝒢1 and called the final network used in our real application as 𝒢2. There were in total 1,818 genes and 21,534 edges in 𝒢2 To show that our method can leverage network information to boost risk gene identification, we applied our algorithm without using the network as an input. When there was no prior information from the network, we identified 18 significant genes with FDR<0.05. To include the network information from 𝒢2, we denote the degree of gene *i* in network 𝒢2 as *d**i*, and let the weight in the prior as ![Graphic][17] following Chen et al. [19]. After adding the network information from 𝒢2, we identified 46 genes with at least 1 DNV, and 26 genes harboring at least 2 DNVs with FDR<0.05 in the CHD cohort. We also compared our results with TADA-*De novo* test [40] and DAWN [21, 42]. We downloaded the software of DAWN and adapted its code by substituting the adjacency matrix inferred from its Partial Neighborhood Selection algorithm to the adjacency matrix from network 𝒢2. The parameter settings were discussed in the simulation section. Specifically, we also compared two approaches to control the global FDR for TADA-*De novo* test (FDR q-values and p-values with FDR adjustment). FDR q-values given by TADA-De novo identified 25 significant genes, and p-values with FDR adjustment identified the same 25 genes. After integrating the p-values with the 𝒢2 network, 36 genes were identified by the adapted DAWN method. In total, N-DATA identified 325 genes with FDR<0.05. As some of the genes may be prioritized due to high prior probability, but did not have DNV count in the study cohort, we further filtered out genes without DNV and considered the 46 genes identified with FDR<0.05 and at least 1 DNV as the candidate genes. (Table 1) View this table: [Table 1.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/T1) Table 1. Comparison of N-DATA with other methods We visualized the overlap of 253 known human CHD genes [5], genes that were identified by TADA-*De novo* q-values, DAWN, and N-DATA in Fig 4. Fig 4A shows the 325 genes identified by N-DATA, while Fig 4B shows the 46 genes with at least 1 DNV. From Fig 4B, N-DATA found all genes that can be identified by TADA, and identified 11 different genes compared with DAWN. ![Fig 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F5.medium.gif) [Fig 4.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F5) Fig 4. Venn diagram of known human CHD genes, TADA genes, DAWN genes and N-DATA genes. (A) Overlapping genes between known human CHD genes, TADA*-De novo*, TADA-*De novo* +DAWN and all 325 genes identified by N-DATA. (B) Overlapping genes between known human CHD genes, TADA*-De novo*, TADA-*De novo* +DAWN and 46 candidate genes identified by N-DATA. Among the 46 genes, 18 are known human CHD genes and 35 are in the top 25% in mouse developing heart (HHE) at E14.5 [12]. Further, we calculated the overlap of the 46 N-DATA candidate genes, TADA genes, DAWN genes and 872 HHE genes that were analyzed in the 1,818 gene network. (Fig 5). ![Fig 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F6.medium.gif) [Fig 5.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F6) Fig 5. Venn diagram of HHE genes, TADA genes, DAWN genes and the 46 N-DATA candidate genes. N-DATA candidate genes identified 11 additional HHE genes compared with other methods. We also visualized these 46 genes in the network to demonstrate that the PPI network information can help boost statistical power and provide biological interpretation. Among the 46 candidate genes, *PTPN11, RAF1* and *RIT1* had 2 recurrent DNVs, and *CHD7, NOTCH1, NSD1* and *PYGL* also had recessive genotypes in the CHD cohort [5]. The 46 candidate genes form 4 clusters (Fig 6) in the 𝒢2 network. The biggest cluster includes seven known CHD genes *TBX5, KMT2D, PTPN11, SOS1, ACTB, NOTCH1*, and *PTEN*, which are involved in transcriptional regulation and the early cell growth or differentiation processes. The six new genes *SMAD2, KLF4, CTNNB1, CDC42, ITSN2*, and *WWTR1* also function in similar pathways and have varied implications in cardiac development. For instance, *KLF4* and *CTNNB1* have been implicated in cardiac cell differentiation [43]. *Cdc42* cardiomyocyte knock-out mice presented heart defects such as ventricular septum defects and thin ventricular walls [44]. *WWTR1* encodes a transcription regulator, which serves as an effector of Hippo pathway and regulates cardiac wall maturation in zebrafish [45]. ![Fig 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/12/02/2021.11.30.21267069/F7.medium.gif) [Fig 6.](http://medrxiv.org/content/early/2021/12/02/2021.11.30.21267069/F7) Fig 6. N-DATA model identified 46 candidate genes with at least 1 DNV. Green labels indicate the 18 genes identified when no network information was provided for N-DATA, and red labels indicate the additional 28 genes identified when the 𝒢 2 network was integrated. Circles indicate the 18 known human CHD genes, and squares indicate the 28 novel genes identified by N-DATA. The second biggest cluster is constituted of 7 new genes, all of which are involved in mRNA splicing. Specifically, *SART1, SRRM2, PRPF38A, PRPF8*, and *SF3B1* are associated factors or components of spliceosome; *HNRNPK* encodes a pre-mRNA-binding protein; *DHX9* encodes an RNA helicase which promotes R-loop formation while RNA splicing is perturbed [46]. Alternative splicing plays an essential role in heart development, homeostasis, and disease pathogenesis. Mouse knockouts of multiple splice factors had impaired cardiogenesis [47]. *SF3B1*, specifically, has been shown to upregulate to induce heart disease in both human and mice [48]. Thus, though not fully investigated, DNVs in those mRNA splicing-related genes may contribute to CHD pathogenesis. The third cluster contains genes involved in protein synthesis, includes the known gene *RPL5* and genes not previously associated with CHD (*EIF4, EIF*5, *EEF2*, and *RPL10). RPL5* and *RPL10* encode the ribosome subunits. Mutations in *RPL5* and other ribosomal genes can lead to multiple congenital anomalies, including CHD [49]. *EIF4* and *EIF5* encode translation initiation factors while *EEF2* encodes the elongation factor that regulate peptide chain elongation during protein synthesis. A recent study reported that the deficiency in ribosome associated NatA complex reduces ribosomal protein and subsequently impact cell development as a mechanism to cause CHD [50]. Thus, DNVs in the above genes may lead to CHD via impairment of protein synthesis. The last cluster contains the known CHD genes *BRAF* and *RAF1*, both of which encode key kinases in Ras signaling and are related to Noonan syndrome with CHD as a common feature. Among the un-clustered genes, six are identified after using the network information: *ABCE1, UBE2B, SDC1, PYGL, KDM5B, MED20. UBE2B* and *KDM5B*, encoding epigenetic modifiers, have shown suggestive evidence in cardiac development or CHD [51] [52] and might be potential CHD genes. ## Discussion In this article, we have introduced a Bayesian framework to integrate PPI network information as the prior knowledge into DNV analysis for CHD. This approach adopts MRF to model the interactions among genes. We apply an empirical Bayes strategy to estimate parameters in the model and conduct statistical inference based on the posterior distribution sampled from a Gibbs sampler. The simulation studies and real data analysis on CHD suggest that the proposed method has improved power to identify risk genes over methods without integrating network information. Our proposed framework is innovative for the following aspects. First, it can directly infer disease-associated genes using a network-based model without relying on summary statistics from other DNV association software. Second, it does not need to estimate hyperprior based on other sources compared to the existing pathway-based test for DNV data [17, 31]. Third, it does not require external expression data for the DNV cohort and uses the publicly available PPI database instead, which makes it more applicable to different diseases. This method will not only increase power in risk gene identification, but will also assist in biological interpretation by visualizing clusters of risk genes with functional relevance in the network. However, there are some limitations in the current N-DATA model. In real application, it is important to conduct an initial analysis on the enrichment of top genes identified from *de novo* association test in the network like our motivating example. Another limitation is that the likelihood-based inference may suffer from local maxima [19]. Thus, we recommend to initiate the labels of genes from a known risk gene set or run with multiple starts. Also, we observe the Gibbs sampler tends to move around local maxima for some time before convergence. We suggest running at least 2,000 times of iterations and discard the first 1,000 iterations as burnins. In addition, we only considered damaging DNVs and assumed the relative risk parameter *³*, is the same across all genes in N-DATA, which may cause our model to lose power if it varies across variants with different functions (e.g., LoF and Dmis). Future studies may explore adding functional annotation of variants as a layer in the model to further improve statistical power. ## Conclusions The topologic information in a pathway may be informative to identify functionally interrelated genes and help improve statistical power in DNV studies. Under the hypothesis that connected genes in PPI networks are more likely to share similar disease association status, we developed a novel statistical model that can leverage information from publicly available PPI databases. Through simulation studies under multiple settings, we proved our method can increase statistical power in identifying additional risk genes compared to methods without using the PPI network information. We then applied our method to the CHD DNV data, and then visualized the subnetwork of candidate genes to find potential functional gene clusters for CHD. Our results may shed new insight on the shared protein functionality among risk genes for CHD. ## Supporting information Additional File 1 [[supplements/267069_file02.docx]](pending:yes) ## Data Availability All data produced in the present work are available online at [https://github.com/JustinaXie/NDATA](https://github.com/JustinaXie/NDATA) ## Funding This work was supported in part by NIH grant R03HD100883-01A1 (Y.X. and H.Z.) and R01GM134005-01A1 (W.J., H.L., and H.Z.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## Ethics Declaration ### Ethics approval and consent to participate This manuscript is a mainly methodology paper. The data used in this manuscript is all public available. The research has no procedure related to collecting human subject data. ## Consent for publication Not applicable. ## Competing interests The authors declare that they have no competing interests. ## Supplementary Information Additional File 1. Supplementary figures 1-4 ## Acknowledgements We thank Jin et al. [5] for sharing the *de novo* variant data of CHD. We thank Andrew Xu and Dr. Min Chen for discussions on coding, and Ziyu Jiang for discussions on diagnostic for Bayesian inference. ## List of abbreviations DNV : *de novo* variant CHD : congenital heart disease WES : Whole Exome Sequencing PPI : protein-protein Interaction FDR : false discovery rate MRF : Markov Random Field GWAS : Genome-Wide Association Studies * Received November 30, 2021. * Revision received November 30, 2021. * Accepted December 2, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## References 1. 1.Zaidi S, Choi M, Wakimoto H, Ma L, Jiang J, Overton JD, Romano-Adesman A, Bjornson RD, Breitbart RE, Brown KK, et al: De novo mutations in histone-modifying genes in congenital heart disease. Nature 2013, 498:220–223. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature12141&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23665959&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000320283400046&link_type=ISI) 2. 2.Postma AV, Bezzina CR, Christoffels VM: Genetics of congenital heart disease: the contribution of the noncoding regulatory genome. Journal of Human Genetics 2016, 61:13–19. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/jhg.2015.98&link_type=DOI) 3. 3.Sevim Bayrak C, Zhang P, Tristani-Firouzi M, Gelb BD, Itan Y: De novo variants in exomes of congenital heart disease patients identify risk genes and pathways. Genome Med 2020, 12:9. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s13073-019-0709-8&link_type=DOI) 4. 4.Diab NS, Barish S, Dong W, Zhao S, Allington G, Yu X, Kahle KT, Brueckner M, Jin SC: Molecular Genetics and Complex Inheritance of Congenital Heart Disease. Genes (Basel) 2021, 12. 5. 5.Jin SC, Homsy J, Zaidi S, Lu Q, Morton S, DePalma SR, Zeng X, Qi H, Chang W, Sierant MC, et al: Contribution of rare inherited and de novo variants in 2,871 congenital heart disease probands. Nat Genet 2017, 49:1593–1601. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3970&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28991257&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 6. 6.Zaidi S, Brueckner M: Genetics and Genomics of Congenital Heart Disease. Circ Res 2017, 120:923–940. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImNpcmNyZXNhaGEiO3M6NToicmVzaWQiO3M6OToiMTIwLzYvOTIzIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTIvMDIvMjAyMS4xMS4zMC4yMTI2NzA2OS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 7. 7.Glessner JT, Bick AG, Ito K, Homsy J, Rodriguez-Murillo L, Fromer M, Mazaika E, Vardarajan B, Italia M, Leipzig J, et al: Increased frequency of de novo copy number variants in congenital heart disease by integrative analysis of single nucleotide polymorphism array and exome sequence data. Circ Res 2014, 115:884–896. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImNpcmNyZXNhaGEiO3M6NToicmVzaWQiO3M6MTA6IjExNS8xMC84ODQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8xMi8wMi8yMDIxLjExLjMwLjIxMjY3MDY5LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 8. 8.Soemedi R, Wilson IJ, Bentham J, Darlay R, Töpf A, Zelenika D, Cosgrove C, Setchfield K, Thornborough C, Granados-Riveron J, et al: Contribution of global rare copy-number variants to the risk of sporadic congenital heart disease. Am J Hum Genet 2012, 91:489–501. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2012.08.003&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22939634&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 9. 9.Pierpont ME, Brueckner M, Chung WK, Garg V, Lacro RV, McGuire AL, Mital S, Priest JR, Pu WT, Roberts A, et al: Genetic Basis for Congenital Heart Disease: Revisited: A Scientific Statement From the American Heart Association. Circulation 2018, 138:e653–e711. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1161/CIR.0000000000000606&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 10. 10.Teer JK, Mullikin JC: Exome sequencing: the sweet spot before whole genomes. Human Molecular Genetics 2010, 19:R145–R151. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/hmg/ddq333&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20705737&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000284456100005&link_type=ISI) 11. 11.Rabbani B, Tekin M, Mahdieh N: The promise of whole-exome sequencing in medical genetics. Journal of Human Genetics 2014, 59:5–15. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/jhg.2013.114&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24196381&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 12. 12.Homsy J, Zaidi S, Shen Y, Ware JS, Samocha KE, Karczewski KJ, DePalma SR, McKean D, Wakimoto H, Gorham J, et al: De novo mutations in congenital heart disease with neurodevelopmental and other congenital anomalies. Science 2015, 350:1262–1266. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEzOiIzNTAvNjI2NS8xMjYyIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTIvMDIvMjAyMS4xMS4zMC4yMTI2NzA2OS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 13. 13.Richter F, Morton SU, Kim SW, Kitaygorodsky A, Wasson LK, Chen KM, Zhou J, Qi H, Patel N, DePalma SR: Genomic analyses implicate noncoding de novo variants in congenital heart disease. Nature genetics 2020, 52:769–777. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41588-020-0652-z&link_type=DOI) 14. 14.Watkins WS, Hernandez EJ, Wesolowski S, Bisgrove BW, Sunderland RT, Lin E, Lemmon G, Demarest BL, Miller TA, Bernstein D: De novo and recessive forms of congenital heart disease have distinct genetic and phenotypic landscapes. Nature communications 2019, 10:1–12. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-019-11876-5&link_type=DOI) 15. 15.Sifrim A, Hitz M-P, Wilsdon A, Breckpot J, Turki SHA, Thienpont B, McRae J, Fitzgerald TW, Singh T, Swaminathan GJ, et al: Distinct genetic architectures for syndromic and nonsyndromic congenital heart defects identified by exome sequencing. Nature Genetics 2016, 48:1060–1065. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3627&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27479907&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 16. 16.Sifrim A, Hitz MP, Wilsdon A, Breckpot J, Turki SH, Thienpont B, McRae J, Fitzgerald TW, Singh T, Swaminathan GJ, et al: Distinct genetic architectures for syndromic and nonsyndromic congenital heart defects identified by exome sequencing. Nat Genet 2016, 48:1060–1065. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3627&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27479907&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 17. 17.Nguyen TH, He X, Brown RC, Webb BT, Kendler KS, Vladimirov VI, Riley BP, Bacanu SA: DECO: a framework for jointly analyzing de novo and rare case/control variants, and biological pathways. Brief Bioinform 2021. 18. 18.Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM: Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 2011, 21:1109–1121. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ2Vub21lIjtzOjU6InJlc2lkIjtzOjk6IjIxLzcvMTEwOSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzEyLzAyLzIwMjEuMTEuMzAuMjEyNjcwNjkuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 19. 19.Chen M, Cho J, Zhao H: Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet 2011, 7:e1001353. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1001353&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21490723&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 20. 20.Hou L, Chen M, Zhang CK, Cho J, Zhao H: Guilt by rewiring: gene prioritization through network rewiring in genome wide association studies. Hum Mol Genet 2014, 23:2780–2790. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/hmg/ddt668&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24381306&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000334694800022&link_type=ISI) 21. 21.Liu L, Lei J, Roeder K: Network assisted analysis to reveal the genetic basis of autism. The Annals of Applied Statistics 2015, 9:1571-1600, 1530. 22. 22.Nguyen HT, Bryois J, Kim A, Dobbyn A, Huckins LM, Munoz-Manchado AB, Ruderfer DM, Genovese G, Fromer M, Xu X, et al: Integrated Bayesian analysis of rare exonic variants to identify risk genes for schizophrenia and neurodevelopmental disorders. Genome Med 2017, 9:114. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s13073-017-0497-y&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29262854&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 23. 23.Nguyen T-H, Dobbyn A, Brown RC, Riley BP, Buxbaum JD, Pinto D, Purcell SM, Sullivan PF, He X, Stahl EA: mTADA is a framework for identifying risk genes from de novo mutations in multiple traits. Nature Communications 2020, 11:2929. 24. 24.Oughtred R, Stark C, Breitkreutz BJ, Rust J, Boucher L, Chang C, Kolas N, O’Donnell L, Leung G, McAdam R, et al: The BioGRID interaction database: 2019 update. Nucleic Acids Res 2019, 47:D529–d541. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gky1079&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30476227&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 25. 25.Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, et al: The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res 2014, 42:D358–363. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkt1115&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24234451&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000331139800054&link_type=ISI) 26. 26.Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Res 2004, 32:D449–451. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkh086&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=14681454&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000188079000106&link_type=ISI) 27. 27.Licata L, Briganti L, Peluso D, Perfetto L, Iannuccelli M, Galeota E, Sacco F, Palma A, Nardozza AP, Santonico E, et al: MINT, the molecular interaction database: 2012 update. Nucleic Acids Res 2012, 40:D857–861. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkr930&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22096227&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000298601300128&link_type=ISI) 28. 28.Mishra GR, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P, Shivakumar K, Anuradha N, Reddy R, Raghavan TM, et al: Human protein reference database--2006 update. Nucleic Acids Res 2006, 34:D411–414. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkj141&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16381900&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000239307700090&link_type=ISI) 29. 29.Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, et al: STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 2019, 47:D607–d613. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gky1131&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30476243&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 30. 30.Nguyen TH, Dobbyn A, Brown RC, Riley BP, Buxbaum JD, Pinto D, Purcell SM, Sullivan PF, He X, Stahl EA: mTADA is a framework for identifying risk genes from de novo mutations in multiple traits. Nat Commun 2020, 11:2929. 31. 31.Nguyen HT, Dobbyn A, Charney AW, Bryois J, Kim A, Mcfadden W, Skene NG, Huckins LM, Wang W, Ruderfer DM, et al: Integrative analysis of rare variants and pathway information shows convergent results between immune pathways, drug targets and epilepsy genes. bioRxiv 2018:410100. 32. 32.Xie Y, Li M, Dong W, Jiang W, Zhao H: M-DATA: A statistical approach to jointly analyzing de novo mutations for multiple traits. PLoS Genet 2021, 17:e1009849. 33. 33.Kindermann R: Markov random fields and their applications. American mathematical society 1980. 34. 34.Samocha KE, Robinson EB, Sanders SJ, Stevens C, Sabo A, McGrath LM, Kosmicki JA, Rehnström K, Mallick S, Kirby A, et al: A framework for the interpretation of de novo mutation in human disease. Nat Genet 2014, 46:944–950. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ng.3050&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25086666&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 35. 35.Besag J: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological) 1986, 48:259–279. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.2517-6161.1986.tb01412.x&link_type=DOI) 36. 36.Le Cessie S, Van Houwelingen JC: Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics) 1992, 41:191–201. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2307/2347628&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1992GW30300017&link_type=ISI) 37. 37.Sun W, Tony Cai T: Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2009, 71:393–424. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1467-9868.2008.00694.x&link_type=DOI) 38. 38.Li H, Wei Z, Maris J: A hidden Markov random field model for genome-wide association studies. Biostatistics 2010, 11:139–150. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/kxp043&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19822692&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000272925500011&link_type=ISI) 39. 39.Jiang W, Yu W: Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies. Bioinformatics 2016, 33:500–507. 40. 40.He X, Sanders SJ, Liu L, De Rubeis S, Lim ET, Sutcliffe JS, Schellenberg GD, Gibbs RA, Daly MJ, Buxbaum JD, et al: Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet 2013, 9:e1003671. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1003671&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23966865&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 41. 41.Li M: Gene-based Association Analysis for Genome-wide Association and Whole-exome Sequencing Studies. Yale University, Biostatistics; 2020. 42. 42.Liu L, Lei J, Sanders SJ, Willsey AJ, Kou Y, Cicek AE, Klei L, Lu C, He X, Li M, et al: DAWN: a framework to identify autism genes and subnetworks using gene expression and genetics. Mol Autism 2014, 5:22. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/2040-2392-5-22&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24602502&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) 43. 43.Kami D, Kitani T, Kawasaki T, Gojo S: Cardiac mesenchymal progenitors differentiate into adipocytes via Klf4 and c-Myc. Cell death & disease 2016, 7:e2190–e2190. 44. 44.Liu Y, Wang J, Li J, Wang R, Tharakan B, Zhang SL, Tong CW, Peng X: Deletion of Cdc42 in embryonic cardiomyocytes results in right ventricle hypoplasia. Clinical and translational medicine 2017, 6:40–40. 45. 45.Lai JKH, Collins MM, Uribe V, Jiménez-Amilburu V, Günther S, Maischein HM, Stainier DYR: The Hippo pathway effector Wwtr1 regulates cardiac wall maturation in zebrafish. Development 2018, 145. 46. 46.Chakraborty P, Huang JTJ, Hiom K: DHX9 helicase promotes R-loop formation in cells with impaired RNA splicing. Nat Commun 2018, 9:4346. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-018-06677-1&link_type=DOI) 47. 47.Zahr HC, Jaalouk DE: Exploring the Crosstalk Between LMNA and Splicing Machinery Gene Mutations in Dilated Cardiomyopathy. Front Genet 2018, 9:231. 48. 48.van den Hoogenhof MM, Pinto YM, Creemers EE: RNA Splicing: Regulation and Dysregulation in the Heart. Circ Res 2016, 118:454–468. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImNpcmNyZXNhaGEiO3M6NToicmVzaWQiO3M6OToiMTE4LzMvNDU0IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTIvMDIvMjAyMS4xMS4zMC4yMTI2NzA2OS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 49. 49.Gazda HT, Sheen MR, Vlachos A, Choesmel V, O’Donohue MF, Schneider H, Darras N, Hasman C, Sieff CA, Newburger PE, et al: Ribosomal protein L5 and L11 mutations are associated with cleft palate and abnormal thumbs in Diamond-Blackfan anemia patients. Am J Hum Genet 2008, 83:769–780. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ajhg.2008.11.004&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19061985&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F12%2F02%2F2021.11.30.21267069.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000261822100012&link_type=ISI) 50. 50.Ward T, Tai W, Morton S, Impens F, Van Damme P, Van Haver D, Timmerman E, Venturini G, Zhang K, Jang MY, et al: Mechanisms of Congenital Heart Disease Caused by NAA15 Haploinsufficiency. Circ Res 2021, 128:1156–1169. 51. 51.Robson A, Makova SZ, Barish S, Zaidi S, Mehta S, Drozd J, Jin SC, Gelb BD, Seidman CE, Chung WK, et al: Histone H2B monoubiquitination regulates heart development via epigenetic control of cilia motility. Proc Natl Acad Sci U S A 2019, 116:14049–14054. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE2LzI4LzE0MDQ5IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMTIvMDIvMjAyMS4xMS4zMC4yMTI2NzA2OS5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 52. Audain E, Wilsdon A, Breckpot J, Izarzugaza JMG, Fitzgerald TW, Kahlert AK, Sifrim A, Wünnemann F, Perez-Riverol Y, Abdul-Khaliq H, et al: Integrative analysis of genomic variants reveals new associations of candidate haploinsufficient genes with congenital heart disease. PLoS Genet 2021, 17:e1009679. [1]: /embed/graphic-2.gif [2]: /embed/inline-graphic-1.gif [3]: /embed/graphic-3.gif [4]: /embed/inline-graphic-2.gif [5]: /embed/graphic-4.gif [6]: /embed/inline-graphic-3.gif [7]: /embed/inline-graphic-4.gif [8]: /embed/inline-graphic-5.gif [9]: /embed/inline-graphic-6.gif [10]: /embed/graphic-5.gif [11]: /embed/inline-graphic-7.gif [12]: /embed/inline-graphic-8.gif [13]: /embed/inline-graphic-9.gif [14]: /embed/graphic-7.gif [15]: /embed/inline-graphic-10.gif [16]: /embed/inline-graphic-11.gif [17]: /embed/inline-graphic-12.gif