Abstract
Although rare neurodevelopmental conditions have a large Mendelian component, common genetic variants also contribute to risk. However, little is known about how this polygenic risk is distributed among patients with these conditions and their parents, its interplay with rare variants, and whether parents’ polygenic background contributes to their children’s risk beyond the direct effect of variants transmitted to the child (i.e. via indirect genetic effects potentially mediated through the prenatal environment or ‘genetic nurture’). Here, we addressed these questions using genetic data from 11,573 patients with rare neurodevelopmental conditions, 9,128 of their parents and 26,869 controls. Common variants explained ∼10% of variance in overall risk. Patients with a monogenic diagnosis had significantly less polygenic risk than those without, supporting a liability threshold model, while both genetically undiagnosed patients and diagnosed patients with affected parents had significantly more risk than controls. In a trio-based model, using a polygenic score for neurodevelopmental conditions, the transmitted but not the non-transmitted parental alleles were associated with risk, indicating a direct genetic effect. In contrast, we observed no direct genetic effect of polygenic scores for educational attainment and cognitive performance, but saw a significant correlation between the child’s risk and non-transmitted alleles in the parents, potentially due to indirect genetic effects and/or parental assortment for these traits. Indeed, as expected under parental assortment, we show that common variant predisposition for neurodevelopmental conditions is correlated with the rare variant component of risk. Our findings thus suggest that future studies should investigate the possible role and nature of indirect genetic effects on rare neurodevelopmental conditions, and consider the contribution of common and rare variants simultaneously when studying cognition-related phenotypes.
Main
Rare conditions affect 3.5%-6% of the global population1 and of these, the majority involve the central nervous system2. While genomic sequencing has revolutionized the diagnosis of rare neurodevelopmental conditions, which typically include intellectual disability and/or developmental delay, a monogenic diagnosis is only identified for about 30-40% of patients3–5. Common variants also contribute to risk for rare neurodevelopmental conditions6,7. In particular, this common variant contribution overlaps with polygenic risk for schizophrenia and for predisposition to reduced educational attainment and cognitive performance6. Accordingly, rare damaging variants in constrained genes, which play a major role in risk of rare neurodevelopmental conditions, are also associated with reduced educational attainment and cognitive performance and increased risk of mental health conditions in UK Biobank 8–12. In this work, we seek to better understand the nature of common variant risk for rare neurodevelopmental conditions, its interplay with rare variants and its distribution amongst different patients and their parents.
We begin by leveraging new, larger genome-wide association studies (GWASs) than were previously available6 to explore the extent to which common variant effects on rare neurodevelopmental conditions are correlated with their effects on a broad range of mental health conditions. This is motivated by findings that some psychiatric conditions have a partial neurodevelopmental origin 13–15, and that people with rare neurodevelopmental conditions16, as well as their relatives 17–19, are more likely to have psychiatric conditions. Furthermore, some of this overlap appears to be driven by certain rare copy number variants with variable expressivity20–22, suggesting some shared etiology between psychiatric and rare neurodevelopmental conditions. Here we explore whether shared common variant effects may also contribute, and whether this is independent of the genetic overlap between these conditions and cognitive traits.
Little is known about the interplay between rare and common variants in the context of rare neurodevelopmental conditions, and dissecting this will be key to fully understanding their genetic architecture and improving genetic diagnosis and risk prediction. Here we address two hypotheses in this space, testing the liability threshold model and whether common variants modify the penetrance of rare variants. The liability threshold model predicts that an individual will develop a condition once the sum of independent genetic and environmental risk factors exceeds some threshold23–26. Under this model, one might expect that patients with neurodevelopmental conditions who have a highly penetrant damaging variant would require, on average, less polygenic load to cross a diagnostic threshold than those without such variants (Extended Data Figure 1). We previously saw no significant difference in polygenic scores between patients with versus without a monogenic diagnosis6, but in this work, we anticipated that increased sample size and improved diagnostic rate5,27 might improve power. Since rare variants associated with neurodevelopmental conditions appear to act additively with polygenic scores in affecting cognitive ability in UK Biobank9,10, we hypothesized that polygenic background would modify the penetrance of these inherited rare variants in families with neurodevelopmental conditions, as it does, for example, in the context of BRCA1/2 variants predisposing to breast cancer28.
A) shows the estimates from Linkage Disequilibrium Score Regression for the DDD GWAS (orange) and the meta-analysis of neurodevelopmental conditions between DDD and GEL (blue). B) shows the estimates for the meta-analysis after conditioning on the GWAS summary statistics for educational attainment (green) or cognitive performance (purple) using GenomicSEM. Error bars show 95% confidence intervals.
Finally, we explore the extent to which common variants predisposing to rare neurodevelopmental conditions act directly on the affected individuals carrying them (“direct genetic effects”). Many studies have shown that genetic associations between common genetic variants and educational and cognitive phenotypes shrink when estimated within families 29–33. One possible explanation for this is that variants associated with these traits have indirect genetic effects, i.e. they have some effect on the parents, and this then affects the offspring through the family environment 29,33–35. These indirect genetic effects are under-explored in the context of rare diseases, but we hypothesized that they may play a role in rare neurodevelopmental conditions given the genetic overlap with educational attainment.
We address these questions using two large UK-based cohorts of individuals with rare neurodevelopmental conditions, the Deciphering Developmental Disorders study (DDD; N=7,955 patients with genotype array and exome sequence data) and the Genomics England 100,000 Genomes project (GEL; N=3,618 patients with genome sequence data), combined with several control cohorts (Supplementary Table 1). We have included a “Frequently Asked Questions” document in less technical language to explain the study, and to address some possible misunderstandings.
Results
GWAS meta-analysis for neurodevelopmental conditions reveals novel genetic correlations with other brain-related traits and conditions
We first sought to replicate the key findings from our previous GWAS for neurodevelopmental conditions6 in a large independent cohort. We identified a subset of GEL rare disease families with neurodevelopmental conditions and removed families overlapping with the DDD study (Methods). Almost all probands with neurodevelopmental conditions in GEL (97%) had intellectual disability or global developmental delay, versus 88% of those in DDD. The cohorts were broadly phenotypically similar (Extended Data Figure 2; Supplementary Note 1). To avoid spurious results due to population stratification, all genetic analyses were conducted in a genetically homogeneous subset of individuals with genetic similarity to British individuals from the 1000 Genomes Project 36, henceforth referred to as having GBR ancestry.
A) Average polygenic scores in probands with (“diagnosed”; N=3,821; dark blue) versus without (“undiagnosed; N=6,345; red) a monogenic diagnosis, from DDD and GEL combined. Subsets of diagnosed probands from trios are in light blue. The polygenic scores have been standardized such that the controls (UK Household Longitudinal Study+GEL for all polygenic scores except PGSNDD,DDD, for which only GEL controls were used) have mean 0 and variance 1. See Supplementary Table 5 for results of statistical tests comparing the various groups. See also Extended Data Figure 4. B) Associations between various factors and diagnostic status within the full DDD cohort5, with or without correcting for the proband’s PGSEA, calculated within GBR-ancestry probands with neurodevelopmental conditions from DDD using logistic regression (see Supplementary Table 8). C) Associations between these factors and DDD probands’, mothers’ or fathers’ PGSEA, assessed via linear regression. Two asterisks indicate that the association passed Bonferroni correction for seven factors. Error bars show 95% confidence intervals. FROH: the fraction of the genome in runs of homozygosity; the expected value is 0.0625 for individuals whose parents are first cousins. EA: educational attainment; CP: cognitive performance; NonCogEA: the non-cognitive component of EA42; SCZ: schizophrenia; NDC,DDD: neurodevelopmental conditions, with the GWAS conducted in DDD versus the UK Household Longitudinal Study, and the polygenic score tested only in samples excluded from the GWAS (GEL and DDD Omni chip).
When comparing 3,618 unrelated patients with neurodevelopmental conditions to 13,667 unrelated controls within GEL, polygenic scores (PGSs) for educational attainment37, cognitive performance37, and schizophrenia38 each explained a significant but small amount of variance on the liability scale (<1%; logistic regression p<3.9x10-4). This was similar to that observed when comparing 6,397 unrelated patients from DDD with 9,270 independent unrelated controls (Supplementary Table 2). The polygenic score for neurodevelopmental conditions derived from our GWAS in DDD6 was also associated with neurodevelopmental conditions within GEL (p=1.1x10-6, R2=0.11% on the liability scale; Supplementary Table 2).
These results suggested that the polygenic contribution to rare neurodevelopmental conditions was similar between these two cohorts. Thus, to increase power to study common variant effects on these conditions, we conducted a GWAS in GEL, then meta-analyzed the results with the DDD GWAS (Extended Data Figure 3; Supplementary Data 1, 2 and 3). No single nucleotide polymorphism (SNP) passed genome-wide significance (p<5x10-8) in either DDD or GEL alone, but in the meta-analysis, six SNPs were significant in two independent loci on chromosomes 15 and 22, respectively (lead SNPs: rs113446150, p=4.0x10-8; rs2284084, p=1.7x10-8; Supplementary Note 2). Variants at one of these loci are associated with cognitive traits39,40. The fraction of phenotypic variance explained by genome-wide common variants - the SNP heritability - was estimated at between 3.7% (95% CI: 1.7–5.7%) and 11.2% (8.5–13.8%), depending on the method used (Supplementary Table 3).
A) Polygenic transmission disequilibrium test (pTDT) in undiagnosed probands with unaffected parents. We tested if probands’ polygenic score deviated from mean parental polygenic score in trios from GEL (N=1,343) and DDD (N=1,523, or N=224 for testing PGSNDC,DDD). Plotted is the mean pTDT deviation (difference between the child’s polygenic score and the mean parental polygenic score, in units of the SD of the latter), with error bars showing 95% confidence intervals. B) Mean polygenic score difference from control samples (GEL+UK Household Longitudinal Study, or GEL alone when testing PGSNDC,DDD). This includes only the samples in the trios used in the pTDT analysis. Error bars indicate 95% confidence intervals estimated from two-sided t-tests. See also Extended Data Figure 4 and Supplementary Table 7.
To test for possible shared genetic contributors to rare neurodevelopmental conditions and other brain-related traits and conditions, we calculated genetic correlations (rg) between them using our own and published GWAS meta-analyses. We observed the expected negative genetic correlations between neurodevelopmental conditions and educational attainment 37 (EA; rg=- 0.65 [-0.84, -0.47], p=4.9x10-12) and cognitive performance 37 (CP; rg=-0.56 [-0.73, -0.39], p=1.6x10-10), stronger in magnitude than those observed with the DDD GWAS alone, and a positive genetic correlation with schizophrenia38 (SCZ; rg=0.27 [0.13, 0.40], p=9.7x10-5) (Figure 1A; Supplementary Table 4). Additionally, we detected significant genetic correlations (p<0.0038=0.05/13 traits) with several other mental health conditions including Attention-Deficit Hyperactive Disorder (ADHD)41 (rg=0.46 [0.28, 0.64], p=5.2x10-7), and with the non-cognitive component of educational attainment derived from GWAS-by-subtraction (NonCogEA) 42 (rg=- 0.37 [-0.52, -0.22], p=1.2x10-6) (Figure 1A). We hypothesized that the genetic correlations with mental health conditions could be explained at least in part by their relationship with educational attainment42,43, given the strong negative genetic correlation between that and neurodevelopmental conditions. To explore this, we used Genomic Structural Equation Modelling 44 (GenomicSEM) to re-estimate the genetic correlations while conditioning on the educational attainment GWAS summary statistics (Figure 1B). This significantly attenuated the genetic correlation with ADHD (rg=0.14 [-0.06, 0.34], p=0.18; two-sided z-test p=0.021 compared to unconditional rg), but the genetic correlations with the other conditions did not significantly change.
These results confirmed that common variants collectively associate with rare neurodevelopmental conditions in two independent cohorts, and that these common variant effects are shared with other brain-related conditions and cognitive traits. To further explore the contribution of polygenic background, below we used polygenic scores for neurodevelopmental conditions from the DDD-derived GWAS6 (PGSNDC,DDD) and for the most significantly genetically correlated traits (PGSEA, PGSCP, PGSNonCogEA, PGSSCZ) for which much larger GWASs and thus more powerful polygenic scores are available. All polygenic scores were corrected for principal components and standardized such that the controls from GEL and the UK Household Longitudinal Study have mean 0 and variance 1, except PGSNDC,DDD which was standardized in the GEL controls alone. We note that several of these polygenic scores are significantly correlated with each other (Supplementary Figure 1), so, in the analyses below, our correction for multiples of five tests is conservative.
Probands with monogenic diagnoses have less polygenic risk
Since 36% of patients in these cohorts have a molecular monogenic diagnosis (including de novo, recessive, X-linked or inherited dominant diagnoses), we next tested whether these diagnosed patients differed from undiagnosed patients in terms of their polygenic risk. Consistent with the liability threshold model (Extended Data Figure 1), we observed significantly higher PGSEA (DDD and GEL combined; average difference Δ=0.12 SD, two-sided t-test p=3.0x10-9), PGSCP (Δ=0.068 SD, p=1.2x10-3), and PGSNonCogEA (Δ=0.085 SD, p=3.7x10-5) in probands with versus without a monogenic diagnosis (all passing Bonferroni significance i.e. p<0.05/5; Figure 2A). Despite this, we observed that for all polygenic scores except for PGSNonCogEA, the diagnosed probands still had significantly more polygenic risk than the controls (p<0.05/5; Figure 2A; Supplementary Table 5). Sensitivity analyses suggest that this observation is not driven by ascertainment bias in the controls, although the effect size is sensitive to the choice of control cohort, particularly for PGSEA (Supplementary Note 3, Extended Data Figure 4, >Extended Data Figure 5, Supplementary Table 6). The difference between the diagnosed probands and controls is driven by those with affected parents (i.e. those reported by clinicians to show a similar phenotype to their child), who had significantly more polygenic risk for several traits than those with unaffected parents (e.g. PGSEA Δ=0.26 SD, p=3.4x10-3) (Extended Data Figure 4; Supplementary Table 5). However, amongst undiagnosed probands, both those with affected parents and with unaffected parents showed significantly more polygenic risk than controls (Extended Data Figure 4, Supplementary Table 7).
The plot shows effect sizes of PGSs on case/control status, testing either the child’s PGS alone (“proband only”) amongst trio probands, or while additionally controlling for the parents’ PGSs (“trio model”). These were obtained from a logistic regression comparing undiagnosed proband with neurodevelopmental conditions from 2,866 trios in which parents are unaffected with 4,804 control trios from GEL (N=872), the Avon Longitudinal Study of Children and Parents (N=1,434) and the Millennium Cohort Study (N=2,498). Error bars indicate 95% confidence intervals.
Correlation coefficients between the number of rare inherited rare damaging coding (left) or synonymous variants (right; negative control) in constrained genes and polygenic scores within/between different sets of individuals. In blue are the correlations within probands with neurodevelopmental conditions whose parents are unaffected (i.e. the child’s rare variant burden score, RVBS, with their own polygenic score, PGS), and in purple are the correlations within their parents. In orange is the cross-parental correlation i.e. one parent’s RVBS correlated with the other parent’s PGS. We calculated the correlations in a combined sample of trios with neurodevelopmental conditions from DDD and GEL (N=3,999 or 2,553 for PGSNDC,DDD excluding samples from the original GWAS6). Note that both the RVBSs and PGSs have been corrected for 20 genetic principal components. Error bars represent 95% confidence intervals.
We next explored whether the difference in polygenic risk between diagnosed and undiagnosed probands was related to various technical, clinical and prenatal factors that are associated with receiving a monogenic diagnosis in DDD5 (Figure 2B). For example, diagnosed probands were more likely to be in a trio (probably due to the ability to distinguish de novo from inherited variants) and to have severe intellectual disability, and less likely to have been born prematurely (a known epidemiological risk factor for neurodevelopmental conditions45–47) (Supplementary Table 8). We hypothesized that some of these associations might be confounding, or be confounded by the association between PGSEA and diagnostic status, since, for example, single-parent households and premature birth are associated with higher levels of deprivation/lower parental educational attainment48,49. Indeed, we observed that the probands’ PGSEA was significantly associated with several of these factors (Figure 2C): a higher chance of being in a trio and having more severe intellectual disability, and a lower chance of being born prematurely and having any affected first-degree relatives (Extended Data Figure 6). However, it was not associated with sex (Supplementary Note 4; Extended Data Figure 7) or maternal diabetes (Figure 2C; Supplementary Table 8). Controlling for PGSEA minimally altered the association between these factors and diagnostic status (Figure 2B). Only a small part of the association between PGSEA and diagnostic status was mediated by the effects of trio status (11%, 95% CI: 6.2–20.8%) and prematurity (3.1%, 95% CI: 0.4–7.3%). Thus, the observation that diagnosed patients tend to have lower polygenic risk than undiagnosed probably largely reflects the liability threshold model under which both common and rare variants contribute to risk (Extended Data Figure 1).
Limited evidence for over-transmission of polygenic risk from unaffected parents to probands
Common variants are inherited from parents, and most of the parents in our sample are reported by clinicians to be clinically unaffected (89.2% in DDD and 95.4% in GEL, although the clinical annotation of parental affected status may be imperfect). Given this, and results in autism50, we hypothesized that probands without monogenic diagnoses inherit higher common variant risk for neurodevelopmental conditions from unaffected parents than one would expect given their parents’ mean risk. Applying the polygenic transmission disequilibrium test (pTDT)50 to undiagnosed trios with unaffected parents (Figure 3A) (1,343 in GEL plus 1,523 in DDD), we saw nominally significant over-transmission of PGSNDC,DDD in 1,567 families not included in the original GWAS (pTDT deviation = 0.062; paired t-test p=0.014). This over-transmission was significant in females (pTDT deviation = 0.10, p=0.0078 in 589 trios) but not in males (pTDT deviation = 0.036, p=0.27 in 978 trios) (Extended Data Figure 7C; Supplementary Note 4). However, we saw no significant transmission disequilibrium for the other polygenic scores (paired t-test p>0.05), in either sex or in both sexes combined. Given the known over-transmission of PGSEA to autistic individuals50, we excluded autistic individuals from our sample and repeated the pTDT, but conclusions were unchanged (Supplementary Figure 2). When focusing on probands with a monogenic genetic diagnosis, we saw no significant transmission distortion for any polygenic score tested (Supplementary Figure 3).
To put the pTDT results in context, we compared average polygenic scores between unaffected parents of undiagnosed patients and controls (Figure 3B; Supplementary Table 7). For all five polygenic scores tested, the parents had more polygenic risk than controls (two-sided t-test p<5.2x10-6 for all polygenic scores except PGSSCZ which had p=0.0093). Given this observation and the results from the pTDT, we conclude that risk for neurodevelopmental conditions is affected both by familial polygenic background, or factors correlated with it, and by polygenic risk (specifically, PGSNDC,DDD) that is over-transmitted from unaffected parents to affected children.
Non-transmitted common alleles in unaffected parents are associated with their children’s risk
We next explored one way in which familial polygenic background might affect children’s risk of neurodevelopmental conditions, namely indirect genetic effects, i.e. effects of alleles in parents on parental phenotypes that affect their offspring’s risk through the family environment. Indirect genetic effects have been argued to explain around ∼30-45% of the association between polygenic predictors of educational attainment and school grades33,51 and educational attainment29,40,52, although these inferences have been contested as confounded by parental assortment and population stratification51,53. To investigate the possible role of indirect genetic effects in risk of neurodevelopmental conditions, we compared 2,866 affected trio probands whose parents are unaffected with 4,804 control trios from two UK birth cohorts (N=3,932 trios) and from GEL (N=872 trios without neurodevelopmental conditions). Using logistic regression, we tested whether the children’s polygenic scores for traits related to neurodevelopmental conditions were significantly associated with case status (“proband only” model), and whether this held after conditioning on the parents’ polygenic scores (i.e. including all three trio members’ polygenic scores as covariates in the “trio model”)54 (Figure 4). The idea of this model is to isolate the environmentally-mediated portion of polygenic risk in the parents from the direct effects of alleles transmitted to their children. Following Young et al.31, we refer to the coefficients on the parental polygenic scores in the trio model as the “non-transmitted coefficients”, since they are mathematically equivalent to the coefficients on the polygenic score constructed from the non-transmitted alleles in a joint regression with the proband polygenic score (Methods).
For PGSEA, PGSCP, and PGSNonCogEA, we found that undiagnosed probands’ polygenic scores were no longer significantly associated with having a neurodevelopmental condition after conditioning on their parents’ polygenic scores in the trio model (implying limited or no direct genetic effects), whereas the non-transmitted coefficients were highly significant (Figure 4). This result held for PGSEA and PGSNonCogEA when analyzing trios with versus without neurodevelopmental conditions from GEL alone, and when using different combinations of control cohorts (Supplementary Figure 4); for PGSCP, results from these sensitivity analyses were more equivocal, but no evidence of direct genetic effects was seen. This finding could imply that there are aspects of the environment — including the prenatal environment — that are correlated with these non-transmitted alleles and that affect risk of neurodevelopmental conditions, including genetically-influenced parental phenotypes. However, our observations could also be due to the effects of parental assortment (i.e. phenotypic correlation between partners), which we discuss further below.
For PGSNDC,DDD, we found that the probands’ polygenic scores were still nominally significantly associated with having a neurodevelopmental condition after controlling for their parents’ polygenic scores in the trio model (Figure 4). This implies that there is a direct genetic effect of PGSNDC,DDD on the probands’ risk of neurodevelopmental conditions, consistent with the over-transmission observed in Figure 3A. For schizophrenia, we saw no significant effect of the probands’ PGSSCZ (p=0.089) in the trio model, whereas the mothers’ PGSSCZ was significant (p=8.6x10-3). Thus, in summary, there is evidence for direct genetic effects of the polygenic score for rare neurodevelopmental conditions, but not for polygenic scores for related traits.
Exploring the role of prenatal factors
We explored whether prenatal factors might mediate the effects of non-transmitted parental alleles on risk of neurodevelopmental conditions (Supplementary Note 5). Preterm delivery (i.e. giving birth prematurely)55, which is a risk factor for neurodevelopmental conditions in the offspring45–47, showed significant genetic correlations with lower educational attainment (rg=- 0.30 [-0.39, -0.21], p=2.3x10-10), mirroring the epidemiological association56, and with neurodevelopmental conditions (rg=0.58 [0.18, 0.97], p=0.004) (Extended Data Figure 8A, Supplementary Table 9). Premature birth was also associated with lower PGSEA in DDD (Extended Data Figure 8B). In theory, the genetic correlation between educational attainment and premature delivery could reflect a causal effect of lower educational attainment on premature birth, and/or a causal effect of premature birth on lower educational attainment. Using Mendelian randomization, we found some evidence that lower educational attainment causally increases the risk of giving birth prematurely and of neurodevelopmental conditions (p=1.5x10-5 and p=6.5x10-19 respectively, using the inverse variance-weighted method; Extended Data Figure 9; Supplementary Note 5). The fact that the neurodevelopmental conditions of the kind studied in this paper are, by definition, childhood-onset, implies that individuals’ own educational attainment is unlikely to causally influence their risk of developing a condition; instead, our finding of a causal effect of educational attainment on these conditions is more likely to reflect a causal effect of parents’ educational attainment on their children’s risk, consistent with the presence of indirect genetic effects. However, we did not find significant evidence that prematurity explained the association between neurodevelopmental conditions and non-transmitted common variants in the parents that are associated with educational attainment (Supplementary Note 5; Supplementary Figure 5).
Parental assortment obscures the true nature of common variant effects
Another factor that may contribute to the significant correlation between non-transmitted alleles in parents and neurodevelopmental conditions in their children is parental assortment, the phenomenon whereby people are more likely to choose partners with similar traits to themselves. Parental assortment is known to be particularly strong for educational attainment and cognitive ability, with estimates of phenotypic correlation between spouses ranging from 0.25 to 0.657–64. It is also observed for psychiatric conditions63,65–67, including in parents of autistic individuals and of individuals with neurodevelopmental conditions due to the 16p12.1 deletion 68. One consequence of parental assortment is that it induces a correlation between alleles that act in the same direction on a trait, both between parents and, in their descendents, within and between loci57. Thus, parental assortment on cognitive ability or correlated traits (e.g. educational attainment) would be expected to lead to individuals with inherited rare variants associated with reduced cognitive ability8,9,12,69 also having a polygenic background of common variants associated with reduced cognitive ability57,68. In the context of our polygenic score analyses in Figure 4, in the proband-only model, the proband’s polygenic score would statistically capture (‘tag’) the correlated effects of these rare variants (which causally impact neurodevelopmental conditions69,70). However, in the trio model, the proband’s polygenic score would no longer be correlated with the rare variant component after conditioning on the parents’ polygenic scores, because the rare and common variant components segregate approximately independently within-family; instead, this correlation with the rare variant component would be reflected by the non-transmitted coefficients on the parents’ polygenic scores53.
To explore this potential genetic consequence of parental assortment in our cohorts, we tested whether the common and rare variant components contributing risk of neurodevelopmental conditions are indeed correlated. From the sequencing data in DDD and GEL, we extracted rare (MAF<1x10-4) protein-truncating variants (PTVs) and damaging missense variants in genes intolerant of loss-of-function (LoF) variation (“constrained genes”), which are associated with reduced cognitive ability9 and risk of neurodevelopmental conditions69,70. Consistent with the effects of parental assortment, amongst unaffected parents of probands with neurodevelopmental conditions, we observed that the number of rare damaging coding variants in constrained genes (the “rare variant burden score”, RVBS) in one parent was significantly negatively correlated with the other parent’s PGSEA (r=-0.065, p=5.5x10-9), PGSCP (r=-0.036, p=1.4x10-3) and PGSNonCogEA (r=-0.046, p=4.3x10-5) (Figure 5), after correcting for genetic principal components. As expected, a similar correlation was seen within the probands themselves (Figure 5), regardless of whether including all probands, undiagnosed probands, or probands with de novo diagnoses, and if restricting RVBS to haploinsufficient genes associated with developmental disorders (Supplementary Figure 6). We also saw a similar result amongst control children from the Millennium Cohort Study, including after applying weights to adjust for non-random sampling and attrition in that cohort, indicating that this correlation is not only observed in patients with neurodevelopmental conditions (Supplementary Figure 7). We saw no significant correlation between any of the polygenic scores and the burden of rare synonymous variants in constrained genes or dominant genes associated with developmental conditions (Figure 5, Supplementary Figure 6), confirming that the result observed for deleterious variants is unlikely to be due to population structure artifacts. The correlations between polygenic scores and rare damaging variants may explain why we saw very limited evidence that these polygenic scores modify the penetrance of such variants in families with neurodevelopmental conditions (Supplementary Note 6, Supplementary Figure 8).
To explore whether the correlation between common and rare variants associated with neurodevelopmental conditions could be driving the association between non-transmitted common alleles and children’s risk shown in Figure 4, we extended the trio model to control for the probands’, mothers’ and fathers’ rare variant burden scores as well as polygenic scores (Extended Data Figure 10). This did not change our original conclusion from the trio regression, namely that the risk of neurodevelopmental conditions is correlated with non-transmitted common alleles in the parents that are associated with cognitive performance, educational attainment and the non-cognitive component thereof, but not with the transmitted common alleles. However, we cannot rule out that this association with non-transmitted common alleles is primarily driven by the assortment-induced correlation between common and rare variants, since the rare variant burden score we have used likely only captures a small proportion of the total rare variant component (just as the polygenic score only captures a small fraction of SNP heritability).
In summary, we find that parents’ non-transmitted alleles at common variants ascertained for their association with educational attainment and cognitive performance are correlated with their children’s risk of neurodevelopmental conditions, but we do not see evidence for direct genetic effects from transmitted alleles. Further work is needed to confirm whether this association with the non-transmitted alleles is due to true indirect genetic effects and/or parental assortment.
Discussion
Here we combined two large cohorts of patients with rare neurodevelopmental conditions to explore the contribution of common variants to risk. After first demonstrating that polygenic scores for neurodevelopmental conditions and several related traits were significantly associated with case/control status within both DDD and GEL (Supplementary Table 2), we conducted a GWAS meta-analysis of patients with neurodevelopmental conditions from the two cohorts and revealed significant genetic correlations with several psychiatric conditions which had not been previously reported6 (Figure 1A). Conditional genetic correlations (Figure 1B) show that several of these (e.g. schizophrenia, Tourette’s) are not simply driven by the component of polygenic risk for neurodevelopmental conditions that is shared with educational attainment. This suggests that these mental health conditions share underlying biology with neurodevelopmental conditions that is independent of that captured by effects of common variants on educational attainment, although we acknowledge that estimates of genetic correlations can be biased by cross-trait parental assortment and other confounding factors71.
We showed that polygenic scores for several traits that are genetically correlated with neurodevelopmental conditions were significantly associated with having a monogenic diagnosis, with the strongest effect observed for educational attainment (Figure 2A). Our previous work had found no such difference in polygenic background between diagnosed and undiagnosed probands in DDD 6, and it is likely that power has been improved here by our larger sample size and better definition of which probands truly have a monogenic diagnosis 5,27. Our result is consistent with a liability threshold model for rare neurodevelopmental conditions; children without a large-effect monogenic variant may require higher polygenic load (or a major environmental contribution such as a teratogenic infection e.g. Zika virus) to move their phenotype over the threshold required to be clinically diagnosed with a neurodevelopmental condition (Extended Data Figure 1). Perhaps important for consideration in clinical settings, we find probands with more affected first-degree relatives had both a lower PGSEA and a lower chance of getting a monogenic diagnosis in DDD (Extended Data Figure 6), emphasizing that if there are multiple first-degree relatives with neurodevelopmental conditions in a family, this may not necessarily be due to a monogenic cause. Our observation that diagnosed patients with affected parents (most of whom have inherited dominant diagnoses), and their parents, have lower average PGSEA than those with unaffected parents (Extended Data Figure 4) is consistent with the effects of parental assortment (Figure 5).
Since most parents of the patients we studied are annotated as clinically unaffected, we hypothesized that they might be over-transmitting polygenic risk to their affected offspring. We saw nominally significant over-transmission of PGSNDC,DDD from unaffected parents to undiagnosed probands, but saw no significant transmission distortion for PGSEA or PGSCP (Figure 3A), despite these polygenic scores explaining much more variance in risk than PGSNDC,DDD (Supplementary Table 2). Consistent with this, in a two-generation model (Figure 4), we found evidence for a direct genetic effect of PGSNDC,DDD on risk of neurodevelopmental conditions, but no evidence for direct genetic effects of the other polygenic scores tested. Instead, we observed that the parents’ PGSEA, PGSCP and PGSNonCogEA were significantly associated with their children’s risk even after controlling for the children’s PGS, indicating a correlation between non-transmitted alleles and the children’s phenotype. This may be due to indirect genetic effects and/or the consequences of parental assortment.
Previous papers have shown that non-transmitted alleles in the parents are associated with children’s educational attainment and school grades, explaining a third to a half of the overall association between educational outcomes and PGSEA that is seen in population-based samples29,33,34,51. However, the interpretation of this finding is still a matter of debate, since most of these papers use models that can give spurious or inflated indirect genetic effect estimates due to population stratification and/or parental assortment51,53,72. Parental assortment induces a correlation between the polygenic score associated with the trait under assortment and the remaining genetic component of the phenotype with which the polygenic score would be uncorrelated under random assortment. This includes the component due to rare variants, which could have a much stronger effect on risk of neurodevelopmental conditions than the common variant component. We demonstrated (to our knowledge, for the first time) a correlation between the rare and common variant components affecting cognitive and educational outcomes, both between parents and within both offspring and parents (Figure 5 and Supplementary Figure 7). This supports the hypothesis that the association of PGSEA with lower risk of neurodevelopmental conditions is at least partly due to the assortment-induced correlation of PGSEA with rare variants affecting both neurodevelopmental conditions and educational attainment. Although these observed correlations are small in magnitude (|r|<0.1), it is likely that the correlation between the total common and rare variant components of educational outcomes and neurodevelopmental conditions is substantially higher than this53, since only small fractions of these components are likely to be captured by the polygenic scores and our rare variant burden score, respectively. Very large whole-genome sequenced datasets will be required to better characterize the total rare variant component of these traits and estimate this correlation more accurately.
With the current study design, we were unable to demonstrate the presence of indirect genetic effects on risk of neurodevelopmental conditions unambiguously, and nor could we test whether, if present, these are mediated by parenting behaviors. However, we did explore whether common genetic variants might influence risk by affecting prenatal risk factors (a form of indirect genetic effects). We found that educational attainment showed a significant negative genetic correlation with preterm delivery, whereas neurodevelopmental conditions showed a significant positive genetic correlation with it even after conditioning on educational attainment with GenomicSEM (rg=0.47) (Extended Data Figure 8A). This is consistent with epidemiological studies that found an association between prematurity and poorer neurocognitive outcomes even after controlling for socioeconomic confounders45,73–79. We found some evidence from Mendelian randomization that lower educational attainment is causally associated with preterm delivery (Extended Data Figure 9B; Supplementary Note 5); this may be because lower educational attainment is associated with several factors that increase the risk of preterm delivery in the mother (such as a short inter-pregnancy interval80, exposure to tobacco smoke during pregnancy81,82, and pre-eclampsia56). We acknowledge that causal estimates from Mendelian Randomization analyses may be biased when using population-based GWASs, as we have done, so these findings should be considered tentative until confirmed using sufficiently well-powered within-family GWASs32. Although we did not find evidence for a causal effect of prematurity on neurodevelopmental conditions (Extended Data Figure 9C), several factors may have reduced the power of this analysis (Supplementary Note 5). We also saw no significant evidence that prematurity mediates indirect genetic effects of common alleles associated with educational attainment (Supplementary Note 5). However, it may be that our analysis was simply underpowered at this sample size, since we did see some attenuation (albeit not significant) of the non-transmitted coefficients for PGSEA when removing premature probands (Supplementary Figure 5). Nonetheless, our results emphasize how genetics may confound epidemiological associations between risk factors and neurodevelopmental conditions83,84, and also suggest that studies seeking to characterize the nature of indirect genetic effects on educational outcomes should consider the contribution of prenatal factors.
Our study has several limitations. Firstly, the overall variance in risk of neurodevelopmental conditions explained by common variants is low (∼10%) and the polygenic scores tested here explain only a fraction of this. Having said that, these polygenic scores are statistically significant predictors of neurodevelopmental conditions (Supplementary Table 2) and are likely to explain more variance as GWAS sample sizes grow. Secondly, the reported significance of detected PGS effects does not simply reflect the strength of the real associations, but also the power of the original GWAS from which SNP effect sizes were derived. Thus, one must be cautious when comparing effects between polygenic scores for different traits. Thirdly, the phenotypic heterogeneity of the cohorts likely limits our power and may confound results. For example, missed diagnoses of autism amongst DDD and GEL participants with neurodevelopmental conditions (perhaps due to the young average age; Supplementary Note 1) could be confounding our result of there being no apparent under-transmission of PGSEA (Figure 3A; Supplementary Figure 2), since PGSEA may be over-transmitted to autistic individuals26,50 but under-transmitted to patients with intellectual disability who are not autistic. Fourthly, the fact that probands in trios tend to have higher polygenic scores for educational attainment than those not in trios (Extended Data Figure 5B) suggests that the trio probands are a non-random sample, which could potentially induce biases in trio-based analyses; for example, the undiagnosed trio probands may be enriched for monogenic causes in as-yet-undiscovered genes, which could reduce power when assessing over-transmission of polygenic risk. Additionally, many of our analyses are predicated on the assumption that the “unaffected parents” (i.e. those reported by the clinician not to have a similar phenotype to the proband) do not have phenotypes related to neurodevelopmental conditions. It may be that some fraction of them do have (or did have, earlier in life) relevant phenotypic features (e.g. learning difficulty, speech delay), but that these were not detected and recorded by the clinicians. The inclusion of these parents could be reducing power or confounding results in several analyses. Another caveat is that our estimates of effect size when comparing to controls are sensitive to the choice of control cohort, likely reflecting differences in educational-related ascertainment bias between them 85 (Supplementary Note 3; Extended Data Figure 5). Despite this, sensitivity analyses suggest that our main conclusions are robust and not driven by ascertainment bias of a particular control cohort (Supplementary Note 3; Extended Data Figure 4; Supplementary Figure 4; Supplementary Table 6). Finally, the correlation between the rare and common variant components of neurodevelopmental conditions (Figure 5), which is likely due to parental assortment, may have confounded several of these analyses.
In future, as GWAS discovery cohorts for both rare neurodevelopmental conditions and related traits increase in size, we will have more power to explore these common variant effects on risk, penetrance and phenotypic expressivity of these conditions. These studies should seek to confirm whether there really are no direct genetic effects of common variants influencing educational attainment and cognitive performance on risk of neurodevelopmental conditions, or whether these are just small. To disentangle the contribution of indirect genetic effects and parental assortment to common variant associations with neurodevelopmental conditions, future studies will need to use extended genealogies and/or more sophisticated modeling of the influence of parental assortment on common and rare variants than is currently possible 51,53,72. If these studies also had measures of epidemiological and prenatal risk factors such as prematurity, and of parental phenotypes and nurturing behaviors, one could explore how indirect genetic effects (if present) are mediated, which has potential implications for assessing the modifiability of risk. Finally, it will be important for future studies to explore the role of polygenic background in neurodevelopmental conditions in families with non-European genetic ancestries.
Author contributions
QQH and EMW conducted most of the analyses, with the remainder being conducted by PC and DSM. QQH and EMW carried out data preparation and quality control, with assistance from KES, VKC, PD, SL, TM, MKM, SA and DB. ES, CFW and HVF helped supervise the DDD study, together with MEH. EJR, VW, ASY and MEH provided key intellectual input. HCM supervised the analyses and directed the study. QQH, EMW, and HCM wrote the first draft of the manuscript, with input from PC, DSM, JCB, VW, ASY and MEH. All authors read and commented on the final manuscript.
Online Methods
Cohort Descriptions and phenotypes
Deciphering Developmental Disorders (DDD)
The aim of the DDD study is to find molecular diagnoses for families and patients affected by previously genetically undiagnosed, severe developmental conditions. Recruitment was conducted from 2011 to 2015 across twenty-four clinical genetics services in the United Kingdom (UK) and Ireland86. The DDD study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South Research Ethics Committee and GEN/284/12, granted by the Republic of Ireland Research Ethics Committee). The clinical inclusion criteria included neurodevelopmental conditions, congenital, growth or behavioral abnormalities and dysmorphic features. Probands were systematically phenotyped via DECIPHER87 using Human Phenotype Ontology (HPO)88 terms and a bespoke online questionnaire that collected information on developmental milestones, growth measurements, number of affected relatives, prematurity, maternal diabetes, and other clinically-relevant parameters. The cohort has been described extensively5,70,86,89.
We focused on probands in the DDD cohort who had neurodevelopmental conditions (NDCs), which were defined previously by Niemi et al.6. Briefly, these were probands who had at least one of the following neurodevelopmental HPO terms or their descendant terms: abnormality of higher mental function (HP:0011446), neurodevelopmental abnormality (HP:0012759), abnormality of the nervous system morphology (HP:0012639), behavioural abnormality (HP:0000708), seizures (HP:0001250), encephalopathy (HP:001298), abnormal synaptic transmission (HP:0012535), or abnormal nervous system electrophysiology (HP:0001311).
Genomics England (GEL) 100,000 Genomes Project
The 100,000 Genomes project is an initiative by the UK Department of Health and Social Care to whole-genome sequence individuals with rare conditions and cancer in the National Health Service90,91. The 100,000 Genomes project was approved by the East of England—Cambridge Central Research Ethics Committee (REF 20/EE/0035). The rare disease branch of the project consists of sequencing data from ∼72,000 patients with rare conditions and their relatives, in ∼34,000 families with a variety of structures. There are over 190 rare conditions represented in the cohort, and about 23% of the patients have NDCs. The cohort was sequenced at around 35x coverage, and variant calling and quality control (QC) were performed by Genomics England91,92.
GEL NDC patients were defined as those recruited under the “Neurodevelopmental disorders” disease sub-category, or with more than one HPO term that was a descendent of “Neurodevelopmental Abnormality” (HP:0012759). We removed probands whose age of onset was >16 years or who had neurodegenerative conditions.
The set of unrelated GEL controls included cancer patients over 30 years old (N=10,469) and unaffected relatives (N=3,198) of probands with rare conditions who were not in the NDC set and did not have phenotypes similar to probands from DDD (“DDD-like”). The “DDD-like” probands were defined as those who:
1) were recruited into a disease model which was also used to recruit probands who had previously been recruited into DDD (see section below on identifying probands overlapping between the two cohorts), or
2) had one the top five HPO terms used in DDD and their descendants, namely HP:0000729 (autistic behaviour), HP:0001250 (seizure), HP:0000252 (microcephaly), HP:0000750 (delayed speech and language development) and HP:0001263 (global developmental delay).
Probands recruited into the neurodegenerative disorders subcategory or with an age of onset >16 years were removed from the DDD-like set, as were probands recruited into a disease subcategory for which the average age of probands was >16 years.
To define relatedness, we used a file generated by GEL consisting of a pairwise kinship matrix produced using the PLINK293,94 implementation of the KING robust algorithm95 and a --king-cutoff of 0.0442 .
Control cohorts
The UK Household Longitudinal Study (UKHLS) cohort consists of a continuation of the British Household Panel Survey (BHPS) of individuals living in the UK96,97. The Avon Longitudinal Study of Parents and Children (ALSPAC) is a birth cohort study of children born in Avon, England with expected dates of delivery between 1st April 1991 and 31st December 199298. Eligible pregnant women (N=13,761) were recruited and their children have been phenotyped extensively over the last 30 years. Ethical approval for the study was obtained from the ALSPAC Ethics and Law Committee and the Local Research Ethics Committees. Please note that the study website (http://www.bristol.ac.uk/alspac/researchers/our-data/) contains details of all the data that is available through a fully searchable data dictionary and variable search tool. The Millennium Cohort Study (MCS) is a birth cohort study of children born across the UK during 2000 and 2001 from 18,552 families 99,100. Further information about recruitment of these cohorts is given in Supplementary Note 3.
Preparation of genetic data
Individuals from DDD, UKHLS, ALSPAC, and MCS were genotyped on various arrays, whereas GEL individuals were whole-genome sequenced. The available data are summarized here briefly:
● A subset of the DDD cohort (all children and several thousand parents) was genotyped on three genotype array chips: the Illumina HumanCoreExome chip (CoreExome), the Illumina OmniChipExpress (OmniChip), and the Illumina Infinium Global Screening Array (GSA). Some probands were genotyped on more than one chip, as shown in Supplementary Figure 9. In downstream analysis, we used the CoreExome and OmniChip data for analyses of probands, and the GSA and OmniChip data for analyses of trios. QC of CoreExome (including DDD patients and 9,270 UKHLS controls genotyped on the same chip) and Omnichip data were performed by Niemi et al. 6 and we performed QC in the GSA data specifically for this paper (Supplementary Tables 10 and 11). The DDD cohort was also exome sequenced, and those data were used for the analyses involving rare variants. QC and processing of the exome data are described below in the section “Extracting and annotating rare variants”.
● GEL individuals were whole genome sequenced with 150bp paired-end reads using Illumina HiSeqX. Variant calling and QC were performed by Genomics England. We used 78,195 post-QC germline genomes from the Aggregated Variant Calls (aggV2) prepared by the GEL team. We kept variants that passed the QC filters shown in Supplementary Table 12.
● Data we received from ALSPAC were processed in two batches97. In the first batch, we received post-QC array data for G0 mothers (N=8,884) who were genotyped on the Illumina Human 660W chip and G1 children (N=8,932) genotyped on the HumanHap550 quad chip. In the second batch, we received another 2,198 parents (G0 mothers and G0 partners101) who were genotyped on the CoreExome array.
● We received data for 21,181 MCS samples who were genotyped using the GSA array chip102.
We applied standard QC filters in each dataset separately, described further in Supplementary Methods.
Genetically predicted ancestry
The Supplementary Methods provide detailed information on ancestry inference, but we summarize it briefly here. The identification of GBR-ancestry samples from the DDD CoreExome and OmniChip data was described previously6. To identify individuals of genetically inferred GBR ancestry in DDD GSA samples, we first projected post-QC samples onto 1,000 Genomes phase 3 individuals36 (Supplementary Figure 10). We then performed another principal component analysis (PCA) within the loosely defined European ancestry subset and identified a homogeneous subgroup (Supplementary Figure 11) using Uniform Manifold Approximation and Projection (UMAP)103. Since we merged parent-offspring trios genotyped on GSA and Omnichip array chips in downstream analysis, we kept GSA individuals who were similar to Omnichip individuals in terms of genetic ancestry in PCA space (Supplementary Figure 12). In GEL, we used individuals with genetically inferred European ancestry, which were identified by the GEL bioinformatics team. We further restricted to a homogeneous subset (N=56,249) that represents white British individuals (Supplementary Figure 13). Array data received from the ALSPAC all had genetically predicted European ancestry, so we did not perform any filtering based on genetic ancestry. We performed similar PCA and UMAP clustering to identify GBR-ancestry individuals in MCS (Supplementary Figure 14; Supplementary Figure 15), and further filtered to individuals who self-reported as being of White ethnicity.
Identifying and removing relatives within and across cohorts
Within each dataset, we identified up to third-degree relatives (kinship coefficient > 0.0442 by KING v2.2.495) using post-QC genotyped array data or WGS data. We always used a subset of unrelated individuals (i.e. more distant than third-degree relatives) in downstream analysis. In analyses using trios, we made sure probands in trios were unrelated and parents were unrelated with parents from other families.
In analyses combining DDD and GEL, we removed from GEL any participants who were also recruited into DDD and or who were related to DDD participants, and also removed Scottish samples from DDD since we were unable to check whether GEL samples were related to them (Supplementary Methods). We removed individuals from the two birth cohorts who were related to each other or to DDD participants, which left 1,434 and 2,498 trios from ALSPAC and MCS, respectively (Supplementary Methods).
Imputation and post-imputation QC
Imputation of array data was performed in each genotyped cohort separately using the maximum number of variants available after QC. Prior to imputation, we removed palindromic SNPs, SNPs that were not in the imputation reference panel, and SNPs with mismatched alleles. DDD samples and UKHLS controls who were genotyped on the CoreExome array were imputed with the HRC r1.1 reference panel by Niemi et al.6. DDD GSA and Omnichip samples and ALSPAC samples were imputed to the TOPMed r2 reference panel using the TOPMed imputation server, and the MCS samples to the HRC r1.1 reference panel 104–106. We kept well imputed common variants with Minimac4 R2 >0.8 and minor allele frequency (MAF) >1%. For polygenic score analyses, we subsequently restricted to common variants that passed these QC filters in all genotyped cohorts and also passed QC in the GEL WGS data.
Defining patients with versus without monogenic diagnoses
DDD
The DDD study identified clinically relevant rare variants from exome sequencing and chromosome microarray data using a filtering procedure described in Wright et al.86. The procedure focuses on identifying rare damaging variants that fit an appropriate inheritance mode in a set of genes that cause developmental disorders (DDG2P, https://www.deciphergenomics.org/ddd/ddgenes). Variants that pass clinical filtering are uploaded to DECIPHER87, where the patients’ clinicians are asked to classify them as definitely pathogenic, likely pathogenic, uncertain, likely benign or benign. We defined “diagnosed” probands as those with one or more variants either annotated as pathogenic/likely pathogenic in DECIPHER by their referring clinician, or predicted as pathogenic/likely pathogenic using autocoded ACMG diagnoses as described in 5. All remaining probands were classed as “undiagnosed”. Probands with a de novo diagnosis are those with a de novo mutation in a monoallelic or X-linked DDG2P gene that was either annotated or predicted as pathogenic/likely pathogenic.
GEL
The probands assigned diagnostic status were those included in the Genomic Medicine Service exit questionnaire, in which a clinician evaluated the pathogenicity of variants of interest identified through GEL’s custom pipeline. We defined “diagnosed” probands as those that had a pathogenic or likely pathogenic variant that is annotated as partially or fully explaining their phenotype in this exit questionnaire. Probands with a de novo diagnosis are those whose pathogenic/likely pathogenic variants from the exit questionnaire were annotated as de novo protein truncating or missense variants in DDG2P monoallelic or X-linked genes. We defined “undiagnosed” probands as those that were present in the exit questionnaire but not annotated as having a pathogenic or likely pathogenic variant and not annotated as “yes” or “partially” in the “case_solved_family” column. We further removed from this undiagnosed set any probands who have potential diagnoses in the Diagnostic Discovery data in GEL, which is a list of varian ts submitted by researchers that are thought likely to be pathogenic by the GEL clinical team.
Extraction and quality control on rare variants
Quality control of DDD exome sequencing data and extraction of rare single nucleotide variants (SNVs), and insertion and deletions (indels) is summarized in Supplementary Table 13. Indels in the same gene and sample were removed (4% of indels with MAF < 1%), since these were often part of complex mutational events that would require haplotype-aware annotation.
For GEL, details of the QC of SNVs and indels in the WGS data are provided by the GEL team91,92 and variant QC is summarized in Supplementary Table 12. We use a custom python script to extract rare variants from GEL aggregated WGS variant call format files (aggV2). We filtered genotypes to those with genotype quality (GQ) ≥ 20 and read depth (DP) ≥ 10. We removed heterozygous genotypes that did not pass a binomial test of balanced REF and ALT alleles (p < 1x10-3) or for which ALT/(REF+ALT) (AB ratio) was not between 0.2 and 0.8. We further removed variants with missing high quality genotypes in more than 5% of all samples in aggV2. We removed indels in the same gene and sample for the same reason described above for DDD.
For MCS, details of the QC of exome sequencing data are in Supplementary Methods.
Defining trio sample sets in DDD and GEL
The procedure used for filtering trios used in DDD and GEL is shown in Supplementary Figure 16. Briefly, in DDD, we combined data across GSA and OmniChip arrays and kept trios in which all three members had GBR ancestry and the proband had an NDC. We excluded trios recruited from Scottish centres and kept unrelated trios. We then split trios into those with both parents unaffected and those with one or both parents affected. These were then categorized as genetically diagnosed or undiagnosed. We applied similar filtering in GEL trios. See Supplementary Methodsfor more information.
GWAS of neurodevelopmental conditions in GEL and meta-analysis with DDD
We used PLINK v1.9 to conduct a GWAS comparing individuals with NDCs (N=3,618) to controls (N=13,667) in GEL, controlling for 20 genetic principal components (PCs), age, and sex. Prior to running the GWAS, we removed variants with MAF < 1%, missingness > 2% or Hardy-Weinberg equilibrium (HWE) P-value < 1x10-5, and performed a differential missingness test between the NDC patients and controls and removed variants with p-value < 1x10-5. We repeated the GWAS comparing DDD patients with neurodevelopmental conditions on the CoreExome array (N=6,397) to UKHLS controls (N=9,270) using PLINK v1.9, after excluding DDD patients recruited from Scottish centres.
We used METAL107 to conduct an inverse variance-weighted GWAS meta-analysis between the DDD-UKHLS and GEL NDC GWASs. We removed palindromic SNPs with MAF > 0.4 since the strand could not be easily inferred using MAF. We also excluded SNPs with discordant AF (difference > 0.05) between the two cohorts. This left 5,451,801 overlapping SNPs in the meta-analysis.
Heritability and genetic correlations
We used Linkage Disequilibrium Score Regression (LDSC)108 to estimate SNP heritability using summary statistics from the GWAS of NDCs in DDD, in GEL, and a meta-analysis of the two cohorts. We used ∼1 million common SNPs from HapMap3 with precomputed LD scores. SNP heritability on the liability scale was estimated assuming a cumulative population prevalence of 1% for rare neurodevelopmental conditions6. We used the effective sample size (4/(1/Ncases + 1/Ncontrols)) or the sum of two effective sample sizes for the meta-analysis and a sample prevalence of 50% in LDSC, as recommended previously109. In addition, we also applied two methods to estimate SNP heritability using individual-level data in DDD and GEL separately. We performed GREML-LDMS110 stratified by LD (two bins of equal size) and MAF (three bins: 1%–5%, 5%–10%, >10%). We also ran phenotype correlation–genotype correlation (PCGC) regression111, using the LDAK-Thin Model to compute the kinship matrix using the direct method. We corrected for sex, and ten genetic principal components as covariates in both methods. We then meta-analyzed the SNP heritability estimates from DDD and GEL using an inverse-variance weighted method.
We used LDSC to estimate genetic correlations between the DDD NDC GWAS or the meta-analyzed NDC GWAS and various brain-related traits and conditions listed in Supplementary Table 14. We did not use the GEL NDC GWAS to calculate genetic correlations as the SNP heritability was not significantly different from zero according to LDSC.
Estimating conditional genetic correlations with GenomicSEM
To estimate the genetic correlations between various traits/conditions (Supplementary Table 14) and NDCs independent of cognitive performance or educational attainment signals, we used genomic structural equation modelling (GenomicSEM)42,44. We estimated the genetic correlation between the target trait and a latent variable representing the non-cognitive component of NDCs, which was genetic influences on NDCs that were not explained by cognitive skills. We applied the GenomicSEM model without SNP effects. We also estimated genetic correlation with the “non-educational attainment” latent variable, which represented genetic influences on NDCs that were not accounted for by the educational attainment latent variable.
Calculating polygenic scores
For calculating PGSs, we used the set of SNPs that were well-imputed in all array cohorts (Minimac4 R2 > 0.8), passed QC in GEL aggV2 samples, and had MAF >1% in all cohorts. We used LDPred112 to estimate weights for calculating PGSs and an LD reference panel composed of HapMap3113 common variants based on 5,000 unrelated individuals of white British genetically-inferred ancestry from the UK Biobank114 (Supplementary Methods). GWAS summary statistics for years of schooling (a measure for EA)37, the non-cognitive component of educational attainment (NonCogEA)42, cognitive performance (CP)37, schizophrenia (SCZ)38, and NDCs6 were matched with the list of overlapping SNPs (Supplementary Table 14). PGSNDC,DDD was evaluated in the DDD Omnichip samples and the GEL samples which were not in the DDD GWAS. To make PGSs comparable across cohorts (DDD, GEL, UKHLS, MCS and ALSPAC), we performed a joint PCA across all cohorts and adjusted the raw scores for 20 PCs. For all analyses, residuals were scaled so that the combined set of unrelated control samples from GEL and UKHLS (or GEL controls only for PGSNDC,DDD) had mean = 0 and SD = 1.
Analyses of polygenic scores
Evaluating variance explained by the PGSs
We evaluated how much variance in risk of NDCs was explained by the PGS on the liability scale111,115,116. We compared 6,397 NDC probands from DDD to 9,270 controls from UKHLS, and 3,618 NDC probands from GEL to 13,667 GEL controls defined as described above. We assumed the population prevalence of NDCs to be 1%6.
Comparison of PGS between different subsets of probands and parents
We used two-sided t-tests to compare PGSs between different groups of probands, parents and controls seen in Figure 2A, Figure 3B, Extended Data Figure 4, Extended Data Figure 5, and Supplementary Tables 5, 6 and 7. We report the mean difference in PC-corrected PGS between groups. Groups who were compared with each other include:
● Combined set of controls from GEL and UKHLS
● Control individuals from UK birth cohorts, ALSPAC and MCS
● Undiagnosed NDC probands regardless of trio status
● Diagnosed NDC probands regardless of trio status
● Undiagnosed NDC probands for whom both parents are unaffected
● Unaffected parents of undiagnosed NDC probands
● Undiagnosed NDC probands with one or both parents affected
● Affected parents of undiagnosed NDC probands
● Diagnosed NDC probands for whom both parents are unaffected
● NDC probands with de novo diagnoses for whom both parents are unaffected
● Unaffected parents of diagnosed NDC probands
● Diagnosed NDC probands with one or both parents affected
● Affected parents of diagnosed NDC probands
The sample size of each subset is listed in Supplementary Table 1. We excluded controls from UKHLS as well as DDD CoreExome and GSA probands when testing the DDD-derived NDC PGS (since these had been included in the original NDC GWAS, whereas the individuals genotyped on the Omnichip had not). All the t-tests involving NDC probands were performed in samples from DDD and GEL combined.
We also compared female probands versus male probands without a monogenic diagnosis regardless of trio status (2,427 and 1,574 male probands from DDD and GEL, and 1,426 and 918 female probands from DDD and GEL), and unaffected mothers versus unaffected fathers (1,523 trios from DDD and 1,343 trios from GEL) using two-sided t-tests (Extended Data Figure 7AB).
Associations between PGS and diagnostic status
We compared average PGSs in NDC probands with and without a monogenic diagnosis using two-sided t-tests, combining NDC probands from DDD and GEL regardless of whether they were in a trio or not. We compared NDC subgroups to the combined control set from UKHLS and GEL, as well as to unrelated children from the MCS cohort who were reweighted using available sociodemographic data to make them more representative of the general UK population (Supplementary Note 3).
Within DDD (N=7,549 without excluding Scottish samples or samples who were related to GEL participants), we tested whether the proband’s PGSEA was associated with factors affecting getting a diagnosis in linear regression models:
![Embedded Image](https://www.medrxiv.org/sites/default/files/highwire/medrxiv/early/2024/03/06/2024.03.05.24303772/embed/graphic-6.gif)
We investigated the following binary factors: trio status (N=5,507 with both parents exome sequenced but not necessarily genotyped), proband sex (N=4,421 male probands), whether the proband had any affected first-degree relatives (N=1,623), whether the proband was born preterm (N=1,098 with gestation <37 weeks), whether the mother had diabetes (N=242), and whether the proband had severe intellectual disability or developmental delay (ID/DD; N=941) versus mild or moderate ID/DD (N=1,887). We compared probands with the above mentioned characteristics to all other probands, except when comparing probands with severe versus mild or moderate ID/DD for which we excluded probands without ID/DD or with ID/DD of unknown severity. We also investigated a continuous factor, the degree of consanguinity, quantified by the fraction of the genome in runs of homozygosity (FROH) divided by 0.0625, which is the expected fraction given a first-cousin marriage.
We also tested whether the mother’s or father’s PGSEA was associated with the above factors, in a total of 2,497 samples; we did not test for association with trio status since parental genotype data were only available for full trios anyway.
See the Supplementary Methods for a description of estimation of the odds ratio of diagnosis for different configurations of affected relatives shown in Extended Data Figure 6, and of the mediation analysis to explore whether trio status and prematurity were mediating the association between PGSEA and diagnostic status.
Evaluating over-transmission of PGS: the polygenic transmission disequilibrium test
We conducted polygenic transmission disequilibrium tests (pTDT) in undiagnosed and diagnosed probands from DDD (N=1,523 undiagnosed, 443 diagnosed) and GEL (N=1,343 undiagnosed, 507 diagnosed) combined. We also conducted pTDT in these trios excluding autistic probands.
The pTDT is a two-sided one-sample t-test of the probands’ PGS deviation from expectation, which is their parents’ mean PGS. The pTDT deviation is defined as:
![Embedded Image](https://www.medrxiv.org/sites/default/files/highwire/medrxiv/early/2024/03/06/2024.03.05.24303772/embed/graphic-7.gif)
To evaluate whether the pTDT deviation is significantly different than 0, the pTDT test statistic (𝑡𝑝𝑇𝐷𝑇) is defined as:
![Embedded Image](https://www.medrxiv.org/sites/default/files/highwire/medrxiv/early/2024/03/06/2024.03.05.24303772/embed/graphic-8.gif)
Analyses of non-transmitted coefficients
We evaluated direct genetic effects and effects of non-transmitted common alleles on NDC case status using logistic regression on PGSs:
![Embedded Image](https://www.medrxiv.org/sites/default/files/highwire/medrxiv/early/2024/03/06/2024.03.05.24303772/embed/graphic-9.gif)
where 1𝑁𝐷𝐶 𝑠𝑡𝑎𝑡𝑢𝑠is an indicator variable for whether the individual is an NDC case (1) or control (0).
Since a child’s PGS is calculated using transmitted alleles and the difference between the sum of parents’ PGS and the child’s PGS is equivalent to a PGS derived from non-transmitted alleles, this model can be rewritten as:
![Embedded Image](https://www.medrxiv.org/sites/default/files/highwire/medrxiv/early/2024/03/06/2024.03.05.24303772/embed/graphic-10.gif)
where 𝛽𝑁𝑇 indicates the non-transmitted coefficient and 𝛽𝑇 indicates the coefficient on transmitted alleles. The regression coefficient on child PGS in this trio model represents an unbiased estimate of direct genetic effect (difference between 𝛽𝑇 and 𝛽𝑁𝑇 ).
NDC probands were from DDD and GEL trios where the proband was undiagnosed and both parents were unaffected (N=2,866 trios). Control samples were trios from the two birth cohorts (ALSPAC and MCS, N=1,434 and N=2,498, respectively) as well as trios from GEL where the proband did not have DDD-like developmental disorders or NDCs (N=872).
We verified that the PGSs in the trio model did not exhibit excessive collinearity (see Supplementary Methods).
We performed various sensitivity analyses in the following subsets (Supplementary Figure 4): NDC probands versus controls from GEL trios only, and NDC patients from GEL and DDD versus each of the three control cohorts separately (GEL, MCS or ALSPAC). We also conducted the analysis while controlling for the rare variant burden score (RVBS) in GEL trios (Extended Data Figure 10; see the section below on “Analyses of PGSs and rare protein-coding variants”).
![Embedded Image](https://www.medrxiv.org/sites/default/files/highwire/medrxiv/early/2024/03/06/2024.03.05.24303772/embed/graphic-11.gif)
We restricted this latter analysis to GEL trios to minimize artifactual differences in rare variant calling and QC between cases and controls, which could otherwise create spurious associations.
See the Supplementary Methods for a description of how we modified the running of this trio model to investigate the hypothesis that the effects of non-transmitted alleles associated with educational attainment and cognition might be mediated by prematurity.
Analyses of PGSs and rare protein-coding variants
Sequence data from DDD, GEL, and MCS were annotated with the Variant Effect Predictor (VEP) 117. We kept the ‘worst consequence’ annotation across transcripts. From parents and probands, we extracted autosomal heterozygous protein-truncating variants (transcript ablation, frameshift, splice acceptor, splice donor and stop gained) annotated as high-confidence by LOFTEE118 (HC PTVs), as well as variants in the following classes which we grouped as “missense”: missense, stop lost, start lost, inframe insertion, inframe deletion, and loss-of-function variants annotated as low-confidence by LOFTEE118. We retained rare variants with MAF < 1 x 10-5 in each gnomAD super-population and MAF < 1 x 10-4 in the respective cohorts.
We considered four (non-mutually-exclusive) groups of damaging rare variants:
i) HC PTVs in constrained genes (pLI > 0.9119)
ii) HC PTVs and missense variants (MPC ≥ 2120) in constrained genes (pLI > 0.9)
iii) HC PTVs in monoallelic DDG2P genes with a loss-of-function mechanism (i.e. “absent gene product”)
iv) HC PTVs and missense variants (MPC ≥ 2) in monoallelic DDG2P genes with a loss-of-function mechanism.
We retained probands and parents who were heterozygous for these variants. We required the variants in the children to have been inherited from a parent.
To investigate whether parental assortment leads to correlated rare and common variant burden, we calculated rare variant burden scores (RVBSs) as the number of rare variants in the classes described above, then calculated the Pearson’s correlation coefficients between RVBSs and PGSs using the “cor” function in R. We used trios in which both parents were unaffected in this analysis. RVBSs were corrected for 20 genetic principal components using linear regression models. We then calculated the correlation coefficients between the PC-adjusted RVBSs in parents and the PC-adjusted PGSs in their partners. We also assessed the correlation within the same person amongst either children or parents. We repeated the analysis in subsets of trios where the proband was undiagnosed as well as in trios where the proband had a monogenic de novo diagnosis (Supplementary Figure 6). The main analysis in Figure 5 and the sensitivity analysis in Extended Data Figure 10 is based on group (ii) above, whereas Supplementary Figure 6, 7 and 8 show the results for all four groups of variants. To investigate whether the results were affected by uncorrected population structure, we also calculated RVBSs using rare synonymous variants in either monoallelic DDG2P genes with a loss-of-function mechanism or constrained genes, and assessed their correlation with PGS.
To assess whether PGS modify penetrance of rare inherited variants, we conducted one-sided paired t-tests comparing the PGS between unaffected parents transmitting a damaging variant to their affected offspring who inherited the variant (Supplementary Figure 8). We hypothesized that the unaffected parents would have a more protective polygenic background than their affected offspring (i.e. higher PGSEA, PGSCP, PGSNonCogEA and lower PGSSCZ, PGSNDC,DDD). If more than one parent transmitted a variant to a proband, one parent-child pair was chosen at random from the trio. We used trios where the proband was undiagnosed and both parents were unaffected in this analysis.
Construction and incorporation of weights for MCS
We were concerned that control cohorts might not be random samples of the population with respect to educational attainment, and that this might bias our effect sizes for the difference in PGSs between cases and controls (Supplementary Note 3). We decided to use MCS, for which extensive sociodemographic data are available, to calculate a mean PGS that would be representative of the general population, using inverse-probability weighting. MCS deliberately oversampled minority ethnic and disadvantaged individuals 121 (sampling bias), and they provide sampling weights to account for this. Additionally, missingness in each wave of data collection, including the collection of DNA for genotyping, was nonrandom (non-response bias). To correct for non-response bias, we produced non-response weights per individual using the inverse of the probability of being genotyped estimated from a logistic regression, considering covariates collected at the first study sweep, as previously described121,122 (Supplementary Methods). We fitted the model to predict who was within the sample of unrelated GBR-ancestry children with genotype data (N=5,884 of 6,036 children who had complete data for these covariates), and separately to predict who was within the subset of these that additionally had genotype data on both parents (N=2,445 of 2,498 trio children who had no missingness). To produce weights that account for both sampling bias and non-response bias, we multiplied the non-response weight from regression models by the sampling weights provided by MCS. These weights were then used to calculate adjusted PGSs shown in Extended Data Figure 4 and Extended Data Figure 5C and adjusted correlation between PGS and RVBS shown in Supplementary Figure 7.
Data availability
The raw and post-quality control genotype array data and exome sequence data from DDD are available through European Genome-phenome Archive, under EGAS00001000775.
Whole-genome sequence data and phenotypic data from the 100,000 Genomes project can be accessed by application to Genomics England (https://www.genomicsengland.co.uk/research/academic/join-gecip). GWAS summary statistics of neurodevelopmental conditions generated in this study are available in Supplementary Data. Researchers can apply to access genotype array data from ALSPAC (https://www.bristol.ac.uk/alspac/researchers/access/) and MCS (https://cls.ucl.ac.uk/data-access-training/data-access/). Publicly available GWAS summary statistics can be accessed at various resources: http://www.thessgac.org/data, https://pgc.unc.edu/for-researchers/download-results/, and https://egg-consortium.org/Gestational-duration-2023.html.
Code availability
We used publicly available software: LDpred (https://github.com/bvilhjal/ldpred), LDSC (https://github.com/bulik/ldsc), GCTA-LDMS (https://yanglab.westlake.edu.cn/software/gcta/#GREMLinWGSorimputeddata), PCGC regression (https://dougspeed.com/pcgc-regression/), GenomicSEM (https://github.com/PerlineDemange/non-cognitive/blob/master/GenomicSEM/Genetic%20correlations/Without%20using%20SNP%20effects/function_rG_woSNP.R), and LHC-MR (https://github.com/LizaDarrous/lhcMR).
Extended Data Figures
Schematic of the liability threshold model for rare neurodevelopmental disorders, illustrating why one might expect patients with a monogenic diagnosis to have less polygenic (common variant) risk than those without a monogenic diagnosis. The normal distribution represents the underlying distribution of liability in the population, which is assumed to be Gaussian. Both genetic and environmental factors of different effects contribute to this total liability. Each panel represents a hypothetical example of one individual, either unaffected (A), affected and diagnosed with a monogenic cause (B), or affected and without a monogenic diagnosis (C). The red line indicates a threshold for being diagnosed with neurodevelopmental conditions. Circles represent different genetic factors, and diamonds represent environmental factors. The size of circles and diamonds represents their impact on disease risk. The undiagnosed patient (C) has more green circles (i.e. risk-increasing common variants) than the patient with a monogenic diagnosis (B), in whom the orange circle (i.e. diagnostic large-effect variant) is sufficient on its own to push the patient over the diagnostic threshold.
Distribution of age at assessment (A) and number of HPO terms (B) in both DDD and GEL probands with neurodevelopmental conditions who have GBR ancestry. The vertical lines indicate the means. A small number of probands in each program were aged over 50 and had more than 30 HPOs, and these have been omitted from the plot due to data sharing restrictions. C) Proportion of probands from each cohort with at least one HPO term within the indicated chapter (black text) or specific phenotype (green text), ordered by the prevalence in DDD. The asterisks indicate results from a logistic regression testing whether there was a significant difference in phenotype prevalence between cohorts after controlling for sex and age (** indicates p-value < 0.05/43; * indicates p-value < 0.05). D) Proportion of probands recruited to both DDD and GEL (N=789) with at least one HPO term within the indicated chapter (black) or specific phenotype (green text) from the phenotype information from each program, ordered by the prevalence in DDD. The same logistic regression was used as in C).
Manhattan plot (A) and quantile-quantile plot (B) of GWAS meta-analysis of neurodevelopmental conditions. We meta-analyzed the GWASs derived from DDD-UKHLS (6,397 cases with neurodevelopmental conditions and 9,270 controls from UKHLS) and GEL (3,618 cases and 13,667 controls). We used overlapping SNPs with MAF >1% in both cohorts. The red line indicates the genome-wide significance threshold (5x10-8).
Average polygenic scores in undiagnosed (red) and diagnosed (blue) probands with neurodevelopmental conditions from DDD and GEL combined, as well as in MCS children reweighted to adjust for sampling bias and non-response bias (yellow). Subsets of probands with neurodevelopmental conditions and their parents from trios are shown in light red (undiagnosed subsets) and light blue (diagnosed subsets). The polygenic scores have been standardized such that the UKHLS+GEL controls have mean = 0 and standard deviation = 1 (except for PGSNDC,DDD for which only GEL controls were used to standardise). Yellow horizontal lines indicate weighted average polygenic scores in MCS children, which should reflect an unbiased estimate for the background population. PGSNDC,DDD was tested in a held-out set of patients in DDD. Error bars show 95% confidence intervals. See also Supplementary Table 5 and 7 for results of statistical tests of differences between groups.
A) Average polygenic score for educational attainment (PGSEA) in different control cohorts and subsets thereof, subsets of probands with neurodevelopmental conditions, and their unaffected parents. B) Comparing average PGSEA in trio probands and probands who did not have genetic data on both parents in ALSPAC, MCS, and affected patients from DDD and GEL. Note that in the case of DDD, “in trios” refers to those who had exome sequence data on both parents (only a subset of which also had genotype array data, since we prioritized genotyping full trios for which the child was undiagnosed), whereas in the rest of the manuscript (except for Figure 2B which uses the same definition as here), “trio proband” refers to those who had genotype data on both parents. C) Average polygenic scores for all five traits in MCS before and after reweighting to adjust for sampling bias and attrition. Note that the PGS are corrected for 20 PCs and then normalized so that a combined set of unrelated controls from UKHLS and GEL have mean = 0 and standard deviation = 1. Error bars show 95% confidence intervals.
Association between different configurations of affected relatives and the child’s PGSEA (left) or average diagnostic rate (right). Left: Average proband PGSEA in subgroups with different configurations of affected relatives based on the number of affected parents, siblings, and more distant relatives. Right: Odds ratio for having a monogenic diagnosis, compared to probands with no affected relatives. See Supplementary Methods for a description of how this was calculated.
A) Comparison of polygenic scores between undiagnosed male and female probands in DDD and GEL combined. We used all undiagnosed probands with neurodevelopmental conditions regardless of trio status in this analysis (N=1,426 females and N=2,427 males in DDD; N=112 females and N=146 males in DDD excluding GWAS samples; N=918 females and N=1,574 males in GEL). A positive difference indicates that female probands have higher PGS than male probands. B) Comparison of polygenic scores between unaffected mothers and fathers of undiagnosed probands from a combined sample of 1,523 trios and 1,343 trios from DDD and GEL, respectively. A positive difference indicates that mothers have higher PGS than fathers. C) pTDT results in undiagnosed female and male probands with unaffected parents (N=586 females and N=937 males in DDD; N=99 females and N=125 males in DDD excluding GWAS samples; N=490 females and N=853 males in GEL). Error bars show 95% confidence intervals. The significant result that passes Bonferroni correction of five tests is highlighted by two asterisks.
Exploring prenatal factors that may influence risk of neurodevelopmental disorders. (A) Genetic correlations between neurodevelopmental conditions and prenatal risk factors, before and after conditioning on educational attainment or cognitive performance. Genetic correlations with our GWAS meta-analysis for neurodevelopmental conditions was estimated using Linkage Disequilibrium Score Regression. Those conditioned on the GWAS summary statistics for educational attainment or cognitive performance were estimated using GenomicSEM. (B) Association between PGSs and prematurity, a risk factor for neurodevelopmental conditions, estimated in DDD. See Supplementary Table 8 for sample sizes. Note that for PGSNDC,DDD, probands who were included in the GWAS were not tested, which left 703 probands, of which 83 were born prematurely. A negative estimate indicates that probands who were born prematurely or their parents had a lower polygenic score. Associations that pass Bonferroni correction for five traits in (A) or five polygenic scores in (B) are indicated by two asterisks and nominally significant results by one asterisk. Error bars show 95% confidence intervals.
Causal effect estimates between educational attainment, neurodevelopmental conditions, and preterm birth from Mendelian randomization. The top panels show bi-directional relationships between educational attainment and giving birth prematurely, and between educational attainment and neurodevelopmental conditions, inferred by the Latent Heritable Confounder-Mendelian randomization method (LHC-MR), which uses all genome-wide SNPs. αX->Y indicates the causal effect of the exposure (X) on the outcome (Y) and αY->X indicates the reverse causal effect. The causal effects of the heritable confounder on the exposure and the outcome are annotated as tX and tY, respectively. The forest plots on the bottom show the causal effects inferred using the standard Mendelian randomization methods. Up to four different methods were used, as indicated in the legend, but not all were used to test each hypothesis, depending on the number of instruments available (see Supplementary Methods). The dots show point estimates and the lines are 95% confidence intervals calculated using standard errors. Estimates that are significant are highlighted with an asterisk and exact p-values are annotated.
Association coefficients of polygenic scores (PGSs) and rare variant burden scores (RVBS) in the ‘proband only’ and ‘trio’ models, from logistic regressions of case/control status within GEL (N=1,343 trios in which the proband with a neurodevelopmental condition is undiagnosed and parents are unaffected and 872 trios without neurodevelopmental conditions). Case/control status was regressed on either the child’s PGS, the child’s PGS and child’s RVBS, all three trio members’ polygenic scores (trio model), or all three trio members’ polygenic scores and RVBSs (trio model+RVBS). The RVBS was defined as the number of rare damaging PTVs and missense variants in constrained genes (requiring these to be inherited in the child), corrected for genetic principle components.
Supplementary Data
Supplementary Data 1. Summary statistics from the GWAS of neurodevelopmental conditions comparing cases to controls within the Genomics England (GEL) 100,000 Genomes Project.
Supplementary Data 2. Summary statistics from the GWAS of neurodevelopmental conditions comparing DDD cases to UKHLS controls, excluding the Scottish samples from DDD.
Supplementary Data 3. Summary statistics from the GWAS meta-analysis of neurodevelopmental conditions combining the DDD and GEL GWASs.
Supplementary Tables
Supplementary Methods
Supplementary Figures
Supplementary Notes
Supplementary Note 1: Phenotypic comparisons of the cohorts
Supplementary Note 2: Genome-wide significant hits from the GWAS meta-analysis of neurodevelopmental conditions
Supplementary Note 3: Potential ascertainment biases in control cohorts and their effects
Supplementary Note 4: Examining sex differences in polygenic risk
Supplementary Note 5: Exploring the role of prenatal risk factors in mediating common variant risk
Supplementary Note 6: Role of PGS in modifying the penetrance of rare variants
Acknowledgements
We are extremely grateful to families for their participation and engagement in the DDD study and 100,000 Genomes projects; without them, this research would not be possible. We also thank their clinicians and our colleagues (including the Sanger Human Genetics Informatics team, particularly Iaroslav Popov and Ruth Eberhardt) who assisted in the generation and processing of data. We are very grateful to Jillian Hastings-Ward, Hannah Podd and Hannah Humphrey from the Participant Panel for the 100,000 Genomes project, Ana Lisa Taylor Tavares from Genomics England, and Sarah Wynn from the patient organization Unique, for their assistance with writing the FAQ. We thank Athanasios Kousathanas and Loukas Moutsianas from Genomics England Bioinformatics Research Services for help with data quality control, Hilary Wong for useful discussions on prematurity, Michel Nivard for advice on use of GenomicSEM, and Angelica Ronald, Naomi Wray, and Nick Martin for helpful discussions.
DDD: The DDD study presents independent research commissioned by the Health Innovation Challenge Fund (grant no. HICF-1009-003). The full acknowledgements can be found at www.ddduk.org/access.html. This study makes use of DECIPHER, which is funded by the Wellcome Trust.
GEL: This research was made possible through access to data in the National Genomic Research Library, which is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The National Genomic Research Library holds data provided by patients and collected by the NHS as part of their care and data collected as part of their participation in research. The National Genomic Research Library is funded by the National Institute for Health Research and NHS England. The Wellcome Trust, Cancer Research UK and the Medical Research Council have also funded research infrastructure.
UK Household Longitudinal Study: We used data from ‘Understanding Society: The UK Household Longitudinal Study’, which is led by the Institute for Social and Economic Research at the University of Essex and funded by the Economic and Social Research Council (grant number ES/M008592/1). The data were collected by NatCen and the genome-wide scan data were analysed by the Wellcome Trust Sanger Institute. Data governance was provided by the METADAC data access committee, funded by ESRC, Wellcome and MRC (grant number MR/N01104X/1).
ALSPAC: We are extremely grateful to all the families who took part in ALSPAC, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. The UK Medical Research Council and Wellcome (Grant ref: 217065/Z/19/Z) and the University of Bristol provide core support for ALSPAC. This publication is the work of the authors and Hilary Martin will serve as a guarantor for the contents of this paper. Genome-wide genotyping data was generated by Sample Logistics and Genotyping Facilities at the Wellcome Sanger Institute and LabCorp (Laboratory Corporation of America) using support from 23andMe.
MCS: We are grateful to the Centre for Longitudinal Studies (CLS), UCL Social Research Institute, for the use of these data and to the UK Data Service for making them available. However, neither CLS nor the UK Data Service bear any responsibility for the analysis or interpretation of these data.
This research was funded in part by Wellcome (grant no. 220540/Z/20/A, “Wellcome Sanger Institute Quinquennial Review 2021–2026”). For the purpose of open access, the authors have applied a CC-BY public copyright license to any author accepted manuscript version arising from this submission. DB thanks the University of Cambridge Amgen Scholar Program for support.
Abbreviations
- DDD
- Deciphering Developmental Disorders study
- GEL
- Genomics England 100,000 Genomes Project
- GWAS
- genome-wide association study
- GenomicSEM
- Genomic Structural Equation Modelling
- PGS
- polygenic score
- NDCs
- neurodevelopmental conditions
- EA
- educational attainment
- CP
- cognitive performance
- SCZ
- schizophrenia
- NonCogEA
- the non-cognitive component of educational attainment
- pTDT
- polygenic transmission disequilibrium test
- RBVS
- rare variant burden score
- UKHLS
- United Kingdom Household Longitudinal Study
- ALSPAC
- Avon Longitudinal Study of Parents and Children
- MCS
- Millennium Cohort Study
- QC
- quality control
- MAF
- minor allele frequency
- GSA
- Global Screening Array from Illumina
- SNP
- single nucleotide polymorphism
- HWE
- Hardy-Weinberg Equilibrium.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.
- 22.↵
- 23.↵
- 24.
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.
- 75.
- 76.
- 77.
- 78.
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵