ABSTRACT
Individuals with autism spectrum disorder (ASD) or related neurodevelopmental disorders (NDDs) often carry disruptive mutations in genes that are depleted of functional variation in the broader population. We build upon this observation and exome sequencing from 154,842 individuals to explore the allelic diversity of rare protein-coding variation contributing risk for ASD and related NDDs. Using an integrative statistical model, we jointly analyzed rare protein-truncating variants (PTVs), damaging missense variants, and copy number variants (CNVs) derived from exome sequencing of 63,237 individuals from ASD cohorts. We discovered 71 genes associated with ASD at a false discovery rate (FDR) ≤ 0.001, a threshold approximately equivalent to exome-wide significance, and 183 genes at FDR ≤ 0.05. Associations were predominantly driven by de novo PTVs, damaging missense variants, and CNVs: 57.4%, 21.2%, and 8.32% of evidence, respectively. Though fewer in number, CNVs conferred greater relative risk than PTVs, and repeat-mediated de novo CNVs exhibited strong maternal bias in parent-of-origin (e.g., 92.3% of 16p11.2 CNVs), whereas all other CNVs showed a paternal bias. To explore how genes associated with ASD and NDD overlap or differ, we analyzed our ASD cohort alongside a developmental delay (DD) cohort from the deciphering developmental disorders study (DDD; n=91,605 samples). We first reanalyzed the DDD dataset using the same models as the ASD cohorts, then performed joint analyses of both cohorts and identified 373 genes contributing to NDD risk at FDR ≤ 0.001 and 662 NDD risk genes at FDR ≤ 0.05. Of these NDD risk genes, 54 genes (125 genes at FDR ≤ 0.05) were unique to the joint analyses and not significant in either cohort alone. Our results confirm overlap of most ASD and DD risk genes, although many differ significantly in frequency of mutation. Analyses of single-cell transcriptome datasets showed that genes associated predominantly with DD were strongly enriched for earlier neurodevelopmental cell types, whereas genes displaying stronger evidence for association in ASD cohorts were more enriched for maturing neurons. The ASD risk genes were also enriched for genes associated with schizophrenia from a separate rare coding variant analysis of 121,570 individuals, emphasizing that these neuropsychiatric disorders share common pathways to risk.
Background
Autism spectrum disorder (ASD) affects over 1.7% of children in the United States1. Epidemiological studies have repeatedly demonstrated that ASD is highly heritable2, with the majority of risk stemming from common genetic variants, each of small effect, acting additively across the genome3,4. However, in at least 10% of cases, rare and de novo genetic variants confer substantial risk5–9. Exome sequencing provides an efficient method to detect these rare and de novo variants, which has led to the association of numerous genes with ASD through repeated observations of such variants across independent cases10.
Beyond ASD, exome sequencing has enabled the discovery of genes associated with overlapping and distinct genetic architectures across a spectrum of developmental and neuropsychiatric disorders6,10–14. These exome studies have largely focused on analyses of single nucleotide variants (SNVs) and insertion/deletion variants (indels), in particular de novo protein-truncating variants (PTVs) and missense variants, with several studies noting modest enrichment of rare inherited variants as well10,15. The relative enrichment of de novo PTVs in cases varies by ascertainment strategy: burden is greatest in individuals with developmental delay (DD), intellectual disability, or multi-system congenital anomalies; moderate in individuals with ASD or isolated developmental anomalies; and lowest in schizophrenia and other neuropsychiatric disorders10,11,13,14,16,17. Indeed, hundreds of genes have now been discovered across this spectrum of developmental disorders, with associations driven by phenotype severity and cohort size18–20.
Furthermore, it has been well established that ASD and DD cases harbor an excess of very large copy number variants (CNVs) compared to unaffected siblings21–27. While these CNVs of large genomic segments represent an approximately 3.5-fold increase in ASD risk28, their incorporation into genomic studies has long represented a significant technical challenge. Early studies using microarrays were limited to CNVs of hundreds of kilobases to megabases12,21,25,29. Among the most significant findings from these studies were recurrent genomic disorder (GD) loci associated with syndromic phenotypes that arose due to mispairing of long homologous segments, a mechanism known as non-allelic homologous recombination (NAHR)12,25,28,30. Due to their high mutation rate, these reciprocal GD regions collectively represent a significant source of genetic risk for neurodevelopmental disorders (NDDs)12,25,28,30.
What remains less clear is the relative burden of CNVs at the resolution of individual genes or exons in ASD etiology, as studies to date have been limited by the relatively small samples assessed by whole-genome sequencing (WGS) and the high false positive rates for existing exome-based CNV discovery methods31. Prior studies by the Exome Aggregation Consortium (ExAC32), 1000 Genomes Project33–35, and Genome Aggregation Database (gnomAD36) have established size distributions and mutation rates for structural variants (SVs) across human populations, while studies leveraging long-read WGS have provided further insights into the broader swath of human SVs, over 99% of which are not readily detectable by microarray37,38. Here, we demonstrate that significant improvements in computational models to normalize exome sequencing read depth, combined with appropriate quality control filters, enable discovery of even small rare coding CNVs with validation rates comparable to indels. These methods allow us to build upon prior studies22 to now jointly analyze rare coding SNVs, indels, and CNVs at the resolution of individual genes and exons from large-scale ASD datasets.
Rare variant discovery can also be enhanced through the integration of functional data, particularly in cohorts such as ASD that display strong selective pressure and reduced fecundity10. One such measure is the recently developed ‘loss-of-function observed/expected upper bound fraction’ (LOEUF) score39, which is a continuous measure of selective pressure against PTVs in each gene. Similarly, the ‘missense badness, PolyPhen-2, and constraint’ (MPC) score40 is one of several measures of the estimated deleteriousness of missense variation. In this study, we used a Bayesian statistical framework, the Transmission and De Novo Association (TADA) model41, to incorporate these functional annotations and jointly analyze coding SNVs, indels, and CNVs across the largest exome-sequenced ASD and DD cohorts to date, comprising 63,237 individuals from ASD cohorts (20,627 ASD-affected individuals) and 91,605 samples from DD cohorts (31,058 DD-affected individuals). Our analyses identify hundreds of genes associated with these disorders and reveal significant overlap, as well as substantial heterogeneity, in the genes associated with each phenotype and in the cell types in which they are enriched during early neuronal development. Overall, these analyses provide new insights into the contributions of rare coding variation in NDDs, including broad overlap and nuanced distinctions of genetic risk and its influence on specific pathways and developmental trajectories.
RESULTS
Patterns of coding variants in a large ASD exome dataset
We aggregated whole exome sequencing (WES) data across 33 ASD research cohorts, totalling 63,237 individuals: 15,036 affected offspring, 5,492 unaffected offspring, 28,522 parents, as well as 5,591 affected and 8,597 unaffected individuals from case-control studies (one sample exists in our dataset as both mother and affected offspring, Fig. 1a, Supplementary Tables 1, 2, 3, 4.1). Data from 35% of these individuals had not previously been analyzed or published. All accessible DNA sequence reads were aligned to the GRCh38 human reference genome and coding SNVs, indels, and CNVs were identified using GATK haplotype caller42 while a new method, GATK-gCNV43, was used for CNV delineation. Variant counts were consistent across cohorts, with an average of 1.64 (1.66/affected, 1.57/unaffected) de novo SNVs, 0.18 (0.18/affected, 0.16/unaffected) de novo indels, and 0.035 (0.042/affected, 0.014/unaffected) de novo CNVs per offspring.
Consistent with prior studies, we observed more PTVs and damaging missense variants in individuals with ASD compared to unaffected individuals (Fig. 1b-c). Enrichment of de novo and inherited PTVs in ASD cases was greatest in genes with the strongest evidence of selective constraint, represented by low LOEUF scores39 (Supplementary Tables 4.2-4.7). Both de novo and inherited PTVs were enriched in the top three deciles of LOEUF (binomial test; 5,446 total genes; Fig. 1b). We grouped missense variants into three bins based on MPC score, which we termed MisB (MPC ≥ 2) and MisA (2 > MPC ≥ 1), with MPC < 1 for all remaining missense variants. MisB variants and, to a lesser degree, MisA variants were significantly enriched in ASD individuals (Fig. 1c). Our aggregated cohort demonstrated the greatest risk from de novo variation, with modest but significant ASD risk observed in rare case/control (for which de novo status cannot be determined) and inherited PTVs in the top three deciles of constrained genes, as well as in MisB and MisA variants.
Detection and analysis of rare and de novo CNVs in ASD
Previous studies using chromosomal microarrays have established a clear etiological role for large, rare and de novo CNVs in ASD cases compared to their unaffected siblings and the general population21,22,25–27,44–46. Despite their considerable impact on ASD risk, CNV discovery at the resolution of individual exons from exome sequencing has represented a significant technical challenge. To overcome these challenges, we applied and benchmarked a pipeline built around GATK-gCNV, a novel Bayesian read depth-based CNV discovery tool for short-read sequencing data42. Our pipeline includes a PCA-based method for inferring large systematic differences, such as different exome capture kits (Supplementary Fig. 1), and we performed extensive benchmarking of this method from 7,165 samples from ASD quartet families for which matching gold-standard CNV calls from microarray and WGS were available47,48. We measured sensitivity and positive predictive value (PPV) as a function of the number of captured exons from canonical protein-coding transcripts. Our application of GATK-gCNV achieved a sensitivity of 82% and a PPV of 89% for rare CNVs (site frequency < 1%) that spanned 3 or more such exons (Fig. 1d, Supplementary Fig. 2). When we considered de novo CNVs detected by WGS, GATK-gCNV achieved 82% sensitivity and 97.5% PPV for de novo CNV discovery at three-exon or greater resolution (Supplementary Fig. 3).
We applied GATK-gCNV to the 57,294 samples from the ASD cohort for which raw sequencing data were available (Supplementary Methods, Supplementary Table 3), including 13,694 affected and 5,007 unaffected offspring with data from both parents and 608 cases and 11,312 controls from non-family based cohorts. We focused our analysis on high quality CNVs spanning at least three captured protein-coding exons (Supplementary Methods), resulting in 16,889 rare (frequency < 1%) and 650 de novo autosomal CNVs, corresponding to an average of 0.90 (0.91/affected, 0.89/unaffected) rare and 0.035 (0.042/ affected, 0.014/unaffected) de novo CNVs per offspring.
Consistent with de novo coding CNVs mediating substantial risk for ASD, we observed that 3.93% of ASD cases and 1.40% of unaffected siblings harbored at least one de novo coding CNV, respectively (odds ratio [OR]: 2.88, p=9.6 × 10−21, Fisher’s exact). Splitting the CNVs by dosage (deletions versus duplications) and assigning each event to a LOEUF decile based on the most constrained gene within its boundaries, we observed that de novo deletions with at least one PTV intolerant gene showed the greatest degree of enrichment in ASD cases across all variant classes evaluated (OR: 9.10, p=2.81×10−20, binomial test), with a relative difference approximately three-fold higher than that observed for de novo PTVs in the same constraint decile (p=2.9×10−4, simulation test). Enrichment of deletions was also observed in case/ control but not in rare inherited analyses. A similar but less significant pattern was observed for duplications across constraint deciles and modes of inheritance (Fig. 1e-f). Additionally, duplications with breakpoints residing within a gene have been suggested to harbor risk differently than duplications of an entire gene, likely through a loss of function mechanism28,36. Leveraging our de novo duplication data, we found that partially duplicated genes in cases harbored more de novo PTV evidence compared to partially duplicated genes in controls (1.3 fold, p=1.7×10−3, binomial test).
To characterize the impact of exon and gene-level resolution in CNV discovery, we first considered 75 large genomic segments (Supplementary Table 4.8) associated with NDDs from curated literature28,49–54. Of the 650 de novo CNVs discovered from WES here, 223 (34.3%) matched one of these loci with at least 50% reciprocal overlap, including 109 de novo GD deletions in cases compared to 6 in controls (OR: 6.7, p=3.9×10−9, Fisher’s exact) and 100 de novo GD duplications in cases compared to 8 in controls (OR: 4.6, p=5.4×10−7, Fisher’s exact, Fig. 2b). Considering rare inherited CNVs, we observed a non-significant trend towards enrichment of GD CNVs in cases (OR:1.3, p=0.07, Fisher’s exact). Excluding the GD loci, the remaining 427 de novo CNVs were also enriched in cases but with more modest effect sizes (de novo non-GD deletion: OR = 2.8, p=1.5×10−8, Fisher’s exact; de novo non-GD duplications: OR = 1.8, p=5.1×10−3, Fisher’s exact). A greater degree of enrichment was observed for non-GD deletions that overlapped a constrained gene (OR: 5.10, p=9.7×10−11, Fisher’s exact, Fig. 2c) than for those that did not (OR: 1.32, p=0.32, Fisher’s exact, Fig. 2c) or for de novo PTVs in constrained genes (OR: 2.74, p=8.5×10−34, Fisher’s exact, Fig. 2c). Effect sizes for de novo duplications were consistently smaller, although similar patterns were observed in relation to GD loci and constrained genes (Fig. 2c). Finally, when we considered de novo, non-GD CNVs that altered one of 71 genes associated with ASD risk (FDR ≤ 0.001, see below), we observed 46 variants in cases (29 deletions, 17 duplications) compared to none in siblings (p=8.5×10−7, Fisher’s exact). Considering the list of 373 genes implicated in both ASD and DD risk (see below) this increased to 130 de novo CNVs in cases and one in siblings (OR: 48.0, p=1.13×10−16, Fisher’s exact).
Consistent with a female protective effect in ASD, which proposes that a higher burden of risk factors is required for an ASD diagnosis in females27,55, the burden of de novo CNVs overall was higher in female cases than male cases (5.94% vs. 3.47%, OR:1.76, p=3.4×10−8, Fisher’s exact). This effect was more pronounced for CNVs than PTVs (p=2.6×10−2, logistic regression) or missense variants (p=2.0×10−5, logistic regression, Fig. 2d). While de novo SNVs and indels frequently arise on the paternal allele9,44,56, a maternal allele bias has been observed from de novo CNVs in ASD57. Using parental SNP data, we estimated the allelic parent-of-origin for 330 de novo CNVs (Supplementary Methods) and observed no overall bias (Fig. 2e, 130 maternal vs. 144 paternal, p=0.43, binomial test). However, restricting analyses to de novo CNVs at NAHR-mediated GD loci, we recapitulated prior findings with 70% arising on the maternal allele (70 maternal vs. 30 paternal, p = 7.9×10−5, binomial test). Including 20 additional samples from the Simons Searchlight project58, we find that 94.3% of 16p11.2 CNVs occur on the maternal allele (Fig. 2e, 33 maternal vs. 2 paternal, p=3.7×10−8, binomial test). In contrast, CNVs at non-NAHR-mediated GD loci showed a 65.5% paternal bias (Fig. 2e, 60 maternal vs. 114 paternal, p=5.2×10−5, binomial test), consistent with a mechanistic bias in CNV formation and previous findings of a paternal origin across all non-recurrent/non-pathogenic CNVs48.
Integrated discovery of ASD-associated genes across variant types and inheritance classes
Our analyses identified enrichment of rare protein-coding variants in ASD cases with effect sizes varying by mode of inheritance, variant class (PTV, misB, misA, deletion, duplication), and evolutionary constraint. We sought to utilise these insights and the size of our cohort (63,237 samples) to refine gene discovery in ASD by extending the TADA analytic framework6,10,22,41 to include: (1) rare and de novo CNVs, (2) variants present in unaffected offspring, and (3) a continuous measure of evolutionary constraint from gnomAD (LOEUF39, Supplementary Methods, Supplementary Fig. 4). A Bayes Factor (BF) was calculated to represent evidence of association for each autosomal protein-coding gene across variant types and modes of inheritance, taking into account null mutation rates and prior relative risks, when appropriate (Fig. 3).
Applying this model to the aggregated ASD data (TADA-ASD) we identify 71 genes associated with ASD at a FDR ≤ 0.001 and 183 genes associated at FDR ≤ 0.05 (Supplementary Table 4.9). Notably, the FDR ≤ 0.001 threshold is approximately equivalent to Bonferroni significance when back-calculating a p-value and correcting for 18,128 autosomal genes (p < 2.8×10−6), making it comparable to other recent studies of schizophrenia13 and DD11. We focus on these 71 genes here, and illustrate in Fig. 3 that 93% of these genes have contributions from multiple variant classes (Fig. 3b). They also represent a meaningful increase in ASD discovery -a prior study from Satterstrom et al. included a subset of 11,986 of the cases aggregated here, producing 32 ASD risk genes at this same FDR ≤ 0.001 threshold using the TADA parameters applied in that study, which increases to 51 genes when incorporating the updated model parameters and the CNVs discovered here (Fig. 3c). The BFs within TADA allow us to assess genomic architecture through the relative contributions across variant types and modes of inheritance (Fig. 3d-e, Fig. 4a-b). In keeping with haploinsufficiency as the primary pathogenic mechanism, PTVs and deletions account for over 90% of the evidence in 20 of the 71 ASD-associated genes (28.2%). However, for 10 genes (14.1%), over 90% of the evidence comes from missense variants and duplications (e.g., DEAF1, SLC6A1; Fig. 4a). Included within these 10 genes is PLXNA1, for which the evidence comes primarily from inherited missense variation localized within the Plexin domain of the encoded protein (Fig. 4b-c).
We next sought to estimate the magnitude of the ASD risk imparted by the genes discovered here and described in Fig. 3. Within the 71 genes, de novo PTVs or damaging missense variants were detected in 4.0% of cases and 0.5% of controls, giving a combined OR of 8.19. Using cross validation (Supplementary methods), we refined this estimate by variant type and mode of inheritance with OR estimates ranging from 16.0 for de novo PTVs to 1.02 for inherited MisA (Supplementary Table 4.10). We also applied this cross validation approach to estimate the yield of future gene discovery efforts, predicting a greater than linear increase in gene discovery at this same statistical threshold from a two-fold increase in the current sample size (150 genes; 95% CI: 108-278) or a three-fold increase (344 genes; 95% CI: 218-500).
Comparing the genetic architectures of ASD and general NDDs
ASD is frequently comorbid with other syndromic and non-syndromic NDDs, including intellectual disability (ID), and neurological disorders, including seizures59,60. Prior analyses have clearly demonstrated both overlapping and distinct genetic architectures across ASD and DD, with significant heterogeneity across these broadly defined NDDs. To explore commonalities in genes that impact such NDDs, we integrated our ASD data with an independent cohort of 91,605 family-based samples encompassing 31,058 offspring with DD, the vast majority of which (>95%) were diagnosed with at least one NDD11. Exome sequencing from this DD cohort was recently analyzed by the Deciphering Developmental Disorders (DDD) project using DeNovoWest, a permutation-based frequentist method, and reported exome-wide significant associations between rare SNVs and indels in 252 autosomal genes. However, these analyses did not include case-control samples, rare inherited SNVs and indels, or CNVs. We sought to re-analyze these DDD data to enable direct comparisons between the ASD and DD cohorts using uniform statistical models and significance thresholds between studies.
We applied the extended TADA framework to de novo variants in the DD cohort (TADA-DD) and found 309 autosomal genes associated at an FDR threshold of ≤ 0.001, including 94% of the initial 252 autosomal genes in Kaplanis et al. 202011 (Supplementary Table 4.9, TADA-DD found 477 at FDR ≤ 0.05). We observed the FDR values generated by our implementation of TADA (TADA-DD) to be highly correlated to those derived from the DeNovoWest significance values in Kaplanis et al. 202011 (r=0.95, p<10×10-22, Supplementary Fig. 5). Given the enrichment of cases with severe and syndromic disorders in this DD cohort compared to the ASD cohort10,11,61, the de novo PTV, MisB, and MisA counts in this DD cohort showed the expected stronger but overall similar trends in variant enrichment in probands across the top three deciles of LOEUF (Fig. 5a-b).
Since the first unbiased whole-genome analyses in ASD62, a notable overlap has been observed between genes affecting ASD and those affecting development more broadly, including intellectual development. Thus, it is reasonable to conjecture that integrating evidence for association from ASD and DD cohorts could yield additional genes influencing risk for ASD and broader development. Still, a cardinal rule of such meta-analysis is that the data should not be too heterogeneous. To determine whether the genes identified in the ASD cohort were also associated in the DD cohort, and vice versa, we converted the distribution of TADA q-values to p-values for each study. If the genes identified as associated in the DD cohort were also associated with ASD, the distribution of their ASD association p-values would be skewed toward zero; if they were not associated, the p-values would follow a uniform distribution; and, if some were associated and others were not, the p-values would follow a mixture distribution. Selecting the genes associated in the DD cohort at FDR ≤ 0.05, we evaluated the distribution of their p-values in the ASD cohort. The estimated proportion of associated genes (i.e., mixing parameter) was 0.701 (limma::propTrueNull, Fig. 5c), indicating that 70.1% of these genes affect risk for ASD. The converse analysis conditioning on the ASD-associated genes suggests that 86.3% of ASD risk genes have some broad effects on development (limma::propTrueNull, Fig. 5d).
Having observed that the results from the ASD and DD cohorts are somewhat complementary, we used the Bayesian framework underpinning TADA to integrate the genetic evidence for each gene across the cohorts by combining the BFs, which is conceptually similar to a frequentist meta-analysis. This combined analysis (TADA-NDD) revealed 373 genes associated with general NDDs at FDR ≤ 0.001 and 662 genes at FDR ≤ 0.05 (Supplementary Table 4.9). We observed that 54/373 and 124/662 genes did not meet the matching threshold in either the TADA-ASD or TADA-DD analyses alone, demonstrating a 15-19% increase in yield by combining evidence from these cohorts.
Heterogeneity of mutation patterns between ASD and DD associated genes
Due to the frequent comorbidity of ASD and DD phenotypes, isolating genes that exert a greater effect on ASD than they do on other DDs has remained challenging. Still, as documented above, 13.7% of the TADA-ASD FDR ≤ 0.05 genes show little evidence for association in the DD cohort at the same threshold (Fig. 5d). Moreover, while the remaining 86.3% of the genes are likely pleiotropic63, some could have a far greater impact on risk for ASD while others have greater impact on risk for other features of development. To evaluate the overall gene-set heterogeneity between the ASD and DD cohorts, we retained only de novo SNVs/indels for independent gene-level BF calculations. For the 373 genes at TADA-NDD FDR ≤ 0.001, we observed a Pearson’s correlation of 0.78 (p=9.1×10−78) of the gene-level log BF between the two major subcomponents of the ASD cohort (ASC+SSC versus SPARK), compared to only 0.42 (p=2.4×10−17) between the BFs of the ASD and DD cohorts. These observations signal much more consistent gene-level evidence between the ASD subcohorts than between the ASD and DD cohorts (Supplementary Fig. 6) and reflect both differential and shared genetic architecture underlying these often-comorbid phenotypes.
Next, we determined if certain genes were observed to be more commonly mutated in one cohort or the other. For these analyses, we first identified 464 “signal genes,” defined as any gene with FDR ≤ 0.05 in either TADA-ASD or TADA-DD from de novo PTV and MisB variants (which showed the greatest association and therefore were less likely to be benign). Of these signal genes, 120 belonged to TADA-ASD, 428 to TADA-DD, and 84 to both. Notably, even for the 84 genes significant in both cohorts, we also observed significant variant count heterogeneity (X2=317.6, DF=83, p=3.8×10−23) between the cohorts.
Given this heterogeneity in mutational patterns, we sought to capture which of the 464 genes have more variation in either the ASD or the DD cohorts. A common way to assess this is by a standardized chi-squared test statistic (C statistic), but its power to discriminate is abrogated by the much higher burden of risk variants in the DD than ASD samples (Fig. 5a-b). We therefore adjusted for the difference in mutational burden between the cohorts by downsampling the DD mutations to be equal to the count of ASD mutations, then computed the C statistic. Genes with fewer than expected mutations in the ASD cohort relative to the DD cohort achieved a negative C statistic; a positive C statistic indicated overrepresentation in the ASD cohort. Next, a mixture model was adopted to disentangle the two commingled distributions, which assigns posterior probabilities that a gene is from the ASD or DD component of the statistical distribution (Fig. 5e-f, Supplementary Table 4.11). Using a posterior probability cutoff of greater than 0.99, we find 36 genes to be a part of the ASD mixture component and 82 genes to be a part of the DD component (Fig. 5f).
Mutational enrichment identifies differential neuronal layers impacted by ASD and DD genes
To gain insight into differences in expression of genes identified from the ASD versus DD cohorts, we examined single-cell gene expression patterns from human fetal brains. Two studies provided more than 37,000 cortical cells ranging from 6-27 weeks post-conception64,65. To integrate these data, we first removed strong batch effects due to different protocols using cFIT and then obtained integrated factor loadings and gene expression for all measured cells66. UMAP plots visualizing all cells from two sources, after removal of batch effects, showed that similar cell types from the different batches group together, while cells unique to either batch were also preserved66 (Fig. 6a, Supplementary Fig. 7).
With the integrated data, we applied unsupervised clustering to identify cell subtypes in the context of a hierarchical tree to illustrate the relationships of major and minor cell type clusters. Using the MRtree method67, we observed that cells of each labeled type were merged across both datasets into common clusters. Visualizing the tree, the major branches corresponded to glial and progenitor cells, excitatory neurons, deep layer enriched excitatory neurons, and inhibitory neurons. Likewise, minor splits reflected the expected relationships between cell types (Supplementary Fig. 8a). Based on the trajectory analysis of Polioudakis et al.65, the ExN clade is less mature than the ExM clade, which in turn is less mature than the ExMU clade.
We subsequently assessed any differentials in strength of enrichment of the DD- and ASD-predominant risk genes within cell clusters for genes meeting the posterior probability 0.99 threshold (Fig. 5f). Of the 36 genes classified as ASD-predominant, 18 were expressed in these cell types; and, of the 82 classified as DD-predominant genes, 51 were expressed. Using odds ratio to reflect the strength of signal, we observed that both DD- and ASD-predominant genes were enriched in both interneurons and excitatory neurons compared to glial and progenitor cells (Fig. 6b, Supplementary Fig. 9, Supplementary Tables 4.12-4.13). While the enrichment was broadly distributed, the signal for DD-predominant enrichment tended to be found for cell types appearing early in the lineages of neurons. DD-predominant genes highlight four cell types, namely intermediate progenitors (IP), precursors of interneurons from the caudal ganglionic eminence (InCGE), migrating excitatory neurons (ExN3), and maturing excitatory neurons (ExM2). By contrast, ASD-predominant genes are strongly enriched in only one cell type, maturing excitatory neurons (ExMU1) and its clade. In keeping with a more mature neuron type, the ASD-predominant genes play roles in cytoskeleton organization, synaptic signaling, and activity-dependent programs. Our results regarding ASD agree with the conclusion of Polioudakis et al.65, which was based on a subset of these data. If we judge enrichment solely by significance after Bonferroni correction for 21 cell types, these conclusions still hold: ExMU1 remained significant for enrichment of ASD-predominant genes; likewise ExN3 remained significant for enrichment of DD-predominant genes. Our results are consistent with DD-predominant genes being expressed earlier in development and in less mature cells than ASD-predominant genes.
Emergence of shared ASD schizophrenia risk genes
The genetic risk for ASD has also been suggested to overlap with that of other neuropsychiatric disorders, especially schizophrenia68, such that the joint study of ASD and schizophrenia might reveal additional insights into both disorders. The Schizophrenia Exome Meta-Analysis (SCHEMA) Consortium recently analyzed exome sequencing from 24,248 schizophrenia cases and 97,322 controls, identifying 10 genes where ultra-rare coding variants were associated with schizophrenia after Bonferroni correction, and 244 genes at p < 0.0113. We compared our ASD- and NDD-associated genes to the SCHEMA findings to determine if there was any overlap between ASD and schizophrenia at the level of individual risk genes and, if so, whether it was related to ASD-DD overlap. Among the 71 ASD genes we discovered at an FDR ≤ 0.001, we found that 63 show an association with DD (using FDR ≤ 0.05, based on TADA-DD), and 8 show an association with schizophrenia (using p ≤ 0.01, based on values reported by Singh et al.). If the two associations were independent, we would expect ∼7 of the 8 ASD-schizophrenia genes to also show an association with DD (based on 63/71 = ∼88% of the ASD genes overlapping DD). However, we find 4 of the ASD-schizophrenia genes (NRXN1, ANK2, BRSK2, and DSCAM) lack an association with DD, which is a significant overrepresentation compared to random chance alone (p = 0.008, binomial test). This suggests that one subset of ASD risk genes may overlap DD, while a different subset overlaps schizophrenia.
We also analyzed these findings with an alternative approach: the distribution of genes in Fig. 5e (the 464 genes with FDR ≤ 0.05 for ASD and/or DD in our heterogeneity analysis) gives 36 ASD-enriched genes and 82 DD-enriched genes. Of the 244 genes identified by SCHEMA as schizophrenia-associated at p ≤ 0.01, 6 genes (ANK2, ASH1L, BRSK2, CGREF1, DSCAM, and NRXN1) overlap the 36 ASD genes, while only 3 (ATP2B1, GRIN2A, and HIST1H1E) overlap the 82 DD genes. If we compare to the null hypothesis that each of the genes in our TADA model has an equal chance of being schizophrenia-associated, then the ASD-schizophrenia overlap is significantly enriched (p = 8.5×10−6, binomial test), while the DD-schizophrenia overlap is not (p = 0.10, binomial test). The two outcomes (6/36 vs. 3/82) are also significantly different when compared to each other (p = 0.023, Fisher’s exact test), reinforcing the idea that the shared genetic risk between ASD and schizophrenia may be distinct to that observed between ASD and DD.
Discussion
We report here a gene discovery effort in ASD that integrates coding SNVs, indels, and exome-based CNVs in ASD. These analyses revealed an allelic spectrum associated with ASD that was dominated by de novo PTVs, damaging missense variants, and deletions within genes that display evidence of intolerance to loss-of-function variation in the general population. Nonetheless, of the 71 genes identified as associated at FDR ≤ 0.001, all included association evidence from multiple inheritance classes and 10 displayed the strongest evidence from de novo missense variants and duplications, including one gene (PLXNA1) with domain-specific transmission distortion. The proportion of log BF evidence arising from PTVs and DELs steadily decreased from 69.2% among the 71 TADA-ASD risk genes to 51.1% and 41.0% for FDR ranges of (0.001, 0.01) and (0.01, 0.05) respectively, while inherited evidence increased from 4.5% to 10.1% and to 20% for the same FDR ranges, indicating that future discoveries are likely to arise from the more subtle effects associated with missense and inherited variation as sample sizes continue to increase.
This study also presents the largest exploration to date of individual gene and multi-exon resolution CNVs from exome sequencing and their contributions to ASD architecture. In a subset of the SSC quartets in this study, we demonstrated that >80% of all CNVs discovered by WGS and spanning three or more exons could be recaptured by exome-based CNV detection, while >90% of all predicted de novo CNVs from exome sequencing at this resolution were confirmed by WGS and/or molecular validation. We readily recapitulated the observed excess of CNVs in known GD regions characterized from microarray studies, and quantified the relative effect size of CNVs outside of GD regions in comparison to PTVs. Prior studies have reported an enrichment of CNVs of maternal origin in ASD probands57. Here, we found a strong bias in the gamete-of-origin giving rise to de novo CNVs based on the mechanism of formation. Most NAHR-mediated CNVs across all GDs arose on the maternal chromosome (e.g., 95% of 16p11.2 CNVs), whereas the majority of all CNVs, which are driven by non-homologous or microhomology mediated repair, were preferentially paternal in origin and consistent with WGS analysis in controls48. We further highlighted the relative value of gene-level CNV analyses, as disruption of one or more highly constrained genes conferred comparable risk to alteration of dosage-sensitive GD loci. Overall, approximately 8% of the BF association evidence in ASD risk genes was derived from CNVs, while 0.95% of all cases and 0.02% of unaffected siblings harbored a de novo CNV that altered one of the 373 broadly defined NDD risk genes. These results emphasize the value of routine joint analysis of all classes of genomic variation in gene discovery analyses, and the relative impact of gene level CNV analyses in diagnostic testing.
By analyzing both the DD and ASD datasets under the same framework, we were also able to directly quantify the overlapping and distinct genetic architectures across these datasets. We showed the increased rate of de novo PTVs and damaging missense variants in the DD cohort in both raw variant enrichments as well as in association statistics. Applying the same statistical model to both the DD and ASD datasets independently reinforced that the analytic framework and statistical thresholds used were robust and highly correlated with the permutation-based approach applied in the DDD study. Integrating the two cohorts together with TADA, we implicated 373 genes associated with risk across a spectrum of syndromic and non-syndromic NDDs at FDR ≤ 0.001 (662 risk genes at FDR ≤ 0.05). Among these 373 genes, 54 genes were unique to the joint analyses and were not captured by either dataset alone.
We expect these gene sets to shed light on the neurobiological origins of ASD, including its key features, namely deficits in social interaction and the presence of repetitive behaviors and/or interests, versus broader DD phenotypes. One might reasonably ask that if there is substantial overlap between the genes implicated in NDDs writ large and those implicated directly in ASD, why is it important to focus on ASD at all? To see how a larger set of risk genes and one originating from analysis of an ASD cohort could be important, consider two of the 183 genes, specifically ARID1B and DSCAM. While both are highly associated with ASD, with FDRs of 5.5×10−12 and 3.7×10−4 respectively, statistical evidence for ARID1B’s association with ASD is stronger. Nine of 15,036 ASD probands carry PTV or misB de novo mutations in ARID1B, while 4 of 15,036 carry such mutations in DSCAM. These genes, however, are also distinguished by their association with broader phenotypes. Individuals with mutations in ARID1B are often given a diagnosis of Coffin-Siris syndrome and, according to recent work69, they typically have “ID, feeding difficulties, laryngomalacia, speech delay, motor delay, hypertrichosis, and cryptorchidism.” While some individuals with mutations in ARID1B also have comorbid ASD, it is only one of a wide range of rarer phenotypes69. ARID1B’s profound impact on development is apparent by the contrast of de novo mutations in the DD and ASD cohorts: 132 carriers out of 31,058 DD probands versus 9 carriers out of 15,036 ASD probands, a sevenfold higher rate in DD. This raises a challenge for neurobiologists. If they perturb ARID1B and identify neurodevelopmental features associated with that perturbation, are those features relevant to ASD or are they non-specific features of DD? One might instead select DSCAM because its evidence for association arose solely from the ASD cohort; no PTV or misB mutations were reported from 31,058 DD probands. Yet, as we develop here, there is evidence DSCAM is also involved in risk for schizophrenia, so it is not uniquely an “autism gene” and studies such as ours continue to demonstrate the pleiotropic consequences of many genes implicated in ASD and NDD risk. To identify the key neurobiological features of ASD will likely require convergence of evidence from many ASD genes and studies. Careful selection of candidates among the genes implicated here based on their mutational and functional features could inform these future studies.
We have taken a step in that direction with this study. First, we asked which of the ASD and DD genes had a predominance of mutations in the ASD versus DD cohorts. Gene lists identified from cohorts ascertained for ASD and DD show substantial overlap, with roughly 70 to 90% of genes contributing risk for both disorders. This genetic result matches clinical experience with individuals who carry damaging DNA variants in conserved, developmentally important genes. This population tends to show substantial comorbidity between the two diagnoses, as well as phenotypic data, with higher rates of cognitive impairment and walking delay in individuals with ASD10,11,15,22. We would predict elevated rates of social impairment in those with DD who carry ASD risk-mediating variants, although this remains to be assessed. Such an assessment is important because, while overlap between genes identified in ASD and DD is substantial, the risk mediated to each disorder is not equal for many of the genes. Broadly, genes expressed at earlier stages of cortical development, such as progenitor genes, display greater DD enrichment, while those expressed later, such as maturing neurons, lean towards ASD. This is consistent with the expectation that earlier and more generalized impairment leads to severe global developmental delay, and later, neuron-specific impairments affect more isolated developmental domains such as social communication. This is supported by our analyses incorporating single-cell genomic data, where we identify ExMu1 among maturing excitatory neurons as preferentially perturbed among ASD-predominant risk genes, while highlighting ExN3, a migrating excitatory neuron, in DD-predominant risk genes.
In conclusion, we have greatly expanded the rare variant information that is simultaneously accounted for in our statistical framework, including the use of accurate WES-based CNVs, to significantly increase the number of genes we can implicate for ASD and NDD risk. We have also used our framework to study the shared and distinct genetic risks between ASD and related NDDs, highlighting differential enrichment of associated genes at different neuronal timepoints. Importantly, much of the work presented in this study was powered by the large-scale collection of ASD datasets from the SPARK, ASC, SSC, and NHGRI Centers for Common Disease Genetics, and the commitment of these programs to make these data rapidly accessible to all qualified individuals. These studies have catalyzed a rapid evolution in genetic architecture studies in ASD, including a number of recent preprints that have leveraged these data for analyses of de novo, ultra-rare, and rare inherited variants in ASD, the combined impact of rare and common variant polygenic risk across males and females, and combined gene discovery from ASD and DD datasets15,70–72. As sample sizes continue to grow rapidly, we expect that the framework presented here will continue to yield returns in both gene discovery and improved understanding of the differential risks to the disorders on the neurodevelopmental and neuropsychiatric spectrum posed by variants within the same genes.
Data Availability
All data used in this study are deposited at: Repository/DataBank Accession: NHGRI AnVIL Accession ID: phs000298 Databank URL: https://anvilproject.org/data Repository/DataBank Accession: Simons Foundation for Autism Research Initiative SFARIbase Accession ID: SPARK/Regeneron/SPARK_WES_2/ Databank URL: https://www.sfari.org/resource/spark/
AUTHOR CONTRIBUTIONS
Study design: M.E.T, S.J.S, K.R., B.D., M.J.D, J.D.B, S.B.G; Sample contribution and data generation: M.E.T., M.J.D., J.D.B., S.D.R., S.B.G., S.D., C.C., C.R.S., M.B, A.B, B.H.Y.C., M.L.C., E.D, G.B.F., J.J.G., G.E.H., I.H., P.M., D.S.M., M.R.P., A.M.P., A.R., F.T., E.T., G.C., M.C.Y.C., C.F., E.G., A.C.G., E.H., S.L.L., C.L., Y.L., R.N., L.P., M.P-V., I.P., E.R., R.S., M.S., C.I.C.S., S.T., J.Y.T.W., M.H.C.Y.; Project management of coordination: M.E.T., S.J.S., M.J.D., J.D.B., S.D.R., L.S., B.M., C.R.S., C.S., B.C., J.J.G.; Methods development and analysis: M.E.T., S.J.S., K.R., B.D., M.J.D., D.J.C., A.S., A.L., L.L., S.K.L., L.G., E.B., M.B., L.K., S.Dong, R.L.C., H.B., M.P., F.K.S., J.M.F.; Writing: M.E.T., S.J.S., K.R., B.D., M.J.D., J.D.B., H.B., M.P., K.F.S., J.M.F.
DECLARATION OF INTERESTS
B.M.N. is a member of the scientific advisory board at Deep Genomics and consults for Biogen, Camp4 Therapeutics Corporation, and Takeda Pharmaceutical. C.M. Freitag has been a consultant to Desitin and Roche and receives royalties for books on ASD, ADHD, and MDD.
ACKNOWLEDGEMENTS
We thank all of the individuals who participated in this research. We also thank all contributing investigators to the consortia datasets used here from the Autism Sequencing Consortium (ASC), the Simons Simplex Collection (SSC), Simons Powering Autism Research for Knowledge (SPARK) project, the iPSYCH project, the Deciphering Developmental Disorders (DDD) project, and Schizophrenia Exome Meta-Analysis (SCHEMA).
This work was supported by grants from the Simons Foundation for Autism Research Initiative (SSC-ASC Genomics Consortium #574598 to S.J.S., #575097 to B.D. and K.R., #573206 to M.E.T. and M.J.D., #571009 to J.D.; the SPARK project and SPARK analysis projects #606362 and #608540 to M.E.T., M.J.D., J.B., B.D., K.R., S.J.S.; SFARI #402281 to S.J.S., M.W.S., B.D., and K.R. and #647371 to S.J.S.), NHGRI (HG008895 to M.J.D., S.G., M.E.T.), NIMH (MH115957 to M.E.T., MH111658 and MH057881 to B.D., MH111661 and MH100233 to J.D.B., R01 MH109900 to K.R., MH111660 to M.J.D., and MH111662 and MH100027 to S.J.S. and M.W.S.), NICHD (HD081256 to M.E.T.), AMED (JP21WM0425007 to N.O). and the Seaver Foundation. J.M.F. was supported by an Autism Speaks Postdoctoral Fellowship and R.L.C. was supported by NSF GRFP #2017240332.
Footnotes
↵+ Senior author: joseph.buxbaum{at}mssm.edu; mjdaly{at}atgu.mgh.harvard.edu; devlinbj{at}upmc.edu; roeder{at}andrew.cmu.edu; stephan.sanders{at}ucsf.edu; mtalkowski{at}mgh.harvard.edu