Abstract
Coding de novo mutations (DNMs) contribute to the risk for autism spectrum disorders (ASD), but the contribution of noncoding DNMs remains relatively unexplored. Here we use whole genome sequencing (WGS) data of 12,411 individuals (including 3,508 probands and 2,218 unaffected siblings) from 3,357 families collected in Simons Foundation Powering Autism Research for Knowledge (SPARK) to detect DNMs associated with ASD, while examining Simons Simplex Collection (SSC) with 6383 individuals from 2274 families to replicate the results. For coding DNMs, SCN2A reached exome-wide significance (p=2.06×10−11) in SPARK. The 618 known dominant ASD genes as a group are strongly enriched for coding DNMs in cases than sibling controls (fold change=1.51, p =1.13×10−5 for SPARK; fold change=1.86, p =2.06×10−9 for SSC). For noncoding DNMs, we used two methods to assess statistical significance: a point-based test that analyzes sites with a Combined Annotation Dependent Depletion (CADD) score ≥15, and a segment-based test that analyzes 1kb genomic segments with segment-specific background mutation rates (inferred from expected rare mutations in Gnocchi genome constraint scores). The point-based test identified SCN2A as marginally significant (p=6.12×10−4) in SPARK, yet segment-based test identified CSMD1, RBFOX1 and CHD13 as exome-wide significant. We did not identify significant enrichment of noncoding DNMs (in all 1kb segments or those with Gnocchi>4) in the 618 known ASD genes as a group in cases than sibling controls. When combining evidence from both coding and noncoding DNMs, we found that SCN2A with 11 coding and 5 noncoding DNMs exhibited the strongest significance (p=4.15×10−13). In summary, we identified both coding and noncoding DNMs in SCN2A associated with ASD, while nominating additional candidates for further examination in future studies.
Introduction
Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by impaired social interaction and communication, and restrictive interests or repetitive behaviors1–3. ASD begins before the age of 3 years and can last throughout a person’s life, though symptoms may improve over time4. A recently review conducted by Tomoya et al. reported that ASD affects approximately 2.3% of children aged 8 years and approximated 2.2% of adults in the US5. Additionally, it has a strong male bias, with boys being affected nearly 4 times more common than girls among children aged 8 years6. ASD is often accompanied by other psychiatric disorders, such as intellectual disability, attention-deficit hyperactivity disorder, anxiety, irritability and aggression1. With the increasing disease burden of ASD, early diagnosis and treatment become an urgent public health issue.
ASD exhibits extensive clinical and genetic heterogeneity with high heritability. Common variants are estimated to play an important role, accounting for ∼40–80% of the overall liability for ASD7–11. Despite the evidence of a significant role for common variants in ASD risk, rare genetic variation (MAF<1%) confers higher individual risk12,13. The rare inherited variants account for a portion of the heritability2,7, while de novo mutations (DNMs) identified from parent–offspring trios are the underlying cause for many cases of ASD, explain additional proportions of the overall liability7. Researchers have identified hundreds of high-confidence ASD genes enriched with likely deleterious protein-coding DNMs14,15. A large-scale exome sequencing study, using an enhanced analytical framework to integrate de novo and inherited rare coding variants, identified 102 putative ASD-associated genes (e.g., CHD8, SCN2A, ADNP)14. Furthermore, Zhou et al. performed a two-stage analysis of rare de novo and inherited coding variants and identified 60 genes with exome-wide significance (p < 2.5×10−6), including five new risk genes (NAV3, ITSN1, MARK2, SCAF1, and HNRNPUL2)15. Additionally, DNMs in the non-coding genome can contribute to ASD risk16–20. Ryan et al. performed whole-genome sequencing of 200 ASD parent–child trios to characterize DNMs and found a significant enrichment of predicted damaging DNMs in ASD cases, of which 15.6% were non-coding17. Meanwhile, the authors revealed that non-coding elements most enriched for DNM were untranslated regions of genes, regulatory sequences involved in exon-skipping and DNase I hypersensitive regions. Kim et al. generated 813 whole-genome sequences from 242 Korean simplex families and found that target genes, including ARHGEF2, BACE1, CDK5RAP2, CTNNA2, GRB10, IKZF1, and PDE3B, affected by the non-coding DNMs in chromatin interactions led to early neurodevelopmental disruption implicated in ASD risk20. However, the pathogenic effects of noncoding DNMs related to ASD remain largely unexplored and poorly understood. Therefore, identifying noncoding DNMs that regulate gene function could provide important insights into ASD pathophysiology, which may have implications for targeted therapeutics.
Simons Foundation Powering Autism Research for Knowledge (SPARK) is an autism research initiative that aims to recruit and retain a community of 50,000 autistic individuals and their family members to advance understanding of the genetic basis of ASD21. As of March 2023, SPARK generated whole-genome sequencing data for 12,519 individuals from over 3000 families, offering the opportunity to assay the contribution of noncoding DNMs for ASD risk. Here, we performed whole-genome sequencing to identify both coding and noncoding DNMs in 3509 ASD trios comprised of affected probands and unaffected parents and in 2218 unaffected sibling-parent trios from the SPARK cohort. Additionally, we replicated our findings in the Simons Simplex Collection (SSC) cohort with 6383 individuals from 2274 families 22.
Methods
Sample selection
The whole genome sequencing data of SPARK was made available via Simons Foundation Autism Research Initiative (SFARI) and can be requested through SFARI Base (https://www.sfari.org/resource/sfari-base/). All participants were recruited to SPARK under a centralized institutional review board (IRB) protocol (Western IRB Protocol no.L20151664). Written informed consent was obtained from all legal guardians or parents for all participants aged 18 and younger, as well as for those aged 18 and older who have a legal guardian. Assent was also obtained from dependent participants aged 10 and older. In total 3385 families were selected from the SPARK cohort. We excluded 28 families with missing data on paternal and/or maternal whole genome sequencing, leaving 3357 families (12,411 individuals) for analysis. Our study was approved by the Institutional Review Board of the Children’s Hospital of Philadelphia.
The Simons Simplex Collection (SSC) cohort also represents an important data resource from the SFARI, to identify de novo genetic risk variants that contribute to the ASD22. Up to date, more than 2000 families with whole genome sequencing and clinical data have been collected. Our study included 6383 individuals (including 2274 affected probands and 1835 unaffected siblings) with whole genome sequencing from 2274 families to replicate results from the SPARK cohort. Probands were excluded who were younger than 4 years of age or older than 18. Informed consent was obtained at each data collection site included in the SSC.
Variant calling
Sequencing and genotyping of all samples for SPARK and SSC were performed at New York Genome Center (NYGC). Alignment of reads to the human reference genome version GRCh38, duplicate read marking, and Base Quality Score Recalibration (BQSR) were performed using the standard pipeline from the Centers for Common Disease Genomics (CCDG). Variants calls for all samples from the SPARK and SSC cohorts are provided from NYGC.
DeepVariant gVCFs are provided for all samples enrolled in SPARK. DeepVariant version 1.3.0 was used to call SNVs and INDELs to produce sample-level gVCFs using the default WGS configuration profile (“--model_type=WGS”). All samples were then jointly called using GLnexus version 1.4.1 and the default DeepVariant WGS configuration (“--config DeepVariantWGS”). Genomic “chunks” were processed in parallel for computational feasibility, then subsequent chunks were combined to form project VCFs (pVCFs) by chromosome. Post-calling BCFtools (version 1.17) norm was used to left-align and normalize indels.
For SSC cohort, the SNVs and indels in families were called using four different callers: GATK HaplotypeCaller v.3.5.0, FreeBayes v1.1.0, Platypus v0.8.1, and Strelka2 v2.9.2. In addition, multi-nucleotide variants were called using FreeBayes and Platypus. Post-calling BCFtools (version 1.3.1) norm was used to left-align and normalize indels. They partitioned the genome into the high-quality regions, consisting of unique space as well as ancient repeats, and the recent repeat regions, which consisted of repeats <10% diverged from the consensus in RepeatMasker. Variants were only assessed in high quality portions of the genome and those in recent repeat regions were removed from the study. Our study selected project VCFs (pVCFs) generated by GATK HaplotypeCaller to replicate de novo analysis.
DNMs detection
We developed a custom pipeline for DNM analysis. Candidate DNMs, defined as variants present in the offspring and absent in both parents, were identified from per-family VCFs generated by DeepVariant. Bcftools and Bedtools (v2.31.0) were used to further filter the candidate DNMs. The filtering criteria for DNMs as follows: 1) Allelic depth (AD) ≥ 6 in the offspring, 2) Genotype quality (GQ) ≥ 25 in the offspring and GQ ≥ 20 in the parents, 3) Read depth (DP) > 8 in the parents, 4) Annotated the candidate DNMs using a custom pipeline based on ANNOVAR23 (human genome hg38), and filtering the calls with an allele frequency (genomAD) < 0.1%, 5) Removed DNMs located in low mappability regions, 6) Filtering the fraction of reads supporting the alternate allele (AB.ALT) at 0.25∼0.75 in the offspring, 7) removed variants located in regions known to be difficult for variant calling (e.g., HLA gene and MUC gene), and 8) when an individual carries multiple DNMs within 100Lbp in the same gene, only one variant with the most severe effects was included in the analysis. Combined Annotation-Dependent Depletion (CADD, v1.6)24 was used to score the deleteriousness of noncoding mutations. To identify potentially pathogenic variants, we extracted noncoding DNMs with CADD score thresholds of 15, which rank among the top ∼1.8% of variants with the most severe predicted effect. Pipelines for analysis were blind with respect to affected probands and unaffected siblings.
Enrichment of coding DNMs
We further determined whether any individual genes carry an excess of DNMs. To compute the expected number of coding DNMs (including exonic and canonical splicing regions) in the cohort, we used pre-computed tabulation of the probability of DNMs arising in each gene based on RefSeq transcript definitions25. The number of expected coding DNMs equals to a gene’s inferred DNMs mutation rate multiply by the number of trios and by 2 (for the number of chromosomes for autosomes). Given that the number of DNMs per trio follows a Poisson distribution26, we use the Poisson test to evaluate the excesses of de novo events:
Enrichment of noncoding DNMs
For noncoding DNMs, we used two methods to assess statistical significance: a point-based test that analyzes sites with a CADD score ≥15, and a segment-based test that uses genomic non-coding constraint of haploinsufficient variation (Gnocchi) constraint scores in 1kb genomic segments to infer the background mutation rates.
Point-based test: to compute the expected number of noncoding DNMs, we assumed that the probability of the noncoding mutations within a given gene aligns with that of coding DNMs. Rodriguez-Galindo et al. supported that the de novo mutation rate is similar in exons compared to introns in the germline, after accounting for trinucleotide sequence composition and an excess of nonsynonymous exonic variation arising from sampling bias27. Given that gene length is an obvious factor in a gene’s mutability, the expected number of noncoding DNMs is calculated as follows: multiply the gene’s DNM mutation rate by the number of trios, then by 2 (for autosomes), and finally by the ratio of the noncoding gene length to the coding gene length. Here, the noncoding variants are classified as those in genic (intronic, upstream, downstream, ncRNA exonic, ncRNA intronic, ncRNA splicing, UTR5, and UTR3) regions. We separately calculated statistical significance for intergenic regions by assigning noncoding variants to the nearest genes. Among the noncoding DNMs, we further classified mutations as severe if they had a CADD score greater than 15. Consequently, the noncoding gene length were calculated based on mutations with CADD score ≥15. Finally, we computed the p value for the observed number of noncoding DNMs compared to the expected number of noncoding DNMs:
Segment-based test: Chen et al. built a genome-wide genomic constraint map (Gnocchi) per 1-kb genomic windows by utilizing the Genome Aggregation Database (gnomAD version 3.1.2)28. The authors derived Gnocchi score by comparing the observed variation to an expectation from1,984,900 tiling 1kb genomic windows (passing all quality control checks) on autosomes. We annotated 1kb genomic windows and excluded variants located within coding and intergenic regions, leaving 876,086 1kb noncoding bins. We used Gnocchi genome constraint in 1-kb genomic segments to calculate the expected number of noncoding mutations in our cohorts as follows:
Where average count 1kb is the average number of DNM per 1kb, which is calculated as the average number of noncoding DNMs in a test dataset among 876,086 noncoding 1 kb genomic windows. β is inferred directly from Gnocchi calculation 28 which includes a score of the expected number of rare variants per 1kb genomic segments based on genomic contexts, and it is calculated as the ratio of the counts of the expected mutations in this 1kb bin divided by the average number of expected mutations in all 1kb bins in gnomAD.
P value is calculated as Poisson distribution using expected and observed counts in all bins assigned to each given gene. A Bonferroni corrected pL<L2.5L×L10−6 (adjusting for ∼20,000 genes) was regarded as statical significant given the theoretical number of genes tested.
To characterize the enrichment of groups of genes with DNMs contributing to ASD risk, we defined 3 gene sets, including 3054 LoF constrained genes (pLI>0.9), 1339 known NDD genes from Developmental Disorders Genotype-to-Phenotype database (DDG2P)29, and 618 known dominant ASD genes (Supplementary Table 1). This set of 618 known ASD genes are compiled in Zhou et al15, and they encompass known NDD genes from DDG2P, high-confidence ASD genes collected by the SFARI, and dominant ASD genes included in the SPARK genes list. Additionally, for highly constrained noncoding 1kb regions based on Gnocchi threshold (for example, Gnocchi>4), we calculated the total counts of expected and observed DNMs with score over this threshold.
Functional prediction of noncoding variants
Genome browser (https://genome.ucsc.edu) was used to visualize and browse noncoding DNMs with a CADD score ≥15, and candidate cis-regulatory elements (cCREs) derived from ENCODE30 integrated DNase-based sequencing (DNase-seq) and chromatin immunoprecipitation with sequencing (ChIP–seq) data. Genes with an pLIL> 0.9 are defined as LoF constrained genes. We annotated potential splice site variants using SpliceAI31, which predicts splice site gain or loss events, and we set delta scores of ≥0.8 as the high precision threshold.
Results
Cohort characteristics and study workflow
Our custom DNMs analytic pipeline was shown in Fig 1. A total of 3508 affected probands and 2218 unaffected sibling control from 3357 family trios were included in SPARK cohort (March 2023 release). Over 65% of the families are quartets. The mean (standard deviation) age of the SPARK cohort in this analysis was 9.0Lyears (5.9 years) for affected probands, 7.9 (4.5) for unaffected siblings, and 40.1 (8.4) for parents. The breakdown of sex in the full SPARK cohort was 58.2% male and 41.8% female, while among ASD cases, the breakdown was 79.7% male and 20.3% female. The characteristics of the individuals included in our analysis was shown in Table 1. Additionally, the SSC cohort, comprising of 6383 individuals (including 2274 affected probands and 1835 unaffected siblings) from 2274 families (over 80% were quartets), were applied to replicate the de novo variants analysis.
(a) DNMs analytic pipeline. For noncoding DNMs, this study used two methods to assess statistical significance (b) point-based test that analyzes sites with a Combined Annotation Dependent Depletion (CADD) score ≥15, and (c) segment-based test that uses Gnocchi genome constraint scores in 1kb genomic segments to infer background mutation rates.
Identification of coding DNMs for ASD
Through our custom pipeline designed for de novo analysis, we identified an average of 1.27 coding DNMs per affected offspring and 1.21 per unaffected siblings in the final call set (Table 2). In the ASD case group, 9.5% (424/4,446) of the coding DNMs were classified as likely gene disrupting (LGD), and 61.4% (2,732/4,446) were characterized as missense. While in the unaffected sibling control group, 7.4% (196/2,650) of the DNMs were classified as LGD, and 62.6% (1,660/2,650) were missense mutations. The number of coding DNMs observed per trio approximately follows Poisson distribution (Fig. 2). SSC was used to replicated DNM analysis. The average of coding DNMs is 1.38 for probands and 1.30 for unaffected siblings. Additionally, 9.3% of the coding DNMs were classified as LGD in the ASD group, compared to 6.1% of the coding DNMs classified as such in the unaffected sibling control group. The distribution of coding DNMs from SSC was shown in Supplementary Fig 1.
(a) distribution of coding DNMs. (b) distribution of missense DNMs. (c) distribution of loss of function DNMs. (d) distribution of noncoding DNMs. (e) distribution of gene-relate noncoding DNMs. (f) distribution of intergenic noncoding DNMs. (g) distribution of noncoding DNMs have a CADD score ≥15. (h) distribution of gene-relate noncoding DNMs have a CADD score ≥15. (i) distribution of intergenic noncoding DNMs have a CADD score ≥15.
We applied de novo enrichment analysis to compare the observed number of DNMs to the expected numbers in ASD case trios using a one tailed Poisson test. After excluding genes that showed evidence of fold change (FC) enrichment less than one for candidate DNMs in ASD cases compared to unaffected siblings, we identify 116 genes (Supplementary Table 1) that harboring an excess of coding DNMs (p <0.05, DNM count ≥ 3). Of note, two genes with the excess of coding DNVs met Bonferroni corrected significance (p <L2.5×10−6), including SCN2A and CCDC168. When coding DNMs were categorized into missense and LGD variants, 33 genes were enriched for missense DNMs (e.g., SCN2A, BRD4, KDM5B, CHD2) and 11 genes were enriched for LGD DNMs (e.g., SCN2A, BRSK2, CCDC168, ADNP, PRICKLE2, RAI1, KDM6B, SHANK3, AUTS2, SRCAP, and WDFY3; p <0.05, FC > 1, DNM count ≥ 3). Among these, SCN2A, BRSK2, CCDC168, ADNP, and PRICKLE2 with LGD DNMs met Bonferroni corrected significance. Meanwhile, we identify 72 genes with an excess of coding DNMs (P <0.05, FC > 1, DNM count ≥ 3) from the SSC cohort, three of which (CHD8, AHNAK2, and FLG2) reached the Bonferroni corrected significance. Furthermore, of the 72 genes examined, 35 were identified as constrained. Nevertheless, 23 genes with missense DNMs and 3 genes with loss of function DNMs reach suggested significance (p <0.05, enrichment FC > 1, DNM count ≥ 3), of which, CHD8 harboring loss of DNMs met Bonferroni corrected significance. Additionally, the overlapping genes that are enriched for coding DNMs between SPARK and SSC cohorts includes SCN2A, ANKRD11, GRIN2B, KDM6B, ARID1B, SPEN, PTPRF, and DNMT3A, with PTPRF being a previously unreported candidate gene.
Identification of noncoding DNMs for ASD
Besides coding DNMs, we also utilized our custom pipeline to identify noncoding DNMs. By comparing each affected and unaffected offspring to their parents, 302,603 noncoding DNMs (148,716 in genic regions and 153,887 in intergenic regions) were identified from 3508 affected offspring trios, while 196,898 DNMs (95,875 in genic regions and 101,023 in intergenic regions) were identified from 2218 unaffected sibling trios in the SPARK cohort. The average of noncoding DNMs is 86.3 per proband and 90.0 per unaffected sibling.
Among the noncoding DNMs, we further classified 5,343 mutations in proband trios and 3,424 in unaffected sibling trios as likely functional, based on a CADD score ≥15, which account for the top ∼1.8% of variants. The distribution of noncoding DNMs is shown in Fig 2. The average number of noncoding DNMs had a CADD ≥ 15 is 1.52 per proband and 1.56 per unaffected sibling. We further replicated our pipeline in SSC, identifying 209,396 noncoding DNMs in 2274 affected offspring trios and 171,258 in 1835 unaffected sibling trios. The average of noncoding DNMs is 92.1 per proband and 93.3 per unaffected sibling. After filtering CADD score ≥ 15, we identified an average of 1.60 noncoding DNMs per ASD case and 1.59 per unaffected siblings in the final call set. The distribution of noncoding DNMs for SSC shown in Supplementary Fig 1.
We further evaluated the excesses of noncoding de novo events over expectation and identified 21 genes with suggestive evidence (p <0.05, CADD score ≥15, FC >1, DNM count ≥3) by point-based test. For further replication, we examined the genes in the SSC cohort, and found that five out of 21 genes (TSHZ2, SCN2A, NELL1, ZEB2, and AGMO) had higher FC in ASD case than unaffected siblings in SSC cohort, though none reached statistical significance. Among them, TSHZ2 and NELL1 were not reported as candidate genes before.
Finally, by combining evidence from case-only tests for both coding and noncoding DNMs, we identified 89 genes enriched for DNMs (p < 0.05). Notably, SCN2A, with 11 coding DNMs and 5 noncoding DNMs in probands and none in unaffected siblings (Table 3), shows the most significant DNM burden (p =4.15×10−13). In the SSC cohort, 65 genes had an excess of DNMs (p < 0.05). Among them, SCN2A had one noncoding DNM in probands and none in unaffected sibling control in SSC. Furthermore, the overlapping genes that reached threshold of 0.05 between SPARK and SSC cohorts include SCN2A, DNMT3A, ARID1B, GRIN2B, and KDM6B. The top genes in SPARK and SSC cohorts showed in Table 4.
In addition to point-based test, we also used a segment-based test to evaluate the observed number noncoding DNMs over expectation. We identified 627 genes exhibiting excesses of noncoding DNMs (p <0.05, DNM count ≥3; Supplementary Table 2). Within this group, four genes (CSMD1, WWOX, RBFOX1, and CDH13) achieved Bonferroni corrected significance (p <L2.5×10−6). We applied SSC for further replication and found that 535 genes are enriched for noncoding DNMs (p <0.05, DNM count ≥3) and four of these genes (CSMD1, CDH13, RBFOX1, SGCZ) met exome-wide significance (Bonferroni pL<2.5×10−6). Taken together, 47 genes (e.g. AGMO, GRIN2A, TEK, WWOX, MEMO1) have excesses of noncoding DNMs in both SPARK and SSC cohorts. Notably, CDH13, CSMD1, and RBFOX1 reached exome-wide statistical significance (Bonferroni pL<2.5×10−6) across both cohorts (Table 5).
Examination of noncoding DNMs in known ASD risk genes via case-sibling comparison
To evaluate the contribution of noncoding DNMs within established ASD risk genes, we examined 618 well-documented dominant ASD genes (Supplementary Table 3). Among these, 22 genes (e.g., SCN2A, NRXN3, and MEIS2) that harbored three or more noncoding DNMs had higher enrichment in probands compared to unaffected siblings (CADD score ≥15, FC >1, and DNM count ≥3), including SCN2A, NRXN1, BCL11A, CASZ1, MEIS2, PBX1, EBF3, MEF2C, FOXP2, HDAC4, CUX2, SOX5, SATB1, EPB41L1, ZEB2, GLI3, NRXN3, RARB, MAGI2, CAMTA1, MACF1, and TRPM3. Remarkably, three out of these 22 genes—SCN2A, HDAC4, and ZEB2—are enriched for noncoding DNMs (p <0.05). For further replication, we applied the finding to the SSC. This corroborative analysis revealed that, five genes (MEIS2, ZEB2, NRXN3, RARB, and MAGI2) exhibited increased enrichment in SSC cohort (FC >1 and DNM count ≥3). Furthermore, among the three genes (SCN2A, HDAC4, and ZEB2) identified with a significant burden of noncoding DNMs in the SPARK, two genes (SCN2A and ZEB2) showed a higher burden in SSC, though not reaching statistical significance.
When we applied a segment-based test to evaluate the excesses of noncoding DNMs for 618 known ASD risk genes, 23 genes are enriched for noncoding DNMs in SPARK and 22 genes are enriched in SSC (p < 0.05). Moreover, the overlapping genes that are enriched for noncoding DNMs between SPARK and SSC cohorts includes MAGI2 and GRIN2A (Table 5; Supplementary Table 4).
Gene set analysis by case vs sibling control comparisons
We next characterized the enrichment of gene set with DNMs contributing to ASD risk in probands compared to unaffected sibling controls. Constrained genes (pLI>0.9) as a group are enriched for coding DNMs in cases than sibling controls (fold change=1.14, p =0.018 for SPARK, fold change=1.44, p =1.21×10−7 for SSC; Table 6). Moreover, the set of 618 known ASD genes had higher enrichment in both cohorts (fold change=1.51, p =1.13×10−5 for SPARK, fold change=1.86, p =2.06×10−9 for SSC). However, these three gene sets (LoF constrained, NDD, and known ASD genes) do not showed higher burden of noncoding DNMs in cases versus sibling controls (CADD≥15).
Gene sets were grouped as 3054 LoF constrained gene (pLI >0.9), 1339 neurodevelopmental disorders (NDD) genes and 618 known ASD genes.
We further evaluate the enrichment of noncoding DNMs identified by segment-based test. We failed to find any significant enrichment of noncoding DNMs in cases over sibling controls in any of the gene sets. Additionally, when we focus on 1kb segments with Gnocchi score>4, we still do not see increased burden of noncoding DNMs in cases over controls (Table 6).
Function prediction of noncoding DNMs in SCN2A, ZEB2, and AGMO
Three noncoding DNMs located in SCN2A (chr2:165296095, G>C; chr2:165297139, T>C; and chr2:165370303, A>T) are situated less than 20 base pairs (bp) away from an exon and are highly conserved (Fig. 3). The probability that the position 2:165296095 (5 bp away from an exon) being splice donor loss is 0.98 (delta score ≥ 0.8 is high precision). Likewise, the probability that the position 2:165370303 (4 bp away from exon) being splice donor loss is 0.81. In addition, the noncoding DNM found at position chr2:165294336 (T>C) of SCN2A is located within a candidate cis-Regulatory Element (cCRE) with a proximal enhancer like signature.
(a) point variation of chr2:165275094 (T>G). (b) point variation of chr2:165294336 (T>C). (c) point variation of chr2:165296095 (G>C). (d) point variation of chr2:165297139 (T>C). (e) point variation of chr2:165370303 (A>T). (f) probability of the mutations situated less than 20 base pairs (bp) distant from the exon being splice-altering.
In the ZEB2 gene, 10 noncoding DNMs were identified in probands and 3 in unaffected sibling controls (2.07-fold). Three of the 10 noncoding DNMs in probands were identified as cCREs with a distal enhancer-like signature. One noncoding DNM, located at position chr2:144404388 (T>C), showed higher enrichment of the H3K27Ac histone mark on HSMM Cells, as determined by a ChIP-seq assay. Another one, located at position chr2:144504145 (G>A), exhibited higher enrichment of the H3K4Me1 histone mark on K562 Cells. Additionally, one noncoding DNM (chr2: 144396373, G>C) is situated 39 bp away from the exon, while the probability that the position being a splice site is 0.
The AGMO gene harbored 4 noncoding DNMs in ASD cases. One noncoding de novo mutation, located at position chr7:15556988 (T>C), shows characteristic of a distal enhancer-like signature and higher enrichment of the H3K4Me1 histone mark on HUVEC cells (Supplementary Fig. 4). Additionally, mutation in chr7:15502483 (G>A) shows slightly higher enrichment of the H3K4Me1 histone mark on HUVEC cells.
Discussion
In the current study, we developed a custom pipeline to identify both coding and noncoding DNMs across 3357 families (12,411 individuals) with whole genome sequencing data from the SPARK cohort. We identified 116 genes that are enriched for coding DNMs in cases, of which SCN2A and CCDC168 reached exome-wide significance. Furthermore, when integrating evidence from both coding and noncoding DNMs, SCN2A exhibited the most significant enrichment in SPARK with replication in SSC.
ASD is characterized by its clinical and etiological heterogeneity, which makes it difficult to elucidate the neurobiological mechanisms underlying its pathogenesis. Recently, DNMs have been recognized as strong source of genetic causality, and the characterization of DNMs allows additional ASD risk genes to be identified. DNMs in non-coding regions have become of interest in recent years. Previous whole exome sequencing studies were unable to detect these variants due to the lack of coverage and sequencing depth across non-coding regions. However, there is evidence that ASD genes harbor hotspots of hypermutability in non-coding regions and besides, deleterious mutations across them are subjected to strong negative selection just like the loss of function mutations located in the coding region12. Werling et al. present an analytical framework to evaluate rare and de novo noncoding mutations from whole genome sequencing of 519 ASD families unable to demonstrate a rare noncoding variant contribution to ASD risk, but found that noncoding de novo indels category showed a greater number of nominally significant results than expected19. The authors19 demonstrated that the contribution of de novo noncoding variation is probably modest compared to de novo coding variants. Although whole genome sequencing now allows the identification of noncoding DNMs in affected probands, their contribution to risk remains relatively unexplored and demands further investigation. Our study suggested that noncoding DNMs of SCN2A were associated with ASD risk.
The SCN2A gene encodes the voltage-gated NA+ channel Nav1.2, one of the major neuronal sodium channels that play a role in the initiation and conduction of action potentials32. Pathogenic mutations in SCN2A, have been associated with a spectrum of epilepsies and neurodevelopmental disorders33–37. Moreover, multiple studies have confirmed the contribution of de novo SCN2A mutations to the risk for ASD. Stephan et al. using whole-exome sequencing of 928 individuals, including 200 phenotypically discordant sibling pairs confirmed that de novo coding mutations across SCN2A were associated with ASD risk37 . Our discovery of coding DNMs events in SCN2A related to the risk of ASD are supporting previous studies. Intriguingly, three noncoding DNMs events observed in SCN2A are located within 20 base pairs of exons, further implicating them in ASD risk. Notably, two of these noncoding DNMs demonstrated a high likelihood of affecting splice donor/loss sites, potentially leading to significant alterations in gene expression. Also, one noncoding DNM in SCN2A displays characteristic of a proximal enhancer like signature. This suggests a critical role for both coding and proximal noncoding regions of SCN2A in ASD pathogenesis, highlighting the intricate interplay between genetic variants and their potential impact on splicing and gene function.
CHD8 encodes chromodomain helicase DNA-binding protein 8 and its mutation is a highly penetrant risk factor for ASD14,38. While DNMs of CHD8 are associated with ASD risk in the SSC cohort, but not in the SPARK cohort. One possible explanation may be the different sample characteristics and family fractures. In SSC cohort, over 80% family types are quartet families and around 20% are simplex. While in SPARK cohort, 65% family types are quartet families and 33% are simplex families. Additionally, the stochastic nature of rare mutations may also explain the general lack of DNMs in CHD8 in the SPARK cohort.
To our best knowledge, this study was among the first to evaluate the significance of non-coding DNMs implicated in the risk of ASD from whole genome sequencing data across more than 3300 families. Our findings also suggested the contribution of noncoding DNMs in known ASD risk genes, especially SCN2A. This study also has several limitations. The sample size from SPARK is relatively small, and the size of the ASD cases was around double that of the unaffected sibling controls, which may hinder the discovery of ASD risk genes. Furthermore, the probability of de novo mutations per gene may differ between coding and noncoding regions. To compute the expected number of noncoding DNMs, we assume that the mutation rate of the noncoding DNMs occurring within the same gene is comparable to that of coding DNMs39,40. Rodriguez-Galindo et al. demonstrated that the rate of generation of new genetic variants, the mutation rate, does not significantly vary between exons and adjacent introns when accounting for sequence context27. However, Sankar et al. revealed mutation rates in exons are 30%–60% higher than in noncoding DNA due to the relative overabundance of synonymous sites involved in CpG dinucleotides41. Studies also reported that mutation rate in non-coding regions is highly heterogeneous and can be affected by local sequence context as commonly modelled in gene constraint metrics as well as by a variety of genomic features at larger scales42,43. Therefore, we used two different methods (point-based and segment-based) to calculate the noncoding DNMs because of the probability of DNMs in noncoding regions still limited. In the near future, repeating these analyses with large-scale whole genome sequencing data, potentially utilizing long-read sequencing technologies, and investigating inherited variants will aid in identifying additional ASD risk genes.
Data availability
Whole genome sequencing data was made available via SFARI and can be requested through SFARI Base (https://www.sfari.org/resource/sfari-base/).
The SPARK data is accessible as follow:
SPARK_iWGS_v1.1 includes whole genome data from 12519 individuals from 3417 families.
The SSC data is accessible as follow:
SFARI_SSC_WGS_2a includes whole genome data from 6383 individuals from 2274 families.
Competing Interests
The authors declare no competing interests.
Acknowledgements
We are grateful to all of the families at the participating Simons Simplex Collection (SSC) sites, as well as the principal investigators (A. Beaudet, R. Bernier, J. Constantino, E. Cook, E. Fombonne, D. Geschwind, R. Goin-Kochel, E. Hanson, D. Grice, A. Klin, D. Ledbetter, C. Lord, C. Martin, D. Martin, R. Maxim, J. Miles, O. Ousley, K. Pelphrey, B. Peterson, J. Piggot, C. Saulnier, M. State, W. Stone, J. Sutcliffe, C. Walsh, Z. Warren, E. Wijsman). We are grateful to all of the families in SPARK, the SPARK clinical sites and SPARK staff. We appreciate obtaining access to genetic data on SFARI Base. Approved researchers can obtain the SSC population dataset described in this study by applying at https://base.sfari.org. Approved researchers can obtain the SPARK population dataset described in this study by applying at https://base.sfari.org. We appreciate obtaining access to recruit participants through SPARK research match on SFARI Base. We thank Dr. Yufeng Shen (Columbia University) for his insightful comments on the analysis strategies. The study is supported in part by a Foerderer grant, NIH grant HG013031 and the CHOP Research Institute.