Abstract
Autism spectrum disorder (ASD) is a neurodevelopmental disorder that affects about 1 in 55 children worldwide, imposing enormous economic and socioemotional burden on families and communities. Genetic studies of ASD have identified de novo copy number variants (CNVs) and point mutations that contribute significantly to the genetic architecture of ASD, but the majority of these studies were conducted in outbred populations, which are not ideal for detecting autosomal recessive (AR) inheritance. However, several studies of ASD in consanguineous populations point towards AR as an under-appreciated source of ASD variants. Here, we used trio whole exome sequencing to look for rare variants for ASD in 115 proband- mother-father trios from populations with high rates of consanguinity, namely Pakistan, Iran, and Saudi Arabia. We report 87 candidate sequence variants, with 57% biallelic, 21% autosomal dominant/de novo, and the rest X-linked. 52% of the variants were loss of function (LoF) or putative LoF (splice site, stop loss), and 47% nonsynonymous. Our analysis indicates an enrichment of biallelic variants. These include variants in genes previously reported for AR ASD and/or intellectual disability (ID) (AGA, ASL, ASPA, BTN3A2, CC2D1A, DEAF1, HTRA2, KIF16B, LINS1, MADD, MED25, MTHFR, RSRC1, TECPR2, VPS13B, ZNF335), and 32 previously unreported candidates, including 15 LoF or splice variants, in genes such as DAGLA, EFCAB8, ENPP6, FAXDC2, ILDR2, PKD1L1, SCN10A, and SLC36A1. We also identified several candidate biallelic exonic loss CNVs. The enrichment shown here for biallelic variants confirms that the genetic architecture for ASD among consanguineous populations is different to non-consanguineous populations. However, as has been shown in many recessive disorders, the genes reported here may also be relevant to outbred populations.
Introduction
Autism spectrum disorder (ASD) is characterized by deficits in social communication, also repetitive and restricted behaviours and interests. Apart from a small percentage of individuals who recover (1), ASD is a life-long condition and only about 20% have good outcome as reported in a recent meta-analysis of longitudinal studies (2). ASD is quite heterogeneous in terms of level of intellectual functioning, co-morbid psychiatric and behavioural problems, and these factors play an important role in the long term outcome of the diagnosis (3). In a systematic review its prevalence was estimated to be 0.76% internationally (4). In another systematic review of international data, the median estimate was 62/10,000 (5). In this review, the prevalence of ASD showed little variation by geographic region, ethnic, cultural and socioeconomic factors (5).
The reported prevalence of ASD has increased with time. In the 1960s it was estimated to affect as few as 1 in 10000 individuals (5), prevalence studies from the 1980s suggested that as many as 72 in 10000 individuals had ASD, rising to 1% in the 2000s (5, 6) More recent studies report prevalence rates of more than 2%. (7–9). In the United States, the prevalence across eleven sites in 2012 was 14.5 per 1,000 (one in 69) children aged 8 years (10) and in 2016 it increased to 18.5 per 1,000 (one in 54) children (11). This increase at least partly is because of changes in diagnostic criteria, reporting practices, increased awareness (12–15). For example, a Swedish study showed that although ASD diagnoses became more common over time among individuals born between 1992 and 2002, the underlying level of autistic traits did not (9).
Studies that examined the differences between individuals with and without ASD on symptom and cognitive measures indicated a decrease in the differences between the two groups (15, 16). However, in a recent study from Denmark only a third of ASD diagnoses could be explained because of broadened diagnostic criteria (13). The search for alternative explanations for the increase in the prevalence includes the possibility of a bigger contribution of the environmental factors that might have become more common with time and have caused a genuine increase in the numbers of affected individuals (17). Many of the reported environmental exposures associated with ASD, such as air pollution exposure, paternal age, and maternal psychotropic medication use during pregnancy tend to occur during the prenatal period (18, 19). This has been further examined in a longitudinal twin study where there was an increase in the ASD traits while the contribution of environmental factors had not changed (20).
Describing the genetic architecture of ASD is the first step towards understanding the pathogenesis of the disorder. There has been significant progress to date, and in genetic studies of ASD it has been shown that de novo copy number variants (CNVs) and point mutations play a prominent role, however, the majority of these studies have been conducted in outbred populations (21)
A systematic review of published literature from South Asian countries reported a prevalence of ASD in the range of 0.09-1.07 % (22). To date, there has been no community study of prevalence from Pakistan. Studies from child psychiatry clinics reported rates of 2.4% and 3.2%, 4.5%, 5% (23–26). Clinical services for ASD are available but limited to big cities. There are special schools in every big city for children with ASD (27). The literature on clinical presentation is limited. One study with a small sample size shows symptoms consistent with studies elsewhere in the world (28).
Several projects have highlighted recessive inheritance as an important component of the genetic architecture of ASD. A large study of consanguineous versus non-consanguineous families in India concluded that consanguinity increases risk for ASD with an odds ratio of 3.22 (29). Morrow and colleagues used SNP microarrays to map homozygous loci in 104 small ASD families from the Middle East, Turkey and Pakistan, finding homozygous deletions implicating SLC9A9, PCDH10, CNTN3 and others (30). Research in outbred populations also support AR inheritance as an important piece of the genetic puzzle for ASD. For example, previous work has estimated that loss-of-function (LoF) recessive mutations contribute 3% of ASD genetics in two US-based case-control cohorts (31). These findings are not limited to population isolates or ethnic subgroups (32, 33). Consanguineous marriages lead to a marked increase in the frequency of rare recessive disorders (34, 35). Many genetic variants will only be pathogenic in recessive form; this includes variants for ID, and almost certainly for ASD too. Identification of recessive genes in outbred populations is problematic, as the analysis pipelines for WES/WGS data are non-optimal for discovering compound heterozygous mutations. Populations with a high proportion of consanguinity have been important for describing autosomal recessive genes in ID in Pakistan (36, 37), Iran (38, 39), Syria (40), and Saudi Arabia (41), yet, to date, relatively few reports of AR genes underlying ASD in consanguineous populations have been published (30,42,43). Despite the higher prevalence of recessive inheritance in consanguineous populations, data from studies of developmental disorder (DD) suggest that de novo variants are also prominent, albeit in a lower proportion e.g. in a UK study of 6040 families from the Deciphering Developmental Disorders (DDD) study, individuals of Pakistani ancestry had 29.8% de novo compared with 49.9% in the European ancestry UK population (44). In an Iranian ID study, the de novo rate was 27.86% (45).
Here, we report on a study of genetic and genomic variants in a cohort of 115 ASD trios from three countries with a high frequency of consanguineous marriages (Pakistan, Iran, and Saudi Arabia). Of the candidate sequence variants we report (not including CNVs), the proportion that are de novo is ∼24%, similar to the rates reported among Pakistani ancestry DD individuals in the UK and in ID individuals in Iran. We report 57% of the candidate sequence variants identified in the study are AR. This is somewhat higher than the 31% AR reported for Pakistani ancestry subjects in the UK DDD study (3.6% for those of European ancestry). Of the 28 sequence variants identified in genes previously linked to neurodevelopmental disorders, 25% were de novo, and 55% AR. This compares well with rates in a study of Iranian ID families (28% de novo, 69% AR) (45).
Materials and Methods
Trio Family Ascertainment
Institutional Research Ethics Board approval was received for this study through the Centre for Addiction and Mental Health (CAMH) and other institutional recruiting sites. Twenty-seven trios were ascertained by Dr. Sasanfar by the Children’s Health and Evaluation project (CHEP) sponsored by the Special Education Organization (SEO), as described previously (46, 47).
Children in the cohort were between ages 4 and 11, and received a clinical diagnosis of ASD by a child psychiatrist. All parents from Iranian cohorts filled out the Social Communication Questionnaire (SCQ) (48), using the Diagnostic and Statistical Manual of Mental Disorders (fourth edition-text revision (DSM-IV-TR) (49). The diagnosis of ASD was confirmed in all subjects using ADI-R (50) and ADOS (51) by a certified physician. Another 13 trios were ascertained through clinical services in Iran: Dr. Bita Bozorgmehr clinically assessed 10 trios and Dr. Abolfazl Heidari assessed three trios. Fifty-four Pakistani trios were recruited from the University of Health Sciences and Children Hospital, Lahore, Pakistan. Pakistani ASD trio probands were assessed using the Childhood Autism Rating Scale (CARS)(52). The clinical assessment has been performed by Drs. Ansa Rabia Saqib Mahmood, and Shazia Maqbool. A further eight Pakistani trios were recruited through COMSATS University, Islamabad, by Drs Raheel Qamar and Zehra Agha, with ascertainment through the National Rural Support Program and the Ghazi Barotha Taraqiata iDara. Diagnosis of ASD was performed using combined DSM-5 (53) and ICD-10 criteria (54), after assessment using CARS (52). The 13 Saudi Arabian trios were provided by Dr. Laila Al Ayadhi through the Autism Treatment Centre, King Saud University, Riyadh, Saudi Arabia, and included subjects aged 2-12 years old who had a diagnosis of ASD confirmed using the ADI-R (50), Autism Diagnostic Observation Schedule (ADOS) (51) and 3DI (Developmental, dimensional diagnostic interview) scales (55). Disease severity was also measured using the Childhood Autism Rating Scale (CARS)(52). Saudi subjects were excluded if they had dysmorphic features, or diagnosis of fragile X or other serious neurological (e.g., seizures), psychiatric (e.g., bipolar disorder) or known medical conditions.
Overall, we have sequenced 115 trios: 62 trios from Pakistan, 40 trios from Iran, and 13 trios from Saudi Arabia. In total, we have sequenced 345 samples from the different institutions, with collaborator contributions listed in Table 1.
Whole Exome Sequencing, Alignment and Variant Calling
Whole Exome Sequencing (WES) was performed using the Thruplex DNA-Seq (Rubicon Genomics) Library Preparation Kit with the Agilent SureSelect V5 Exome Capture kit. Samples were sheared using the Covaris ME220 Focused-Sonicator to ensure that the DNA was sheared to 200 bp. Samples were then analyzed using the Agilent 2100 Bioanalyzer System for fragment length distribution and quantification. Library preparation and adapter ligation was performed using the Thruplex DNA-seq kit. All trios, representing the proband, mother and father, were sequenced on the Illumina HiSeq 2500 or the Illumina NovaSeq sequencing system.
After sequencing, FastQC was used to assess the quality of sequences and determine the adapter contamination and read quality. Unligated adapters and low quality sequences were removed using FastX Toolkit (Updated Aug 4th 2017: https://github.com/agordon/fastx_toolkit) (56) and Trimmomatic (v 0.33: Updated Mar. 10th, 2015: https://github.com/timflutre/trimmomatic) (57) . Cleaned fastq files were then aligned to the hg19 reference genome using BWA 0.7.17 (58) and variants were called using the GATK best practices for variant calling and quality control (59). Briefly, after alignment, duplicate reads were removed using Samtools version 1.4 (60) and the GATK v 3.8 (61) was used for base recalibration, indel realignment and then variant calling using the Global Haplotyper (61). This created raw Variant Call Format (VCF) files. The variants called per sample were then checked for their transition-transversion ratio which was approximately 2.5.
Annotation and Variant Prioritization
VCF files were annotated using ANNOVAR (62), integrating allele frequencies, from gnomAD (63), The Greater Middle Eastern Variome Server (64) and ExAC (65), as well as functional pathogenicity scores from Polyphen2 (66), SIFT (67) and MutationTaster (68), MutationAssessor (69), FATHMM (70), and CADD (71). Variants were assessed, firstly, if they passed the quality metrics as set by the Convolutional Neural Network scoring algorithm by GATK (61). Briefly, variants were compared against known variants from dbSNP and a convolutional neural network was trained to filter out poor variant calls based on several metrics (61) and minor allele frequency (MAF) in gnomAD (https://gnomad.broadinstitute.org/) of equal to or lower than 1 x 10-3 for autosomal recessive, or 1 x 10-5 for autosomal dominant/de novo, with a preference for variants that had no homozygotes in the gnomAD non-Neuro Cohort. For autosomal recessive variants, as the MAF used was on the conservative side, reanalysis was also performed using less stringent values (1 x 10-3, 1 x 10-2), to see if any variants in known ASD/ID genes were missed under the stringent analysis. Variants were then split into either homozygote or heterozygote categories and then prioritized by functional impact of mutation type. For example, LoF variants would carry more weight than missense variants leading to assessing stopgain, frameshift, splicing, and missense variants, in that order. The LoF mutations were given highest priority as it was assumed that LoF variants will disrupt protein function more severely and, depending on the position within the transcript, cause nonsense mediated mRNA decay (NMD). The variants were then assessed for their functional pathogenicity score, where variants that met at least two functional prediction scores between Polyphen2 (66), and MutationTaster (68) and CADD (71) were further investigated. Special attention was given to splicing with additional functional annotation scores being used for splicing events with the addition of FATHMM (70) and dbscSNV (72). Due to the large number of missense variants for autosomal dominant/de novo inheritance, stricter criteria were used where a variant would only be considered if it was predicted as pathogenic by all software used including Polyphen2 (66), MutationTaster (68), MutationAssessor (69), FATHMM (70), CADD(71) and M-CAP(73). The overall prioritization method is outlined in Figure 1. Variants were also analyzed to determine if they are in genes associated with any previous studies reporting association with ASD or ID. Variants in genes in pathways such as neuron growth and guidance were also considered as a higher priority.
Finally, variants were Sanger-sequenced and segregation checked with both parents for validation.
Splice site annotation
Standard annotation software has been noted to be suboptimal for splicing variants and cryptic splice sites (74). To improve detection of splicing variants, particularly in neurodevelopmental disorders, SpliceAI was used to splicing variants or cryptic splice sites (75). All VCF files were run through the standard SpliceAI algorithm as per documentation and filtered for variants with a SpliceAI score greater than 0.5. Once variants were extracted, they were annotated with ANNOVAR (76) to add additional meta data for biological interpretation. Variants were then binned into heterozygous, X-linked and homozygous and filtered by gnoMAD (77) allele frequency (MAF < 0.01). Variants that survived filtering were then investigated for appropriate inheritance patterns (i.e. de novo for heterozygous variants and both parents contributing one allele for homozygous variants).
Copy Number Variant (CNV) analysis
Microarrays were run using DNA from 103 ASD probands from the 115 trios (where sufficient DNA was available), including 19 using Affymetrix CytoScanHD, and 84 using Illumina CoreExome. CNV analysis for the CytoScanHD arrays was performed using ChAS and PennCNV; for CoreExome using Illumina Genome Suite/CNVpartition and PennCNV. Evidence of homozygous loss CNVs was cross-references with WES data, using the Integrated Genome Viewer (IGV: https://software.broadinstitute.org/software/igv/; (78), and CNVs corroborated in this manner were then checked by PCR.
Also, for WES data we performed CNV analysis using CLAMMS (79), XHMM (80) and CoNIFER (81). The analysis of exome sequencing data for CNVs has drawbacks due to the capture technology used. A majority of these methods employ a read based exome calling strategy after normalizing for the depth of sequencing across different regions. CLAMMS was used because it offered the ability to create subsets of the samples and construct more specific references through principal component analysis (PCA) of sequencing metrics. All softwares used were run according to standard procedures described on their respective documentation. Filtering was done for quality metrics outlined in the documentation. Once CNVs were called they were also flited against the gnomAD structural variation call set (82) and the Database of Genomic Variants (DGV: http://dgv.tcag.ca/dgv/app/home). CNVs that overlapped more than 50% with gnomAD or DGV SVs were filtered out. Once the final call set after filtration was created, CNVs were then binned into homozygous, heterozygous or X-linked variants and further validated with qPCR to check whether they followed expected inheritance patterns for their respective zygosity. For instance, homozygous CNVs in the proband would have to have both parents that were heterozygous, and heterozygous proband variants would have to be de novo. X-linked variants were also further validated in males, as they lack a second copy of the X chromosome and are thus more likely to be damaging. CNV validations were also performed for family members, to confirm the inheritance pattern of candidate variants.
Cross-referencing with other datasets
In order to evaluate the strength of candidacy of the variants/genes identified here or to find supporting evidence, we attempted to cross-reference with other available datasets. We attempted where possible to identify rare variants in the genes we report here (Tables 2-4) that were either: 1. De novo, 2. Heterozygous LoF but status either de novo or unknown; 3. Biallelic (i.e. homozygous); 4. Putative biallelic (possibly compound heterozygous, but phasing unknown); 5. X-linked in males, maternally inherited. Data are shown in Supplementary Tables S1 to S3.
MSSNG: More than 10,000 individuals from the Autism Genetic Research Exchange (AGRE) repository and other cohorts have been whole-genome sequenced, through Autism Speaks, and data made available through the MSSNG database (https://research.mss.ng; (83). MSSNG data was accessed in April 2020.
SFARI: SSC biallelic: The Simons Simplex Collection (SSC) includes 2600 simplex autism or related developmental disorder families. Whole exome sequence data is available for ∼2,500 of these families, biallelic variants for our gene list were searched using the GPF browser (https://gpf.sfari.org/gpf19/datasets/SSC/browser). Putative compound heterozygous variants criteria for minor allele frequency (MAF) in SSC exome, gnomAD exome and genome frequencies, of <0.001, or, for missense variants, MCP scores >1 and CADD scores >18. SVIP biallelic: the autism Simons Variation in Individuals Project (SVIP) dataset was searched for putative biallelic variants through the GPF browser (https://gpf.sfari.org/gpf19/datasets/SVIP/browser). Putative compound heterozygous variants checked for criteria including minor allele frequency (MAF) in SSC exome, gnomAD exome and genome frequencies, of <0.001, or, for missense variants, MCP scores >1 and CADD scores >18.
Deciphering Developmental Disorders (DDD) study: includes sequence data for ∼14,000 children; this dataset was searched for variants in genes overlapping with our study through the known developmental genes list (https://decipher.sanger.ac.uk/ddd#ddgenes), as well as the list of research variants (https://decipher.sanger.ac.uk/ddd#research-variants), which are 2,723 variants of unknown significance from 4,293 children. For de novo variants, all are constitutive, unless specifically noted as mosaic here. CNV variants are not included here.
Autism Sequencing Consortium: genes with de novo variants, either in the control set or case set, from the Autism Sequencing Consortium (ASC) exome analysis browser (https://asc.broadinstitute.org/). This dataset lists numbers of de novo protein truncating variants, as well as de novo missense variants with MCP scores either 1-2 or ≥2. The ASC dataset includes 6,430 probands, and includes the SSC data.
DeRubeis: genes with de novo variants reported in the DeRubeis et al, 2014 study, with either LoF or missense variants (84). As subject IDs were not reported, overlap with other studies recorded here is not possible.
AutismKB: additional studies where variants have been reported for the genes are taken from the Autism Knowledge Base (http://db.cbi.pku.edu.cn/autismkb_v2/quick_search.php), (85).
Neuroanatomical Enrichment Analysis
The Allen Adult Human Brain Atlas (86), Brainspan (87), and a single-cell atlas of the mouse nervous system(88) were used to test for neuroanatomically specific expression. For each gene, expression was standardized across regions or cell-types. For these compartments, each gene was then ranked from most specific to most depleted. The area under the receiver operating statistic (AUC) was used to quantify specific expression for the genes of interest within a region or cell cluster. The Mann-Whitney U test was used to test statistical significance, and the Benjamini–Hochberg procedure (89)was used to correct for multiple tests.
Results
Our study used whole exome sequence data for a total of 115 proband/mother/father trios where the proband had a diagnosis of ASD. These included 62 from Pakistan, 40 from Iran, and 13 from Saudi Arabia. Of the probands, 91 were male, 24 were female (Table 1).
Whole exome sequencing
Of the 115 trios, 18 were found to have an autosomal biallelic mutation in a gene previously reported for ASD, ID or other neurodevelopmental disorder (see Table 2). Of these, eight were loss-of-function (LoF), or putative LoF (e.g. canonical splice site) mutations. De novo autosomal or X-linked mutations in known or previously reported ASD, ID, or other neurodevelopmental disorder were identified in a further seven trios (see Tables 3 and 4). A known de novo Rett syndrome-associated mutation (NM_004992: c.763C>T; p.Arg255*) was identified in one trio.
Biallelic mutations
We identified mutations in several genes that have been previously reported to be associated with autosomal recessive non-syndromic ID, such as TECPR2 (41), MTHFR (40), MADD (41), BTN3A2 (41), LINS1 (39), CC2D1A (MIM 608443; MRT3; (90)), RSRC1 (91, 92) and ZNF335 (38), and a number known for syndromic or metabolic forms of autosomal recessive ID, including VPS13B (Cohen syndrome), AGA (aspartylglucosaminuria), ASL (arginosuccinic aciduria), HTRA2 (3-methylglutaconic aciduria) ASPA (aspartoacylase deficiency/Canavan disease), and MED25 (Basel-Vanagaite-Smirin-Yosef syndrome). CC2D1A, VPS13B, and DEAF1 have also previously been associated with ASD or autistic features (93–95). Biallelic mutations in DEAF1 are known to cause neurodevelopmental disorder with hypotonia, impaired expressive language, with or without seizures (MIM 617171). LoF mutations were reported for new candidate genes HRNR, ILDR2, SCN10A, DDIT4L, SLC36A1, FAXDC2, GIMAP8, DAGLA, B4GALNT1, VPS16, and EFCAB8, and potentially LoF (such as canonical splice site mutations) in DENND1B, LRRC34, PKD1L1, CNPY4, and PPP1R36. VPS16 has previously been reported for autosomal dominant dystonia (MIM 608559), B4GALNT1 for AR spastic paraplegia (MIM 609195), and PKD1L1 for visceral heterotaxy 8 (MIM 617205). Heterozygous missense and LoF mutations in SHH are known to be involved in autosomal dominant holoprosencephaly 3 (HPE3; MIM 142945), and so it is somewhat surprising to find a homozygous missense variant associated with a milder phenotype with no obvious HPE3-related dysmorphic features. The Asn69Ile variant in SHH substitutes an asparagine for isoleucine at a residue that is conserved across vertebrates, is predicted to be damaging, and is found in heterozygous form in only 6 out of 125,748 gnomAD exomes.
We report a homozygous missense variant in SGSM3. Previously de novo variants in SGSM3 have been reported for ASD (96), and SFARI lists this an ASD gene with a score of 3 (https://gene.sfari.org/).
De novo mutations
De novo dominant mutations have been shown to be a major causal factor in ASD etiology. In this set of 87 identified WES variants from 115 trios, 21 variants were identified as de novo variants (18 autosomal and 3 X-linked). Among the de novo variants identified, variants in three of the genes have been identified in previous studies (MECP2, SCN2A, MYT1L, ZNF292). 12 were LoF variants in the genes MECP2, TMIGD3, ECM1, SLAMF7, MYT1L, SCN2A, NCL, FAM53C, ZNF292 (2 trios), ATP2B1, and RETN (Tables 3 and 4). Four were putative LoF (splice site or stop loss) variants, including ADGRF2, DGKZ, CLEC4E, and PPIL2, and the remainder were nonsynonymous. There were also several other genes that were listed as de novo variants that have not been associated with ASD (or ID) previously. These variants have met the criteria for filtering as described in the Materials and Method, and may represent novel putative variants for ASD.
ZNF292 (Mirzaa et al, 2020(46))
Our study identified a de novo LoF mutation in ZNF292 (NM_015021:c.6159_6160del; p.Glu2054Lysfs*14) in two trios (IAU-65 (Iranian) and Autism-10 (Pakistani). These two trios and 26 additional families (from multiple studies and cohorts) with mutations in ZNF292 were reported recently (97). Out of 28 families reported in the study, the mutations were de novo in all but one, in which the mutation was inherited in dominant fashion. The clinical features associated with ZNF292 mutations spanned a range of severities and diagnoses, including ASD and ID.
X-linked
A number of mutations were identified in well-established ASD or ID genes, for instance MECP2 (Rett syndrome: MIM 312750), AIFM1 (Cowchock syndrome; MIM 310490), MID2 (XLID101: MIM 300928). For other trios there are mutations in genes that have been implicated in other recent studies of ASD and/or ID. For instance, we report a hemizygous nonsynonymous mutation in CLTRN- hemizygous mutations in CLTRN have recently been reported in several males with neutral aminoaciduria accompanied with autistic features, anxiety, depression, compulsions and motor tics (98)- a clinical picture reminiscent of the autosomal recessive Hartnup disorder (MIM 234500). For other trios there are variants in genes linked with other nervous system disorders, such as DRP2 in Charcot-Marie-Tooth disease (99).
Splice site variants
Eight of the biallelic autosomal mutations (16%), four of the de novo autosomal variants (22%), and two of the X-linked recessive variants (11%) were at canonical splice positions. In addition to detection of variants at canonical splice donor and splice acceptor sites, use of SpliceAI suggests that, for the biallelic nonsynonymous variant in the known ASD/ID gene ASL in trio SA7, an alternative explanation could be the generation of an alternative splice donor by the cytosine to thymine transition at this site. However, other splice algorithms such as https://www.fruitfly.org/seq_tools/splice.html do not support this, and molecular experimental procedures may be needed to corroborate or refute this prediction.
Overlap with studies in non-consanguineous populations
Although not focused on consanguineous populations, WGS data from the MSSNG study, as well as WES studies such as DDD and SFARI, shows possible support for several of the candidate AR genes, with possible compound heterozygous damaging rare variants in DNAH8 (in three MSSNG individuals), CLCA4, ANO10, SCN10A, and WDR90 (see Supplementary Table S1). There was also evidence of disease-causing biallelic variants in known neurodevelopmental genes MTHFR (one homozygous), AGA (one homozygous), DEAF1 (one homozygous), ZNF335 (one homozygous, one putative compound heterozygous), and VPS13B (six homozygous, nine putative compound heterozygous). There was also support for new candidate dominant/de novo autosomal genes, including ECM1, SLAMF7, NCL, DGKZ, ATP2B1 (5 individuals), CBFA2T3, RETN, and PPIL2 (see Supplementary Table S2). There was strong support for X-linked recessive (in males, maternally inherited or de novo) candidate gene CNGA2 (16 individuals), GUCY2F (13 individuals), SAGE1 (eight individuals), as well as support for EGFL6 (six individuals), NRK (six individuals), PIR (five individuals), MSN (five individuals), ACTRT1 (five individuals), TRPC5 (four individuals), and ZNF185 (two individuals).
Copy Number Variants
A number of autosomal biallelic loss CNVs were identified using microarray data, which were then validated through observation of WES reads using IGV, and through molecular methods, using PCR and/or Sanger sequencing. Of these implicated genes, DHRS4 and KLK15 both have supportive evidence of neurobehavioral phenotype in mouse models (Supplementary Table S5). DNAH7 shows strong transcription in brain tissues, and both SHPK and WDR73 have high protein levels in cerebellum. Of the biallelic CNVs identified, WDR73 has previously been associated with the autosomal recessive Galloway-Mowat syndrome 1 (MIM 251300), which includes microcephaly, delayed psychomotor development and cerebellar atrophy. A biallelic WDR73 mutation has also been reported for ID (37). In other ASD or ID genome or exome datasets, a homozygous stop-gain and two homozygous missense mutations in SIRPB1 were found in affected individuals in the MSSNG data (1-0157-003: Chr20: 1585484C>A; NM_001135844.2:c.655G>T:p.Glu219*; AU3915301: Chr20:1584614T>G; NM_001135844.2:c.926A>C:p.Tyr309Ser; 3-0410-000: Chr20:1585517T>A; NM_001135844.2:c.622A>T:p.Ile208Phe). None of these variants are present in the gnomAD v3.1.2 control database. For CYP2A7, the MSSNG data affected individuals included a homozygous frameshift insertion (1-1093-003: Chr19: 41383105dupA; NM_000764.2: c.1151dupT: p.Leu385Profs*3) and two with homozygous missense mutations (both AU4302301 and AU008505: Chr19: 41382543C>T; NM_000764.2:c.1192G>A: p.Val398Met), neither variant present in gnomAD. While there is evidence of brain expression for SIRPB1, expression of CYP2A7 appears to be restricted to liver and testes (Supplementary Table S9). However, the WRD73, CYP2A7, and KLK15 CNVs were all identified in a proband with a biallelic splice mutation in the Canavan gene ASPA, the SHPK CNV is in an individual with a biallelic LoF mutation in known ID gene DEAF1, and the SIRPB1 CNV is in an individual with a biallelic non-frameshift deletion in known gene B4GALNT1, and are thus less likely to be the main causative variants in these individuals (see Supplementary Table S4). A biallelic CNV loss implicates the gene SEMG1, however expression of this gene is restricted to the seminal vesicle (see Supplementary Table S9), and thus unlikely to be related to the ASD phenotype.
There are also three large, multi-genic de novo CNV losses among the trios, which may also be pathogenic (Table 6).
Neuroanatomical Enrichment Analysis
Testing for regional and cell-type specific expression did not indicate clear anatomical targets with higher expression of the candidate genes. Enrichment in the cerebellum and cerebral cortex was observed in the human developmental atlas, but this was not supported by either the mouse or adult human atlases. We found no consistent neuroanatomical expression pattern for the identified genes, suggesting heterogeneity of neural circuits disrupted.
Discussion
Many recent studies involving NGS in ASD have involved large cohorts, focusing predominantly on dominant/de novo inheritance. This is largely driven by the fact that autosomal recessive variants in novel genes are more difficult to identify in the outbred population, and informatics pipelines are not sufficiently tuned to identify compound heterozygote variants. Compound heterozygous variants would need to have variants on opposite alleles, and in order to identify them would require informatics pipelines that focus on multiple damaging heterozygous variants in the same gene in a proband, and comparing with genotypes from the maternal and paternal WES or WGS data. This requires extra care when analyzing variants. The use of genomewide association studies (GWAS) to identify variants is also inadequate for finding recessive variants because the sample size required would be significantly larger than current Psychiatric Genomics Consortium studies to have sufficient power to find biallelic variants. Also, currently most ASD genetics research emanates from North American and European countries, where larger families, which are optimal for recessive variant mapping are not as readily available.
Using whole exome trio analysis in consanguineous families, we have enriched for recessive variants to assess the role these variants play in ASD in populations where endogamy is common. This study of 115 trios has identified WES variants for 28 trios in genes previously associated with ASD or other neurodevelopmental disorders, resulting in a diagnostic yield of 24%. In our ASD cohort, 18 recessive, five de novo, and five X-linked known genes. With a further three trios with large, multi-genic, loss CNVs validated experimentally as de novo and thus likely pathogenic (Table 6), this yield increases to 27%. There were many other variants identified that met filtering criteria which could potentially represent novel ASD genes or targets. Cumulatively, 18 candidate autosomal de novo variants were identified in the 115 trios and 50 candidate autosomal recessive variants identified in this cohort. Within this cohort we identified variants in five genes that are associated with known metabolic syndromes (4.3% of the diagnosis cohort): AGA, ASL, HTRA2, ASPA, and MTHFR. These genes may represent better clinical management opportunities for patients and potentially better therapies.
We identified a biallelic nonsynonymous mutation in calcyphosine-like protein gene, CAPSL. Biallelic mutations have previously been reported for ID in both of the calcyphosine–related genes, CAPS (36) and CAPS2 (41). Other than being calcium-binding EF-hand domain proteins, remarkably little is currently known about the role of calcyphosine. EFCAB8, in which a bialleic nonsense mutation was identified in trio SMPA9, is also a calcium-binding EF-hand domain protein. Given the possible involvement of all three calcyphosine related genes in neurodevelopmental disorders, further investigation into this biological pathway is clearly warranted.
A number of the ID genes identified here among ASD trios have also previously been reported for ASD or ASD-like features, e.g. CC2D1A (94, 100), VPS13B (43), DEAF1 (93), ZNF335 (101), ZNF292 (97,102,103), MYT1L (103, 104), SCN2A (96,105,106), and PLXNA3 (107, 108). Of the candidate genes identified here, MSSNG, SFARI, and other datasets provide putative support (although it is not possible to determine whether two reported mutations, i.e. putative compound heterozygotes, in the same gene are on the same alleles or not, unless occurring within the same sequence read) for the biallelic genes DNAH8, SCN10A, CLCA4, ANO10, WDR90 (Supplementary Table S1), and for de novo/dominant genes NCL, SLAMF7, ADGRF2, DGKZ, ATP2B1, CBFA2T3, RETN, and PPIL2 (Supplementary Table S2).
There are two genes identified here for which multiple trios from our cohort with mutations were identified: ZNF292 and SCN2A. De novo mutations in ZNF292 in were found in two families, including two frameshift deletions (Autism-10 and IAU65). These two trios, one from Pakistan, the other from Iran, and the frameshift mutation they share, were reported previously in Mirzaa et al, 2020 (97). ZNF292 encodes a zinc finger protein, a transcription factor that binds to the promoter of the growth hormone gene, GH (109). Mutations in the known ASD gene SCN2A (105) were identified in two trios-one de novo and the other paternally inherited. In addition, a biallelic LoF mutation was identified in another voltage-gated sodium channel member (SCN10A), in which heterozygous gain of function missense mutations have previously linked to familial episodic pain syndrome 2 (FEPS2; MIM 615551)(110).
For some of the autosomal recessive candidate genes, there is additional support from animal models. Rasal2 (MGI:2443881) and Vps16 (MGI:2136772) knockout mice have behavioral/neurological and nervous system phenotypes (http://www.informatics.jax.org/; Supplementary Table S5). Note that VPS16 is a vesicle mediated trafficking protein similar to VPS13B. Homozygous knockout of Ephb1 results in impaired contextual and cued conditioning, as well as abnormal freezing behaviour, and homozygous knockout of Slc36a1 results in embryonic growth retardation, decreased freezing behaviour, and preweaning lethality (www.mousephenotypes.org; Supplementary Table S5). Biallelic knockout of Dagla results in decreased brain size, hypoactivity, abnormal behaviour, and decreased thigmotaxis (Supplementary Table S5). In general, there is a high rate of neurodevelopmental phenotypes in knockout mice available for the genes with biallelic variants (∼70%), and in biallelic or hemizygous knockouts for the X-linked genes (∼58%). However, it is a remarkably contrasting story for the de novo/dominant variants, for which there is little or no support by way of neurodevelopmental or behavioural phenotypes in heterozygous mouse models (0%). Moreover, where there are biallelic knockout mouse models for genes from the de novo/dominant set, the phenotypes described are mainly unrelated to CNS and behaviour (<10%; Supplementary Table S5). This would suggest that the autosomal biallelic (and X-linked) mutations among our cohort are much more likely to be etiopathologically relevant to ASD.
There are a number of trios for which there are two or more candidate variants (Supplementary Materials Table S4). For some of these, there may be a variant that clearly delivers a more plausible etiopathological explanation. For instance, trio SMPA3 carries a biallelic nonsynonymous change in the gene MTHFR, mutations in which have been reported to cause recessive homocystinuria and a known cause of neurodevelopmental and psychiatric disorders. For the same trio, there is a biallelic nonsynonymous change in RASAL2, and a maternally inherited hemizygous LoF mutation in MAGEB2, as well as a biallelic loss CNV encompassing PCDHA9 and 10. The likelihood of disease due to mutation in MTHFR thus diminishes the candidacy of alternative RASAL2, MAGEB2 and PCDHA9/10. Similarly, for trio IAH2, a biallelic LoF in TECPR2 – a gene which has been reported for autosomal recessive spastic paraplegia 49 (SPG49; MIM 615031) as well as for autosomal recessive ID (41) and hereditary sensory neuropathy with ID (111) – also lessens the case for the PCDHA9/10 biallelic loss CNV. For trio IABB2, the biallelic LoF mutation in DEAF1 – a gene associated with an AR neurodevelopmental disorder with hypotonia, impaired expressive language, with or without seizures (NEDHELS; MIM 617171) – seems a highly probable candidate, and thus diminishes the possible involvement of biallelic loss CNVs at UBE2U and SHPK. The initial report for DEAF1 was, in fact, in a consanguineous Omani family with autism, ID and epilepsy (112). A biallelic splice mutation in the Canavan disease (a severe neurodevelopmental disorder; MIM 271900) gene, ASPA, is the likely cause in trio IABB14, making candidate CNVs/genes WDR73, KLK15, and CYP2A7 more likely to be benign.
For other trios with two or more variants reported, the decision process was not as clear-cut or unequivocal. Seven de novo autosomal mutations reported for trio RQPA20 were all validated by Sanger sequencing, however, none of these candidate genes are particularly convincing in terms of the biology, support from other human disease studies, or support from mouse models. However, expression of the genes DGKZ and CBFA2T3 are associated with numerous electrophysiological phenotypes (113)(Supplementary Table S6), and show high transcription and protein expression in brain regions, which could warrant further investigation into the effects of the variants identified.
Since our initial methodology used relatively stringent MAF cut off criteria for biallelic variants (<0.001), we attempted to justify this level by reiterating the analysis but with lower thresholds (<0.01). Using the more relaxed criteria resulted in just two additional candidate variants, in the genes CLCA4 (gnomAD South Asian MAF=1.72E-3), and RASAL2 (gnomAD South Asian MAF=6E-3), but none in known neurodevelopmental genes. Our methodology was particularly stringent in regard to de novo nonsynonymous variants, owing to the large number of likely spurious calls, and the inclusion of only variants predicted as damaging by all algorithms used may have excluded some bona fide mutations. Other variants, particularly intronic or intergenic, would also likely be missed by our approach. Also, a proportion of exons, and particularly in GC-rich regions, are either missed or have low coverage in WES. Follow up with whole genome sequencing could be considered as a possible next step. However, parsing and assessing rare non-coding variants is particularly challenging.
It was recently estimated in the DDD cohort that recessive variants make up 3.6% of diagnoses, whereas de novo variants would contribute 48.6% of variants (44). The same study also compared homozygous variants in a narrowly defined group of Pakistani Ancestry in the British Isles (PABI), which is most like our cohort (44). The PABI cohort had 356 (333 undiagnosed) probands, 110 of whom (30.9%) had biallelic coding candidate variants, approximately half of which are in known DD genes and half in novel candidates (44). For comparison, our study reports very similar yields for recessive variants, with 16% of probands with biallelic variants in known NDD genes (18/115), or 34% including those with biallelic variants in novel candidate genes (39/115). We report 24 de novo variants (18 autosomal, three X-linked, three CNVs) in 15 of 115 individuals (13%), which is somewhat lower than those reported for the PABI cohort (29.8%). This difference is most likely due to differences in the methodologies or stringency used for calling de novo variants in the respective studies. Overall, the findings demonstrate the importance of autosomal recessive mutations in ASD in countries, or in cohorts, with high rates of consanguinity, as well as identifying genes that should be examined for autozygous and allozygous mutations in outbred populations, just as would be done for more established autosomal recessive gene disorders.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Conflict of Interest Statement
The authors declare no conflicts of interest in relation to the current study.
Acknowledgements
The authors thank the families for their participation in this study. This work was supported by grants from the Canadian Institutes of Health Research to JBV (#MOP-102758 and #PJT-156402), also a National Alliance for Research on Schizophrenia and Depression (NARSAD) Young Investigator Award to NV. RH was supported by a Peterborough K.M. Hunter Charitable Foundation Graduate Scholarship. AR and AH were supported by International Research Fellowship Program scholarships from the Pakistani Higher Education Commission. CS was supported by a Summer University Research Program award from the Institute of Medical Science, University of Toronto. Computations were performed on the CAMH Specialized Computing Cluster. The SCC is funded by The Canada Foundation for Innovation, Research Hospital Fund.
The authors wish to acknowledge the resources of MSSNG (research.mss.ng), Autism Speaks and The Centre for Applied Genomics at The Hospital for Sick Children, Toronto, Canada. We also thank the participating families for their time and contributions to this database, as well as the generosity of the donors who supported this program.
DDD: The DDD study presents independent research commissioned by the Health Innovation Challenge Fund [grant number HICF-1009-003], a parallel funding partnership between
Wellcome and the Department of Health, and the Wellcome Sanger Institute [grant number WT098051]. The views expressed in this publication are those of the author(s) and not necessarily those of Wellcome or the Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South REC, and GEN/284/12 granted by the Republic of Ireland REC). The research team acknowledges the support of the National Institute for Health Research, through the Comprehensive Clinical Research Network.
Footnotes
↵§ Equal contribution
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.
- 115.