Abstract
Neuropsychiatric disorders are highly complex conditions and the risk of developing a disorder has been tied to hundreds of genomic variants that alter the expression and/or products (isoforms) made by risk genes. However, how these genes contribute to disease risk and onset through altered expression and RNA splicing is not well understood. Here we show our current understanding of gene isoforms is far from complete and reveal the precise splicing profiles of neuropsychiatric disorder risk genes. Combining our new bioinformatic pipeline IsoLamp with nanopore long-read amplicon sequencing, we deeply profiled the RNA isoform repertoire of 31 high-confidence neuropsychiatric disorder risk genes in human brain. We show most risk genes are more complex than previously reported, identifying 440 novel isoforms and 28 novel exons, including isoforms which alter protein domains, and genes such as ATG13 and GATAD2A where most expression was from previously undiscovered isoforms. The greatest isoform diversity was present in the schizophrenia risk gene ITIH4. Mass spectrometry of brain protein isolates confirmed translation of a novel exon skipping event in ITIH4, suggesting a new regulatory mechanism for this gene in brain. Our results emphasize the widespread presence of previously undetected RNA and protein isoforms in brain and provide an effective approach to address this knowledge gap. Uncovering the isoform repertoire of neuropsychiatric risk genes will underpin future analyses of the functional impact these isoforms have on neuropsychiatric disorders, enabling the translation of genomic findings into a pathophysiological understanding of disease.
Introduction
Over 90% of multi-exonic human genes undergo alternative splicing (AS), a process that enables genes to produce multiple mRNA products (RNA isoforms) [1]. Common AS events include exon skipping, intron retention and alternative 5’ and 3’ exonic splice sites [2]. These mRNA alterations can impact the open reading frame (ORF) and/or alter post-transcription regulation and translation of an RNA, increasing both transcriptomic and proteomic diversity [1, 3, 4]. AS has been established as an important regulator of organ development and physiological functions and is highly regulated under normal conditions [5, 6]. Conversely, aberrant RNA splicing has been linked to the development of cancer, autoimmune and neurodevelopmental disorders [7–11]. AS plays an especially important role in the brain, which has a distinct splicing program including the largest number of tissue specific exons and frequent use of microexons [12]. Numerous studies have reported crucial roles for AS in brain development and dysregulation in disease [13, 14].
Neuropsychiatric or mental health disorders (MHDs) including schizophrenia (SZ), major depressive disorder (MDD), autism spectrum disorder (ASD) and bipolar disorder (BD) can carry significant morbidity for affected individuals [15]. Comorbidities, delayed diagnoses and stigma surrounding MHDs also present a significant challenge to individuals and their families [16]. However, treatment options remain limited or are not well tolerated or effective in some individuals and the underlying aetiology of disease and risk remains poorly understood [17–19]. Recently, large genome wide association studies (GWAS) have revealed hundreds of common single nucleotide polymorphisms (SNPs) that are associated with risk of developing neuropsychiatric disease [20–25]. The vast majority of these variants are found in non-coding parts of the genome and are expected to be regulatory, impacting gene expression levels or which RNA isoforms are produced. For example, risk variants could impact splicing factor binding potentially resulting in novel RNA isoforms, gross transcript alterations or changes to isoform splicing ratios [8]. Confirmatory studies including transcriptome wide association studies (TWAS), summary data–based Mendelian randomization (SMR) [26], multimarker analysis of genomic annotation (MAGMA) and variants (H-MAGMA [27], nMAGMA [28]) and functional genomics have helped to identify the risk genes at these loci and also showed a considerable number of risk loci are shared between disorders [29]. However, there is a current lack of understanding about how risk gene expression and splicing are altered by the risk variants and therefore profiling both their expression and RNA isoforms is essential to link genetic changes to disease pathophysiology.
Current sequencing technologies including Illumina short-reads perform well at detecting novel AS, however the lack of long-range exon connectivity information inherent in short-reads means these approaches are limited in their ability to identify and quantify full-length isoforms and this issue is exacerbated in longer, more complex genes [30, 31]. In contrast, long-read technologies including nanopore sequencing from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) single-molecule, real-time sequencing can sequence entire isoforms in a single read enabling more accurate isoform profiling [7, 32]. Such technologies now make it feasible to comprehensively examine gene isoform profiles. Initial investigations of SNX19 and CACNA1C demonstrated the incomplete knowledge of isoform profiles in humans and the likely importance of novel MHD risk gene isoforms in disease [33, 34].
In this study we addressed the lack of knowledge surrounding MHD risk gene isoform expression using nanopore amplicon sequencing. We developed a new bioinformatic tool, IsoLamp, to identify known and novel RNA isoforms from long-read data. Analysis of the RNA splicing profiles of 31 MHD risk genes identified 360 novel RNA isoforms and 28 novel exons. We identified several genes where most expression is from novel isoforms, including ATG13 and GATAD2A, where the most highly expressed isoforms were novel. Our results show the transcript structure for most risk genes is more complex than current annotations, containing additional exon skipping events, retained introns, novel splice sites and novel exons, including novel isoforms that alter the protein and potentially its function. This work lays the foundation for a better understanding of how risk gene isoforms may play a role in disease pathophysiology.
Methods
Sample preparation and QC
Healthy control post-mortem human brain samples were obtained from six individuals collected through the Victorian Brain Bank (VBB) under HREC approval #12457. Brain samples from two further individuals with a non-control phenotype were also obtained from the VBB for protocol testing and validation purposes. Age, sex and additional details including the post-mortem interval (PMI), pH and tissue weight are shown in Supplementary Table 1. Briefly, samples comprised 5 males and 1 female, age range: 51 – 72 yrs, PMI range: 31 – 64 hrs and pH range: 5.7 – 6.7. Frozen tissue (weight range: 57 – 135 mg) was cut from seven brain regions including Brodmann areas (BA), BA9 (dorsolateral prefrontal cortex (DLPFC)), BA46 (medial prefrontal cortex (MPFC)), BA10 (fronto-parietal cortex (FPC)), Brodmann Area 24 (dorsal anterior cingulate cortex (dACC)), caudate, cerebellum and temporal cortex. Total RNA was extracted from bulk tissue in eight randomised batches of 3 – 6 samples. First, frozen brain tissue was homogenised on ice, using a manual tissue grinder (Potter-Elvehjem, PTFE), whilst immersed in 1 mL QIAzol Lysis Reagent (QIAGEN). Lysate was then processed using a RNeasy Lipid Tissue Kit (QIAGEN, 74804), according to the manufacturers’ instructions. Isolated RNA quality and quantity was checked using a Qubit 4 Fluorometer (2 μL), TapeStation 4200 (RNA integrity number equivalent (RINe), cut-off = 6) and Nanodrop 2000.
Database curation and risk gene selection
MHD risk genes were selected for long-read amplicon sequencing using an internal database that aimed to collate evidence from the literature of gene involvement in disease risk from the original GWAS, meta-analyses including MAGMA (and variants including eMAGMA, hMAGMA, nMAGMA), TWAS, SMR and follow-up studies including fine mapping, protein-protein interaction (PPI), epigenetic (DNA methylation) and targeted experimental validation (Supplementary Figure 1). The foundation of this database was a list of significant GWAS SNPs for SZ, BD, MDD and ASD. Association data was downloaded from the NHGRI-EBI GWAS Catalog [35]. MHD GWAS associations were filtered on the ‘Disease/Trait’ column to exclude effects of treatments including pharmaceutical, mixed disorder studies and associations with behavioural traits like smoking or alcohol intake. Associations were excluded if both the ‘reported gene’ column was ‘not reported (NR)’ and the ‘mapped gene’ column was blank. Date data was downloaded, filters applied, and percentage associations retained are detailed in Supplementary Table 2.
Follow-up studies and experiments were then identified in the literature and the reported genes were manually collated and assigned to an ‘evidence’ category as described above. Each evidence category had the following information headers including the PubMed ID of the reporting manuscript, the first author and the reported SNP and gene. An R script (Supplementary File 1) was used to curate risk genes and appearances in unique studies. Counts of reported risk genes and unique studies, for each evidence category, were then combined with the original GWAS table. Risk genes were then sorted by evidence (high to low), separately for each MHD. A multi-trait evidence list was also made by combining each MHD table together and again sorting by descending evidence. This gave us flexibility to focus on risk genes that appeared to be specific to a single MHD or those with shared risk across disorders.
Primer design, cDNA synthesis and long-range PCR
Thirty-one (31) MHD risk genes were selected from our database and the full coding sequence (CDS) from the canonical isoform was downloaded from the UCSC Genome Browser [36]. Primers, located in the 5’ and 3’ UTRs, were designed to amplify the CDS using Primer3 Plus [37]. Additional primers were made to amplify alternative start or end sites that were not captured by a single primer pair.
Additional UCSC track sources including expressed sequence tags (EST), transcript support level (TSL), APPRIS designation, human mRNA support, cap-analysis of gene expression (CAGE) peaks, CpG islands and H3K4Me3 marks were examined to ensure there was enough evidence that alternative start or end sites were real before a primer was designed [38]. All primers, primer combinations and modified Primer3 Plus settings are listed in Supplementary Table 3. Risk gene primers from Primer3 Plus were aligned to tracks on the UCSC Genome Browser using BLAT for visualisation and tested using the In-Silico PCR [39].
To amplify risk gene CDS, 1 μg of total RNA was used as template for cDNA synthesis using Maxima H Minus Reverse Transcriptase (Thermo Fisher Scientific, EP0752, 200 U/μL) according to the manufacturers’ instructions. Two duplicate cDNA plates were generated simultaneously to reduce variability and provide enough template for multiple risk gene PCRs. Risk genes were amplified using one of following DNA polymerases; LongAmp® Taq 2X Master Mix (NEB, M0287S), Platinum™ SuperFi II PCR Master Mix (Thermo Fisher Scientific, 12368010) or PrimeSTAR GXL (TakaraBio, R050B). Each set of gene primers were individually optimised by adjusting PCR cycling conditions (Supplementary Table 3) until sufficient pure template (∼1 – 10 ng) could be produced for input to barcoding. Short-fragments and primer-dimer were removed prior to barcoding using AMPure XP beads (Beckmann Coulter) at 0.5 – 0.8x ratios to PCR volume. An overview of the experimental protocol is shown in Figure 1A.
Long-read amplicon sequencing
Barcoding conditions for sample multiplexing (N=35, EXP-PBC096, ONT) and library preparation for long-read sequencing followed the recommended ligation sequencing protocol (Figure 1A) (SQK-LSK109/110, ONT). All barcoding PCR was done using LongAmp® Taq 2X Master Mix with an amplicon specific extension time (approximately 1 min/Kb) and 10 – 15x cycles. AMPure clean-up following adaptor ligation was adjusted from the default ratio of 0.4X depending on the length of the target amplicon. Adaptor ligated libraries were loaded (25 – 35 fmol) onto MinION (FLO-MIN106) flow cells and a minimum of 10,000 reads per sample were targeted before flushing and storing the flow cell. All runs were re-basecalled using the super-accurate (SUP) basecalling model (Guppy v6.0.17, 2022) and minimum qscore = 10.
Isoform discovery from long-read amplicon sequencing with IsoLamp
We developed a new bioinformatic pipeline, IsoLamp, for the analysis of long-read amplicon data (Figure 1B). First, pass reads were down sampled [40] to a consistent number (8000) per barcode and mapped to the reference genome with minimap2 (v2.24) [41]. Then, low accuracy reads (< 0.95) were removed and samples were merged prior to isoform identification. Read accuracy was calculated using CIGAR strings in the BAM files and is defined as: (‘X’+’=’+’I’+’D’-‘NM’)/(‘X’+’=’+’I’+’D’). Next, the merged BAM file of high accuracy reads was used as input for isoform discovery with Bambu (v3.2.4) using the following parameters: novel discovery rate (NDR) = 1 and min.fractionByGene = 0.001 [42]. Next, isoforms identified with Bambu were filtered to remove any with zero expression and to retain only isoforms with start and end positions overlapping the known primer coordinates using bedtools intersect (v2.30) [43] to retain only isoforms with start and end positions overlapping the known primer coordinates. Finally, the remaining isoforms were used to create an updated transcriptome with GffRead (v0.12.7) [44], and reads from each barcode were quantified with salmon in alignment-based mode (v0.14.1) [45, 46] and isoforms were annotated with GffCompare (v0.12.6) [44]. The pipeline outputs a list of isoform annotations (.gtf), isoform expression as transcripts per million (TPM) and proportion of overall gene expression, as well as a report summarising the results. An optional TPM filter was also applied to further filter isoforms which required a minimum TPM of 5000. If the user specifies a grouping variable for their input samples, a t-test is performed between isoform proportions between groups and p-values are adjusted for the false discovery rate (0.05).
We benchmarked the performance of the IsoLamp pipeline using Spike-in RNA Variant (SIRV) Set 1 synthetic RNA controls (Lexogen). SIRV isoforms are present in three mixes (E0, E1, E2) that contain isoforms in varying known concentrations. Primers were designed to amplify from the first to the last exon (as described above) of SIRV5 and SIRV6 genes from cDNA generated in triplicate from each SIRV mix (N=27) (Supplementary Figure 2). PCR amplification conditions for SIRV amplicons are shown in Supplementary Table 4. Samples were barcoded and sequenced as described above, and subsequent basecalling and demultiplexing was performed with Guppy (v6.0.17, SUP, 2022). The IsoLamp pipeline was compared against three other isoform discovery tools: StringTie2 [47], FLAIR [48] and Bambu (using both default parameters and the optimised parameters used in IsoLamp). The sensitivity, specificity, accuracy and correlations (expected versus observed counts) of the programs were compared using three SIRV reference annotations: complete (C), incomplete (I) (missing isoforms, to test ability to recover unannotated true positive isoforms) and over (O) (extra isoforms, to test ability to minimise false positive annotated isoforms). Novel isoforms were categorised using SQANTI (v4.2) against the human reference (GENCODE release 41, GRCh38.p13). Finally principal component analyses (PCA) were done on a combined dataset of expression values for each known and novel isoform and its associated metadata including brain region, gene, RINe, pH, individual, PMI and age.
Novel exon validation
Nanopore long-read supported novel exons were validated by RT-PCR. Amplicons were designed from the known 5’ flanking exon into the novel exon and from the novel exon into the known 3’ flanking exon. An amplicon spanning the known 5’ and 3’ flanking exons was used as a positive control. Primers were designed using Primer3 [37] and checked using Primer BLAST [49] and are listed in Supplementary Table 5. In some cases, the primer design space was restricted by the novel exon sequence length and/or nucleotide composition. Novel exon amplifications were done using Taq 2X MasterMix (NEB, M0270L) and cycling conditions can also be found in Supplementary Table 5. PCR products were visualised via gel electrophoresis using GelGreenⓇ Nucleic Acid Stain (Biotium, 41005) and GeneRuler 100 bp ladder (TFS, SMN0243). PCR products in the expected size range were cleaned up using AMPure XP Reagent (Beckmann Coulter, A63881) at a 1.8X ratio to remove fragments <100 bp and sent for Sanger sequencing (100 – 200 bp, AGRF).
Protein isolation and novel sequence detection using targeted mass spectrophotometry (MS)
Post-mortem frontal cortex and cerebellum brain samples (mean weight = 53.25 mg) were used for protein isolation. Samples were from a healthy 59 yo female and 74 yo male with PMIs of 30 and 22 hrs respectively who had no known neurological or neuropsychiatric conditions. Briefly, samples were lysed in 500 μL of guanidinium-HCl buffer using tip-probe sonication, heated briefly to 95 ℃ and diluted 1:1 with LC-MS water before 4 mL of ice-cold acetone was added to precipitate protein overnight at −30 ℃. Supernatant following a wash (3 mL 80% cold acetone) and incubation (−30 ℃, 1 hr) was discarded and protein air-dried (RT, 30 mins). The protein pellet was resuspended in 500 μL 10% TFE in 100 mM HEPES (pH 7.5) and sonicated. Protein concentration was estimated with BCA (1 μL sample + 9 μL 2% SDS). Normalised protein (10 μg/10 μL) was then digested using a combination of LysC/trypsin or GluC for all samples. Finally, digested peptides were analysed using an OrbiTrap Eclipse mass spectrophotometer using known peptide targets informed by long-reads. Protein structure prediction was done using AlphaFold accessed through UCSF ChimeraX (v1.5) [50–52].
Data visualisation
The publicly available web-tool IsoVis (v1.1, https://isomix.org/isovis/) was used to visualise RNA isoforms and associated expression data (Wan et al., 2024, under review). Known and novel RNA isoforms are represented as a stack to compare alternative splicing events between different isoforms. Read counts assigned to each RNA isoform for each of the 35 samples were visualised as a heatmap.
Results
Experimental overview
To identify the RNA isoforms expressed from genes of interest we aimed to perform long-read amplicon sequencing, which provides a highly sensitive means for comprehensive isoform discovery and relative quantification (Figure 1A) [33]. We selected seven regions of post-mortem human brain, encompassing both transcriptionally divergent regions as well as those highly implicated in MHDs. Amplicons were designed cover the full coding region of target genes and, where possible, run from the first to the last exon. Multiple set of primers were used for genes with alternative transcriptional initiation and termination exons and/or alternative coding sequence initiation and termination sites.
IsoLamp: a tool for RNA isoform discovery from long-read amplicon sequencing
While there are several long-read isoform discovery and quantification tools, these are not generally optimised for amplicon sequencing of single genes at high depth. Therefore, we created ISOform discovery with Long-read AMPlicon sequencing (IsoLamp), a custom pipeline designed for isoform profiling from amplicon sequencing (Figure 1B). In contrast to previous tools [33] IsoLamp can be applied to any gene, provides flexible filtering options and provides a simpler, unified, output of isoforms.
We benchmarked the performance of IsoLamp using synthetic Spike-in RNA variants (SIRVs) that provide a known ground truth for isoform exonic structures and abundances. We performed long-read amplicon sequencing on SIRV5 and SIRV6, targeting five isoforms per gene, as these SIRVs allowed targeting of the largest number of isoforms with a single primer pair and so best recapitulated human genes. The SIRV dataset comprised nine replicates from each of the three SIRV mixes (E0, E1, E2) for each gene. 99% of reads mapped to the SIRV genome with minimap2 [41], confirming on-target amplification. We benchmarked the performance of IsoLamp with Bambu [42], FLAIR [48], FLAMES [53] and Stringtie2 (-L) [47]. We assessed the precision, recall and expression correlations of the five tools using three different reference annotations (Figure 2A-C, Supplementary Figure 3, Supplementary Table 6): 1. Complete - contains all SIRV isoforms, 2. Insufficient - missing SIRV isoforms known to be present, and 3. Over – contains additional isoforms that are not present in the SIRV mixes.
Our benchmarking results demonstrated IsoLamp had the highest precision and recall values, consistently outperforming other isoform discovery tools by correctly identifying true isoforms and minimising false positives (Figure 2A, Supplementary Table 6). This included maintaining high performance with the more challenging, but also more realistic, insufficient and over annotation references. IsoLamp expression quantification was also consistently accurate and maintained performance irrespective of the annotation provided (Figure 2B, C).
Bambu, which is also utilised within the IsoLamp pipeline, was the next best performing tool although it identified more false positives and had poorer recall and quantification results using the insufficient annotation (Figure 2A-C). IsoLamp utilises Bambu parameters optimised for amplicon-sequencing, including a novel discovery rate (NDR) of 1. Adjusting the Bambu ‘NDR’ to 1 improved its recall but didn’t improve precision (Figure 2A-C, Supplementary Table 6). These results demonstrate how IsoLamp outperforms tools designed for whole-transcriptome analysis, including when Bambu is provided with optimised isoform discovery parameters for amplicon sequencing.
FLAIR had the highest number of isoforms of all tools tested identifying 261, 181, and 278 novel transcripts in the complete, insufficient, and over-annotated reference-based analyses, respectively. This high level of false-positive novel transcripts led to inaccuracies in transcript abundance assignments, resulting in low correlations compared to other tools (Figure 2A-C). FLAMES exhibited 100% recall for SIRV5 across all annotations, however, its performance with SIRV6 was suboptimal, indicating a higher degree of variability in the FLAMES isoform discovery pipeline. FLAMES also performed poorly for isoform quantification. Lastly, while Stringtie2 did not introduce large numbers of false positives, it had the highest number of false negatives, including when provided with complete annotations (Figure 2A-C, Supplementary Table 6).
IsoLamp employs an optimised expression-based filter to remove lowly expressed isoforms that are likely to be false positive detections. Applying this filter to Bambu, FLAIR, and FLAMES substantially reduced false positive novel isoforms and enhanced overall precision (Supplementary Figure 3, Supplementary Table 6), though IsoLamp was still the top performing tool. Beyond synthetic benchmarking data, reference annotations are typically a combination of insufficient and over annotations. In such scenarios, IsoLamp demonstrated better or comparable correlations with all other tools (Figure 1A-C, Supplementary Figure 3), suggesting its superiority for amplicon-sequencing based isoform discovery and quantification from real biological data.
Post-mortem human brain RNA quantity and quality
Total RNA for long-range amplicon sequencing was extracted from 7 brain regions from 5 healthy individuals (Ind01 - 05) and subject to sample QC (Supplementary Figure 4A-D). RINe (mean = 7.4, range = 6 - 8.1) did not differ by brain region, however Ind04 had significantly lower RINe scores (Supplementary Figure 4B). No trend between the PMI (mean = 44.25 hrs) and RINe was observed (Supplementary Figure 4C). RINe appeared to worsen with decreasing brain tissue pH levels (Supplementary Figure 4D). A principal component analysis (PCA) showed separation of Ind04 (likely driven by lower sample pH and RINe) and of cerebellum and caudate samples from cortical regions in PC1 and PC2 (Supplementary Figure 5AB). A relatively small proportion of variance (4.7%) was attributed to control donor age in PC5 (Supplementary Figure 5C).
Long read sequencing identifies 360 novel RNA isoforms
A total of 31 risk genes were selected for amplicon sequencing based on the accumulated evidence for their involvement in neuropsychiatric disorder risk. A custom database of risk genes and their evidence levels was created and genes ranked (Methods). In a reflection of current GWAS cohort sizes, 21 of the selected genes had the highest evidence for involvement in risk for SZ, 7 for MDD, 2 for ASD and 1 for BPD (Figure 3A). Evidence from GWAS, TWAS and other studies show that some genes appear to be risk factors for multiple disorders including KLC1 for SZ, MDD and ASD (Figure 3B).
The full RNA isoform profile for each gene was sequenced using nanopore long-read amplicon sequencing. Mapping accuracy ranged from 93.7% (CLCN3) to 97.5% (SORCS3) (Supplementary Figure 6A, B). Each novel isoform and its predicted impact on known protein domains, open reading frame (ORF) and associated instability index was recorded (Supplementary File 2) and visualised using IsoVis (Supplementary File 3) (Wan et al., 2024, under review). With no TPM filter set in IsoLamp we identified 872 known and novel isoforms across all 31 neuropsychiatric disorder risk genes. To filter this list for more highly expressed novel isoforms we applied a TPM filter which resulted in 440 known and novel isoforms across all genes (Figure 4A). Of these, SQANTI [54] classified 80 as known (full splice match (FSM)), 266 as novel but using only known splice sites or junctions (novel in catalogue (NIC)), and 95 as containing at least one novel splice site (novel not in catalogue (NNC)) (Figure 4A).
We next asked what proportion of reads for each gene were assigned to novel isoforms (Figure 4B). This ranged widely from approximately 96.9% for GATAD2A to 0% for GRIN2A, which was the only gene for which no novel RNA isoforms were detected. Approximately one quarter (7/31) of genes investigated had most of their gene expression assigned to novel isoforms, demonstrating how isoforms and their expression profiles for many genes are still poorly understood. As our amplicon-sequencing does not encompass all variations in transcriptional initiation and termination sites, these results can be seen as a lower bound for novel isoforms and their expression proportions. Linear regression of gene isoform counts (Supplementary Figure 7) and novel isoform proportion did not reveal a significant relationship with amplicon length or canonical exon count, indicating that detection of novel isoforms is largely gene dependent (Supplementary Figure 8A-D). To determine what was different about the splicing pattern of each novel isoform we further sub-classified them using SQANTI, based on the use of a combination of known exon junctions (COJ) or splice sites (COS), retained intron (RI) or containing at least one novel splice site (ALO) (donor, acceptor or pair) [54]. Overall, the most reads were assigned to “novel combination of known junctions”, where all individual exon combination were known but the entire chain of exons was novel. The type and proportion of novel isoforms from each category was highly gene specific, demonstrating a wide variety of novel RNA types missing from current gene annotations.
The impact of each novel isoform on the encoded ORF was examined using Expasy [55] and recorded as retaining the canonical or other known reading frame (coding), likely-NMD or unknown. Novel isoforms were classified as coding for 54.2%, 67.3% and 75.4% for ALO, COJ and COS subcategories respectively. We identified 49 novel isoforms that contained retained introns, 39 (83%) of which were predicted to lead to NMD (Figure 4C).
Isoforms that contained ‘at least one novel splice site’ (ALO) generally contained a novel deletion within a known exon or had novel donors and/or acceptors. All novel junctions in ALO isoforms were canonical GT-AG, GC-AG or AT-AC junction pairs, though often only the splice donor (GT) or acceptor (AG) was novel (Supplementary Figure 9A). We found that ∼47% of ALO isoforms contained either a single novel splice donor or acceptor. Novel GC-AG pairing was detected in two SZ risk genes, within the 5’UTR of GABBR2 and the donor site of a novel exon in RFTN2. These results show a clear advantage of using long-read sequencing to contextualise novel splice sites which aids in predicting the outcome on the isoform and ORF.
Detection of highly expressed novel isoforms
A key question regarding novel isoforms is whether they are expressed at high enough level to impact the biological function of a gene. This is a complex question, because a novel isoform could be low at the tissue level but highly abundant in a specific cell type, or multiple expressed novel isoforms can be significant cumulatively, especially if they all encode the same change to a protein. Therefore, we focused on genes with significant individual or cumulative expression of novel isoforms (analysis on all gene isoforms is available in Supplementary File 2).
We identified 22 novel isoforms for the schizophrenia risk gene autophagy-related protein 13 (ATG13). Novel isoforms represented 64% of gene expression, compared to 36% for full-splice matches. The most abundant class of novel isoforms (15/22) were COJ, which made up 55.4% of gene expression (Figure 4B). ATG13 had two alternative splicing hotspots, firstly with the 5’UTR and secondly around a predicted disordered region involving exons 12 and 13 in the canonical isoform.
Across all brain regions the most highly expressed isoform was the novel COJ transcript 26 (Tx26), which represented 23% of total reads, surpassing the canonical transcript ENST00000683050 (12.8%). Tx26 differs from the canonical transcript by skipping of exon 12. It contains the same CDS as ENST00000359513 but includes an additional exon in the 5’UTR (exon 3). Novel COJ transcripts 6 and 8 also had high read counts and together accounted for 16.6% of expression. These isoforms were novel due to a combination of 5’UTR exons not previously seen within full-length GENCODE annotations (Figure 5A, B).
The schizophrenia risk gene CUB and sushi multiple domains 1 (CSMD1) was the longest CDS we amplified at approximately 10,838 nt encompassing 70 coding exons. In total 8/9 detected isoforms were classified as novel. Following the canonical isoform (ENST00000635120, 51.5% of reads), novel transcripts 26 (COJ) and 33 (ALO) accounted for 38.3% and 7.1% of assigned reads, respectively (Figure 5C, D). Novel Tx26 skipped known exon 65 which encodes a sushi 28 extracellular domain and glycosylation site. The ORF of Tx26 retained the reading frame encoding a 3549 amino acid (aa) protein. Novel Tx33 contained a novel splice donor (GT, −8 nt) in canonical exon 21 predicted to lead to a PTC in canonical exon 22. The full Tx33 mRNA also skipped canonical exon 65. CSMD1 also provides a useful example of the benefit of long read for profiling isoforms.
GTEx isoform expression data (https://www.gtexportal.org/home/gene/CSMD1) for CSMD1 in brain is almost exclusively assigned to isoforms with downstream transcriptional initiation sites (including the two-exon fragment ENST00000521646), despite splice junction level expression largely supporting expression from the canonical start site. The emphasises the difficulty of assembling and quantification expression of full-length isoforms from long, complex genes, which can be achieved using long isoform spanning reads.
The chromatin remodelling subunit and shared SZ and BPD risk gene GATA zinc finger domain containing 2A (GATAD2A) had one of the highest proportions of reads (96.9%) assigned to novel isoforms. Most novel isoforms were predicted to be coding COJs (10/24) and these also accounted for the majority of novel read assignment (66.6%). Novel Tx17 had the highest expression level (22.7%) of any GATAD2A isoform and skipped canonical exon 10 which overlaps a CpG island (212 nt, 21.7% CpG) and contains a disordered, polar residue biased region and a phosphorylation site (Figure 5E). Two additional novel isoforms (Txs 8 and 12), together accounting for 19.9% of expression, incorporated a known 89 nt 5’UTR exon (ENST00000494516) into full-length isoforms for the first time, clarifying the isoforms expressed from this gene.
Our results were also useful for several genes in identifying the probable isoforms represented by GENCODE transcript fragments. For the SZ risk gene clusterin (CLU), the novel Tx1 (COJ) extended ENST00000520796 to the canonical stop codon and suggested this isoform is moderately abundant (8.2% of TPM) across all brain tissues. The ASD risk gene microtubule associated protein tau (MAPT) novel Tx5 extended ENST00000703977 and further demonstrated that inclusion of canonical exon 7 (chr17:45,989,878-45,990,075) does not always exhibit coordinated splicing with canonical exon 5. This isoform had a moderate read count comprising 3.2% of MAPT expression.
The full RNA isoform profile for the SZ risk gene calcium voltage-gated channel subunit alpha1 C (CACNA1C) has previously been reported and was repeated in this study [33]. In total, we identified 5 annotated and 22 novel isoforms. The most highly expressed novel isoform identified in our previous study, novel 2199, now known as ENST00000682835.1 was also identified in our samples and exhibited similar cerebellar specific expression. Importantly, 10 novel isoforms replicated one of two alternative splicing events in a hotspot identified previously in canonical exon 7 [33]. This hotspot contains the canonical splice site and two alternative 3’SS acceptors over only 12 nucleotides (chr12: 2,493,190-2,493,201) (Supplementary Figure 10).
Novel isoforms alter predicted protein structures
Novel isoforms have the potential to affect post-transcriptional regulation, protein sequence, structure and function or both. We next investigated a selection of isoforms that would be predicted to lead to protein changes to understand their possible impact.
Several novel isoforms (including 5 of the top 20 by expression, Supplementary File 3R) predicted a novel exon 22 skipping event in the SZ risk gene ITIH4. Targeted mass spectrophotometry confirmed a novel junction between exons 21 and 23 (ETLFSVMPG//PVLPGGALGISSSIR) created due to skipping of exon 22 (Figure 6A). This event was predicted to encode a PTC <50 nt from the final exon junction, indicating it may not be directed to NMD. Protein structure prediction of the canonical (ENST00000266041.9) and a representative novel isoform (Tx71) indicated a loss of 106 aa (∼44%) of the 35 kDa heavy chain domain but retention of three O-glycosylation sites (Thr:719, 720, 722) (Figure 6B-D). Novel transcript 71 accounted for ∼3.7% of ITIH4 TPM and this skipping event was found in an additional 24/68 (35.2%) novel isoforms which together accounted for 23.4% of reads.
Tx71 also skipped canonical exons 15 and 16, which contain a protease susceptibility region (residues 633 - 713) and a MASP-1 cleavage site (645 - 646: RR) [56]. Cleavage at this site and subsequent formation of an ITIH4-MASP complex can inhibit complement activation via the lectin pathway [56]. However, skipping of canonical exon 22 was not mutually inclusive with skipping of exons 15/16 as other novel isoforms with exon 22 skipping retained exons 15/16. The absence of much of the 35 kDa heavy chain domain is likely to impact on ITIH4 protein function and further studies will be required to examine if it plays a role in neuronal phenotypes.
Ten novel isoforms were identified for the SZ risk gene glutamate metabotropic receptor 3 (GRM3). Three novel isoforms (Txs 6, 7, 9) skip exon 2 which contains the canonical translation start site and instead could use an alternative, frame retaining, translation initiation site in exon 1, extending the truncated reference isoform ENST00000454217.1 which is also supported by human amygdala mRNA (AK294178) (Supplementary File 3Q). Translation of these isoforms would likely cause significant disruption to the resultant protein with removal of the signal peptide, transmembrane domain and disulphide bonds. Cumulatively these novel isoforms accounted for a relatively low 8.8% of expression when compared to the canonical isoform (86.5%).
Both novel isoforms and exons were identified for the shared MDD and ASD risk gene neuronal growth regulator 1 (NEGR1). Most reads (83.5%) were assigned to the canonical NEGR1 isoform (ENST00000357731, 354 aa). Three novel isoforms were identified, two of which (Txs 1 and 2) contained novel exons (Supplementary Figure 11A, B). These transcripts accounted for 9.4% (Tx2) and 5% (Tx1) of read counts. Both novel exons were located between cassette exons 6 - 7 and were validated using Sanger sequencing. The novel exon within Tx1 was 42 nt (14 aa) in length, had high 100 vertebrate conservation (UCSC) and was predicted to be frame retaining (Supplementary Figure 11C). Protein structure prediction of the 368 aa Tx1 using AlphaFold [50] showed a 14 aa extension near the C-terminal prior between the GPI anchor (G:324 aa) and the three immunoglobulin-like domains (Supplementary Figure 11D). In contrast, the 58 nt novel exon within Tx2 encoded a PTC (TAG) only 35 nt distant to the final exon junction complex, suggesting it might not trigger NMD. Truncation of the protein at this position (313 + 7 novel aa) would remove the GPI anchor potentially creating a near complete protein (320 aa) that is unable to attach to the cell membrane (Supplementary Figure 11E).
Brain region specific expression of novel isoforms
Many isoforms have brain region enriched or specific expression [57, 58]. Our amplicon sequencing approach identifies the presence and relative expression proportion of different isoforms. We next asked if any novel risk genes isoforms showed expression differences between brain regions. Overall, cerebellum exhibited most differences in isoform expression, consistent with previous whole transcriptome results [12].
Depression risk gene DCC netrin 1 receptor (DCC) novel isoform Tx9 had significantly higher TPM in cerebellum (Figure 7A, Supplementary File 3J). TPMs of Tx9 in CBM were approximately 10x higher than the average for cortical regions and 3x higher than in caudate. This isoform, classified as a COJ and predicted to encode a 1425 aa protein, accounted for ∼5% of DCC expression. Tx9 uses an alternative 3’SS (−60 nt) in cassette exon 17 and the skipped nucleotides cover an extracellular region and fibronectin type-III domain (UniProt). The SZ risk gene double C2 domain alpha (DOC2A) had two novel isoforms with significant variation in brain specific expression including Tx8 in cerebellum and Tx41 in caudate (Figure 7B, C, Supplementary File 3K). Novel Tx8 used a novel splice donor in canonical 5’UTR exon 1 (GT, +158 nt) and was predicted to encode a 400 aa protein unchanged from the canonical transcript. Tx41 was the only novel transcript that showed moderate but specific expression in caudate samples or any tissue other than cerebellum. Tx41 extends the known isoform ENST00000574405 to the canonical stop and is predicted to encode a 400 aa protein. Overall, 28 novel isoforms in 11 risk genes were found to have variable expression amongst brain tissues supporting a role for these isoforms within specific brain regions or potentially in a subset of cells.
Sequencing and validation of novel exons
Our amplicon sequencing approach detected a total of 28 novel exons in 13 MHD risk genes. Using RT-PCR followed by Sanger sequencing we validated 21/21 targeted novel exons (Table 1). The SZ risk gene chloride voltage-gated channel 3 (CLCN3) contained four novel exons within six novel isoforms and an example of PCR validation is shown in Supplementary Figure 12A. Validated novel exon mean length was 99 nt, ranging from 41 nt (CLCN3) to 231 nt (GRM3). 16 (76%) of validated novel exons were classified as ‘poison exons’ as they encoded a PTC (Supplementary Figure 12B), although two of these poison exons, within NEGR1 and XRN2, were <50 nt from the final exon junction and therefore may not undergo NMD. The novel exon contained within Tx3 for XRN2 had the second highest isoform read count for the gene, following the canonical transcript (ENST0000037191), with 4.7% of assigned reads. If translated, this transcript would omit an omega-N-methylarginine modification site (ARG:946) within a disordered region at the C-terminus (Supplementary Figure 13, Supplementary File 3AE).
Three novel exons were in untranslated regions and two were predicted to retain the ORF, including the 42 nt exon in NEGR1 mentioned previously and a 60 nt exon within SORCS3. SORCS3 is a member of the VPS10 transmembrane protein family and assists with neuronal protein trafficking and sorting and a lack of SORCS3 in the hippocampus in mice has been associated with impaired learning and fear memory in mice [59–61]. The novel exon in Tx1, encoding 20 aa (AMCGRAQWFTPVILALWETE), falls within the SORCS3 lumenal region (position: LYS:956/PRO:957) and did not appear to disrupt the transmembrane or cytoplasmic domains.
Comparison of protein prediction models of the canonical (ENST00000369701.8) and novel isoform (Tx1) showed the addition of an unstructured loop with a partial alpha helix within the second polycystic kidney disease (PKD2) domain (Figure 8A-D), though the prediction confidence was low, so the structural impact on the PKD2 domain remains uncertain [52].
Discussion
In this study we used long-read sequencing to profile 31 neuropsychiatric disorder risk genes identifying 360 novel RNA isoforms. We also present a new bioinformatic tool, IsoLamp, that can accurately identify and quantify novel RNA isoforms from long-read amplicon data. The recent proliferation of GWAS studies examining increasingly large population-wide data has identified hundreds of genomic variants associated with risk of developing a mental health disorder [23, 62]. Evidence suggests that some risk variants, specifically those that are non-coding, play a role in pre-mRNA splicing and our current understanding of the transcriptomic profile for these risk genes is limited [33, 63]. A key finding of our research is both the high number of novel expressed RNA isoforms and, for some risk genes, the high expression of novel isoforms both individually and collectively. This finding reflects both the known complexity of alternative splicing in the human brain [64] and the current incompleteness of the reference transcriptome. As a result of the relatively deep sequencing afforded by this long-read approach, we have shown that there is a much higher level of RNA isoform diversity for these genes than reported in the current reference annotations. These findings provide new insight into the repertoire of RNA isoforms expressed in brain that could be important for understanding the risk and onset of neuropsychiatric disorders.
RNA isoform discovery, classification and visualisation
We generated a set of high-confidence RNA isoforms from nanopore long-read data using IsoLamp. IsoLamp optimises and streamlines transcript identification, quantification and annotation from long-read amplicon data and outperformed other methods. IsoLamp improves upon our previously published pipeline TAQLoRe [33] by expanding analysis capabilities to any gene and identifying all isoforms within a single pipeline. Our overall approach also overcomes the significant challenge of re-assembling and classifying RNA isoforms using short reads [65–67]. The primary outputs from IsoLamp, filtered transcripts (GTF) and transcript expression (TPM) is designed to be compatible with multiple downstream tools, including our visualisation tool IsoVis (https://isomix.org/isovis) (Wan et al., 2024, under review).
Taken together, our benchmarking results highlight that IsoLamp’s optimised isoform discovery parameters, coupled with its specific filters (expression, overlapping primers, and full-length reads), yield significant improvements in both precision and recall compared to Bambu, FLAIR, FLAMES and StringTie2. The TPM filter applied to the data presented in this study is conservative and for long and complex genes may need to be tested to yield a balance of novel isoform detection and acceptable expression levels. IsoLamp also output consistent expression quantification that was robust to the quality of the annotations provided.
Novel RNA isoforms in neuropsychiatric disorders
The results presented in this study confirm our current limited understanding of RNA isoform profiles in the human brain and demonstrates how long-read sequencing technologies are a powerful tool to address this issue [32, 33, 68].
Several novel isoforms and exons were identified for ion homeostasis and channel genes CLCN3, CACNA1C and SLC30A9 which have shared risk for SZ, BPD and ASD [21, 23, 62]. Voltage-gated ion channels are widely distributed in the brain and regulate neuronal firing. Mutations to these genes have been associated with disease and the emerging role of these channels in neuropsychiatric disorders has been previously reviewed [69]. A study of SZ, BPD and MDD patients versus controls reported gene expression changes in human striatum for potassium, calcium and chloride channel genes including CACNA1C and CLCN3 [70]. CLCN3 belongs to the CLC family of anion channels and transporters and has an established role in human neurodevelopment [71, 72]. We identified and validated four novel exons in CLCN3, three of which were predicted to encode a PTC which could lead to NMD. The fourth was located within the 5’UTR, an area known to impact translation regulation in humans potentially through structural interference with the ribosome [73]. Splice variants of CLCNC3 have been shown to impact intracellular localisation and our results confirmed these splice variants and further add to them, in particular a novel RNA isoform (Tx9) which is similar to ENST0000613795, but includes the 76 bp exon 12 [71]. Twenty-two novel isoforms were identified for calcium channel gene CACNA1C and supporting previous findings [33] the top ten novel isoforms, by assigned read count, were classified as frame retaining, supporting their potential to generate functional proteins. SLC30A9 (first known as HUEL) encodes the zinc transporter protein 9 (ZnT9) which is involved in zinc transport and homeostasis in the endoplasmic reticulum and also localises to the cytoplasm and nucleus [74]. Whilst the function of the protein is not fully understood, a 3 nt familial deletion (c.1047_1049delGCA) in the highly conserved cation efflux domain (CED) has been recorded to result in changes to protein structure, intracellular zinc levels and intellectual disability [75, 76]. Critically, we identified and validated a novel exon (Tx3-9a:73 nt) within this CED providing evidence that this region may be alternatively spliced more commonly than previously understood, potentially impacting protein function [74].
Several novel exons were detected in the long reads for post-mortem brains used in this study. Previous studies have identified approximately 250 conserved neuronal micro-exons (3-27 nt) that typically preserve the ORF and are involved in neuronal differentiation [77, 78]. Although not classified as micro-exons, a third (N=7) of the validated novel exons found in post-mortem brain in our study were <65 nt, similar to previously reported exon lengths that are difficult for the splicing machinery to process, indicating that these exons may also be inefficiently spliced [79]. Micro-exons are known to be disrupted, usually through increased skipping events, in the brains of individuals with ASD [78, 80]. Importantly, we identified novel exons in two ASD risk genes, SORCS3 and XRN2.
SORCS3, which has remarkable shared neuropsychiatric disorder risk, contained a novel exon (22 aa) predicted to preserve the reading frame [59, 81]. How these novel exons impact expression or function of a mature protein remains unknown. Further studies with a focus on the transcriptional landscape in the developing brain will be crucial in furthering our understanding of these splicing events [80].
Limitations and future directions
The results of our study are limited by the sample size of available control post-mortem brain tissue. The nanopore long-read data for each risk gene was generated from five elderly, male control individuals, with a single female sample removed from further analyses due to quality constraints.
These results may not be representative of all populations, particularly when it comes to novel exons and lowly expressed novel RNA isoforms. Individual transcriptome-wide variation has been observed between age, sex and pathology [82, 83], despite this we did not detect high levels of individual isoform expression variation and novel isoforms were almost always present in all individuals. The small number of available individuals means this dataset was not powered to investigate genotype impacts on isoform expression, though this will be an important area of investigation to determine which risk genotypes act through changes in isoform structure and/or expression.
Sample and RNA quality, as measured by RINe, is critical to high-quality sequencing and this is especially true for long reads [33, 84]. Supporting previous findings in mRNA, our data suggest that pH values <6.3 impacted the quality of post-mortem human brain RNA, which is especially critical for robust amplification of longer (>5 Kb) CDS [85]. In general, PCR cycling was kept as low as possible to avoid PCR bias towards shorter isoforms and other artifacts. However, we noted that lower RINs, as recorded for Ind04, appeared to impact amplicons of longer CDS. To help overcome such issues, future long-read amplicon sequencing could incorporate unique molecular identifiers to tag molecules prior to PCR to ensure an accurate representation of the original RNA isoforms [86].
The risk genes profiled in this study were selected based on multiple levels of evidence for their involvement in risk, not only from GWAS but from meta-analyses and further independent studies [22, 25]. Whilst this approach was expected to produce a set of genes with high-confidence of their involvement in disorder risk, it is not exhaustive and it will be important to ensure risk gene lists are updated as more evidence from GWAS and other studies becomes available [25, 63]. Genes that are thought to confer resistance to the development of neuropsychiatric disease are also beginning to emerge and the addition investigation of their expression and isoform profiles may also provide valuable insight into disease risk and progression [87]. Additionally, combining whole or amplicon transcriptomic data, large-scale proteomic data and machine-learning predictive models like TRIFID can help to identify and prioritise functional proteomic isoforms [88].
Conclusion
In conclusion, we identified several hundred unreported RNA isoforms and novel exons, many of which could impact the function of known neuropsychiatric risk genes that also play crucial roles in normal neuronal development and activity. An understanding of the regulatory and functional impacts of these novel isoforms and incorporating long-read Nanopore data into existing repositories will help form an important knowledge base of alternative splicing in the human brain [89, 90]. Some novel isoforms or exons may also be future therapeutic targets through the modulation of splice isoforms using antisense oligonucleotides or CRISPR technology.
Conflict of interest
RDP, YP, YY, JG and MBC have received financial support from Oxford Nanopore Technologies (ONT) to present their findings at scientific conferences. ONT played no role in study design, execution, analysis or publication.
Data Availability
All raw nanopore long-read data (fastq) generated for each of the genes reported in this manuscript will be uploaded to The European Genome-Phenome Archive (EGA) upon publication. Isoform GTFs and counts tables from IsoLamp analysis are available upon request. Scripts used for long-read data preparation and downstream data analysis are also available in supplementary material or upon request. IsoLamp is open source and freely available (https://github.com/ClarkLaboratory/IsoLamp). IsoVis is freely available (https://isomix.org/isovis/).
Supplementary Figures
Acknowledgements
The authors would like to acknowledge that brain tissues were received from the Victorian Brain Bank (VBB), supported by The Florey, The Alfred and the Victorian Institute of Forensic Medicine and funded in part by Parkinson’s Victoria, MND Victoria and FightMND. The authors would also like to thank Geoff Pavey at the VBB for his assistance with frozen tissue preparation. This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.