ABSTRACT
Cross-species genome comparisons have revealed a substantial number of ultraconserved non-coding elements (UCNEs). Several of these elements have proved to be essential tissue- and cell type-specific cis-regulators of developmental gene expression. Here, we characterized a set of UCNEs, as candidate CREs (cCREs) during retinal development and evaluate the contribution of their genomic variation to rare eye diseases, for which pathogenic non-coding variants are emerging. Integration of bulk and single-cell retinal multi-omics data revealed 594 genes under potential cis-regulatory control of UCNEs, of which 45 are implicated in rare eye disease. Mining of candidate cis-regulatory UCNEs in WGS data derived from the rare eye disease cohort of Genomics England revealed 178 ultrarare variants within 84 UCNEs associated with 29 disease genes. Overall, we provide a comprehensive annotation of ultraconserved non-coding regions acting as cCREs during retinal development which can be targets of non-coding variation underlying rare eye diseases.
INTRODUCTION
The ‘dark matter of the genome’ harbors functional cis-regulatory elements (CREs) such as promoters, enhancers, silencers, and insulators, whose orchestrated activity is essential to provide spatial and temporal patterns of gene expression that ensure proper tissue development and homeostasis (Field & Adelman, 2020; Spielmann & Mundlos, 2016). The influence of these regulatory elements on their target genes spans within architectural chromatin units known as topologically associating domains (TADs), demarcated by boundaries enriched in CTCF and cohesin proteins (Dixon et al., 2012). Due to their high context specificity, the annotation of candidate CREs (cCREs) poses an attractive yet challenging task. In this regard, the Encyclopedia of DNA Elements (ENCODE) (Abascal et al., 2020; Dunham et al., 2012) represents a powerful tool to characterize functional cCREs, as it provides a robust inventory of well-defined cCREs supported by epigenetic data derived from a wide variety of both human tissues and cell types.
Over the past decades, comparative genomics have been made progress to map functional cCREs by establishing cross-comparison between species. As a result, several databases of non-coding sequences that exhibit extremely high conservation have been generated, including UCEs (Bejerano et al., 2004; Lomonaco et al., 2014), UCNEs (Dimitrieva & Bucher, 2013), ANCORA (Engström et al., 2008) and CONDOR (Woolfe et al., 2007). One of the most comprehensive resources is UCNEbase, which comprises exclusively >200bp-long non-coding genomic regions that exhibit ≥95% sequence identity between human and chicken. In total, 4,135 regions across the genome comply with the criteria and are listed as ultraconserved non-coding elements (UCNEs). The vast majority of these highly constrained elements cluster around key developmental genes, such as PAX6, and have been long hypothesized to act as tissue-specific transcriptional regulators (Dimitrieva & Bucher, 2013; Polychronopoulos et al., 2017). Ultraconserved CREs have been shown to be consistently depleted of common variants (Drake et al., 2006; Snetkova, Ypsilanti, et al., 2021), indicating that purifying selection has shifted ultraconserved CRE-derived allele frequencies towards magnitudes similar to those observed for missense variants (Drake et al., 2006), hence reinforcing that variation within these regions is more likely to have functional consequences.
Due to the wide implementation of whole genome sequencing (WGS) in human genetic studies and ambitious initiatives such as the 100,000 Genome Project (100kGP) launched by Genomics England (GEL) (Smedley et al., 2021), previously overlooked regions within the vast non-coding fraction of the genome have been investigated in patients with rare diseases and have shown an emerging role of non-coding variants in disease pathogenesis. Two remarkable examples of functional evidence of disease driven by disruption of conserved CREs can be found precisely in the rare eye disease field, in which mutations located within highly conserved non-coding regions have been linked to developmental defects. A first example is a de novo single nucleotide variant (SNV) in an ultraconserved enhancer (SIMO element) 150 kb downstream of an intact PAX6 transcriptional unit, found in a case with aniridia (Bhatia et al., 2013). Second, tandem duplications within a gene desert downstream the IRXA cluster (Cipriani et al., 2017; Small et al., 2016) have been found in patients affected by North Carolina Macular Dystrophy (NCMD); the shared duplicated region harbors an enhancer element defined as UCNE that could act as a cis-regulator during retinal development (Van de Sompele et al., 2022).
The retina is a heterogeneous tissue composed of neuronal (rod and cone photoreceptors, bipolar cells, ganglion cells, horizontal cells, and amacrine cells) and non-neuronal cell types (astrocytes, Müller cells and resident microglia) that arise from the common pool of retinal progenitor cells (RPCs) (Turner et al., 1990; Turner & Cepko, 1988). This cellular complexity is the result of spatiotemporally controlled gene expression programs during retinal development, requiring concerted action of thousands of CREs. In the wide spectrum of retinopathies, inherited retinal diseases (IRD) represent a leading cause of early-onset vision impairment affecting over 2 million people worldwide. Despite targeted panel-based sequencing and whole exome sequencing (WES)-based genetic testing 30-50% of IRD cases remain unsolved (Ellingford et al., 2016; Haer-Wigman et al., 2017), raising the hypothesis that disease-causing mutations are found in the non-coding genome that are mostly not covered by standard genetic testing.
In this study we set out to annotate UCNEs as cCREs that modulate gene expression during retinal development and that can play a role in retinopathies such as IRD. Firstly, we performed a comprehensive integration of publicly available multi-omics datasets derived from human retina based on biochemical features associated with regulation of gene expression. Secondly, UCNEs co-localizing with genomic regions showing regulatory activity were linked to target genes that exhibit expression in the retina. Finally, we evaluated the contribution of genomic variation within these regions to rare eye diseases. This allowed us to identify an ultrarare SNV in a candidate cis-regulatory UCNE located upstream PAX6 in a family displaying autosomal dominant foveal abnormalities. Overall, this work has improved the functional annotation of UCNEs in human retina representing understudied targets of non-coding variation that may explain missing heritability in rare eye diseases.
RESULTS
Integration of bulk and single-cell multi-omics data enables the identification of 1,487 UCNEs with a candidate cis-regulatory role in human retina
The wealth of publicly available data allowed the integration of transcriptomic (bulk and scRNA-seq) and epigenomic (DNase-seq, scATAC-seq, ChIP-seq) data to evaluate the regulatory capacity of UCNEs over the major developmental stages of human retina (Fig. 1A). More specifically, to identify which UCNEs might have a potential cis-regulatory role, an intersection was performed with genomic regions characterized by an open chromatin context in the retina at different developmental stages, as supported either by DNase (ENCODE) or scATAC-seq (Thomas et al., 2022) peaks. This resulted in a total of 1,487 UCNEs, i.e., approximately one third of UCNEs display a retinal open context (Fig. 1B, Supplementary Table 1). Approximately 80% of the DNase-seq peaks were also supported by scATAC-seq, hence defining high-confidence candidate cis-regulatory UCNEs (Fig. 1B). Nevertheless, to account for differences in time points between datasets, all 1,487 UCNEs were included in downstream analyses. Apart from the display of a retinal open chromatin context, we also evaluated the overlap between UCNEs and genomic regions featured by markers associated with regulation of gene expression. A total of 834 UCNEs were found to display at least one of the assessed features in the interrogated retinal developmental stages; more specifically 111 UCNEs were identified to display the active enhancer mark H3K27ac, of which most (95%) also exhibit other histone marks related to active enhancers (H3K4me1 and H3K4me2) (Fig. 1B). Table 1 and Supplementary Table 2 include summarized and detailed overviews of the number of identified UCNEs per marker and stage, respectively.
A) Overview of the integrative multi-omics analysis for the prediction of UCNEs with potential cis-regulatory role in human retina. B) Venn diagrams illustrating the CREs displaying open chromatin features based on scATAC-seq and DNase-seq (left) and their overlap of active enhancer marks (H3K27ac, H3K4me1 and H3K4me2). C) Barplot showing the proportions of elements from the VISTA enhancer browser (V), UCNEs (U), and putative regulatory UCNEs (characterized by retinal datasets) (U+) displaying reporter expression (positive). D) Proportional stacked barplot showing the distribution of tissues (eye, forebrain, hindbrain, midbrain, neural tube, limb and tail) in which the putative regulatory UCNEs display reporter expression. E-F) Illustration of one of the characterized UCNEs (NR2F1_Hector_1/2) displaying open chromatin supported by DNase-seq (embryonic day 74-85, 89 and 103-125) and scATAC-seq (AC/HC/GC Precursors Cells, Early Progenitor Cells, Ganglion Precursor Cells, Late Retinal Progenitor Cells) and enhancer reporter expression in the eye. Figures obtained from the UCSC VISTA (hs1170) enhancer (E) and genome (F) browsers. Abbreviations: FW: fetal weeks; V: VISTA enhancer elements; U: UCNEs; U+: Putative active regulatory UCNE; ∩: intersection.
Abbreviations: FW: fetal weeks.
To investigate the functional relevance of the putative regulatory UCNEs, we performed a correlation with the 1,942 experimentally validated human non-coding fragments of the VISTA Enhancer Browser (Fig. 1C). Out of these 1,942 elements, approximately 50% (998/1,942) display enhancer activity. When overlapped with all 4,135 UCNEs, a slight increase of positive elements (505/922) is observed; however, when only the identified retinal accessible UCNEs were considered, roughly 68% were found to be positive (272/402), thereby reinforcing their predicted regulatory role (Fig. 1C). Interestingly, evaluation of the anatomical description of the reporter gene expression patterns of these 272 elements revealed a potential enrichment in eye (Fig. 1D). An illustrative example of one of these elements and its corresponding multi-omics-based characterization is shown in Fig. 1E-F.
In silico prediction and integration of retinal chromatin conformation data identify putative target genes under UCNE cis-regulation
We annotated the identified active UCNEs to assign them potential target genes and thus assess their association with active gene expression in the retina. A total of 724 target genes were assigned to the initially identified 1,487 UCNEs displaying candidate cis-regulatory activity. The vast majority (74.5%) of queried regions were assigned as putatively regulating two genes based on the used association rule (Fig. 2A). Furthermore, most of these regions were found to be located at 50-500kb from the TSS of the regulated target genes regardless of their orientation (Fig. 2B). Out of these 724 target genes, 594 (81.9%) are expressed in the retina in at least one of the interrogated developmental stages (52 to 136 days post-conception). To evaluate further whether chromatin contacts between the UCNEs and the promoters of their target genes can occur in retina, we integrated the 2,948 retinal TADs identified by Marchal et al. (Marchal et al., 2022). A large majority (393/594) of the target genes were found to be in TADs also harboring their associated UCNE (Supplementary Table 1). These genes were then used as input for Gene Ontology analysis, which revealed an overrepresentation of terms related to regulation of gene expression and differentiation, thereby confirming the expected roles of UCNEs as tissue-specific CREs during development, particularly of the nervous system (Fig. 2C).
A) Bar plot showing the number of associated genes per retinal UCNE. B) Distance distribution from the TSS and its associated UCNE. C) Gene ontology for the UCNE-associated target genes. Abbreviations: GO: Gene Ontology; TSS: transcription start site.
As a last layer of characterization of these active UCNEs and their candidate target genes, we assigned to each target gene cell-type specificity information based on expression data. Out of the 1,487 active UCNEs, 808 (54.3%) displayed a cell-type specific open chromatin context consistent with their target gene(s) expression signature(s) (Supplementary Table 1).
Mining of WGS data from a rare eye disease cohort reveals rare variants within active UCNEs
Out of the 594 putative target genes under UCNE cis-regulation, 45 were found to be previously linked to a rare eye disease phenotype, of which 19 are also associated with other developmental disorder phenotypes (Supplementary Table 4). Considering the extreme selective constraints of UCNEs, we hypothesized that genetic changes in these regions could contribute to disease. Therefore, we evaluated the genomic variation within the active UCNEs associated with these 45 disease genes in a sub-cohort of individuals affected by rare eye diseases (n = 3,220, Supplementary Table 5) from the 100,000 Genomes Project (Fig. 3A).
A) Overview of a sub-cohort comprising n=3,220 participants of the 100,000 Genomes Project affected by posterior segment abnormalities (n=2,615), anterior segment abnormalities (n=360), and ocular malformations (n=245). B) Variant population frequencies within putative retinal UCNEs normalized to a background composed of randomly selected sequences (see Methods).
As expected, a depletion of common variation (MAF>1%) was observed across these regions (Fig. 3B). A total of 431 (426 SNVs and 5 SVs) ultrarare variants, i.e., absent from reference population public databases, were identified in 199 active UCNEs linked to these 45 genes (Supplementary Table 6). Notably, out of the 5 identified SVs, one corresponds to the known shared duplicated region downstream of IRX1, located within the NCMD-linked MCDR3 locus [MIM: 608850]; in particular, this duplication, identified in 8 affected individuals of 4 different families, involves a UCNE exhibiting chromatin accessibility in developing horizontal cells. Finally, out of these ultrarare variants, 178 are located within 84 UCNEs displaying histone modification marks (in at least one of the interrogated stages), associated with 29 genes. This set defined our primary search space for variants with potential functional effects and further assessment.
An ultrarare SNV in an active UCNE upstream of PAX6 found in a family segregating autosomal dominant foveal anomalies
We identified an ultrarare SNV (chr11:31968001T>C) within a candidate cis-regulatory UCNE located ∼150kb upstream of PAX6. This variant was found in four individuals of a family segregating autosomal dominant foveal abnormalities (Fig. S1A-B; Supplementary Table 7). A CFH missense variant (c.1187A>G, p.Asn396Ser) was initially reported in the affected individuals but could not explain the foveal anomaly. Given the phenotype, a variant screening was performed with a particular focus on the PRDM13 and IRX1 NCMD loci, both associated with foveal or macular mis-development. A total of three and two SNVs within the IRX1 and PRDM13 loci respectively were found to segregate with disease. However, these SNVs are present in individuals from reference population databases and are not located in genomic regions displaying cis-regulatory features (Supplementary Table 8). Equally, loci associated with foveal hypoplasia, nystagmus and hypopigmentation (AHR, FRMD7, GPR143, and SLC38A8) were assessed for SNV or SV segregating with disease. No other (likely) pathogenic variants in these loci or any other known IRD/rare eye disease gene were identified in the affected cases (Supplementary Table 8).
The identified chr11:31968001T>C variant affects a nucleotide residue that is conserved for at least 360 million years of evolution that separate humans from X. tropicalis. In silico assessment of this variant and flanking sequence pointed to a likely deleterious effect and revealed a predicted disruption of several TF binding motifs (Supplementary Table 8_2). More specifically, this variant is located within a UCNE that is catalogued as a cCRE in ENCODE and featured by distal enhancer-like signatures in bipolar neurons (EH38E1530321). Regarding its retinal context, this UCNE displays accessible chromatin in the early stages of retinal development (7/8 gestational weeks), and in particular in ganglion cell precursors (Fig. 4A-B). Importantly PAX6 is expressed within this pool of cells and, interestingly, this expression appears to be enriched within a specific sub-population, as observed from the distribution of expression values across all the cells composing this cluster (Fig. S2). Additionally, this UCNE was found to display the active enhancer mark H3K4Me1 at the earliest time points available for this dataset (13/14 and 15/16 gestational weeks).
A) Visualization of the epigenomic context of the identified PAX6-associated UCNE (highlighted in gray) in relation to the PAX6 locus (top; chr11:31,490,261-32,075,591) and B) zoomed-in of the identified element (bottom; chr11:31,965,000-31,972,000). Figures are obtained from the UCSC genome browser.
This region has also been functionally validated in transgenic mouse assays (VISTA element hs855), which revealed gene enhancer activity mainly in the forebrain and, to some extent, in the retina (3/6 embryos) (Fig. 5A). A closer assessment of this UCNE with regard to its proximity to the PAX6 promoter in other species revealed its syntenic conservation up to zebrafish; interestingly, this element localized in closer proximity to the PAX6 promoter in species like X. tropicalis (60kb distance from the TSS) or Danio rerio (15kb distance from the pax6a TSS) (Fig. S3).
A) This UCNE displays reporter activity in the human embryonic forebrain (6/9 embryos) as well as in other structures including eye (3/6 embryos) and neural tube. Figure obtained from the VISTA enhancer browser (hs855). B) Zebrafish enhancer assays at 3 different time points (1 dpf, 2 dpf, 3 dpf) showing GFP-positive tissues (optic fissure -1 dpf-, lens cells -2,3 dpf- and forebrain -2,3 dpf-; white arrows). Abbreviations. dpf: days post-fertilization.
In vivo enhancer assays in zebrafish display retinal activity for the active UCNE upstream of PAX6
Given the putative retinal activity suggested by the VISTA enhancer assay for this UCNE, we performed zebrafish enhancer assays to gain further insights into its regulatory range of action. We found consistent reporter expression in several regions, including the forebrain, hypothalamus/otic vesicle, somites, nosepit and eye. Importantly, we observed that the GFP expression in the eye initiated in a small fraction of the embryos at 2 dpf (4/22 GFP+ embryos) increasing to a larger proportion of the embryos by3 dpf (21/22 GFP+ embryos) and eventually disappeared from 4dpf (Fig. 5B). A detailed overview of the results of these assays can be found in Supplementary Table 10. Altogether, these observations are consistent with the epigenomic characterization based on retinal datasets and provides further evidence of a potential cis-regulatory role of this UCNE on PAX6 expression.
DISCUSSION
Since the landmark study of Bejerano et al. (Bejerano et al., 2004) almost twenty years ago, ultraconservation of non-coding elements and their paradoxical functional (in)dispensability (Ahituv et al., 2007; Chiang et al., 2008; Dickel et al., 2018; Drake et al., 2006; Snetkova, Pennacchio, et al., 2021) have equally stirred controversy and fascination among scientists from diverse disciplines. A recent study demonstrated that UCNEs with robust enhancer activity during embryonic development appear to be unexpectedly resilient to mutation (Snetkova, Pennacchio, et al., 2021). There are, however, clear instances in which variation within ultraconserved CREs can be a driver of human disease (Benko et al., 2009; Bhatia et al., 2013; Ghiasvand et al., 2011; Klopocki et al., 2008; Kvon et al., 2020; Lettice et al., 2003; Martínez et al., 2010; Plaisancié et al., 2018; Wieczorek et al., 2010). Despite the substantial body of literature on these intriguing elements, thus far there are no comprehensive studies based on the integration of multi-omics to map the regulatory capacity of UCNEs in a specific tissue or cells, and on human genetic data to assess their contribution to disease. Apart from a few well-known examples, variants in CREs are an underrepresented cause of Mendelian diseases. A major obstacle hampering their identification is the need to define the search space of the CREs they affect in a tissue- and even cell-type specific manner. Here we set out to address this challenge in human retina, of which the cis-regulatory architecture has been well-studied (Cherry et al., 2020; Marchal et al., 2022; Thomas et al., 2022) and for which there is emerging cis-regulatory variation implicated in disease (Bhatia et al., 2013; Ghiasvand et al., 2011; Plaisancié et al., 2018; Van de Sompele et al., 2022).
Although ENCODE provides a well-annotated inventory of cCREs, its retinal datasets do not include the earliest stages of RPC differentiation. To overcome this and to incorporate cell-type specificity information, we integrated scATAC-seq, which allowed us to identify 1,487 UCNEs characterized by accessible chromatin in at least one major stage of retinal development. We also interrogated the epigenetic landscape of these elements by analyzing histone modification patterns associated with e.g. active/poised enhancers (H3K27ac, H3K4me1, H3K4me2) (Creyghton et al., 2010; Rada-Iglesias et al., 2011; Santos-Rosa et al., 2002) and highly packed chromatin (H3K27me2 and H3K27me3) (Juan et al., 2016). This analysis provided a useful approach to trace the activity of elements that are active at a specific stage during retinal development. In total, 111 UCNEs were found to display active enhancer marks during retinal development. Similarly to ENCODE, the analyzed ChIP-seq datasets do not cover the earliest stages and hence these 111 elements are more likely to be related to stages of differentiation corresponding to later-born cell types (i.e., rod photoreceptors, bipolar cells, Müller glial, and some varieties of amacrine cells) (Rapaport et al., 2004). Overall, our data integration spans all the major stages of retinogenesis and thus provides a comprehensive framework for the systematic characterization of ultraconserved retinal CREs.
Establishing an association between cCREs and their putative target genes is essential for the interpretation of genomic variation that could disrupt cell type-specific binding sites of TFs and/or long-range chromatin contacts that can lead to ectopic expression of genes (de Bruijn et al., 2020; Lupiáñez et al., 2015). Here we combined GREAT (McLean et al., 2010) with retinal expression and Hi-C derived retinal data to assign a range of action to candidate cis-regulatory UCNEs. In total, 594 retina-expressed genes can be under UCNE cis-regulation; interestingly most of the UCNEs were found to be distributed in distal position from the TSS of their assigned target genes, as observed for enhancers promoting expression of developmental genes (Benko et al., 2009; Long et al., 2020; Pachano et al., 2022). In this regard, a large fraction of the identified ultraconserved putative enhancers clusters around key developmental genes (e.g. MAB21L2, OTX2, PAX6, SOX2, ZEB2) known to be controlled by complex regulatory landscapes (Polychronopoulos et al., 2017). Besides, 45 of these genes are implicated in rare eye diseases. Interestingly, coding variants in these genes lead in most cases to syndromic developmental phenotypes with other manifestations apart from the ocular ones as collated in the G2P database 23. As has been shown before, the phenotype caused by a coding mutation of a developmental gene can be different from the phenotype caused by a mutation in a CRE controlling spatiotemporal expression of this gene. This is exemplified by the PRDM13 locus, for which bi-allelic coding PRDM13 variants result in hypogonadotropic hypogonadism and perinatal brainstem dysfunction in combination with cerebellar hypoplasia (Whittaker et al., 2021) while cis-regulatory variants in retinal CREs leads to NCMD, a developmental macular disease (Van de Sompele et al., 2022). From our analyses we could identify a cCRE for PRDM13 that displays an open chromatin context in early RPCs and developing amacrine cells, which are precisely the retinal cell types in which PRDM13 is expressed (Goodson et al., 2018; Watanabe et al., 2015). Although no disease associations have been established thus far for other genes for which we identified candidate cis-regulatory UCNEs, some of them are known to play important roles in retina development, e.g., NFIB (Clark et al., 2019), and therefore represent potential target regions for non-coding variants contributing to missing heritability.
A limitation of this analysis is the usage of a pre-defined set of genomic regions, in particular UCNEs, as this database is static and rarely updated (Polychronopoulos et al., 2017). Our approach could thus be expanded using more flexible criteria, thus including other less conserved, albeit disease-relevant loci (Ghiasvand et al., 2011). Nonetheless, the advantage of working with highly conserved genomic regions is the availability of already validated experimental data (Visel et al., 2007); additionally, as shown here, the integration of tissue-specific epigenomic data provides a robust framework to identify potential bona fide enhancers and their corresponding target genes, and clinically relevant variants therein.
To further investigate the contribution of genomic variation within candidate cis-regulatory UCNEs to disease, we mined WGS data from a sub-cohort of 100,000 Genome Project Genomics England participants affected by different rare eye diseases. As reported previously (Katzman et al., 2007; Snetkova, Pennacchio, et al., 2021), we observed a substantial depletion of common variation across these regions when compared to a background comprising randomly selected genomic regions. This apparently strong purifying selection has been proven difficult to reconcile with recent findings demonstrating remarkable resilience of UCNEs to variation (Snetkova, Ypsilanti, et al., 2021). Moreover, it has been suggested that weaker but uniform levels of purifying selection across hundreds of bases and different species could bring together these otherwise contradictory observations and explain why rare variants are not significantly depleted within UCNEs (Dukler et al., 2022). Our primary search space for variants with potential functional effects comprised 178 ultrarare variants located within 84 retina-active UCNEs associated with 29 genes. Interestingly, most of them were predicted to have a likely functional and/or deleterious effect, which could be explained by a skewed cumulative importance towards evolutionary conservation-related features in the predictive models (Rogers et al., 2018; Schwarz et al., 2019; Shi et al., 2021; Zhou & Troyanskaya, 2015).
Out of all variants found, one ultrarare variant (chr11:31968001T>C) within a UCNE displaying open chromatin in progenitor ganglion cells and linked to PAX6 was further dissected. This variant was exclusively found within the studied family, displaying isolated foveal abnormalities, and initially solved with an ultrarare missense variant in CFH, which could not fully explain the clinical presentation (Raychaudhuri et al., 2011; Taylor et al., 2019). As no other likely pathogenic variants were found within relevant loci (Ehrenberg et al., 2021; Kuht et al., 2020, 2022; Small et al., 2016; Van de Sompele et al., 2022), and the observed phenotypes of the affected individuals match within the PAX6 disease spectrum, mis-regulation of PAX6 expression cannot be excluded as a pathogenetic mechanism.
Genetic defects of PAX6 have been found in aniridia, a pan-ocular disorder characterized by the absence or hypoplasia of the iris, nystagmus and foveal hypoplasia (Cunha et al., 2019), the latter comprising thinning of macular inner and outer retinal layers consistent with misdirected foveal development (Pedersen et al., 2020). Recently, a phenotype characterized by isolated foveal hypoplasia with nystagmus has been also linked to PAX6 variation (Lima Cunha et al., 2021). Given the known inter- and intra-familial phenotypic variability observed in PAX6-associated disorders (Dubey et al., 2015; Yokoi et al., 2016), variable expressivity cannot be discarded for the identified variant. Thus far, in terms of regulatory variation in CREs, only a single SNV located within the ultraconserved SIMO element has been associated with PAX6-disease. This subtle change (SNV), identified de novo in an individual with aniridia and foveal hypoplasia, was found to disrupt an autoregulatory PAX6 binding site (Bhatia et al., 2013). Importantly, we could establish in vivo, using the zebrafish as animal model, a developmental expression pattern in the eye driven by this UCNE in tissues for which PAX6 expression is relevant (Ashery-Padan et al., 2000; Collinson et al., 2003; Georgala et al., 2011; Kimura et al., 2005; Pituello et al., 1999) and within a time window consistent with the period in which the zebrafish retina has become fully laminated (Morris & Fadool, 2005; Schmitt & Dowling, 1999). However, to confirm the association of this UCNE and impact of the identified variant on PAX6 expression, further functional assays are needed. The phenotypic consequence of the variant itself, nevertheless, constitutes a limitation for modelling since currently no suitable models, including retinal organoids, fully recapitulate foveal patterning (Hussey et al., 2022).
Overall, our work is exemplar for how the wealth of publicly available multi-omics data can be used to exploit the regulatory capacity of UCNEs in a tissue- and cell type-specific way. As demonstrated here, UCNEs can represent understudied regions of non-coding variation underlying missing heritability in Mendelian diseases. With the increasing implementation of WGS in rare disease research and diagnosis, the delineation of tissue and cell type-specific CREs will be a prerequisite to identify and fully interpret the pathogenic nature of non-coding variants. With this study, we have illustrated how the creation of a comprehensive set of functionally annotated UCNEs in a target tissue can represent a powerful initial strategy to narrow down the variant search space, particularly for developmental phenotypes.
METHODS
Integration of UCNEs with bulk and single-cell epigenomic, regulatory and transcriptional datasets from human developing and adult retina
The 4,135 genomic regions defined as UCNEs (Dimitrieva & Bucher, 2013) were used as the basis for the integration of multiple publicly available multi-omics datasets derived from embryonic and adult human retina. More specifically, to evaluate the potential function of UCNEs as cis-regulatory modules, we made use of DNase-seq (Abascal et al., 2020), ChIP-seq of histone modifications (Aldiri et al., 2017), and single-cell (sc) ATAC-seq (Thomas et al., 2022) derived from retinal tissue collected at distinct stages of development. Furthermore, to correlate potential cis-regulatory activity with gene expression, we mined both bulk (Hoshino et al., 2017) and scRNA-seq (Thomas et al., 2022) generated at stages overlapping or extending the ones of the epigenomic datasets. Additionally, as a last layer of functional characterization, we integrated into our analyses the experimental data of the VISTA Enhancer Browser (Visel et al., 2007), which allows the classification of candidate regulatory elements based on in vivo enhancer reporter assays tested in transgenic mice at embryonic day 11.5 (Pennacchio et al., 2006).
Identification of candidate cis-regulatory UCNEs in retina
Data generated by scATAC-seq of embryonic (53, 59, 74, 78, 113, and 132 days) and adult (25, 50, and 54 years old) human retinal cells were obtained (GSE183684) and imported into R (v4.0.5). Count matrices were processed using the ArchR single-cell analysis package (v1.0.1) (Granja et al., 2021) as described in Thomas et al., 2022(Thomas et al., 2022) with minor modifications. Briefly, the total number of cells after filtering out doublets was 61,313. Single-nucleus RNA-seq of the same tissue and timepoints (GSE183684) were integrated using the unconstrained integration method. Both datasets were then used to assign retinal cell class identities to the different clusters and subsequent peak calling. BigWig files from each annotated cell cluster were extracted and converted into bedGraph files using bigWigtoBedGraph UCSC utility; peak detection was performed using bdgpeakcall (MACS2.2.7.1) (Zhang et al., 2008) with default parameters and a value of 0.4 as cut-off. Similarly, bigWig files corresponding to histone modification patterns (H3K27Ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9/14Ac, H3K9me3, PolII, CTCF) were retrieved (GSE87064) and converted into bedGraph files; in this case, the cut-off value for peak calling by bdgpeakcall was set at 20. Additionally, the subset of ENCODE DNase hypersensitivity sites (rDHSs) identified in embryonic retina (74-85, 89, 103-105 days) were obtained (ENCSR786VSQ, ENCSR666FML, ENCSR632UXV) and elements featured by Low-DNase filtered out. Finally, to identify putative cis-regulatory UCNEs, we screened for overlaps between UCNEs and the retrieved sc-ATAC/ChIP/DNase-seq peaks using bedtools intersect (BEDTools (2.30.0)(Quinlan & Hall, 2010) with default parameters; of note bedtools window including a ±250bp-long window was used when overlapping the ChIP-seq peaks to characterize more broadly the chromatin status in the vicinity of the accessible UCNEs (Fig. S3).
Identification and characterization of target genes under active UCNE regulation
To assign potential target genes to the identified active UCNEs we used the Genomic Regions Enrichment of Annotations Tool (GREAT) (McLean et al., 2010). Briefly, each gene in the genome was assigned a basal regulatory domain (5kb and 1kb upstream and downstream of the transcription start site {TSS}, respectively) with an extension of up to 1Mb to the nearest gene’s basal domain. Curated regulatory domains were also included. Each active UCNE was then associated with all genes whose regulatory region it overlaps.
Potential target genes were subsequently filtered based on retinal expression. To do so, we retrieved RNA-seq paired-end FASTQ files derived from fetal retina samples (52 to 136 days post-conception) characterized in Hoshino et al. (Hoshino et al., 2017). Transcripts were quantified through pseudo-alignment by Kallisto (v.0.46.1)(Bray et al., 2016) using default parameters for both index build (derived from the Ensembl human release 101) and transcript abundance estimation. A custom R script was then used to filter out target genes exhibiting no expression (TPM<0.5) at any of the interrogated stages. Additionally, to evaluate the reliability of the regulatory domains assigned by GREAT to these target genes, we integrated the 2,948 retinal TADs described by Marchal et al. (Marchal et al., 2022). To characterize further the putatively regulated target genes, we made use of the integration of the scATAC-seq gene score and the scRNA-seq gene expression matrices generated previously. For each target gene, its expression was ranked by percentile of expression across all clusters. A gene expression signature was then assigned by retrieving the cluster identities exhibiting an expression value above the 80th percentile threshold.
Finally, to identify overrepresented Gene Ontology (GO) terms and infer possible enriched pathways among the potential target genes under UCNE regulation, GO enrichment analyses were performed using Enrichr (Xie et al., 2021).
Interrogation of WGS data in a rare eye disease cohort
Active UCNEs associated with genes expressed in retina were filtered further based on their implication in disease as retrieved from the comprehensive gene-disease pairs and attributes list provided by G2P (Eye and Developmental Disease –DD– Panels; 2022-03-17) (Lenassi et al., 2021; Thormann et al., 2019) extended with the NCMD-associated IRX1 locus(Small et al., 2016). To assess the contribution of genomic variation within these loci to disease, an analysis was performed to detect small variants (SNVs, and indels < 50bp), large structural variants (SVs) including copy number variants (CNVs) overlapping these disease-gene associated UCNE loci through query of a sub-cohort of participants with rare eye disease phenotypes (n = 3,220) from the 100,000 Genomes Project (100KGP, Genomics England).
Retrieved variants were annotated with VEP (v.107). As a first filter, only variants with filtering allele frequency <0.5% and no observed homozygotes in gnomAD v3.1 were considered for further assessment. Variants were evaluated for their potential pathogenic/modifying effect using in silico prediction tools focusing on transcription factor (TF) binding site disruption -QBiC-Pred- and chromatin state effects -DeepSea, CARMEN, FATHMM-XF, RegulationSpotter. Furthermore, we annotated these variants with our integrated analyses in order to evaluate them within their regulatory context and potential target genes.
For each candidate variant, we compared the similarities between the participant phenotype (HPO terms) and the ones known for its target gene through literature search and clinical assessment by the recruiting clinician when possible. Finally, for each candidate variant identified in participants whose cases were not solved through 100KGP, a variant screening of known retinal disease genes (Genomics England Retinal Disorder panel)(Martin et al., 2019) was performed to discard (likely) pathogenic variants, both SNVs and SVs, that could provide an alternative molecular diagnosis. For each instance for which only the UCNE variant remained as candidate, we placed a clinical collaboration request with Genomics England.
Targeted sequencing and reverse phenotyping
We performed segregation testing of two ultrarare SNVs initially identified in three affected individuals of a 3-generation family displaying autosomal dominant foveal abnormalities, namely an SNV in a candidate cis-regulatory UCNE located upstream of PAX6 and a missense CFH variant. Genomic DNA was extracted using Oragene-DNA saliva kits (OG-500, DNA Genotek) according to manufacturer’s instructions. Targeted sequencing of the variants was performed on genomic DNA by PCR amplification followed by Sanger sequencing using the BigDye Terminator v3.1 kit (Life Technologies). Primer sequences can be found in Supplementary Table 9.
Following their genetic assessment, four members of this 3-generation family were clinically re-evaluated. The examination included visual acuity assessment, slit-lamp examination for anterior and posterior segment anomalies, and intraocular pressure measurement. Detailed imaging involving ultra-wide field fundus photography, ultra-wide field autofluorescence imaging and optical coherence tomography (OCT) was performed.
Generation of in vivo reporter constructs and functional characterization of candidate cis-regulatory UCNE upstream of PAX6 using enhancer assays in zebrafish embryos
Primers were designed to amplify the sequence of the candidate cis-regulatory UCNE located upstream of PAX6 from human genomic DNA (Roche). The PCR product was then cloned into the E1b-GFP-Tol2 enhancer assay vector containing an E1b minimal promoter followed by the Green Fluorescent Protein (GFP) reporter gene(Li et al., 2010) by restriction-ligation cloning. The primer sequences can be found in Supplementary Table 9. The recombinant vector containing the cCRE-PAX6-UCNE was then amplified in One Shot TOP10 Chemically Competent E. coli cells (Invitrogen) and purified using the QIAprep Spin Miniprep Kit (Qiagen). The sequence of the insert was confirmed by Sanger sequencing using the BigDye Terminator v3.1 kit (Life Technologies). The constructs were then microinjected into the yolk of at least 70 wild-type zebrafish embryos at single-cell stage using the Tol2 transposase system for germline integration of the transgene according to Bessa et al. (Bessa et al., 2009) with minor modifications. As readout, GFP fluorescence was observed and photographed with a Leica M165 FC Fluorescent Stereo Microscope (Leica Microsystems) and its localization annotated at 1, 2 and 3 days post fertilization (dpf) to evaluate enhancer activity.
COMPETING INTEREST STATEMENT
The authors declare that they have no competing interests.
FUNDING
This work was supported by the Ghent University Special Research Fund (BOF20/GOA/023) (EDB); H2020 MSCA ITN grant (No. 813490 StarT) (EDB, FC, MB), EJPRD19-234 Solve-RET (EDB). EDB is a Senior Clinical Investigator (1802220N) of the Research Foundation-Flanders (FWO); VLS and ADR are an Early Starting Researcher of StarT (grant No. 813490). EDB is member of ERN-EYE (Framework Partnership Agreement No 739534-ERN-EYE).
Data Availability
The data that support the findings of this study are available within the Genomics England (protected) Research Environment but restrictions apply to the availability of these data, as access to the Research Environment is limited to protect the privacy and confidentiality of participants. De-identified data as well as analysis scripts are, however, available from the authors upon reasonable request. Extended data generated in this study are available in the supplementary materials.
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
The 100,000 Genomes Project Protocol has ethical approval from the HRA Committee East of England – Cambridge South (REC Ref 14/EE/1112). This study was registered with Genomics England under Research Registry Projects 465.
AUTHORS’ CONTRIBUTIONS
V.L.S.: Conception and project design, acquisition of data, analysis and interpretation of data, drafting and revising the manuscript.
A.D.R.: Conception and project design, acquisition of data, analysis and interpretation of data, drafting and revising the manuscript.
R.M: Acquisition of data, analysis and interpretation of data, revising the manuscript. G.E: Acquisition of data, revising the manuscript.
F.C: Acquisition of data, revising the manuscript.
M.B: Project supervision, acquisition of data, revising the manuscript.
A.W: Acquisition of data, analysis and interpretation of data, revising the manuscript.
E.D.B.: Conception and project supervision, acquisition of data, analysis and interpretation of data, drafting and revising the manuscript.
CONSENT FOR PUBLICATION
Not applicable. We present only de-identified data.
AVAILABILITY OF DATA AND MATERIALS
The data that support the findings of this study are available within the Genomics England (protected) Research Environment but restrictions apply to the availability of these data, as access to the Research Environment is limited to protect the privacy and confidentiality of participants. De-identified data as well as analysis scripts are, however, available from the authors upon reasonable request. Extended data generated in this study are available in the supplementary materials.
SUPPLEMENTARY MATERIAL
A) Pedigree of the family, indicating segregation of the PAX6-associated UCNE variant (V1) and of the CFH variant (V2). B) Ophthalmological assessment (fundus examination, FAF, and OCT) revealed tessellated fundi with atrophic areas at the macula involving the fovea in II-1, III-1 and III-2, while I-1 displayed only an area of pallor inferior to the left fovea and corresponding hyper-autofluorescence (see Table S7).
Table S1. Characterization of UCNEs based on: UCNE ID – UCNE name extracted from the original library elaborated by Dimitrieva et al., 2013; g.coordinates (hg38) – genomic coordinate of the UCNE; Target gene – gene association extracted from GREAT analysis; Bulk RNA-seq (target gene mean expression) – expression value (TPM) obtained from the RNA-seq based on samples related to the human retinal development from Hoshino et al., 2018; Bulk RNA-seq (target gene expression rank) – expression rank obtained from the RNA-seq based on samples related to the human retinal development from Hoshino et al., 2018; Bulk RNA-seq (target gene maximum expression) – maximum values of gene expression (TPM) obtained from the RNA-seq based on samples related to the human retinal development from Hoshino et al., 2018; Bulk RNA-seq (stage of maximum target gene expression) – stage where the maximum levels of gene expression where observed, obtained from the RNA-seq based on samples related to the human retinal development from Hoshino et al., 2018; scRNA-seq target gene expression – characterization of the cell identity where the expression pattern of the target gene is expected, to perform correlation between enhancer-target gene, obtained from the scRNA-seq based on samples related to human retinas from Thomas et al., 2022; DNase-seq (stage) – specific stages where the open chromatin context was identified, obtained from the DNase-seq experiments related to the retinal development from ENCODE; scATAC-seq – cell clusters where the peak identification was retrieved; Retinal TAD Support – qualitative estimation of the association UCNE-target gene; ChIP-seq - epigenomic marks, CTCFs and PolII peaks observed within the genomic context of the characterized UCNE; VISTA (Element ID) – assessment of the inclusion of the UCNE element within the VISTA enhancer browser and its unique ID; VISTA (Assay result) – reporter assay result from the VISTA enhancer browser; VISTA (Expression pattern) – includes the tissues where the expression pattern for the reporter assay was observed.
A) UMAP of the analyzed single cell dataset (Thomas et al., 2022).B) Feature plots (scATAC-seq and scRNA-seq) for the PAX6 locus.
Location of the characterized UCNE (in gray) linked to PAX6, found exclusively in the pax6a locus of zebrafish at a shorter distance (≈15kb) when compared to the human PAX6 locus (≈150kb) with respect to the PAX6 transcription start site.
A) Peak identification within the NR2F1 locus for the following marks: H3K4me1, H3K4me2, H3K4me3, CTCF, RNAPolII and H3K27ac across the different time points.
Table S2. Characterization of UCNEs based on chromatin accessibility. UCNE_id – Ultraconserved non-coding element name; Stage – retinal development stages where a peak from DNase is retrieved; #Stages – number of stages for the specific element where it was retrieved; scATAC-seq – cell cluster where the peak identification was retrieved; #Peaks – total number of cell clusters where the peak identification was retrieved.
Table S3. Characterization of UCNEs based on epigenomic marks.
Table S4. Disease association of UCNE target genes. Gene; UCNE_id; g.coordinates; #OMIM disease name; confidence category; allelic requirement; mutation consequence; phenotypes; organ specificity list; PMIDs.
Table S5. Overview of the eye disease sub-cohort of 100,000 Genomes Project (Genomics England). Normalized Disease Group; Normalized Disease Sub Group; Normalized Specific Disease; Participant Count.
Table S6. Variants retrieved within the UCNEs that are linked to an eye or retinal disease phenotype. It includes the gene, the UCNE_id or name and the retrieved variant (hg38).
Table S7. Phenotypic description of the studied family segregating autosomal dominant foveal abnormalities. Participant information (ID, origin and sex), the molecular findings (carriers of the V1 (chr11:31968001T>C) and/or V2 (CFH, c.1187A>G (p.Asn396Ser)), diagnosis and age of onset (Clinical diagnosis, Age at diagnosis, Age of onset visual field loss, Age of onset BCVA loss) and clinical findings (Age at examination, Slit-lamp examination, BCVA OD/OS (LogMAR), Fundus Examination, Visual field, FAF, OCT, Other findings).
Table S8. Analysis of the TFBS motif disruption potentially exerted by the chr11:31968001T>C variant (qBiC-PRED) and retrieved variants within genes associated with macular developmental defects and foveal hypoplasia (IRX1, PRMD13, SLC38A8, GPR143, FRMD7 and AHR). All coordinates are in the hg38 assembly.
Table S9. Set of primers used in this study for cloning and segregation.
Table S10. Overview of the transgenic enhancer assays in zebrafish for the PAX6-associated UCNE.
ACKNOWLEDGEMENTS
This research was made possible through access to the data and findings generated by the 100,000 Genomes Project. The 100,000 Genomes Project is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The 100,000 Genomes Project is funded by the National Institute for Health Research and NHS England. The Welcome Trust, Cancer Research UK and the Medical Research Council have also funded research infrastructure. The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support. We acknowledge Chris Inglehearn (Leeds Institute of Medical Research, School of Medicine, University of Leeds, Leeds, UK) for his helpful advice and comments for the manuscript. We also would like to thank the Zebrafish Facility Ghent (ZFG) Core at Ghent University. Hanna De Saffel, Quinten Mahieu, Angelika Jürgens and Lies Vantomme are thanked for their excellent technical assistance.