Abstract
Background Primary carnitine deficiency (PCD) is an autosomal recessive rare disorder of carnitine cycle and carnitine transport caused by mutations in the SLC22A5 gene. The prevalence of PCD is unclear. This study aimed to estimate the carrier frequency and genetic prevalence of PCD using Genome Aggregation Database (gnomAD) data.
Methods The pathogenicity of SLC22A5 variants was interpreted according to the American College of Medical Genetics and Genomics (ACMG) standards and guidelines. The minor allele frequency (MAF) of SLC22A5 gene disease-causing variants in 807,162 unique individuals was examined to estimate the global prevalence of PCD in five major ethnicities: African (afr), Admixed American (amr), East Asian (eas), Non-Finnish European (nfe) and South Asian (sas). The global and population-specific carrier frequencies and genetic prevalence of PCD were calculated using the Hardy–Weinberg equation.
Results In total, 195 pathogenic/likely pathogenic variants (PV/LPV) were identified according to ACMG standards and guidelines. The global carrier frequency and genetic prevalence of PCD were 1/88 and 1/31,260, respectively.
Conclusions The prevalence of PCD is estimated to be 1/30,000 globally, with a range of between 1/20,000 and 1/70,000 depending on ethnicity.
Introduction
Systemic primary carnitine deficiency (OMIM: 212140, ORPHA: 158, GARD: 5104) is a rare inborn error of fatty acid metabolism [1]. It presents with a broad spectrum of clinical signs and symptoms, with cardiac symptoms (predominantly cardiomyopathy) being the most prevalent. Neurological, hepatic, and metabolic symptoms occurred. symptoms in PCD predominantly develop in early childhood, however, adult onset of symptoms can occur, patients suffered a severe event without any preceding symptom. Newborn screening (NBS) can detect most cases of PCD, with both newborns and mothers of newborns remaining asymptomatic [2]. The diagnosis can be suspected on newborn screening, it is established through genetic and/or functional (carnitine transporter activity) testing [3]. Primary carnitine deficiency can be treated with L-carnitine supplementation [4].
PCD is an autosomal recessive disorder which caused by homozygous or compound heterozygous pathogenic mutations in the SLC22A5 gene [5–7] that encodes the plasma membrane sodium-dependent high affinity carnitine transporter OCTN2 [8]. OCTN2 is necessary for L-carnitine transport across the plasma membrane and L-carnitine is necessary for transporting long chain fatty acids into the mitochondria for fatty acid β-oxidation. The SLC22A5 gene, located on chromosome 5q31.1, spans 10 exons and comprises 25,903 base pairs, encoding OCTN2 protein of 557 amino acids. more than 200 SLC22A5 variants have been identified in patients with PCD since 1999 [7, 9–11].
primary carnitine deficiency (PCD) exact prevalence is unknown and varies depending on ethnicity. The incidence is 1/10,000-1/20,000 in China through newborn screening [12–14], 1/20,000 - 1/70,000 newborns in Europe and the USA [1] while the estimated incidence in Japan is 1/40,000 births [11]. The prevalence of PCD in the Faroe Islands is 1/297, which is the highest reported in the world [15].
As high-throughput sequencing technology has evolved, re-evaluation guidelines for interpreting and classifying the pathogenicity of identified variants have been implemented. In addition, large-scale population databases have become widely available and can be used for assessment of genetic variants in rare diseases. In fact, for several rare diseases, there is evidence that these databases have improved the interpretation and classification of variants in patients with monogenic disease and allowed better prediction of which variants are likely to cause disease. The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects. The v4 data set (GRCh38) spans 730,947 exome sequences and 76,215 whole-genome sequences from unrelated individuals of diverse ancestries.
We attempted to obtain a more reliable estimate of the global prevalence and genetic spectrum of PCD from the Genome Aggregation Database (gnomAD) dataset using a well-established pipeline [16, 17]. Additionally, we aimed to generate a curated machine learning training dataset of SLC22A5 variants for pathogenicity classification and interpretation.
Methods
Simulation single nucleotide variants and missense variants of SLC22A5 gene
To simulate all single nucleotide variants (SNVs) in the SLC22A5 gene, restrict the SNVs to be generated to coding sequence (CDS) from the Matched Annotation from NCBI and EMBL-EBI (MANE) transcript and corresponding protein [18] using the Variation Simulation Python package (https://github.com/liu-sun/VariationSimulation) with Human Genome Variation Society (HGVS) nomenclature [19], and then translate HGVS notation to all possible variant IDs (refSNP, SPDI, VCF, etc.) using Python client for Ensembl REST API (https://pypi.org/project/ensemblrestpy/). Simulated missense, nonsense and synonymous variants were included in the subsequent variant annotation and curation.
all missense (SNVs and MNVs) variants in SLC22A5 gene coding sequence (CDS) were simulated using Variation Simulation python package with state-of-the-art of variant prediction score: EVE [20], AlphaMissense [21], ESM1b [22], CADD [23] and PrimateAI-3D [24].
Identification and annotation of previously reported SLC22A5 disease-causing variants
To evaluate the genomic spectrum of PCD, a comprehensive literature retrieve was performed to identify all previously reported disease-causing SLC22A5 gene variants. Searches were conducted in the literature database PubMed, Scopus and Web of Science using the following combinations of retrieval terms: primary carnitine deficiency (PCD), carnitine transporter defect (CTD), carnitine uptake deficiency (CUD), SLC22A5, OCTN2, mutations, variants, variations and mutants.
Two authors screened publications according to inclusion and exclusion criteria: original case reports and newborn screening (NBS) reporting disease-causing variants of the SLC22A5 gene were included, and variants in title, abstract, full-text, tables, figures, or supplementary material were extracted. Non-English-language case reports, articles, reviews, comments, editorials, letters, etc., and cell-based assays and animal model studies were excluded. Common SLC22A5 polymorphisms associated with common traits in genome-wide association studies were also excluded.
Reported SLC22A5 variants were also identified from LitVar [25], Ensembl Variation [26] and ClinVar [27]. Prediction and functional annotation for all SLC22A5 nonsynonymous single-nucleotide variants were compiled using Ensembl Variant Effect Predictor (VEP).
Predicting splicing consequence for SLC22A5 synonymous single-nucleotide variants using SpliceAI [28] and Pangolin [29].
Identification and prediction of novel SLC22A5 loss of function variants
The gnomAD database (https://gnomad.broadinstitute.org/) was searched for novel loss of function (LoF) SLC22A5 variants that had not yet been reported, and protein-truncating variants (PTVs) were examined (frameshifts, stop codons, initiator codons, splice donors and splice acceptors).
Interpretation of the pathogenicity of SLC22A5 variants
The pathogenicity of variants was interpreted following the Clinical Interpretation of Sequence Variants protocol [30]. The pathogenicity of all missense, synonymous, in-frame indel and nonsense, frameshift and splicing SLC22A5 variants was classified according to Standards and Guidelines of the American College of Medical Genetics and Genomics (ACMG) [31] and ClinGen Sequence Variant Interpretation Working Group (SVI) recommendation [32–38] with the ClinGen Variant Curation Interface (VCI) implement [39].
For missense, in-frame indel and synonymous variants, we performed additional literature retrieval to curate in vitro or in vivo functional studies supportive of functional effect on SLC22A5 missense/synonymous/in-frame variants. Variants reported in a peer-reviewed journal were labeled with PS3 level evidence and classified as likely pathogenic if variants with function less than 20% of WT OCTN2 function with respect to carnitine transport [40, 41].
Variants classified as pathogenic and likely pathogenic were included, and variants classified as benign, likely benign, or uncertain significance were excluded, respectively. Pathogenic/likely pathogenic variants were included in the subsequent carrier frequency and prevalence calculation.
Annotation of variants with minor allele frequencies (MAFs)
four consequence category SLC22A5 variants: predicted loss-of-function (pLoF) variants, missense/inframe indel variants, synonymous variants and other variants data were directly downloaded from gnomAD browser. Variants were subsequent manually filtered out, which flagged by the loss-of-function transcript effect estimator (LOFTEE), such as variants near the end of transcripts and in non-canonical splice sites and multi-nucleotide variants (MNVs).
The canonical SLC22A5 transcript (NM_003060.4 / ENST00000245407.8, 3,277 bp) and protein (NP_003051.1 / ENSP00000245407.3, 557 aa) were selected by Matched Annotation from NCBI and EMBL-EBI (MANE) [18].
The minor allele frequency (MAF) of pathogenic/likely pathogenic SLC22A5 variants in the gnomAD population for the following ethnic groups: African/African American (AFR), American Admixed/Latino (AMR), East Asian (EAS), Non-Finish European (NFE) and South Asian (SAS).
Carrier frequency and genetic prevalence calculation
The genetic prevalence and carrier frequency of PCD were calculated based on the Hardy–Weinberg equation. For a monogenic autosomal recessive disorder, the genetic prevalence is given by [1−∏i(1−qi)]2. This is based on the theory of probability, where qi stands for each likely pathogenic/pathogenic variant minor allele frequency (MAF). The genetic prevalence was approximately equal to (∑qi)2, and the carrier frequency was 2(1−∑qi) ∑qi≈2∑qi. The disease prevalence was estimated by utilizing the observed allele frequency of a likely pathogenic/pathogenic variant in the gnomAD v4.1 database as the direct estimator for qi. The disease prevalence can be estimated by the following equation [1−∏i(1−ACi/ANi)]2≈∑AFi2. here, the AC is the allele count, the AN is allele number and the AF is allele frequency [16, 17].
The Python statistics package statsmodels and scientific computing package NumPy were employed for calculating the 95% confidence interval (95% CI) for the binomial proportion of carrier frequency and the genetic prevalence with the Clopper–Pearson interval. Graphics were plotted using the python graphic package seaborn and matplotlib [16, 17].
Results
Simulation of SLC22A5 single nucleotide variants and missense variants
single nucleotide variants (SNVs) in SLC22A5 gene coding sequence were generated 5013 variants, including 3633 missense variants, 162 nonsense variants, 1218 synonymous variants. Missense variants in SLC22A5 gene coding sequence were generated 10583 variants, including 5013 SNVs and 5570 multi-nucleotide variants (MNVs). All single nucleotide variants and multi-nucleotide variants are listed in the Science Data Bank (ScienceDB) online data repository. three MNVs c.35_36delinsAT (p.Gly12Asp), c.56_57delinsCT (p.Arg19Pro) and c.1324_1325delinsAT (p.Ala442Ile) were previously reported [41–45].
Identification of SLC22A5 disease-causing variants
Comprehensive retrieval of PCD disease-causing variants resulted in the identification of 2209 variants in the gnomAD v4.1 database, including 108 pLoF variants, 802 missense/in-frame indel variants, 273 synonymous variants and 1026 other variants. After filter, remaining 94 pLoF variants. For in-frame, synonymous and other variants, we pick reported variants: c.453G>A (p.Val151=) [41, 46] and c.-149G>A (rs57262206) [47, 48], p.Phe23del [49] and p.Leu394del [41], respectively. All disease-causing variant pipelines are shown in Fig. 1.
flowchart of curation and classification of SLC22A5 pathogenic/likely pathogenic variants and prevalence calculation.
195 disease-causing variants in the SLC22A5 gene, including 98 missense variants, 2 in-frame variants, 1 UTR variant and 94 protein-truncating variants (PTVs), are classified as pathogenic/likely pathogenic according to the Standards and guidelines of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (missense variants are shown in Table 1; all variants are listed in the Science Data Bank (ScienceDB) online data repository).
gnomAD allele frequencies were available for 178/195 disease-causing diseases. Including 81 missense, 2 in-frame, 1 UTR variants and 94 pLoF variants. Pooling of the allele frequencies of these variants resulted in a global allele frequency of 0.005655974, which is equivalent to a prevalence of 3.2 per million population (95% confidence interval (95% CI): [3.07, 3.33]). Five major populations had distinct prevalence (Fig 2 and Table 1) with a range of between 1/20,000 and 1/70,000 depending on ethnicity.
PCD genetic prevalence estimated from gnomAD allele frequencies. B. PCD carrier frequency among diverse population. African/African American (afr), American Admixed/Latino (amr), East Asian (eas), and South Asian (sas).
Discussion
PCD is a rare autosomal recessive disorder caused by homozygous or compound heterozygous mutations in the solute carrier family 22 member 5 (SLC22A5) gene on chromosome 5q31.1, which encodes organic cation/carnitine transporter 2 (OCTN2) protein. PCD typically manifests in infancy, between the ages of 3 months and 2 years. Infants often present with hypoketotic hypoglycaemia, poor feeding, irritability, lethargy, and hepatomegaly, which is triggered by fasting stress or common illnesses, including gastroenteritis and respiratory tract infections. Approximately half of the patients who present clinically present with muscle hypotonia and progressive childhood cardiomyopathy, which can lead to heart failure. Anemia is occasionally observed in patients with this condition, as carnitine plays a role in red blood cell metabolism. In adults, the presentation is often associated with minor symptoms such as fatigue and decreased stamina. However, dilated cardiomyopathy, arrhythmias, and sudden cardiac death (SCD) have also been reported. Asymptomatic adults have also been described. During pregnancy, minor symptoms as well as cardiac arrhythmias can worsen.
The prevalence is uncertain and varies according to ethnicity. The estimated prevalence is 1/20,000 to 1/70,000 newborns in Europe and the USA, while the estimated incidence in Japan is 1/40,000 births. The prevalence of PCD in the Faroe Islands is the highest reported in the world (1/297). In this study, we sought to estimate the prevalence of PCD using 800 k-scale population genome data and to deepen our understanding of SLC22A5 genetic variation. Using gnomAD data, the prevalence is 3.2 per 1 million (1/31,260) in the global population, 3.6 per million (1/27,388) in the European population, 3 per million (1/34,143) in the East Asian population, and 1.4 per million (1/71,269) in the South Asian population, respectively.
Multi-nucleotide variants (MNVs), defined as two or more nearby variants existing on the same haplotype in an individual, are a clinically and biologically significant consequence of genetic variation [50]. Previously, three MNVs in the SLC22A5 gene have been reported. 3633 of 10583 missense variants were SNVs, but 6950 missense variants were MNVs.
In addition, recent efforts in newborn screening (NBS) have highlighted the importance of second-tier genetic screening for better detection of PCD [3, 14, 51]. Traditional NBS methods, which measure free carnitine (C0) levels, have limitations due to the influence of maternal carnitine levels on the newborn. By incorporating genetic testing, the detection rate of PCD has improved. This combined approach has shown that the actual incidence of PCD may be higher than previously reported.
The prevalence of PCD in the Faroe Islands is the highest reported in the world (1:297) based on 26,462 individuals from the nationwide screening program for primary carnitine deficiency [15]. One study showed a strong association between sudden death and untreated PCD in the Faroe Islands, especially in females [52]. Another study showed PCD in adults can cause serious symptoms, but adult Faroese patients identified through a screening program were predominantly asymptomatic with a normal cardiac structure and function [53]. A 10-year follow-up in the Faroe Islands showed patients with primary carnitine deficiency treated with L-carnitine are alive and doing well more than 10 years after diagnosis [54].
IRIDA is on the first national list of rare diseases issued by China, and the prevalence of IRIDA in China has been estimated [12, 13]. The National Rare Diseases Registry System (NRDRS) has registered IRIDA cases [55]. We demonstrated the power and limitations of the 800-k population genome database to calculate the prevalence of rare diseases. our estimations provide a useful comparison with newborn screening, and for other rare diseases for which screening data are not available, estimations based on genomic data can serve as a valuable reference [16, 17].
In conclusion, through a comprehensive analysis of genetic variation in SLC22A5, we expanded our recognition of disease-causing mutations to 195 variants. These data can be used as a training set for pathogenicity prediction of novel variants and genetic diagnosis of PCD.
Data Availability
the Science Data Bank (ScienceDB) repository.