Abstract
The emergence of a heavily mutated SARS-CoV-2 variant (Omicron; B.1.1.529/BA.1/BA.2) and its rapid spread globally created public health alarms. Characterizing the mutational profile of Omicron is necessary to interpret its shared or distinctive clinical phenotypes with other SARS-CoV-2 variants. We compared the mutations of Omicron with prior variants of concern (Alpha, Beta, Gamma, Delta), variants of interest (Lambda, Mu, Eta, Iota and Kappa), and ∼1500 SARS-CoV-2 lineages constituting ∼5.8 million SARS-CoV-2 genomes. Omicron’s Spike protein has 26 amino acid mutations (23 substitutions, two deletions and one insertion) that are distinct compared to other variants of concern. Whereas the substitution and deletion mutations have appeared in previous SARS-CoV-2 lineages, the insertion mutation (ins214EPE) has not been previously observed in any other SARS-CoV-2 lineage. Here, we discuss various mechanisms through which the nucleotide sequence encoding for ins214EPE could have been acquired and highlight the plausibility of template switching via either the human transcriptome or prior viral genomes. Analysis of homology of the inserted nucleotide sequence and flanking regions suggests that this template switching event could have involved the genomes of SARS-CoV-2 variants (e.g. B.1.1 strain), other human coronaviruses that infect the same host cells as SARS-CoV-2 (e.g. HCoV-OC43 or HCoV-229E), or a human transcript expressed in a host cell that was infected by the Omicron precursor. Whether ins214EPE impacts the epidemiological or clinical properties of Omicron (e.g. transmissibility) warrants further investigation. There is also a need to understand whether human host cells are being exploited by SARS-CoV-2 as an ‘evolutionary sandbox’ for inter-viral or host-virus genomic interplay to produce new SARS-CoV-2 variants.
Introduction
A new SARS-CoV-2 variant with an extensively mutated Spike protein was first reported to the World Health Organization (WHO) from South Africa on November 24, 2021, with the first sample collected on November 9, 2021. This strain has since been denoted as the Omicron variant (WHO nomenclature) and B.1.1.529/BA.1/BA.2 (PANGO lineage)1. The rapid assessment of the variant by The Technical Advisory Group on SARS-CoV-2 Virus Evolution and classification of Omicron as a variant of concern by the WHO within 48 hours facilitated timely epidemiological surveillance. After its initial discovery, this variant rapidly spread across the globe and was detected in over 75 countries across six continents by December 16, 2021.2,3
Thoroughly characterizing the mutational profile of Omicron is the necessary first step to begin interpreting its shared or distinctive clinical phenotypes, sensitivity or resistance to existing vaccines, and whether Omicron-like variants that evolve in the future may have heightened virulence. Indeed, SARS-CoV-2 has evolved into different variants of concern and variants of interest through a combination of missense, deletion, insertion, and other mutations. For example, the D614G substitution in the Spike (S) protein, which emerged early and has been detected in nearly all SARS-CoV-2 genomes in GISAID since mid 2020, increases the replication capacity and infectivity of SARS-CoV-2.4,5 Other substitutions (e.g. E484K and E484A) have led to significant changes in the Spike-ACE2 binding affinity, and deletions (e.g. ΔY144) have modulated the effect of neutralizing anti-Spike antibodies.6–13 Insertion mutations have been less prevalent in SARS-CoV-2 evolution.14 However, one of the most functionally consequential mutations in the evolutionary history of SARS-CoV-2 till date was the “PRRA” Spike protein insertion in the S1/S2 cleavage site, which introduced the polybasic FURIN cleavage site that mimics the RRARSVAS peptide in human ENaC-alpha.15–19 This insertion serves as an important determination of SARS-CoV-2 transmission, at least in part by facilitating an endosome-independent entry pathway into respiratory epithelial cells which bypasses important innate antiviral responses.18 It is also mechanistically required for the syncytium-mediated lymphocytes death which may contribute to the lymphopenia that is often clinically observed in COVID-19 patients.20–23 This insertion is an important determinant for SARS-CoV-2 transmission The availability of 5.8 million SARS-CoV-2 genomes covering ∼1500 lineages from over 203 countries/territories in the GISAID database from the beginning of the pandemic gives an opportunity to characterize the mutational profile of the Omicron variant in comparison to other SARS-CoV-2 variants.
In this study, we compare the mutational profile of Omicron with all other SARS-CoV-2 lineages including the variants of concern and variants of interest. We highlight that Omicron’s Spike protein harbors an insertion mutation ins214EPE that is absent in all other SARS-CoV-2 lineages. Given the salience of viral genetic recombination and the debated plausibility of host genome integration by SARS-CoV-224–26, we considered a variety of host-viral and inter-viral genomic matter exchange scenarios that may have contributed to the adoption of this insertion mutation in the precursor variant of Omicron. We discuss potential sources for the origin of the ins214EPE and highlight the need to experimentally characterize the role of ins214EPE for transmission and immune evasion.
Results
Comparison of mutations in Omicron to previous SARS-CoV-2 lineages shows the presence of a unique insertion mutation in Omicron’s Spike protein
The Omicron variant harbors 37 mutations in the Spike protein, which include six deletion mutations, one insertion mutation and 30 substitution mutations.27 16 of the 37 mutations are surge-associated mutations28 (Table 1; see Methods), i.e. their mutational prevalence increased during any three-month window when COVID-19 cases surged. Comparing these Spike protein mutations in Omicron with pre-existing variants of concern (VOCs: Alpha, Beta, Gamma and Delta) shows that 26 mutations are distinct to Omicron and seven mutations overlap between Omicron and Alpha (Figure 1A). When compared to the Delta variant, the mutational load of Omicron is particularly high in the Spike protein sequence, with more similar rates of mutation in regions of the viral genome encoding other proteins (Figure S1).
We next analyzed which of the 26 Omicron mutations appeared in the prior variants of interest (VOIs: Lambda, Mu, Eta, Iota and Kappa) or prior SARS-CoV-2 lineages by comparing with mutations from 5,781,715 genomes corresponding to ∼1500 lineages from the GISAID database (GISAID) (Table 1; Table S1). Interestingly, while other insertions at or near this position have been observed in other lineages,14,29 the specific insertion mutation ins214EPE (Figure S2) has not been previously observed in any SARS-CoV-2 lineage other than Omicron (Figure 1B). Specifically, of the 1168 SARS-CoV-2 genomes in GISAID harboring this insertion, 1164 are classified as Omicron. Of the remaining four genomes, three have not yet been assigned a Pango lineage, while the other one (which was deposited on November 24, 2021) is labeled as B.1 (a parent lineage of Omicron). On the other hand, all of the Omicron substitution and deletion mutation have appeared in previous SARS-CoV-2 lineages (Figure 1B, Table S1).
The EPE insertion (Ins214EPE) on Omicron maps to the Spike protein N-terminal domain (NTD) distal from the antibody binding supersite.11 However, the loop where the insertion is present maps to a known human T-cell epitope on SARS-CoV-2.30 Further studies will be necessary to understand whether this insertion may help SARS-CoV-2 escape T-cell immunity.14 A recent study also suggests that insertions in the NTD, including ins214EPE in Omicron, may increase viral transmissibility by enhancing sialic-acid receptor binding.31 Such a functional consequence may be consistent with the observation that multiple SARS-CoV-2 lineages have acquired insertions of 1 to 8 additional amino acids at or near this same site (i.e. between positions 213 and 216).29 Given the well-established potential for insertions to impact SARS-CoV-2 virulence (e.g. the PRRA insertion giving rise to a polybasic FURIN cleavage site in the original SARS-CoV-2 strain), it is important to understand the functional significance and evolutionary origins of the ins214EPE insertion in the Omicron variant.15–17,19
Template switching is a plausible mechanism for the origin of ins214EPE in Omicron
Ins214EPE is an in-frame insertion of 9 nucleotides that occurred between nucleotide positions 22204 and 22207. Based on the local sequence alignment, there are three candidate insertions: (1) 5’-GAGCCAGAA-3’ between 22204 and 22205, (2) 5’-AGCCAGAAG-3’ between 22205 and 22206, and (3) 5’-AGCCAGAAG-3’ between positions 22206 and 22207 (Figure S2). According to a recent analysis of the secondary structure of the reference SARS-CoV-2 genome, the nucleotides between positions 22205 and 22208 in (5’-GAUC-3’) constitute an RNA loop32, which is notable given that RNA loops appear to be more prone to insertions than stems.
A previous analysis of sequences deposited in GISAID concluded that insertions in the SARS-CoV-2 genome most likely arise from one of three mechanisms: (1) local duplications, (2) polymerase slippage, or (3) template switching.14 While local duplications were found to explain some short (<9 nucleotides) and long (>= 9 nucleotides) insertions, the observation that these groups had distinct nucleotide compositions, phyletic patterns, and genomic localizations led to the conclusion that short and long insertions typically arise from distinct mechanisms.14 Specifically, it was suggested that polymerase slippage is the best explanation for most short insertions, which typically have an excess of uracil nucleotides and are non-monophyletic. It is proposed that slippage is most likely to occur during runs of uracils owing to the slow processing of polyU tracts that has been demonstrated for the RNA-dependent RNA polymerase (RdRp) of SARS-CoV-1 and is hypothesized to be true for the RdRp of SARS-CoV-2 as well.14,33 On the other hand, it was posited that long insertions typically arise from template switching, given that their nucleotide composition is consistent with the SARS-CoV-2 genome, they are typically monophyletic, and they tend to occur at or near sites that were previously identified as hotspots for template switching.14,34 It is notable that template switching is a normal part of the life cycle for Coronaviridae, as discontinuous transcription via template switching is responsible for the synthesis of subgenomic RNAs (sgRNAs).35,36 In this light, we asked which mechanism is the most fitting explanation for the origin of ins214EPE in Omicron.
As mentioned above, exact duplication of an adjacent nucleotide sequence has been observed previously as a mechanism for both short and long insertions in the SARS-CoV-2 genome.14 For example, in three GISAID sequences of the B.1.429 lineage, there is a duplication of 24 nucleotides (5’-AAAAGAAGAAGGCTGATGAAACTC-3’) resulting in the sequence 5’-AAAAGAAGAAGGCTGATGAAACTCAAAAGAAGAAGGCTGATGAAACTC-3’ at position 29387 (corresponding to the Nucleocapsid protein amino acid position 372).14 However, the inserted nucleotide sequence resulting in the Omicron ins214EPE is not a result of such a local duplication, as it is not identical or closely homologous to the preceding or subsequent nucleotide sequences in the original reference genome sequence of SARS-CoV-2 or that of the Omicron variant (Figure S2).
We thus asked whether polymerase slippage or template skipping was a more plausible explanation for the origin of ins214EPE. For several reasons, it appears that template switching is the more plausible hypothesis. First, this is a long insertion per the definition described previously (≥9 nucleotides), and long insertions are more likely to arise from template switching than polymerase slippage.14 That said, we recognize that with exactly 9 nucleotides, this is a borderline case between short and long insertions, and so its length alone may have limited value in distinguishing between these mechanisms. Second, this insertion has no uracil nucleotides, contrary to the expected excess uracils in slippage-mediated insertions.14 Third, this insertion is monophyletic, although it is worth noting that other insertions at the same location have been observed in several other SARS-CoV-2 lineages.14,29 Finally, the insertion occurs near previously described sites of potential non-canonical template switching. Specifically, there were non-canonical junctions observed 30 nucleotides upstream and 60 nucleotides downstream of this site (at positions 22183 and 22276, respectively).34
Candidate templates for the origin of ins214EPE in Omicron
It is of interest to determine candidate RNA molecules from which SARS-CoV-2 template switching-mediated insertions have arisen. We reasoned that there are broadly three categories of most likely templates: (1) genomic material of SARS-CoV-2 itself (i.e., the positive-sense genomic RNA or the negative-sense anti-genomic RNA), (2) genomic or anti-genomic material of other viruses that have the capacity to co-infect the same cells as SARS-CoV-2; and (3) human transcripts that are expressed in cells infected with SARS-CoV-2 (Figure 2A-B). Although the last category is not a well-described method of template switching,37 it has been suggested previously that insertions in SARS-CoV-2 genomes could be derived from the host transcriptome.38 We identified exact matches for the forward and/or reverse complement sequences in all three categories (Table 2, Figure 2C).
There are no exact matches in any Omicron genomes collected to date for the three candidate sequences outside of the insertion site itself. This is notable because in previous template switch-mediated insertions in the SARS-CoV-2 genome, the putative insertion template (“origin”) has typically been present in the insertion-containing genome (Table S2).14 For the Omicron insertion, on the other hand, the only SARS-CoV-2 sequences in GISAID containing exact forward or reverse complement matches are assigned to other SARS-CoV-2 lineages (or were not assigned to any lineage at the time of this analysis). There are several possible implications or interpretations of this finding.
First, substitutions can be introduced within the inserted sequence during template switching itself or during subsequent rounds of viral replication, which would result in imperfect matching between the insertion and template sequences. This is particularly relevant for Omicron, which represents a long phylogenetic branch that arose after presumably several months of unobserved evolution.39,40 We indeed found that the reference Omicron genome (EPI_ISL_6640916)39 harbors several nucleotide 9-mers that differ from the candidate insertions by only a single nucleotide (Table S3). These should be considered as possible templates for ins214EPE. Second, it is possible that the utilized template was derived from a co-infecting SARS-CoV-2 variant that does harbor the exact inserted sequence. Recombination between SARS-CoV-2 lineages in the context of simultaneous co-infection has been described previously, with particularly high recombination rates seen in the Spike protein sequence.35,41 The distributions of lineages comprising the matched genomes for each candidate insertion are shown in Table S4. Finally, we noticed that several of the non-Omicron genomes with exact matches to one or more of the candidate insertions have been assigned to the B.1.1 Pango lineage (or sublineages thereof) (Table S4). Given that Omicron (originally Pango lineage B.1.1.529) is a phylogenetic descendant of B.1.1, this suggests that the ancestral genome which evolved into Omicron could have provided the necessary template for this insertion.
We also identified several genomes (or anti-genomes) of seasonal or enteric human-infecting coronaviruses that contain one or more of the putative insertion sequences. For example, multiple human coronavirus OC43 (HCoV-OC43) and human enteric coronavirus strain 4408 genomes both contain 5’-GCCAGAAGA-3’ and 5’-GAGCCAGAA-3’ in their nucleocapsid and replicase polyprotein genes, respectively (Table S5). Further, 5’-GAGCCAGAA-3’ was present in the 33 HCoV-229E anti-genomes, and 5’-GCCAGAAGA-3’ was present in two HCoV-NL63 anti-genomes (Table S5). The importance of recombination between coronaviruses has been highlighted recently,42 and its potential is supported by clinical reports showing that COVID-19 patients are co-infected with other respiratory pathogens, including non-SARS-CoV-2 viruses of the Coronaviridae family, at relatively high frequencies.43–45 Further, host receptors utilized by other coronaviruses (e.g. ANPEP and DPP4) are co-expressed with the SARS-CoV-2 receptor (ACE2) at the single cell level in respiratory and/or gastrointestinal epithelial cells (Tables S6–S7),46 which could facilitate co-infection at the cellular level (a prerequisite for genomic recombination). Intestinal co-expression is relevant given the evidence that SARS-CoV-2 and other coronaviruses can infect enterocytes.47–49 That said, it is worth noting the possibility that genomic material of pathogens outside of the Coronaviridae family (perhaps even non-viral pathogens) could also serve as substrates for template switching in the context of such cellular co-infection.
Finally, there were 4677 human transcripts (from 1534 genes) containing the forward sequence 5’-GAGCCAGAA-3’ and 3264 human transcripts (from 1220 genes) containing the reverse complement sequence. Similar summary statistics are shown for the two other potentially inserted 9-nucleotide sequences in Table 2.
Consideration of local homology for the candidate templates
This landscape of possible templates, particularly among human transcripts, is expectedly quite vast given the reasonably small total space of possible 9-mer nucleotide combinations (49 = 262144). This raises the question of whether the most likely candidates may be those transcripts which have more homology surrounding the inserted sequence, as complementary base pairing resulting from such local similarity can increase the likelihood of serving as a template for recombination.37 Indeed, in the normal process of coronavirus genomic replication, the prevailing model is that subgenomic RNAs are generated during negative strand synthesis via a template switching mechanism that relies on homology between conserved transcription regulatory sequence (TRS) elements dispersed strategically throughout the genome.34,36,50,51 However, in the context of SARS-CoV-2, non-canonical transcripts with junctions that are not derived from TRS sequences and that share little homology between the 5’ and 3’ sites suggests that template switching guided by partial complementarity or other mechanisms may also play a role.34,52
To study the homology between the regions surrounding template switch-mediated insertions and their origins (i.e. the putative template that was copied to generate the insertion), we first considered a “positive control” set of four 12-15 nucleotide insertions in SARS-CoV-2 that were previously attributed to template switching with high confidence (Table S2).14 We calculated a normalized homology score (NHS) between the 7 or 35 nucleotides upstream or downstream of the insertion and origin sequences in these genomes (see Methods). Surprisingly, the degree of homology observed was generally not higher than expected by chance, as assessed via the NHS distribution of 10,000 randomly paired non-overlapping 7-mer or 35-mer nucleotide sequences from the SARS-CoV-2 genome (Figure S3, Table S8). This suggests that local homology may not be a prerequisite for the generation of template switch-mediated insertions in SARS-CoV-2.34
Nevertheless, we still assessed the homology between the 35 nucleotides upstream or downstream of the Omicron insertion and the 35 nucleotides upstream or downstream of all candidate templates (see Methods). The NHS distributions for these candidates were generally similar to the distribution observed previously for randomly selected SARS-CoV-2 35-mers (Figure S4). Candidates in each category with the highest degrees of homology in the flanking upstream or downstream sequences included SARS-CoV-2 genomes from the lineages B.1.609 (NHS = 69) and AY.103 (NHS = 66), the HCoV-229E Spike protein (NHS = 63), and human transcripts of ACTN1 (NHS = 74) and EMC4 (NHS = 71). Some candidate templates also showed more homology in shorter sequences directly upstream and/or downstream of the inserted sequence. For example, in the reverse complement of the human TMEM245 transcript, there is a 17-nucleotide stretch with exact homology to the insertion-containing region of the Omicron genome (i.e. the 9-nucleotide inserted sequence plus 8 exactly matched flanking nucleotides) (Figure S5).
Discussion
Omicron is more highly transmissible than prior variants,53,54 is less susceptible to neutralization by monoclonal antibodies and sera of vaccinated individuals,55–61 and is more likely to cause re-infections and vaccine breakthrough infections.62–64 Among the many mutations in its Spike protein, ins214EPE is the only one that has not previously been observed in any other lineages. Whether this insertion, singularly or in concert with other mutations, contributes to high transmissibility or lower susceptibility of neutralization by antibodies warrants further investigation.
While it is challenging to definitively determine the mechanism that gave rise to ins214EPE, we propose that template switching is a plausible explanation (Figure 2). Although the RNA-dependent RNA polymerases of SARS-CoV-2 and other coronaviruses do normally utilize template switching to generate subgenomic RNAs,34,50 it appears that template switch-mediated insertions may derive from a non-canonical form of this process in which a high degree of local homology (e.g. homologous TRS core sequences in the leader and body regions of the genome) is not essential.34,52 Here we have highlighted several possible sources of the template for this insertion, including the SARS-CoV-2 genome itself along with the genomes of other viruses or human transcripts. The use of SARS-CoV-2 genomic material as the template is supported by the finding that SARS-CoV-2 genome replication occurs in organelles which spatially concentrate the viral genomic material and replication machinery.65 That said, it might indeed be possible for non-SARS-CoV-2 viral genomes or host mRNAs to be aberrantly included in these organelles, thus rendering them accessible for utilization during template switching as well.
There may also be other mechanisms that could lead to SARS-CoV-2 insertions in addition to those which are considered here. For example, it has been suggested that Omicron could have evolved in a non-human species such as mice or deer, in which case other viral genomes and transcripts should be considered as possible templates.66,67 Also, a recent analysis suggests that the proofreading exoribonuclease (encoded in nonstructural protein 14, or nsp14) is required for at least some of the genetic recombination observed in SARS-CoV-2,68 but the potential mechanisms described here do not account for this. Finally, it is reasonable to question whether there is a relationship between ins214EPE and the 3-nucleotide deletion (ΔN211) that occurs shortly upstream of it in most Omicron sequences. However, because most sequences in GISAID with other insertions at position 214 do not possess such neighbording deletions (Figure S6), we believe that this proximal deletion was not a mechanistic prerequisite for the generation of the Omicron insertion.
It is not clear whether any one of these insertion-generating mechanisms would have more far-reaching consequences than the others. Template switch offers the intriguing possibility of new SARS-CoV-2 lineages borrowing protein domains or sub-domains from previous variants, other viruses, or human proteins. However, it is worth noting that several of the possible templates that we identified for ins214EPE were derived from viral anti-genomes (i.e. from HCoV-229E or SARS-CoV-2 anti-genomes) or the reverse complement sequences of human transcripts (e.g. TMEM245). In such cases, the polypeptide encoded in the Omicron genome would differ from that encoded by the positive-sense strand of the template. The genomic location and/or amino acid content of the inserted sequence may be more important than its origin, and we thus highlight the need to characterize the functional impact of ins214EPE on the clinical and epidemiologic properties of the Omicron variant. Importantly, the data and hypotheses presented here are not sufficient to make inferences about such properties including transmissibility, immune evasion, or disease severity. Even if Omicron did acquire an insertion by utilizing the genome of a common cold-causing coronavirus (e.g. HCoV-OC43, HCoV-229E) or a host transcript, we do not propose that this would explain the reduced severity of COVID-19 observed in patients infected with Omicron compared to prior VOCs.69–72
The rapid rise in COVID-19 cases attributed to the Omicron variant, including among fully vaccinated individuals, raised alarm globally. In this context, it is important to better understand both the origins and consequences of new genomic alterations that distinguish Omicron from prior VOCs and VOIs. Here, we begin to address the former by providing several plausible hypotheses on the origins of a 9-nucleotide insertion in the N-terminal domain of the Omicron Spike protein. We suggest that genomic surveillance strategies should include an emphasis on sequencing SARS-CoV-2 genomes from immunocompromised patients and individuals with viral co-infections (including co-infections with multiple SARS-CoV-2 variants), as such individuals may provide unique contexts for genomic recombination and the evolution of new variants.
Methods
Analysis of surge-associated mutations
Core mutations were derived from parental lineages from the CoV-RDB database for each Variant of Interest or Variant of Concern: Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2), Lambda (C.37), Mu (B.1.621), Eta (B.1.525), Iota (B.1.526), Kappa (B.1.617.1) and Omicron (BA.1).27 All SARS-CoV-2 genome sequences corresponding to the Omicron variant were directly derived from the GISAID database.2,3 There were 1448 SARS-CoV-2 genomes annotated as B.1.1.529/BA.1/BA.2 as of December 13, 2021).
To understand which SARS-CoV-2 mutations correlate with COVID-19 test positivity, we identified “surge-associated mutations” using genomic data from GISAID (5,781,715 sequences from 203 countries/territories between December 2019 to December 2021) and epidemiology data from Our World in Data (OWID)73 as described in our previous study.28 A mutation is considered “surge associated” if it satisfies the following criteria: (1) it is present in at least 100 SARS-CoV-2 sequences in GISAID; (2) in a period of three consecutive months during which there was a monotonic increase of PCR positivity by at least 5% in a given country, the prevalence of the mutation in that country also monotonically increased by at least 5%.
Nucleotide 9-mer search to identify candidate viral and human templates for ins214EPE
A exact 9-mer search for all the three possible inserts (5’-AGCCAGAAG-3’, 5’-GAGCCAGAA-3’, 5’-GCCAGAAGA-3’) and their reverse complement sequences was performed across three different databases: (i) human transcriptome, (ii) SARS-CoV-2 genomes from GISAID, and (iii) human-infecting Coronaviridae family viruses. For a reference Omicron genome (EPI_ISL_6640916),39 we also performed a modified search to identify all 9-nucleotide sequences that differed from the possible insertion sequences by only one nucleotide (i.e. allowing for a single mismatch). The human transcript sequences (n = 244,939) were downloaded from the Gencode database74 (version 39; GRCh38.p13 of human genome). Coding sequences (CDS) for 5,781,715 SARS-CoV-2 genomes were accessed from GISAID2 on 13th December, 2021. All available sequences from human-infecting Coronavirdae family (taxid:11118) viruses were accessed from the NCBI virus database 75 (https://www.ncbi.nlm.nih.gov/labs/virus on 13, December 2021 (n = 574,178 complete genomes with 4,028,478 CDS). In the final analysis, we excluded genomes that were labeled as SARS-CoV, SARS-CoV-2, and Middle East Respiratory Syndrome (MERS)-CoV (or otherwise named strains of these viruses) to consider only seasonal and enteric coronavirus strains: human coronavirus (HCoV)-229E, HCoV-OC43, HCoV-NL63, and human enteric coronavirus strain 4408.
Assessment of homology between regions flanking insertion and origin sites
For a given insertion, we defined a putative origin as the matched sequence in a viral genome, a viral anti-genome, or a human transcript. We then assessed the homology between the 35 nucleotides upstream of the insertion and the 35 nucleotides upstream of the origin, and similarly we assessed the homology between the 35 nucleotides downstream of the insertion and the 35 nucleotides downstream of the origin. To assess the similarity between any given pair of nucleotide sequences, we defined a function using Bio.pairwise2 module from biopython v1.76 that performs a global alignment of nucleotide sequence using a custom scoring scheme (i.e. +5 for a match, −4 for a mismatch, 0 for gap start and extension). This score ranges from 0 (no matches) to 175 (perfect match for all 35 nucleotides). We further defined a Normalized Homology Score (NHS), ranging from 0 to 100, which is simply calculated by dividing the Homology Score by 175 and multiplying by 100. We also applied a similar protocol to assess the homology between shorter upstream and downstream sequences (7 nucleotides), where the NHS was calculated by dividing the Homology Score by 35 and multiplying by 100.
A set of “positive control” template switch-mediated insertions was obtained from Supplementary Data 4 from the prior analysis by Garushyants et al.14 We filtered this table to only those insertions that were assigned a mechanism of “Template switch” and which were 12 or more nucleotides long. This set of insertions is shown in Table S3. For each insertion, the genomic positions of the origin and insertion sequences were provided in this table, along with the Pango lineage assigned to the genome(s) harboring the insertion. We searched SARS-CoV-2 genomes in GISAID to identify those genomes that contained the provided origin and insertion sequences at approximately the provided genomic positions. The GISAID identifiers for the corresponding insertion-containing genomes are provided in Table S3. For each of these genomes, we then obtained the 35-nucleotide sequences upstream and downstream of the insertion and origin. Finally, we calculated the NHS for each relevant pair of sequences: (i) the 35 nucleotides upstream of the insertion versus the 35 nucleotides upstream of the origin, and (ii) the 35 nucleotides downstream of the insertion versus the 35 nucleotides downstream of the origin. These scores, and the local alignments contributing to them, are shown for each individual genome in Table S4.
Single cell analysis of coronavirus receptor coexpression
Publicly available single cell RNA-sequencing datasets were obtained and processed as described previously.46,76 The data is hosted at https://academia.nferx.com/dv/202011/singlecell/ and includes approximately 2.8 million cells which are derived from dozens of independent studies that cover most major human tissues. We determined the total number and percentage of cells (out of these 2.8 million total cells) expressing each gene encoding a coronavirus receptor of interest (ACE2, ANPEP, DPP4), along with the number and percentage of cells co-expressing ACE2 and ANPEP or ACE2 and DPP4. We also performed similar analyses on two specific studies of interest: (i) a study of nasopharyngeal and bronchial samples from COVID-19 patients and healthy controls (approximately 33,000 cells),77 and (ii) a study of ileal biopsies from Crohn’s Disease patients (approximately 136,000 cells).78 For these studies, we also evaluated co-expression in the cell types that showed the strongest expression of these genes. Specifically, to evaluate whether the degree of co-expression for ACE2 and ANPEP or DPP4 was more than expected by chance, we calculated the Observed to Expected Ratio of co-expression, assuming that expression of each gene is distributed randomly across the analyzed cells. Specifically, this value was calculated by dividing the co-expressing percentage by the product of the individual expression percentages and multiplying by 100.
Data Availability
All data produced in the present work are contained in the manuscript and accessible via the GISAID and NCBI resources.
Declaration of Interests
AJV, PA, PJL, RS, MJMN and VS are employees of nference and have financial interests in the company. nference collaborates with bio-pharmaceutical, medical device, diagnostics and software companies, as well as public health agencies, academic medical centers and health systems, on data science initiatives unrelated to this study. These collaborations had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.