Abstract
The COVID-19 pandemic has initiated an unprecedented worldwide effort to characterize its evolution through the mapping of mutations of the coronavirus SARS-CoV-2. The early identification of mutations that could confer adaptive advantages to the virus, such as higher infectivity or immune evasion, is of paramount importance. However, the large number of currently available genomes precludes the efficient use of phylogeny-based methods. Here we establish a fast and scalable early warning system based on Topological Data Analysis for the identification and surveillance of emerging adaptive mutations in large genomic datasets. Analyzing millions of SARS-CoV-2 genomes from GISAID, we demonstrate that topologically salient mutations are linked with an increase in infectivity or immune escape. We report on emerging potentially adaptive mutations as of January 2022, and pinpoint mutations in Variants of Concern that are likely due to convergent evolution. Our approach can improve the surveillance of mutations of concern, guide experimental studies, and aid vaccine development.
Introduction
The COVID-19 pandemic, caused by the coronavirus SARS-CoV-2, has led to millions of lost human lives and devastating economic impact worldwide. As the virus continues to spread through the world, it is acquiring new mutations in its genome, and although most mutations will be deleterious or neutral, a few of them could be advantageous for the virus, for instance, by increasing its infectivity or by helping it to avoid the immune system. As more people develop immune protection by previous viral infections or through vaccination, it is important to rapidly and effectively identify mutations that could confer the virus some adaptive advantage [1, 2].
One approach to identify potential adaptive mutations consists of experimentally mutating many positions and testing the effect on certain phenotypes, like binding to the human receptor or immune evasion [3–6]. However, experimental approaches are limited by the vast number of possible variations.
A more data-driven approach, which solely relies on the genomic information of the virus, is to look for mutations in a particular genomic locus that occur multiple times. If a mutation gives some sort of advantage to the virus, we expect it to occur in several places independently and its frequency to increase with time. In this approach, one usually reconstructs an estimated phylogenetic tree and identifies mutations that occur in independent branches [7–12]. For instance, the D614G mutation in the Spike gene was identified early in the pandemic and is now found in virtually all virus isolates [13].
In the COVID-19 pandemic, an unprecedented worldwide effort to sequence viruses resulted in a growing number of millions of genomes available to the scientific community [14]. Ideally, one would like to leverage all this genomic information at real-time to rapidly report the emergence of potential mutations of concern [1, 2]. Phylogenetic approaches, however, become daunting as the number of sequences increases, and are computationally prohibitive when the number of genomes exceeds the tens of thousands [12, 15, 16]. In addition, the independent reemergence of mutations gives rise to homoplasies that confuse the generation of phylogenetic trees [17]. Moreover it has been observed that constructing a single optimal phylogeny for SARS-CoV-2 is generally problematic, as the number of sequences is large while the genetic diversity is low [15].
Here we establish a novel method based on Topological Data Analysis (TDA) that can efficiently identify emerging adaptive mutations without the need to choose a single optimal phylogenetic tree in a vast range of equally plausible tree reconstructions. It systematically detects convergent events in viral evolution merely by their topological footprint, overcoming limitations of current phylogenetic inference techniques. Thanks to our use of highly optimized algorithms, it easily scales to hundreds of thousands of distinct genomes.
Analyzing millions of genomic sequences shared via GISAID, the global data science initiative [14, 18], we first characterize the convergent evolution of the coronavirus SARS-CoV-2 in the early phase of the pandemic from December 2019 until February 2021. We demonstrate that our method can detect adaptive mutations at an early stage, often several weeks before they become recognizable by their prevalence in the population, and explain how this can serve as an early warning and surveillance system. We further report on potentially adaptive mutations in the current phase of the pandemic as of January 2022, and identify mutations in Variants of Concern, most notably in the Omicron variant, that are likely due to convergent evolution.
Results
Quantification of topological recurrence
Chan et al. [19] initiated the use of persistent homology, a method from Topological Data Analysis, to extract global evolutionary features from large genomic datasets. This method detects topological cycles in the dataset, which correspond to reticulate events in the phylogeny and may cause phylogenetic inference methods to produce ambiguous tree topologies. All information is compiled into a stable and unbiased descriptor known as a persistence barcode (see Figure 1, Figure 2 and Methods).
Here we define a novel index of recurrence that is based on persistent homology and does not rely on a possibly ambiguous tree reconstruction. We use a specifically designed algorithm, implemented in Ripser [20], that associates to relevant bars in the persistence barcode explicit SNV cycles, given by a series of isolates that approximates all evolutionary steps as faithfully as possible in terms of single nucleotide variations (SNV). We define the topological recurrence index (tRI) of a specific mutation as the total number of SNV cycles in the reticulate phylogeny that contain an edge corresponding to this mutation. This index provides a lower bound for the number of independent occurrences of the mutation in the phylogeny and is therefore a measure for convergent evolution (see Figure 2 and Methods).
Standard phylogenetic methods for the detection of convergent evolution are based on the reconstruction of a phylogenetic tree and therefore have an unfavorable scaling, due to the rapid growth of the number of trees representing evolutionary histories that are compatible with the observations [15]. Persistent homology provides a new approach by measuring convergent evolution purely in terms of topological cycles, without the need to construct any phylogenetic trees. It enables a rapid and scalable analysis of large datasets with hundreds of thousands of sequences. A performance analysis showed that when applied to sample alignments of up to 5,000 SARS-CoV-2 sequences, the topological method was at least an order of magnitude faster than phylogeny-based methods [7, 8, 21] (see Figure 3). Moreover, the topological recurrence analysis of sequence alignments with millions of SARS-CoV-2 genomes was accomplished in less than a day (see Methods).
Topological analysis of the first year of the pandemic
We analyzed topological signals for convergent evolution of the coronavirus SARS-CoV-2, both across the whole genome and on the Spike gene, during the first year of the pandemic from its beginning in December 2019 until February 2021. To that end, we performed a topological recurrence analysis for a curated alignment of 303,651 high-quality SARS-CoV-2 whole genome sequences from GISAID [14, 18] (see Methods). The resulting persistence barcode features 2,899 bars, 58% of which (corresponding to 1,684 SNV cycles) concentrate at small genetic distance scales ≤ 2 and are therefore expected to be associated mainly with homoplasic events (see Supplementary Information Figure 1). In large genomic datasets, homoplasies may arise randomly, causing a certain amount of noise in persistent homology. However, from simulations we inferred that at least 60% of the topological cycles found in the GISAID dataset must be real features due to increased mutation probabilities, selection, recombination, or sequencing errors (see Supplementary Information Figure 6 and Methods). We detected 401 non-synonymous and 299 synonymous mutations across the whole genome showing significant tRI signal (see Figure 4 and Supplementary Information Table 7). Here signals with tRI ≥ 2 are statistically significant (p < 0.05; see Methods). Most of the mutations with strong tRI signal, such as ORF1ab:G11083T (tRI = 141; see Figure 5), are known to be highly homoplasic and cause stability issues in the construction of phylogenetic trees [10, 15, 17, 23].
Performing the topological recurrence analysis for the Spike gene alignment, we found a total of 322 non-synonymous and 196 synonymous Spike mutations with significant tRI signal (see Figure 4, Supplementary Information Table 8 and Methods). Here signals with tRI ≥ 8 are statistically significant (p < 0.05; see Methods). We observed a distinct accumulation of topological signals in the S1 subunit, which is associated with host receptor recognition and contains epitopes for antibody binding [24], and in the Spike protein signal peptide as well as in the S1-S2 cleavage region [25] (see Figure 4). We found a strong signal (tRI = 400) for S:D614G, which indicates that this mutation appeared independently at least 400 times during the pandemic. In fact, the mutation increases transmissibility [13, 26] and in vitro infectiousness [27–29]. We performed a comparative time series analysis (tRI vs. prevalence) which revealed a steady increase in both tRI and prevalence until the new variant eventually superseded the wild type in the population (see Figure 5 and Methods).
The mutations S:A222V and S:S477N are among those with strongest topological signal for recurrence (see Figure 5). Both are associated with lineage B.1.177 / 20E (EU1) which emerged in Europe in mid-2020, and S:S477N is now also seen in the Omicron variant B.1.1.529 [30]. While S:S477N is known to affect the binding affinity to the ACE2 receptor [3, 31] leading to a slight increase in fitness, there is no conclusive evidence yet whether or not S:A222V also results in higher transmissibility [32]. Our time series analysis for the latter shows that the particularly strong tRI signal is still rising after the initial surge in prevalence in the European summer of 2020 (see Figure 5). This suggests that S:A222V is notably recurrent, independently of its impact on viral fitness.
We noticed that in particular on the receptor-binding domain (RBD), several of the topologically significant amino acid changes (tRI ≥ 8) are found in Variants of Interest (VOI) or Variants of Concern (VOC) [30], with a distinct accumulation in the receptor-binding motif (see Table 1, Supplementary Information Figure 2 and Supplementary Information Figure 3). Specifically, the substitutions S:N501Y, S:E484K, S:L452R, S:K417N, S:F490S and S:S494P all result in reduced binding of polyclonal convalescent plasma [4] and exhibit a distinct increase in tRI and prevalence starting in late 2020 (see Supplementary Information Figure 3). This pattern is likely due to selective pressures induced by immune evasion in a host population with rising immunity. While both S:N501Y and S:N501T produce comparable tRI signals and induce a slight antibody escape, the fact that S:N501Y has a comparatively high prevalence of 19%, and is seen in several VOCs, is probably due to the additional increase in ACE2-binding affinity caused by the asparagine-to-tyrosine substitution. Topological signals for the mutations S:Y453F and S:F486L exclusively originated from a small subpopulation in minks [33–35]. The fact that tRI signals remain low (tRI = 8) despite becoming significant already in June/May 2020 suggests that both mutations have an adaptive effect in minks but do not confer a significant fitness advantage in humans.
Correlation with host adaptation
Our analysis of the first year of the pandemic revealed that on the receptor-binding domain, there is a strong correlation between significant topological signals (tRI ≥ 8) and an increase in ACE2-binding affinity [3] compared to the wild type (Fisher’s exact test, p < 0.01; Figure 6). We did not find a similar correlation between significant tRI and an increase in plasma antibody escape [4], which is plausible as immune evasion has become a relevant factor for the evolution of the virus only towards the end of 2020. However, among those mutations with strong topological signal of convergence (tRI > 12) we found a correlation between increased mean plasma antibody escape > 0.025 and the initial acquisition of a significant tRI signal after October 2020 (Fisher’s exact test, p < 0.05; Figure 6). This provides evidence for a shift, beginning in late 2020, towards immune escape as the driving force behind adaptation during the first year of the pandemic.
An early warning and surveillance tool
We found that the sensitivity of the topological recurrence index (tRI) is sufficient to detect signals of convergent evolution already at very low mutation frequencies, often several weeks before the respective mutation becomes recognizable by its prevalence in the population. More specifically, we observed that the onset of significant tRI signals precedes the actual rise in prevalence by several weeks for the adaptive mutations S:L18F [36], K417N [4], L452R [4, 37], S:E484K [4, 37], S:N501Y [3, 4] and S:P681H [38, 39] seen in the VOCs Alpha, Beta, Gamma, Delta and Omicron [30] (see Figure 7). This demonstrates that the topological recurrence index serves as an early indicator for emerging adaptive mutations.
In order to assess the reliability of this indicator, we retrospectively compared mutations that had been identified as potentially adaptive in February 2021 with the actual evolution of the virus, and the emergence of VOIs/VOCs in particular, during the later year 2021. Focusing on the RBD, we found that the amino acid changes S:A348S, S:N354D, S:P384L, S:N439K, S:G446V, S:A475V, S:E484G, S:F490S, S:N501T and S:Y508H developed a significant tRI signal over the course of the first year of the pandemic, with a rising tendency in February 2021, while they are associated with an increased mean plasma antibody escape > 0.01 [4], but had low prevalence < 5% and had not been seen in any VOI/VOC as of February 2021 (see Table 1, Supplementary Information Figure 2 and Supplementary Information Figure 4). Mutations at these RBD residues are likely to confer a fitness advantage to the virus and might therefore appear in future variants. In fact, several months after our analysis was completed, the immune escape mutations S:F490S [4, 37] and S:G446S [4, 40], which had not been observed in notable variants before, occurred in the Lambda [41, 42] and Omicron [43] variants, which were designated as VOI/VOC in June/November 2021 [30]. Similarly, the Spike substitutions S:T95I, S:L452R, S:S477N, S:T478K, S:N679K, S:P681R and S:D796Y later occurred in the Delta, Kappa and Omicron variants, which were designated as VOC/VOI/VOC not until May/April/November 2021 [30]. While S:T478K had not yet reached a statistically significant tRI, the tRI of all of these mutations showed a rising tendency in February 2021 (see Table 1 and Supplementary Information Figure 3).
By investigating explicit representatives of topological cycles in the genomic dataset, we were able to extract geographic and temporal information about the independent acquisition of topologically salient recurrent mutations during the pandemic (see Figure 7).
Updated analyses are available at https://tdalife.github.io/covtrec.
Ongoing convergent evolution and the Omicron variant
We analyzed the ongoing evolution of the coronavirus by applying our early warning and surveillance pipeline to a curated alignment of 3,928,116 high-quality SARS-CoV-2 Spike gene sequences from GISAID [14, 18] that have been collected during the year from January 2021 until January 2022 (see Methods).
A total of 343 non-synonymous and 189 synonymous Spike mutations with significant tRI signal were found (see Figure 8, Supplementary Information Table 11 and Methods). These mutations are potentially adaptive in the current phase of the pandemic and might therefore appear in future variants. Here signals with tRI ≥ 88 are statistically significant (p < 0.05; see Methods). We observed that on the receptor-binding domain, amino acid changes with significant topological signal accumulate in the receptor-binding motif (see Figure 8).
Of particular interest are the defining amino acid substitutions in the Omicron variant B.1.1.529 [43, 44], which was designated as VOC in November 2021 [30] and is rapidly spreading all over the world in January 2022 [45]. Our analysis revealed that 47% (18 out of 38) of the Spike amino acid substitutions seen in B.1.1.529—with 53% (16 out of 30) in the sublineage BA.1 and 56% (15 out of 27) in the sublineage BA.2—show a significant topological signal of recurrence (see Figure 8). This includes, among others, the following Spike amino acid substitutions: S:G339D, S:N440K, S:S477N, S:T478K and S:N501Y, which are known to enhance binding to the human ACE2 receptor; S:K417N, S:G446S and S:E484A, which may reduce polyclonal antibody binding; S:N679 and S:P681H at the S1-S2 cleavage region [25, 46]. Our findings provide further insight into the possible origins of the Spike mutations observed in the Omicron variant [47]. In fact, all 18 substitutions with significant topological recurrence index tRI ≥ 88 likely arose from convergent evolution, while the emergence of the remaining substitutions showing weak topological signals may be due to other reasons.
Discussion
In the current COVID-19 pandemic, the early identification of emerging adaptive mutations in large SARS-CoV-2 genomic datasets is of paramount importance. Such mutations could be associated with vaccine resistance or higher transmissibility, among other concerning attributes [48]. We present here an effective and unbiased method, based on a technique from Topological Data Analysis known as persistent homology, that can rapidly identify the presence of these mutations and is able to efficiently deal with the ever-increasing wealth of genomic sequencing data created by global public health surveillance. Our method does not rely on the prior reconstruction of an optimal phylogenetic tree and therefore (i) outperforms phylogeny-based methods [7, 8, 21] by at least an order of magnitude, and (ii) accomplishes the analysis of current datasets with millions of SARS-CoV-2 genomes within just one day. However, this gain in performance comes at the cost that, in contrast to phylogeny-based methods, our method will resolve homoplasies only on small genetic distance scales. We further remark that although persistent homology is robust with respect to noise in the genomic dataset [49], systematic sequencing errors [50, 51] might tamper topological inference to some extent.
Drawing from data from GISAID [14, 18], we applied our analysis pipeline to monitor the adaptive evolution of the coronavirus SARS-CoV-2 in two phases of the pandemic—the early phase from December 2019 until February 2021, and the current phase as of January 2022. The topological analysis of the early phase revealed a total of 700 (518) topologically recurrent mutations distributed across the whole genome (Spike gene), which shows that convergence plays a significant role in the evolution of SARS-CoV-2. We found that our method reliably identifies highly homoplasic sites across the whole genome, making it a useful tool in the design of unbiased masking schemes in phylogenetic inference. Including experimental data obtained by Starr et al. [3] and Greaney et al. [4] into our analysis, we inferred that on the receptor-binding domain topological recurrence correlates with host adaptation. Specifically, we observed a beginning shift in late 2020 from host receptor binding towards immune evasion as the main selective pressure. The topological analysis of the current phase of the pandemic revealed a total of 532 topologically recurrent mutations on the Spike gene.
We demonstrated that our method (i) is sufficiently sensitive to detect emerging adaptive mutations already at an early stage when mutation frequencies are still very low, and (ii) provides geographic and temporal information about virus isolates involved in concrete reticulate events. We identified potentially adaptive mutations on the Spike gene that showed significant topological signals of recurrence with a rising tendency in February 2021. These mutations are likely to appear in future variants, and in fact, several of these candidate mutations have later received special attention as features of the Delta, Lambda and Omicron Variants of Interest/Concern [30], not until several months after our analysis was completed. We explained how our results can aid the understanding of possible reasons for the occurrence of certain mutations in Variants of Concern such as the Omicron variant. We found that as of January 2022, at least half of the Spike gene mutations observed in the Omicron variant are likely due to convergent evolution. Updated analyses are available at https://tdalife.github.io/covtrec.
Based on these insights, we propose persistent homology as an early warning system for the emergence of new adaptive mutations in the ongoing COVID-19 pandemic and foresee this capability also in future pandemics of various pathogens. Basically, our method allows for a targeted mapping of recurrent mutations in any region of the genome, which can guide and motivate the experimental study of adaptive effects of specific mutations also in genes other than the Spike gene. Our method could in particular aid the development and rapid adaptation of vaccines for new emerging variants [2, 52]. We envision a combined effort between public health organizations with extensive sequencing of viral genomes, the computational characterization of potentially adaptive variants, and the experimental phenotypic characterization of these variants.
Methods
Data acquisition and data preparation
We use two separate datasets of SARS-CoV-2 genome sequences provided by the GISAID Epi-CoV Database [14, 18]. The first dataset was obtained by downloading all available SARS-CoV-2 whole genome sequences as of 28 February 2021, isolated from human and animal hosts, that carried the following attributes: “complete”, “high coverage”, “low coverage excluded”, “collection date complete”. This dataset comprised 450,473 sequences. We aligned all sequences with MUSCLE (Version 3.8.31) [53], using as reference genome the sequence Wuhan/Hu-1 with accession number EPI_ISL_402125, truncated at the start codon of ORF1ab (reference site position 266) and the stop codon of ORF10 (reference site position 29,674). Subsequently all sequences containing at least one ambiguous site (character “N”) were removed. This resulted in an alignment comprising 303,651 complete SARS-CoV-2 genomes of length 30,130nt. The second dataset is the alignment msa 0117.fasta which we downloaded from GISAID on 19 January 2022. It comprises 6,475,061 SARS-CoV-2 whole genome sequences that have been aligned to the reference sequence Wuhan/WIV04 with GISAID accession number EPI_ISL_402124 using MAFFT (Version 7.497) [54]. Sequences with collection date between 1 January 2021 and 17 January 2022 were selected, then truncated to the Spike gene (reference site positions 21,563 to 25,384), and finally sequences containing any characters other than A, C, T or G were removed. This resulted in an alignment comprising 3,928,116 complete SARS-CoV-2 Spike genes of length 4,669nt. A list of accession numbers of all sequences in these two alignments, along with an acknowledgement of the contributions of both the submitting and the originating laboratories, can be retrieved through the Data Acknowledgement Locator at https://www.gisaid.org with IDs EPI_SET_20220127bo and EPI_SET_20220124he.
The experimental data on viral phenotypes by Starr et al. and Greaney et al. was downloaded from [3, Table S2] and [4, Table S3]. The values for the mean plasma antibody escape used in Supplementary Information Figure 2 were computed by averaging mutation escape values for every substitution in [4, Table S3] over all subjects.
Distance matrices
We used Hammingdist (Version 0.13.0) [22] to compute the genetic distance matrix of a given alignment of genome sequences. For any pair of sequences in the alignment, this matrix gives the Hamming distance between the two sequences, which is the number of site positions at which the nucleotides in the two aligned sequences differ. Noteworthy, our convention in this work is that insertions and deletions (dashes “-” in aligned sequences) do not contribute to the genetic distance.
From the whole genome alignment covering the first year of the pandemic we created 15 time buckets, each ranging from December 2019 to one of the months between December 2019 and February 2021. For each time bucket, a time bucket sub-alignment of all genetically distinct sequences whose collection dates belong to the given time bucket was created by selecting isolates by their date stamp and removing genetically identical sequences (Hamming distance = 0). The largest time bucket sub-alignment ranging from December 2019 to February 2021 contained 161,024 genetically distinct sequences. Then for each such sub-alignment the corresponding genetic distance matrix, which is a sub-matrix of the distance matrix of the whole alignment, was derived. We obtained 15 distance matrices of whole genome time bucket sub-alignments. This process was repeated for all sub-alignments after truncating sequences to the Spike gene (reference site positions 21,563 to 25,384). We obtained 15 distance matrices of Spike gene time bucket sub-alignments. The Spike gene alignment covering the year from January 2021 until January 2022 contained 291,141 genetically distinct sequences. We computed the genetic distance matrix for this alignment.
Topological Data Analysis and viral evolution
Topological Data Analysis (TDA) is a field of data science that aims to study the shape of large datasets, by extracting topological structures and patterns. Such topological structures have associated dimensions: structures of dimension zero can be thought of as the connected components, and structures of dimension one are essentially the loops, or topological cycles, of the dataset. Structures of higher dimensions can also be defined, but are also more difficult to interpret. Here we are interested in reticulate evolutionary processes, thus we choose to focus on topological structures in dimension one, since topological cycles can be interpreted as signals of divergence from phylogenetic trees (see Figure 2).
Datasets often come as point clouds: in our setting, each point corresponds to a virus genome sample, and lies in a high-dimensional space where each nucleotide of the genome is a dimension. A common way to extract the phylogenetic network from this point cloud simply amounts to connecting samples as soon as their genetic distance is less than a given threshold r > 0. This results in a (neighborhood) graph, whose set of cycles provides candidates for the topological structures in dimension one of the true underlying network. However, a main limitation of this approach comes from the fact that relevant topological structure typically appears at multiple scales (see Supplementary Information Figure 5).
The most common way to handle this issue in Topological Data Analysis is to actually compute and track the cycles for all possible values of r ranging from 0 to + ∞. As r increases, some cycles can appear, and some already existing cycles can disappear, or get filled in. The whole point of Topological Data Analysis is to record, for each cycle, its radius of appearance, or birth time, and radius of disappearance, or death time. This is called the persistent homology of the point cloud. The construction, based on a scale parameter r, can be summarized as follows. The input is a distance matrix describing the dataset, considered as a finite metric space. First, consider the geometric graph at scale r, whose vertices are the data points, with any two points connected by an edge whenever their distance is at most r. Generalizing this construction, the Vietoris–Rips complex at scale r connects any subset of the data points with a simplex (an edge, a triangle, a tetrahedron, or a higher-dimensional generalization thereof) whenever all pairwise distances in the subset are at most r. A Vietoris–Rips complex is a particular type of simplicial complex, a higher-dimensional generalization of graphs which is of crucial interest in algebraic topology, in particular in homology theory. The family of Vietoris–Rips complexes for all parameters r is called the Vietoris–Rips filtration. It provides a multiscale method to extract cycles of various sizes, and to encode them in a so-called persistence barcode: each bar, or interval, in this barcode corresponds to a cycle representing a topological feature (a reticulate evolutionary process in our case), and the bar endpoints correspond to its radii of birth and death (the maximum genetic distance between consecutive samples forming the cycle, and, roughly, the maximum pairwise genetic distance between samples forming the cycle).
Each bar indicates the presence of a reticulate event, implying that the evolutionary history cannot be fully explained in terms of a single phylogenetic tree. The mathematical background of this phenomenon is a classical theorem due to Rips, which asserts that trees have trivial persistent Vietoris–Rips homology [55]. The corresponding cycle in the associated reticulate phylogeny can then be localized in the sequence alignment by tracing it back to the isolates that constitute the reticulate event. Moreover, the length of the bar represents the cycle size. In our case, this corresponds to the length of the reticulate evolutionary process, which allows to distinguish, for instance, between homoplasies and recombinations (see Figure 2). There are several scenarios that can lead to reticulate events. For instance, if a genome of an organism imports genetic material from a different genome, in lateral gene transfer for instance, we will observe that parts of the newly generated genome resemble the parent, while others resemble the genome of the organism that exported the material. Recombination and reassortments are common phenomena observed in viruses where two parental strains co-infect the same host cell generating a new virus containing genetic material from both parental strains. But similarity between genomes can also be generated at smaller scales, when the same mutation occurs independently twice, making the two strains more similar than expected. Persistent homology captures all these events, and also the scale of the events. Although in general it requires care to infer the biological origin of a given topological cycle, in viral evolution one expects bars at small genetic distance scales to correspond mostly to homoplasies, while well-supported recombination signals typically produce topological features at larger scales, as entire blocks of genetic material are exchanged in the process.
Computation of persistent homology
Ripser is a state-of-the-art software for the computation of persistent homology based on the topological construction of Vietoris–Rips complexes, developed by one of the authors [20]. For the computation of the persistence barcode, Ripser resorts to various optimizations, which are crucial when handling datasets of the size considered here. Notably, Ripser computes persistent cohomology, which is not based on cycles but instead on cocycles, often described intuitively as cuts that tear open a hole. In order to obtain the requisite cycles representing the features in persistent homology, we used a custom version of Ripser that subsequently carries out a second computation, this time based on cycles instead of cocycles. While a naive computation based on homology would be prohibitively expensive, the previous computation of the persistence barcode based on cocycles makes the subsequent computation of representative cycles feasible.
For our computations, we used a customized version of Ripser to compute the representative cycles for the persistent homology of the Vietoris–Rips filtration associated to the genetic distance matrix for each time bucket sub-alignment (whole genome and Spike gene). As we are only interested in SNV cycles, the computation of persistence barcodes for the time bucket sub-alignments was restricted to small genetic distance scales (Ripser scale parameter threshold set to 2), which greatly increases the speed of the computation.
The homological features identified by persistent homology admit different representative cycles. In order to obtain cycles that fit tightly to the data points, our customized version of Ripser uses a method called exhaustive reduction [56, 57], which can be roughly summarized as follows. Whenever a representative cycle contains an edge that also appears in another cycle as the longest edge, a tighter representative can be obtained by replacing the edge with the remaining edges from the other cycles, which all have shorter length.
Topological features are statistically significant
We estimated the expected number of topological cycles in persistent homology that are created by random homoplasic events in the GISAID dataset covering the first year of the pandemic with 161,024 genetically distinct whole genome sequences collected from December 2019 until February 2021. To this end, we simulated several evolutionary scenarios under the following assumptions: uniform probability distribution for substitutions across the genome, no variations in fitness, and zero recombination rate.
We generated forward simulations of viral evolution based on a Wright-Fisher model using SANTA-SIM (Version 1.0) [58] with fixed parameters: number of generations (N = 10, 000), number of sequences sampled from the population per time step (n = 15), recombination rate (ρ = 0), and variable parameters: mutation rate per site per generation, effective initial population, carrying capacity, population growth rate per generation. We considered five scenarios: In scenarios I-III we varied the mutation rate under the assumption of fixed population size, while in scenarios IV and V we investigated the effects of logistic growth of the viral population (see Supplementary Information Figure 6). The range of mutation rates in scenarios I-III were chosen such that the diversity in the simulated phylogenies are in close correspondence to the observed diversity in the GISAID dataset. While a mutation rate of μ = 0.75E − 7 substitutions per generation per site systematically underestimates the maximal distances to the root, the highest value of μ = 1.25E − 7 produces slightly larger maximal values. In fact, scenario II with μ = 1.00E − 7 reproduced the observed maximal distance accurately and provides a good approximation of the GISAID dataset (see Supplementary Information Figure 6). Major differences between simulations and the GISAID dataset are likely due to epidemiological phenomena in the ongoing pandemic such as variable population size and sequencing rate, and the spread of certain variants.
For each of the simulated datasets, we computed its persistent homology in dimension one using Hammingdist [22] and Ripser [20]. In order to keep overall computational expenses at a reasonable level, we resorted to extrapolations from smaller simulated datasets to the size of the GISAID dataset with 161,024 sequences. For all scenarios we produced 100 simulations for each of the following values of the effective population p (resp. carrying capacity c): 100, 500, 1000, 2500, 5000, 7500, 104, 105. Additionally, we included five simulations for p = 106 to achieve a better support of the extrapolation fit. For each value of p we randomly chose 60% of the simulations as training data, used to determine the parameters of different models in a non-linear least squares fit, while the remaining 40% were reserved for later validation and comparison of the models.
For each scenario we considered a quadratic, cubic, powerlaw and exponential model for the observed points (xi, yi), and linear and powerlaw fits for the squared residuals (yi − yfit(xi))2 in the training data (see Supplementary Information Figure 13). In each model, we then used the resulting fits mean(x) and var(x) as estimators for the mean and variance of an underlying Panjer (a, b, 0)-class distribution [59, 60]. The quantiles of the observed number of cycles in the training data fit the quantiles of the Panjer distribution with corresponding mean and variance remarkably well (see Supplementary Information Figure 12). We then determined the likelihood L = Πi PPanjer(y = yi|mean(xi), var(xi)) to observe the validation data {(xi, yi)}. For each model, the corresponding log-likelihoods are listed alongside the corresponding fits in Supplementary Information Figure 13. According to the log-likelihoods, the variance of the Panjer distribution is generally best described by a powerlaw behaviour. An exception is scenario II, for which the small sample of 5 simulations at p = 106 has an uncharacteristically small variance that skews the fits and corresponding likelihoods. Among the models that assume a powerlaw dependence of the variance, again with exception of scenario II, the cubic-powerlaw model yields maximum likelihoods. Finally, we determined the 95% prediction intervals for the expected numbers of random cycles by use of the cubic-powerlaw extrapolation of mean and variance of a Panjer distribution (see Supplementary Information Figure 6). The validation data of scenarios I, IV and V, which were all based on the same mutation rate, are well described by the prediction intervals of scenario I. The prediction intervals of scenario V differ significantly from the other two scenarios only at high numbers of distinct sequences. This difference arises because simulations in scenario V generally produce fewer distinct sequences than scenario I and IV, such that a steeper extrapolation is not sufficiently penalized. Hence, the prediction intervals of scenario V illustrate the error margins of the extrapolations, but are not likely to faithfully represent the expected number of one-dimensional cycles. We also observe that higher mutation rates in scenarios II and III lead to smaller numbers of one-dimensional cycles in the dataset.
In conclusion, the 95% prediction interval of scenario V yields an upper bound between 1023 and 1171 expected random cycles in a dataset comparable to the GISAID dataset with 161,024 distinct sequences. Moreover, since the diversity of the GISAID dataset is better approximated by scenario II than by scenarios I, IV or V, it is reasonable to rely on the prediction interval of scenario II, which predicts that in 95% of the cases we expect between 362 and 408 random cycles (see Supplementary Information Figure 6).
Topological recurrence analysis
We performed topological recurrence analyses for the whole genome and for the Spike gene. Notably, the analysis can also be carried out for single genes, based on alignments of appropriately truncated genome sequences. In this case, evolutionary processes outside the specific gene are ignored, leading to the creation of more topological cycles and hence to a more detailed picture of the ongoing convergent evolution in the respective gene.
Regarding the alignments covering the first year of the pandemic, we proceeded as follows: For each time bucket sub-alignment (whole genome and Spike gene) a complete list of SNV cycles (topological cycles all of whose edges correspond to single nucleotide variations) in this alignment was generated from the corresponding Ripser output. For each edge in an SNV cycle the endpoints of the edge correspond to a pair of uniquely determined sequences in the alignment that differ in exactly one nucleotide site position and hence determine an SNV. Then for each such SNV, its topological recurrence index (tRI) is by definition the total number of all SNV cycles containing an edge that gives rise to the given SNV. We restricted our analysis to SNVs with the following two properties: (i) one of the two nucleotides involved in the SNV agrees with the nucleotide in the reference sequence EPI_ISL_402125 at that site position, and (ii) the SNV is isolated in the sense that at the two preceding and following site positions the nucleotides are the same as in the reference sequence. These two conditions ensure that the corresponding SAAV is uniquely determined by the SNV. We used custom code implemented in Python to compute the tRI of each such SNV for every time bucket sub-alignment (whole genome and Spike gene). Moreover, for every whole genome time bucket sub-alignment the prevalence of every SNV was computed as the quotient of the number of all sequences carrying that SNV by the number of all sequences in that sub-alignment. Note that the sub-alignments entering into this computation consisted of genetically distinct sequences. Finally, for every SNV the measurements of both tRI (whole genome and Spike gene) and prevalence for all time buckets were combined into a time series analysis chart.
Even if all SNV cycles arose through random processes, it is expected that the resulting tRIs are distributed uniformly among all observed mutations. So the probability for a given mutation to have tRI ≥ k is given by a binomial distribution where the number of trials corresponds to the number of mutations in SNV cycles, and the probability for success is the inverse of the number of mutations that are realized in the dataset. From this we deduce that in the whole genome analysis a tRI ≥ 2 is highly significant (p < 0.01), while for the Spike gene analysis any signal with tRI ≥ 8 is significant (p < 0.05).
An analogous analysis was carried out for the Spike gene alignment covering the year from January 2021 until January 2022. In that case, a tRI ≥ 88 is significant (p < 0.05).
Performance analysis
We performed a basic runtime comparison between Topological Data Analysis (TDA)-based methods and standard phylogeny-based methods for random samples of up to 5,000 SARS-CoV-2 genomes drawn from the GISAID alignment covering the first year of the pandemic. We used IQTree [21] to reconstruct phylogenetic trees (with default settings and fast search option). The subsequent homoplasy analysis was performed with TreeTime [7] and HomoplasyFinder [8] (with default settings). For the TDA-approach we used Hammingdist [22] (with OpenMP multithreading disabled) to generate genetic distance matrices, and Ripser [20] (with scale parameter threshold set to 2) for the subsequent computation of persistence barcodes. All computations were carried out on one kernel of an Intel Xeon E7-4870 processor. The resulting runtimes for each sample are shown in Figure 3.
The computation of the genetic distance matrix for the whole genome alignment covering the first year of the pandemic with 303,651 sequences was carried out with Hammingdist [22] (with OpenMP multithreading enabled) on a virtual machine with Intel Xeon Gold 6230R processors and 52 kernels. The runtime was 57 minutes and the memory usage was 36 gigabytes for the whole genome analysis (49 seconds and 2 gigabytes for the Spike gene analysis).
The computation of the persistence barcodes for all monthly sub-alignments of the corresponding alignment with 161,024 genetically distinct sequences was carried out with Ripser [20] (with scale parameter threshold set to 2) on an Intel Xeon Gold 6230R processor. Runtime and memory usage for each sub-alignment are shown in Figure 3. For the largest time bucket sub-alignment ranging from December 2019 to February 2021 with 161,024 genetically distinct sequences, the runtime was 45 minutes and the memory usage was 49 gigabytes for the whole genome analysis (59 seconds and 2.1 gigabytes for the Spike gene analysis).
Similarly, for the Spike gene alignment with 3,928,116 sequences covering the year from Janaury 2021 until January 2022, the computation of the genetic distance matrix for 291,141 genetically distinct sequences took 4 hours, and the computation of the corresponding persistence barcodes was completed within 2.2 hours.
Ancestral state reconstruction analysis
For the study of the evolutionary histories of topologically highly recurrent mutations (see Figure 5) we performed ancestral state reconstruction analyses using Mesquite (Version 3.61) [61]. As input we used a curated alignment of 3,507 genome sequences and its corresponding Maximum-Likelihood tree, downloaded from Nextstrain [16] on 3 March 2021. The tree was rooted using the oldest sequence available (EPI_ISL_406798, collected on 26 December 2019). We inferred the evolution of each amino acid of interest along this SARS-CoV-2 tree using a parsimony approach.
Data availability
The SARS-CoV-2 genome data used in this work are available from the GISAID EpiCov Database [14, 18] at https://www.gisaid.org and can be retrieved through the Data Acknowledgement Locator with IDs EPI_SET_20220127bo and EPI_SET_20220124he. Experimental data on viral phenotypes by Starr et al. and Greaney et al. is available from [3, Table S2] and [4, Table S3].
Code availability
Code used for the analyses is available at https://github.com/ssciwr/hammingdist and https://github.com/Ripser/ripser/tree/tight-representative-cycles. All other code is available from the corresponding authors upon request.
Data Availability
SARS-CoV-2 genome data used in this work are available from the GISAID EpiCov Database at www.gisaid.org.
Author contributions
M.B., L.H., A.O., J.P.G., R.R. designed the study; M.B., L.H., A.O. curated data; M.B., M.C., L.H., A.O., J.P.G. performed computational analyses; U.B., M.B., L.H., A.O. developed and implemented software; M.B., L.H., A.O. acquired computing resources; M.B., L.H., A.O. drafted the manuscript; all authors contributed to the final version of the paper.
Competing interests
R.R. is a founder of Genotwin, he is member of the Scientific Advisory Board of AimedBio and consults for Arquimea Research. The other authors declare no competing interests.
Additional information
Supplementary Information is available for this paper.
Correspondence and requests for materials should be addressed to M.B., L.H. or A.O.
Supplementary Information
Supplementary Information Table 7. External spreadsheet containing full results of the topological recurrence analysis for the whole genome in the first year of the pandemic (December 2019 until February 2021). The table lists mutations together with their topological recurrence index (tRI) and prevalence. All mutations with statistically significant tRI ≥ 2 are included.
Supplementary Information Table 8. External spreadsheet containing full results of the topological recurrence analysis for the Spike gene in the first year of the pandemic (December 2019 until February 2021). The table lists mutations together with their topological recurrence index (tRI) and prevalence. All mutations with tRI ≥ 2 are included, but only a tRI ≥ 8 is statistically significant.
Supplementary Information Table 9. External spreadsheet containing a complete list of the topological recurrence index (tRI) for all variable amino acid site positions on the Spike gene in the first year of the pandemic (December 2019 until February 2021). All variable site positions with tRI ≥ 2 are included, but only a tRI ≥ 8 is statistically significant.
Supplementary Information Table 10. External spreadsheet containing a sublist of the list in Supplementary Information Table 8 featuring all mutations on the receptor-binding domain together with their topological recurrence index (tRI) and prevalence.
Supplementary Information Table 11. External spreadsheet containing full results of the topological recurrence analysis for the Spike gene as of January 2022. The table lists mutations together with their topological recurrence index (tRI). All mutations with tRI ≥ 2 are included, but only a tRI ≥ 88 is statistically significant.
Acknowledgements
The authors gratefully acknowledge all data contributors, i.e. the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative [14, 18], on which this research is based. Acknowledgement tables can be retrieved through the Data Acknowledgement Locator at https://www.gisaid.org with IDs EPI_SET_20220127bo and EPI_SET_20220124he. The authors acknowledge the use of de.NBI Cloud and the support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen and the German Federal Ministry of Education and Research (BMBF) through grant no 031 A535A. They thank M. Hanussek for IT support and early access to VALET [62]. The authors further acknowledge support from the Interdisciplinary Center for Scientific Computing at Heidelberg University and the development work of the Scientific Software Center of Heidelberg University carried out by L. Keegan and D. Kempf. This research was supported by the DFG Collaborative Research Center SFB/TRR 109 “Discretization in Geometry and Dynamics”. M.B. was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy EXC 2181/1 - 390900948 (the Heidelberg STRUCTURES Excellence Cluster). L.H. thanks the Evangelisches Studienwerk Villigst for their support. A.O. acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 281869850 (RTG 2229).