ABSTRACT
Full genome sequences are increasingly used to track the geographic spread and transmission dynamics of viral pathogens. Here, with a focus on Israel, we sequenced 212 SARS-CoV-2 sequences and use them to perform a comprehensive analysis to trace the origins and spread of the virus. A phylogenetic analysis including thousands of globally sampled sequences allowed us to infer multiple independent introductions into Israel, followed by local transmission. Returning travelers from the U.S. contributed dramatically more to viral spread relative to their proportion in incoming infected travelers. Using phylodynamic analysis, we estimated that the basic reproduction number of the virus was initially around ~2.0-2.6, dropping by two-thirds following the implementation of social distancing measures. A comparison between reported and model-estimated case numbers indicated high levels of transmission heterogeneity in SARS-CoV-2 spread, with between 1-10% of infected individuals resulting in 80% of secondary infections. Overall, our findings underscore the ability of this virus to efficiently transmit between and within countries, as well as demonstrate the effectiveness of social distancing measures for reducing its spread.
INTRODUCTION
In December 2019 an outbreak of severe respiratory disease was identified in Wuhan, China (Huang, et al. 2020). Shortly later, the etiological agent of the disease was identified as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Zhou, et al. 2020; Zhu, et al. 2020), and the disease caused by the virus was named coronavirus disease 19 (COVID-19). The virus has since spread rapidly across the globe, causing a WHO-declared pandemic with social and economic devastation in many regions of the world. The infectious disease research community has quickly stepped up to the task of characterizing the virus and its replication dynamics, describing its pathogenesis, and tracking its movement through the human population. Parameterized epidemiological models have been particularly informative of how this virus has spread with and without control measures in place (e.g., Tian, et al. 2020), and have been used to project viral spread both in the short-term (Flaxman, et al. 2020) and in the more distant future (Kissler, et al. 2020).
Along with epidemiological analysis based on case reports and COVID-19 death data, sequencing of viral genomes has become a powerful tool in understanding and tracking the dynamics of infections (Volz, et al. 2013; Gardy and Loman 2018). So-called “genomic epidemiology” allows for effective reconstruction of viral geographical spread as well as estimation of key epidemiological quantities such as the basic reproduction number of a virus, its growth rate and doubling time, and patterns of disease incidence and prevalence. Such insights have been used to inform policy makers during various pathogen outbreaks, as occurred for example in the 2014-2016 outbreak of Ebola virus in West Africa (Khoury, et al. 2018; Armstrong, et al. 2019), and during this current SARS-CoV-2 pandemic (Bedford, et al. 2020; Fauver, et al. 2020).
Here, we set out to sequence SARS-CoV-2 from samples across the state of Israel, with the aim of gaining a better understanding of introductions of the virus into Israel, spread of the virus inside the country, and the epidemiology of the disease, including (a) the basic reproduction number of the virus before and after social distancing measures were implemented, and (b) the extent of viral superspreading within Israel. We sought to gain this understanding within the context of existing epidemiological knowledge, including that the first confirmed cases of SARS-CoV-2 infection in Israel were reported in early-February, followed by many identified SARS-CoV-2 cases in travelers returning to Israel mainly from Europe and the United States. Growth in the number of verified cases rapidly ensued, which led to increased measures of social distancing, including the cessation of passenger flights to Israel, school closure, and eventually a near complete lockdown across the entire state of Israel. Quarantining of returning travelers from Europe was implemented between February 26 and March 4, 2020, and subsequently all incoming travel to Israel was arrested on March 9. In the meantime, the rate of testing was ramped up, eventually reaching a rate of more than 1,500 tests per million people per day. The reported daily incidence and reported numbers of daily severe cases peaked around mid-April and have dropped steadily since then. Despite this knowledge, many questions remain: Which of the multiple SARS-CoV-2 introductions resulted in sustained local transmission? How did the virus spread across the state? What was the magnitude of the virus’s reproduction potential within Israel, and to what extent did control measures mitigate its spread? Here, through a comprehensive set of phylogenetic and phylodynamic analyses, we quantitatively address these questions.
RESULTS & DISCUSSION
In order to gain a better understanding of the dynamics of SARS-CoV-2 spread into and within Israel, we sequenced the virus from a cohort of patients representing a random sample across Israel, resulting in 212 full-genome SARS-CoV-2 sequences (Methods). A total of 224 unique single nucleotide variants (SNVs) were identified between the Wuhan reference sequence and this set of sequences from Israel. Figure 1 shows the distribution of identified SNVs along the genome and their counts in the sequenced samples. Of these SNVs, 141 were non-synonymous, 72 were synonymous, and the remaining 11 were in non-coding regions. One of the most abundantly detected SNVs was a non-synonymous variant D614G found in the spike protein, which was present in 90% of the sequences. This variant has generated much interest as it has been reported to potentially increase the transmissibility of the virus (Korber, et al. 2020). Of note however, the alternative hypothesis that the observed increases in this variant’s frequency is due to demographic considerations and genetic drift has not been ruled out.
We also found five different high confidence genomic deletions, spanning between one and eighteen nucleotides (Fig. 2) (Methods), each of which was found in one to two samples. Three of these five deletions occurred in multiples of three and were in-frame deletions or affected non-coding regions. Of the remaining two deletions, deletion #3 spanned ten nucleotides, and likely prevents the translation of ORF7a. Deletion #4 occurred at the end of ORF8 and causes the replacement of the last amino acid with an additional five amino acids. Notably, an 81 nt in-frame deletion in ORF7a has been previously reported (Holland, et al. 2020), as has been a 382 nt deletion in ORF8 (Su, et al. 2020), suggesting that the virus is to some extent tolerant to deletions in these ORFs.
When focusing on deletions that occurred in two samples, we noted that the two samples bearing deletion #1 were located in very remote clades on the phylogeny, suggesting that the deletion occurred two independent times. Deletion #5 on the other hand was present in two related samples that were sampled five days apart from each other. Deletions #2, #3 and #4 revealed an intriguing pattern: three independent deletions (one of which was present in two samples) were all part of the same clade that included eighteen samples (Fig. 2). One non-synonymous SNV defined this clade: S2430R in ORF1b which affects the non-structural protein NSP16. This protein has been reported to be a 2’O-methyl-transferase that enhances evasion of the innate immune system (Menachery, et al. 2014).
While further in-depth investigation of SARS-CoV-2 indels is clearly needed, at this point we conjecture that the deletions we detected are neutral or to some extent deleterious, and that deletions in SARS-CoV-2 are likely to occur frequently given the number of deletions detected in our samples.
Origins and transmission patterns in Israel
We next set out to explore patterns of SARS-CoV-2 introduction into Israel. Figure 3 shows the time-resolved phylogeny inferred using the 214 Israeli sequences (212 sequenced here and two additional ones sequenced previously) in addition to 4,693 representative sequences from across the world. This phylogeny allowed us to characterize the major viral clades circulating within Israel and to infer the geographic sources and timing of virus introductions into the state. We found multiple introductions into Israel from both the U.S and Europe, the latter including mainly the U.K., France and Belgium. Over 70% of the clade introductions into Israel (confidence intervals ranging from ~50% to ~80%, see Methods) were inferred to have occurred from the U.S., while the remaining were mainly from Europe. During the entire epidemic in Israel, very close monitoring of all incoming infected travelers was imposed, and reports show that only ~27% of infected returning travelers were from the U.S. The discrepancy between these estimated proportions suggests that the travelers returning from the U.S. contributed substantially more to the spread of the virus in Israel than would be proportionally expected. This may have occurred due to the gap in policy that allowed returning non-European travelers to avoid quarantine until March 9, or due to different contact patterns of those who returned from the U.S. Moreover, this suggests that had flights from the U.S. been arrested at the same time that flights from Europe were arrested (between February 26 and March 4, instead of by March 9), a substantial fraction of the transmission chains in Israel would have been prevented.
As the pandemic spread, entry into Israel was restricted, and local transmissions became dominant. Transmission patterns into and between six various geographical regions in Israel (North district, Tel Aviv district, South coast district, Jerusalem district, and South district) are shown in Figure 4. While most local transmission occurred inside defined regions, transmission among distinct regions was also observed, such as, for example, movement between Jerusalem and the north district of Israel.
Phylodynamic modeling of viral spread in Israel
To estimate the basic reproduction number of SARS-CoV-2 in Israel initially and then following the implementation of social distancing measures, we performed coalescent-based phylodynamic inference using PhyDyn (Methods). We note that existing phylodynamic analyses of SARS-CoV-2 have shown that the effective reproduction number RE of the virus has decreased over time, as quarantine and social distancing measures have been implemented within specific regions (Danesh, et al. 2020; Vaughan, et al. 2020; Volz, Baguelin, et al. 2020). However, many of these analyses have to date modeled reductions in RE as stemming from the depletion of susceptible individuals (Volz, Baguelin, et al. 2020), rather than from reductions in R0, the latter of which would be consistent with lowering of contact rates. Other analyses, particularly those that use the birth-death model approach for phylodynamic inference, have allowed for changes in R0 over time (Danesh, et al. 2020; Vaughan, et al. 2020), but cannot as easily accommodate structure in the infected host population (e.g., that some individuals are exposed but not yet infectious, and that transmission heterogeneity exists between infected individuals). Our phylodynamic analysis here, based heavily on existing coalescent-based model structures that have been applied to SARS-CoV-2 (Volz 2020), instead allows for this structure to be accommodated and for R0 to change in a piecewise fashion over time.
Our phylodynamic analysis assumes an underlying susceptible-exposed-infected-recovered (SEIR) epidemiological model for SARS-CoV-2 transmission dynamics and explicitly incorporates transmission heterogeneity (Methods). Recent epidemiological analyses have estimated considerable levels of SARS-CoV-2 transmission heterogeneity, with ~7-10% of infected individuals estimated to be responsible for 80% of secondary infections (Bi, et al. 2020; Endo, et al. 2020). Instead of assuming a given level of transmission heterogeneity for Israel, we instead performed phylodynamic inference of the SEIR model across a range of transmission heterogeneities ranging from ph = 1% to 20% of infected individuals being responsible for 80% of secondary infections. We estimated R0 prior to March 19 to be approximately 2.0 across this superspreading range, with estimates increasing towards R0 = 2.4 to 2.6 at extremely high levels of superspreading (ph = 1-2%) (Figure 5A). Across this superspreading range, we robustly estimated that quarantine measures had the effect of reducing R0 by approximately two-thirds (α = ~30%; Figure 5B; see Table S2 for model parameter priors and estimated values).
Figure 5C shows the cumulative number of SARS-CoV-2 cases by April 22, estimated by our phylodynamic analyses across the considered superspreading range. Estimates of the cumulative number of cases is highly sensitive to the level of assumed transmission heterogeneity, particularly at high levels of superspreading (ph = 1-5%). Comparison between these inferred cumulative cases and reported case numbers (dotted line in Figure 5C) indicates that SARS-CoV-2 transmission dynamics were driven by an extremely high level of viral superspreading. Specifically, if we assume almost complete case reporting, our phylodynamic analysis indicates that between 5-9% of infections are responsible for 80% of secondary infections; with lower assumed levels of case reporting, between 1-5% of infections would be responsible for this 80% of secondary infections.
Phylodynamic analysis further allows us to visualize inferred epidemiological dynamics. In Figure 6, we show inferred patterns of prevalence (Fig. 6A) and incidence (Fig. 6B) for three different assumed levels of viral superspreading. Inferred patterns of prevalence corroborate epidemiological findings that the number of cases started to decline in early April. Inferred patterns of cumulative incidence indicate that reporting rates were initially low, but improved considerably over the time course of viral spread. The ‘leveling off of cumulative incidence around late March/early April observed in both the reported case data and in our inferred epidemiological dynamics, ground-truthing the results of our phylodynamic analyses.
Conclusions
Overall, our findings highlight the use of genomic data to effectively track the spread of an emerging virus using phylogenetic and phylodynamic approaches that have been developed to study viral outbreaks. We have hereby succeeded in tracking the main transmission chains that led to SARS-CoV-2 spread in Israel, and applied phylodynamic analysis to infer key epidemiological parameters governing its spread. Our results suggest that superspreading events are a main feature of SARS-CoV-2 spread, suggesting that focused measures to reduce contacts of select individuals/social events could dramatically mitigate viral spread. Finally our results highlight how global connectivity allows for massive introductions of a virus, and emphasize how border control and shelter-in-place restrictions are crucial for halting viral spread.
METHODS
Ethics statement
An exemption from institutional review board approval was determined by the Israeli Ministry of Health as part of an active epidemiological investigation, based on use of anonymous data only and no medical intervention. The study was further approved by the Tel-Aviv University ethics committee (approval 0001274-1).
Details of samples & virus genome sequencing
With the aim of generating a random sample of viral infections across the entire country, a total of 213 samples were retrieved from six major hospitals in Israel spanning the entire geography of Israel from south to north (Table 1, Table S1).
We obtained RNA extracted from nasopharyngeal samples. Sequencing was performed based on the V3 Arctic protocol (https://artic.network/ncov-2019). Briefly, reverse transcription and multiplex PCR of 109 amplicons was performed, and adapters were ligated to allow for sequencing. All samples were run on an Illumina Miseq using 250-cycle V2 kits in the Technion Genome Center (Israel).
Determining genome consensus sequences
Sequencing reads were trimmed using pTrimmer, a multiplexing primer trimming tool (Zhang, et al. 2019), and then aligned to the reference genome of the SARS-CoV-2 (GenBank ID MN908947) using our AccuNGS pipeline (Gelbart, et al. 2019), which is based on BLAST (Altschul, et al. 1997), using an e-value of 10-9. The pipeline allows for consensus determination and variant calling. We considered substitutions at the consensus sequence (as compared to the reference) only if a given base was present in 80% of the aligned reads, and five or more reads aligned to the reference; bases where the majority of reads showed a substitution but that did not fulfill these two conditions were deemed uncertain. Similarly positions to which no reads were mapped were also deemed uncertain, and such positions were assigned with an “N”. All deletions were manually verified: (a) over 98% of the reads covering the deletion site mapped to both ends of the deletion (i.e, bore evidence of the deletions), (b) the deletion was based on over 40 independent reads (on average >1,000 reads), and (c) coverage was high at both ends of the deleted region. Only sequences that spanned 90% of the reference genome were retained, leading to the removal of one sequence (Table S1), and hence a new set of 212 Israeli sequences was generated here. Another two Israeli sequences already available on GISAID were added to the phylogenetic analysis, leading to a total of 214 sequences from Israel.
The collection dates of the 214 Israeli sequences used in our analysis ranged from February 23 through April 22, 2020. The number of sequences is thus approximately 1.5% of the total number of reported cases on April 22.
Phylogenetic analyses
All available full-length SARS-CoV-2 genomes from outside of Israel (a total of 16,403 sequences) were retrieved from GISAID on May 5, 2020. All sequences from a non-human host as well as sequences with incomplete sampling date (YYYY-MM or YYYY-MM-XX) or a high level of uncertainty (>10% ambiguous bases marked as N) were removed. All available sequences were then down-sampled to 4,693 representative sequences across the globe using the latest build of NextStrain ncov pipeline (Hadfield, et al. 2018; Hadfield, et al. 2019) (https://github.com/nextstrain/ncov). 1195 of these 4,693 sequences were from the U.S., while 1991 were from Europe. The 212 new Israeli sequences were added to the tree.
Confidence in numbers and fractions of importation events
Confidence in the relative number of importation events from the U.S. vs. Europe was assessed using two measures of confidence intervals, which were aimed at testing whether the set of exogenous sequences was biased, or whether the set of Israeli sequences was biased. First, we generated 1,000 samples of the exogenous (non-Israeli) sequences using a bootstrap approach and assessed the fraction of importation events in each sample. Second, we similarly bootstrapped the local (Israeli) sequences only and assessed the fraction of importation events in each sample. The reported confidence interval includes the lower bound and higher bound of both bootstrapping schemes.
Down-sampling of global tree for phylodynamic analysis
Following the initial sampling of the global tree described above, we applied a second sampling specifically for the phylodynamic analysis. The down-sampling followed the recommended guidelines described previously for SARS-CoV-2 (Volz, Boyd, et al. 2020). We used two sampling techniques:
Random time stratified sampling – we sampled a total of 100 sequences from outside of Israel across ν=5 time intervals such that each time interval contained approximately 20 sequences.
Closest sequence match – we will define SISR as the set of all sequences from Israel. We sample the exogenous set of sequences from the global tree with the minimal cophenetic distance between each Israeli sequence belonging to SISR as based on the maximum likelihood phylogeny. This allow including sequences closely related to sequences from Israel.
We next manually curated the sequences from Israel to ensure they represent a random sample across Israel. To this end we removed samples suspected to be from the same household, samples with consecutive identifiers, or identical samples with similar identifiers and similar dates. Only one sample from a given household was chosen randomly. This led to a removal of 6 sequences.
Following down-sampling and manual curation, a phylogenetic tree was inferred using the NextStrain pipeline (Hadfield, et al. 2018). The tree topology was validated as a legitimate representative of the global tree by performing 1,000 random samples containing 373 sequences from the global tree. The Kendall-Colijn metric (Kendall and Colijn 2016) was used to assess distance between each random sample and the original tree, allowing us to create a null distribution. The λ parameter, which determines the trade-off between topology and branch length, was set to zero, thus accounting for the tree topology alone. The significance of down-sampling procedure was thus obtained by comparing the Kendall-Colijn metric of the down-sampled tree to the null distribution.
Rate of importations
As described previously, strong global connectivity can result in a high number of independent seeding events (Chinazzi, et al. 2020), and we thus aimed at generating the distribution of importation dates (Volz, Boyd, et al. 2020). This was achieved based on the time-resolved phylogeny, which was then used to estimate the date of initial importation as well as the date on which the rate of new importations dropped. Given a down sampled ML tree TML, we re-estimate a forest of trees using IQ Tree (Nguyen, et al. 2015) such that each tree is generated by randomly resolving polytomies in TML. Each is used to produce a time-based tree, using TreeTime (Sagulenko, et al. 2018) with molecular clock rate r~U[0.0009 − 0.0015] substitutions/site/year. We deduce the distribution of all seeding events by taking the mid-branch date for each node leading to an Israeli tip for all trees. Both IQ Tree and TreeTime were executed using the Augur python package.
Phylodynamic Analysis
Phylodynamic analyses were conducted using BEAST2 v2.6.2 (Bouckaert, et al. 2019) and PhyDyn v1.3.6 (Volz and Siveroni 2018). An HKY substitution model with a lognormal prior for κ with mean log(κ) = 1.0 and standard deviation of log(κ) = 1.25 was used. We assumed no sites to be invariant and used an exponential prior for γ with a mean of 1.0. A strict molecular clock with a uniform prior between 0.0007 and 0.002 substitutions/site/year was used. A uniform prior was used for nucleotide frequencies. The down-sampled maximum likelihood tree generated using IQ Tree was used as a starting tree.
PhyDyn is a coalescent-based inference approach implemented in BEAST2, allowing for the integration over phylogenetic uncertainty (Volz 2012; Volz and Siveroni 2018). The program requires specification of an underlying epidemiological model, as well as any priors on parameters that will be estimated. In line with recent analyses (sarscov2.phylodynamics.org), we assumed that the epidemiological dynamics of SARS-CoV-2 were governed by Susceptible-Exposed-Infected-Recovered (SEIR) dynamics. Transmission heterogeneity has previously been described for viral pathogens including SARS-CoV-1 (Lloyd-Smith, et al. 2005) and appears to be important in the transmission dynamics of SARS-CoV-2 (Bi, et al. 2020; Endo, et al. 2020). To account for the possibility of transmission heterogeneity, as in previous work (sarscov2.phylodynamics.org), we modeled two classes of infected individuals: one with low transmissibility II and one with high transmissibility Ih. Mathematically, the epidemiological model is given by
We set as fixed the host population size to the population size of Israel, according to the European Centre for Disease Prevention and Control (N = 8,883,800), the average duration of time an individual spends in the exposed class (1/γE = 3 days), and the average duration of time an individual spends in the infected (infectious) class (1/γI = 5.5 days). These rates are based on a study that inferred transmissibility over the course of infection based on data from established SARS-CoV-2 transmission pairs (He, et al. 2020). R0 in this model is given by (βhph. + βl(1−ph))/γI, where ph is the fraction of exposed individuals who transition to the Ih class. Instead of independently parameterizing βh and βl, we defined (as in previous work) the relative transmissibility of infected individuals in the Ih and II classes by the parameter τ = βh/βl. We defined a parameter P as the fraction of secondary infections that were caused by a fraction ph of the most transmissible infected individuals and set P to 0.8. Based on set values of P and ph, we calculated r as . As such, we could easily parameterize the model across various levels of transmission heterogeneity, with a fraction ph of infected individuals being responsible for 80% of secondary infections. Existing epidemiological analysis indicate that ph is on the order of 7-10% (Bi, et al. 2020; Endo, et al. 2020) indicative of even more transmission heterogeneity than given by the 20/80 rule (Woolhouse et al., PNAS, 1997). We focused on the range of ph between 1-10% in our phylodynamic analyses.
Again based on existing analyses (sarscov2.phylodynamics.org), we included an external reservoir, Y, in our analysis to allow for multiple introduced clades into Israel to be jointly considered. We assume symmetric migration of individuals in the E, II, and Ih classes in and out of Israel based on a per capita migration rate η. An exponential prior with a mean of 10 and an upper limit of 10 was used for η. Based on the results of the importation analysis described above, we assumed that the migration rate decreased to 25% of its original value after March 20, 2020. As migration is assumed to be symmetrical in and out of Israel it does not affect the focal SEIR model dynamics, however, it influences the probability that a given lineage’s geographic state is assigned to Israel. We fix the rate of removal of individuals from Y to be 1/(1/γE + 1/γI) and estimate the rate of entry into Y, ρ, using a lognormal distribution with mean log(ρ) = 3.6 and standard deviation log(ρ) = 1.
In our model, we estimated a piecewise R0 by estimating an initial R0 that was in effect until March 19, 2020 slightly after strong social distancing measures were implemented, along with a factor α by which R0 changed on March 19. We set the prior on R0 to a lognormal distribution with mean log (R0) of 1.5 and standard deviation of log (R0) of 0.5. We set the prior on α to be uniform between 0 and 2, thereby allowing R0 to either increase, decrease, or remain unchanged after March 19. An exponential prior with mean 1.0 was used for the initial size of the E class. The II and Ih were assumed to be negligibly small at the beginning of the SEIR dynamics. The PhyDyn t0 parameter was set to 2019.7 and a constant population size coalescent model was used prior to this date when proposed trees had earlier root dates. SEIR dynamics were assumed to begin on February 1st. Sequences sampled from Israel were randomly assigned to Ih with probability ph and to II with probability 1 − ph. XML files to run both BEAST2 and PhyDyn were generated using a custom Python 3 script which was designed to edit a template XML file originally generated with BEAUti and manually edited. MCMC chains for each parameter set were run for at least 1.9 million steps. Convergence was assessed based on visual inspection of parameter traces. Longer MCMC chains and additional replicates are currently being conducted. The first 50% of MCMC steps were discarded as burn-in. BEAST2 log and tree files were combined using LogCombiner. Maximum clade credibility trees were generated using TreeAnnotator. BEAST2 and PhyDyn outputs were visualized using Python 3, Matplotlib (Hunter 2007), Seaborn, and Baltic (https://github.com/evogytis/baltic).
Data Availability
The assembled SARS-CoV-2 genomes (consensus sequences) were uploaded to GISAID (Table S1). Submission of the raw sequencing data to Sequence Read Archive (SRA) is pending. Code for phylodynamic analysis and model XML configuration, as well as scripts to analyze outputs are available on github.
Data and code availability
The assembled SARS-CoV-2 genomes (consensus sequences) were uploaded to GISAID (Table S1). Submission of the raw sequencing data to Sequence Read Archive (SRA) is pending. Code for phylodynamic analysis and model XML configuration, as well as scripts to analyze outputs are available at: https://github.com/SternLabTAU/SARSCOV2NGS.
Acknowledgements
We wish to thank Dr. Boaz Lev at the Israeli Ministry of Health and Dr. Tal Katz-Ezov at the Technion Genome Center, as well as Stern lab members for their support during an ongoing pandemic and various stages of lockdown. This work was funded by the Israeli Science Foundation (1333/16) and by a generous donation from the Milner foundation and from AppFlyer. This study was supported in part by a fellowship to DM, TK and OT from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv University.
Footnotes
↵* Co-equal authorship