ABSTRACT
SARS-CoV-2 virus genomes are currently being sequenced at an unprecedented pace. The choice of sequences used in genetic and epidemiological analysis is important as it can induce biases that detract from the value of these rich datasets. This raises questions about how a set of sequences should be chosen for analysis, and which epidemiological parameters derived from genomic data are sensitive or robust to changes in sampling. We provide initial insights on these largely understudied problems using SARS-CoV-2 genomic sequences from Hong Kong and the Amazonas State, Brazil. We consider sampling schemes that select sequences uniformly, in proportion or reciprocally with case incidence and which simply use all available sequences (unsampled). We apply Birth-Death Skyline and Skygrowth methods to estimate the time-varying reproduction number (Rt) and growth rate (rt) under these strategies as well as related R0 and date of origin parameters. We compare these to estimates from case data derived from EpiFilter, which we use as a reference for assessing bias. We find that both Rt and rt are sensitive to changes in sampling whilst R0 and date of origin are relatively robust. Moreover, we find that the unsampled datasets (opportunistic sampling) provided, overall, the worst Rt and rt estimates for both Hong Kong and the Amazonas case studies. We highlight that sampling strategy may be an influential yet neglected component of sequencing analysis pipelines. More targeted attempts at genomic surveillance and epidemic analyses, particularly in resource-poor settings which have a limited genomic capability, are necessary to maximise the informativeness of virus genomic datasets.
INTRODUCTION
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is an enveloped single-stranded zoonotic RNA virus belonging to the Betacoronavirus genus and Coronaviridae family (Gorbalenya et al., 2020). It was first identified in late 2019 in a live food market in Wuhan City, Hubei Province, China (Zhu et al., 2020). Within a month, SARS-CoV-2 had disseminated globally through sustained human-to-human transmission. It was declared a public health emergency of international concern on the 30th of January 2020 by the World Health Organisation (World Health Organisation, 2020). Those infected with SARS-CoV-2 have phenotypically diverse symptoms ranging from mild fever to multiple organ dysfunction syndromes and death (Verity et al., 2020).
Despite the implementation of non-pharmaceutical interventions (NPIs) by many countries to control their epidemics, to date over 300 million SARS-CoV-2 cases and 5.4 million deaths have been reported worldwide (World Health Organisation, 2022). These NPIs can vary within and between countries and include restrictions on international and local travel, school closures, social distancing measures and the isolation of infected individuals and their contacts (European Centre for Disease Prevention and Control, 2020). The key aim of NPIs is to reduce epidemic transmission, often measured by epidemiological parameters such as the time-varying reproduction number (Rt at time t) and growth rate (rt) (Supplementary Table 1) (Anderson et al., 2020; UK Health Security Agency, 2022). However, there is currently great difficulty in estimating and comparing epidemiological parameters derived from case and death data globally due to disparities in molecular diagnostic surveillance and notification systems between countries. Further, even if data are directly comparable, the choice of epidemiological parameter can implicitly shape insights into how NPIs influence transmission potential (Dushoff and Park, 2021; Parag, Thompson and Donnelly, 2021). As such, there is a need to use alternative data sources, such as genomic data (World Health Organisation, 2021a), to gain improved insights into viral transmission dynamics (Jombart et al., 2014; Duchene et al., 2020).
Phylodynamic analysis of virus genome sequences have increasingly been used for studying emerging infectious diseases, as seen during the current SARS-CoV-2 pandemic (Faria et al., 2021; Nadeau et al., 2021; Romano and Melo, 2021; Volz et al., 2021), recent Ebola virus outbreaks in Western Africa (Dudas et al., 2017) and Zika outbreaks in Brazil and the Americas (Faria et al., 2017; Grubaugh et al., 2017). Transmissibility estimates such as the basic reproduction number (R0), Rt and rt can be directly inferred from genomic sequencing data in addition to other epidemiological parameters like the date of origin of a given viral variant which can only be inferred from genomic data. This is of particular importance for variants of concern (VOC), genetic variants with evidence of increased transmissibility, more severe disease, and/ or immune evasion. VOC are typically detected through virus genome sequencing and there is often a limited understanding of their epidemiological characteristics from epidemiological data alone (Harvey et al., 2021). To maximise the use of additional epidemiological information from genomic data, clear guidelines on sampling need to be provided (Lesley et al., 2021).
Currently, SARS-CoV-2 virus genomes from COVID-19 cases are being sequenced at an unprecedented pace providing a wealth of virus genomic datasets (Rambaut et al., 2020). There are currently over 7.4 million genomic sequences available on GISAID, an open-source repository for influenza and SARS-CoV-2 genomic sequences (Shu and McCauley, 2017). These rich datasets can be used to provide an independent perspective and can help validate or challenge parameters derived from epidemiological data. Moreover, the use of genomic data can overcome some of the limitations and biases of using epidemiological data alone. For example, it is less susceptible to changes at the government level such as alterations to the definition of a confirmed case and changes to notification systems (de Souza et al., 2020; Tsang et al., 2020). Inferences from virus genomic data improve our understanding of underlying epidemic spread and can facilitate better-informed infection control decisions (Dolan, Whitfield and Andino, 2018).
The most popular approaches used to investigate changes in virus population dynamics include the Bayesian Skyline Plot (Drummond et al., 2005) and Skygrid (Gill et al., 2013) models and the birth-death skyline (BDSKY) (Stadler et al., 2013). These integrate Markov Chain Monte Carlo (MCMC) procedures and often converge slowly on large datasets (Hall, Woolhouse and Rambaut, 2016). As such, currently available SARS-CoV-2 datasets containing thousands of sequences become computationally impractical to analyse and sub-sampling is necessary. Although there have been some previous studies (Stack et al., 2010; de Silva, Ferguson and Fraser, 2012; Hall, Woolhouse and Rambaut, 2016; Karcher et al., 2016; Parag, du Plessis and Pybus, 2020), the effects of sampling strategies on phylogenetic and phylodynamic inferences of pathogens is currently a neglected area of study (Frost et al., 2015), particularly concerning SARS-CoV-2. To our knowledge, there are no published studies concerning SARS-CoV-2 which explore the effect that sampling strategies have on the phylodynamic reconstruction of key transmission parameters. This is important as incorrectly implementing a sampling scheme or ignoring its importance can mislead inferences and introduce biases (Hall, Woolhouse and Rambaut, 2016; Hidano and Gates, 2019). This raises the important question of how a set of sequences should be selected for analysis and which parameters are sensitive or robust to changes in sampling.
Here we aim to explore how diverse sampling strategies in genomic sequencing may affect the estimation of key epidemiological parameters from genomic data. To do this, we estimate R0, Rt, and rt from genomic sequencing data under different sampling strategies from a location with high genomic coverage represented by Hong Kong, and a location with low genomic coverage represented by the Amazonas region, Brazil. Moreover, we compare epidemiological parameters derived from genomic data to those estimated from corresponding epidemiological data which we considered here as our gold standard. By getting genomic inferences close to the case data we can then draw better inferences of transmission estimates and parameters that cannot be derived from case data alone. This will help us to understand the impact that sampling strategies have on phylodynamic inference and aid in the interpretation of epidemiological parameters from areas with differing genomic coverage.
METHODS
Empirical Estimation of the Reproduction Number, Time-varying Effective Reproduction Number, and Growth Rate
Epidemiological Datasets
Two sources of data from the Amazonas region, Brazil and one source of data from Hong Kong were used in calculating empirical epidemiological parameters. For the Amazonas region, mortality, and case data from the SIVEP-Gripe (Sistema de Informação de Vigilância Epidemiológica da Gripe) SARI (severe acute respiratory infections) database, including both class 4 and 5 death records (corresponding to confirmed and suspected COVID-19 deaths), from the 30th of November 2020 up to 7th of February 2021, were used. Here we were interested in cases caused by the P.1/Gamma VOC first detected in Manaus, the number of P.1 cases was calculated by using the proportion of P.1/Gamma viral sequences uploaded to GISAID within each week (Supplementary Figure 1). For Hong Kong, all case and mortality data were extracted from the Centre of Health Protection, Department of Health, the Government of the Hong Kong Special Administrative region up to the 7th of May 2020. Due to lags in the development of detectable viral loads, symptom onset and subsequent testing (Gostic et al., 2020); the date in which each case was recorded was left shifted by 5 days within our models (Pullano et al., 2021) to account for these delays in both datasets.
Basic Reproduction Number
The R0 was estimated using a time series of confirmed SARS-CoV-2 cases from both Hong Kong and the Amazonas region. To avoid the impact of NPIs on R0 estimates, only data up to the banning of mass gathering in Hong Kong (27th March 2020) and up to the imposition of strict restrictions in the Amazonas region (12th January 2021) were used. Weekly counts of confirmed cases were modelled using maximum likelihood methods. The weekly case counts were assumed to be Poisson distributed and were fitted to a deterministic closed Susceptible-Exposed-Infectious-Recovered (SEIR) model (Equation 1) by maximising the likelihood of observing the data given the model parameters (Table 1).
Subsequently, the log-likelihood was used to calculate the R0 by fitting β, the effective contact rate (Equation 2). To generate approximate confidence intervals for R0, bootstrapping was used with 1000 iterations.
Time-varying Effective Reproduction Number
To estimate the Rt from empirical line list data the EpiFilter model (Parag, 2021) was used. To estimate Rt, EpiFilter uses a renewal transmission model; a general and popular framework used in the modelling of infectious diseases (Fraser, 2007). This model describes how the number of new cases (incidence) at time t depends on the Rt at that specified time point and the past incidence, which is summarised by the cumulative number of cases up to each time point weighted by the generation time distribution. Moreover, EpiFliter integrates both Bayesian forward and backward recursive smoothing. This improves Rt estimates by leveraging the benefits of two of the most popular Rt estimation approaches: EpiEstim (Cori et al., 2013) and the Wallinga-Teunis equation (Wallinga and Teunis, 2004). Both methods only utilise a proportion of the information available with either past or future incidence being informative. EpiFilter combines both past and future information and consequently minimises the mean squared error in estimation and reduces dependence on prior assumptions. We assume the generation time distribution is well approximated by the serial interval (SI) distribution (Flaxman et al., 2020). EpiFilter was used as a reference for parameters estimated from genomic data.
Growth Rate
After the Rt has been inferred, its relationship with rt as described by the Wallinga-Lipsitch equation for a gamma distributed generation time (Equation 3) was used to estimate rt (Wallinga and Lipsitch, 2007). The SI and variance for Hong Kong were derived from a systematic review and meta-analysis exploring these values (Rai, Shukla and Dwivedi, 2021) and a study exploring SI in Brazil was used for the Amazonas datasets (Prete et al., 2021). The SI was assumed to be gamma distributed. The gamma distribution is represented by gamma = (ε, γ).
SARS-CoV-2 Brazilian Gamma VOC and Hong Kong datasets
All high-quality, complete SARS-CoV-2 genomes were downloaded from GISAID (Shu and McCauley, 2017) for Hong Kong (up to 7th May 2020) and the Amazonas state, Brazil (from 30th November 2020 up to 7th February 2021). Using the Accession ID of each sequence, all sequences were screened and only sequences previously analysed and published in PubMed, MedRxiv, BioRxiv, virological or Preprint repositories were selected for subsequent analysis. For both datasets, sequence alignment was conducted using MAFFTV.7 (Katoh et al., 2002). The first 130 base pairs (bp) and last 50 bps of the aligned sequences were trimmed to remove potential sequencing artefacts in line with the Nextstrain protocol (Hadfield et al., 2018). Both datasets were then processed using the Nextclade pipeline for quality control (https://clades.nextstrain.org/). Briefly, the Nextclade pipeline examines the completeness, divergence, and ambiguity of bases in each genetic sequence. Only sequences deemed ‘good’ by the Nextclade pipeline were selected for. Subsequently, all sequences were screened for identity and in the case of identical sequences, for those with the same location, collection date, only one such isolate was used. Moreover, PANGO lineage classification was conducted using the Pangolin (Rambaut et al., 2020) software tool (http://pangolin.cog-uk.io) on sequences from the Amazonas region and only those with the designated P.1/Gamma lineage were selected for (Supplementary Figure 1).
Maximum Likelihood tree reconstruction
Maximum likelihood phylogenetic trees were reconstructed using IQTREE2 (Minh et al., 2020) for both datasets. A TIM2 model of nucleotide substitution with empirical base frequencies and a proportion of invariant sites was used as selected for by the ModelFinder application (Kalyaanamoorthy et al., 2017) for the Hong Kong dataset. For the Brazilian dataset, a TN model of nucleotide substitution (Tamura and Nei, 1993) with empirical base frequencies was selected for. To assess branch support, the approximate likelihood-ratio test based on the Shimodaira–Hasegawa-like procedure with 1,000 replicates (Anisimova et al., 2011), was used.
Root-to-tip regression
To explore the temporal structure of both the Brazilian and Hong Kong dataset, TempEst v.1.5.3 (Rambaut et al., 2016) was used to regress the root-to-tip genetic distances against sampling dates (yyyy-mm-dd). The ‘best-fitting’ root for the phylogeny was found by maximising the R2 value of the root-to-tip regression. Several sequences showed incongruent genetic diversity and were discarded from subsequent analyses. This resulted in a final dataset of N = 117 Hong Kong sequences and N = 196 Brazilian sequences. The gradient of the slopes (clock rates) provided by TempEst were used to inform the clock prior in the phylodynamic analysis.
Subsampling for analysis
Four retrospective sampling schemes were used to select a subsample of Amazonas and Hong Kong sequences. Each sampling period was broken up into weeks with each week being used as an interval according to a temporal sampling scheme (without replacement). This temporal sampling scheme was based on the number of reported cases of SARS-CoV-2.
Temporal sampling schemes explored were:
Uniform sampling: All weeks have equal probability.
Proportional sampling: Weeks are chosen with a probability proportional to the value of the number of cases in each epi-week.
Reciprocal-proportional sampling: Weeks are chosen with a probability proportional to the reciprocal of the number of cases in each epi-week.
No sampling strategy applied: All sequences were included without a sampling strategy applied.
These sampling schemes were inspired by those recommended by the WHO for practical use in different settings and scenarios (World Health Organisation, 2021b). Proportional sampling is equivalent to representative sampling, uniform sampling is equivalent to fixed sampling whilst the unsampled data includes all sampling strategies. Reciprocal-proportional sampling is not commonly used in practice as was used as a control within this study.
Bayesian Evolutionary Analysis
Date molecular clock phylogenies were inferred for all sampling strategies applied to the Amazonas and Hong Kong dataset using BEAST v1.10.4 (Suchard et al., 2018) with BEAGLE library v3.1.0 (Ayres et al., 2019) for accelerated likelihood evaluation. For both the Amazonas and Hong Kong datasets, a HKY substitution model with gamma-distributed rate variation among sites and four rate categories was used to account for among-site rate variation (Hasegawa, Kishino and Yano, 1985). A strict clock molecular clock model was chosen. Both the Amazonas and Hong Kong dataset were analysed under a flexible non-parametric skygrid tree prior (Hill and Baele, 2019). Four independent MCMC chains were run for both datasets. For the Amazonas dataset, each MCMC chain consisted of 250,000,000 steps with sampling every 50,000 steps. Meanwhile, for the Hong Kong dataset, each MCMC chain consisted of 200,000,000 steps with sampling every 40,000 steps. For both datasets, the four independent MCMC runs were combined using LogCombiner v1.10.4 (Suchard et al., 2018). Subsequently, 10% of all trees were discarded as burn in, and the effective sample size of parameter estimates were evaluated using TRACER v1.7.2 (Rambaut et al., 2018). An effective sample size of over 200 was obtained for all parameters. Maximum clade credibility (MCC) trees were summarised using Tree Annotator (Suchard et al., 2018).
Phylodynamic Reconstruction
Estimation of the Reproduction Number and Time-varying Effective Reproduction Number The Bayesian birth-death skyline (BDSKY) model (Stadler et al., 2013) implemented within BEAST 2 v2.6.5 (Bouckaert et al., 2019) was used to estimate time-varying rates of epidemic transmission, measured as changes in Rt (Table 2). A HKY substitution model with a gamma-distributed rate variation among sites and four rate categories (Hasegawa, Kishino and Yano, 1985) was used alongside a strict molecular clock model. A lognormal distribution was used for Rt. The selected number of intervals for both datasets was 5, representing Rt changing every 2.5 weeks for the Hong Kong datasets and every 2 weeks for the Brazilian datasets, with equidistant intervals per step. An exponential distribution was used with a mean of 36.5y−1 for the rate of becoming infectious, assuming a mean duration of infection of 10 days (Nadeau et al., 2021). A uniform distribution was used for the sampling proportion. Four independent MCMC chains were run for 50 million MCMC steps with sampling every 5000 steps for each dataset. The four independent MCMC runs were combined using LogCombiner v2.6.5. (Bouckaert et al., 2019) and the effective sample size of parameter estimates were evaluated using TRACER v1.7.2 (Rambaut et al., 2018). An effective sample size of over 200 was obtained for all parameters. The bdskytools R package (https://github.com/laduplessis/bdskytools) was used to plot the BDSKY results.
Estimation of Growth Rates
For each dataset, a scaled proxy for rt was estimated through time using the skygrowth model (Volz and Didelot, 2018) within R. Skygrowth uses MCMC to apply a first-order autoregressive stochastic process, founded on a non-parametric Bayesian approach, on the growth rate of the effective population size. The MCMC chains were run for one million iterations for each dataset on their MCC tree with an Exponential (10−5) prior on the smoothing parameter. The skygrowth model was parameterised assuming that the effective population size of SARS-COV-2 could change every two weeks. To enable comparisons of rt estimated by skygrowth and rt estimated by EpiFilter, the rt provided by the skygrowth model was converted to the exponential growth rate. To do this, the Rt was calculated from rt by adding a gamma rate variable which assumed a mean duration of infection of 10 days (Nadeau et al., 2021). Subsequently, the Wallinga-Lipsitch equation (Equation 3) was used to convert Rt into the exponential growth rate (Wallinga and Lipsitch, 2007).
Comparing Parameters Estimates from Genetic and Epidemiological Data
To compare parameters estimates from epidemiological and genetic data the Jensen-Shannon divergence (DJS) (Lin, 1991), which measures the similarity between two probability mass functions (PMFs), was applied. The DJS offers a formal information theoretic evaluation of distributions and is more robust than comparing Bayesian credible intervals (BCIs) since it considers both the shape and spread of a given distribution. The DJS is essentially a symmetric and smoothed version of the Kullback-Leibler divergence (DKL) and is commonly used in the fields of machine learning and bioinformatics. The DKL between two PMFs, P and Q, is defined as DKL in Equation 4 below (Kullback and Leibler, 1951). To calculate the PMF for each epidemiological parameter, the cumulative probability density function (PDF) was extracted for each model, converted to a probability density function (PDF), and a discretisation procedure then applied (Equation 5). τ represents the PDF and is discretized via Equation 4, where s = 0.05, 0.01….and τ(v) is the cumulative probability density of τ. The Jensen-Shannon distance (JSD) metric quantifies the square-root of the total DJS to the average probability distribution and is the metric that we used to compare parameter estimations from differing sampling strategies. The DJS can be calculated using Equation 6 with P and Q representing the two probability distributions and DKL as the KL divergence. A smaller JSD metric indicates that P and Q are more similar with a Jensen-Shannon distance of 0 indicating equivalence of the two distributions. The mean JSD was taken over all intervals for the BDSKY and Skygrowth models to obtain an overall measure of the level of estimated similarity.
RESULTS
Sampling Schemes
Hong Kong
Hong Kong reacted rapidly upon learning of the emergence of SARS-CoV-2 in Wuhan, Hubei province, China by declaring a state of emergency on the 25th of January 2020 and by mobilising intensive surveillance schemes in response to initial cases (Cowling et al., 2020). This appeared to be successful in controlling the first wave of cases. However, due to imported cases from Europe and North America, a second wave of SARS-CoV-2 infections emerged prompting stricter NPIs such as the closure of borders and restrictions on gatherings (Cowling et al., 2020). Following these measures, the incidence of SARS-CoV-2 rapidly decreased (Figure 1). Hong Kong has a high sampling intensity with 11.6% of confirmed cases sequenced during our study period.
The number of cases within Hong Kong for each week was used to inform the sampling schemes used within this study. This resulted in the unsampled scheme having N = 117 sequences, the proportional sampling scheme having N = 54 sequences, the uniform sampling scheme having N = 79 and the reciprocal-proportional sampling scheme having N = 84 sequences (Supplementary Figure 2).
Amazonas
The Amazonas state of Brazil had its first laboratory confirmed case of SARS-CoV-2 in March 2020 in a traveller returning from Europe (Nascimento et al., 2020). The first wave of SARS-CoV-2 infections within the state peaked in early May 2020 (Figure 2). From then, the epidemic waned, cases dropped, remaining stable until mid-December 2020. The number of cases then started growing exponentially, ushering in a second epidemic wave. This second wave peaked in January 2021 (Figure 2) and was caused by the emergence of a new SARS-CoV-2 VOC, designated P.1/Gamma (Faria et al., 2021).
To combat this second wave, the Government of the Amazonas state suspended all non-essential commercial activities on the 23rd of December 2020 (http://www.pge.am.gov.br/legislacao-covid-19/). However, in response to protests, these restrictions were reversed, and cases continued to climb. On the 12th of January, NPIs were re-introduced (http://www.pge.am.gov.br/legislacao-covid-19/) which seemed to be successful in reducing the case incidence in the state. However, cases remain comparatively high (Figure 2). Amazonas has a low sampling intensity with 2.4% of suspected P.1/gamma cases sequenced during our study period.
The number of cases within the Amazonas region informed the sampling schemes used within this study. This resulted in the unsampled scheme having N = 196 sequences, the proportional sampling scheme having N = 168 sequences, the uniform sampling scheme having N = 150 and the reciprocal-proportional sampling scheme having N = 67 sequences (Supplementary Figure 3).
Root-to-tip Regression
The correlation (R2) between genetic divergence and sampling dates for the Hong Kong datasets ranged between 0.36 and 0.52 and between 0.13 and 0.20 for the Amazonas datasets. This implies that the Hong Kong datasets have a stronger temporal signal. This is likely due to the Hong Kong datasets have a wider sampling interval (106 days) compared to the Amazonas datasets (69 days). A wider sampling interval can lead to a stronger temporal signal (Drummond et al., 2003). No association between the number of sequences in each sampling scheme and the R2 was found. This implies that the data has a high degree of non-independence which is an unexpected finding as more independent data should reduce the effects of stochasticity. The gradient (rate) of the regression ranged from 1.24×10−3 to 1.72×10−3 s/s/y for the Hong Kong datasets and 4.41×10−4 to 5.28×10−4 s/s/y for the Amazonas datasets.
Estimation of Evolutionary Parameters
The mean substitution rate (measured in units of number of substitutions per site per year, s/s/y) and the time to most common recent ancestor (TMRCA) was estimated in BEAST, for both datasets, and the estimation from all sampling schemes was compared.
Hong Kong
For Hong Kong, the mean substitution rate per site per year ranged from 9.16×10−4 to 2.09×10−3 with sampling schemes all having overlapped BCI (Supplementary table 2; Supplementary Figure 4A). This indicates that the sampling scheme did not have a significant impact on the estimation of the clock rate. Moreover, the clock rate is comparable to estimations from the root-to-tip regression and to early estimations of the mean substitution rate per site per year of SARS-CoV-2 (Duchene et al., 2020).
Molecular clock dating of the Hong Kong dataset indicates that the estimated time of the most common recent ancestor was mid-November 2019 and early January 2020 (mean, 10th December 2019; 95% BCI interval, 14th November 2019 – 1st January 2020, Figure 3B; Supplementary Table 2). This is around 5 weeks before the first confirmed case which was reported on the 18th of January 2021. Once again, all sampling strategies have overlapped BCIs suggesting that the sampling scheme does not significantly impact the estimation of the TMRCA.
Brazil
For the P.1 lineage in the Amazonas region, the mean substitution rate ranged from 4.00 x 10-4 to 5.56 x 10-4 with all sampling schemes having overlapped BCIs (Figure 3D, Supplementary Table 2; Supplementary Figure 4B). This indicates that sampling strategy does not impact the estimation of the clock rate, supporting findings from the Hong Kong dataset. This supports estimations from the root-to-tip analysis.
Molecular clock dating estimated a TMRCA between mid-September and mid-November (mean, 23rd October 2020; 95% BCI interval, 16th September 2020 – 18th November 2020, Figure 3D; Supplementary Table 2). This is around five weeks before the date of the first P.1 case identified in Manaus used in our study. All sampling schemes have overlapping BCI supporting the inference form the Hong Kong datasets that TMRCA is robust to sampling.
Estimation of Basic Reproduction Number
We found that Hong Kong had a significantly lower R0 of 2.17 (95% credible interval (CI) = 1.43 - 2.83) when compared to Amazonas which had a R0 of 3.67 (95% CI = 2.83 – 4.48). All sampling schemes for both datasets were characterised by similar R0 values (Figure 3) indicating that the estimation of R0 is robust to changes in sampling scheme.
Time-varying Reproduction number and Growth rate
We examine the Rt and rt estimated for local SARS-CoV-2 epidemics in Hong Kong and Amazonas, Brazil. Our main results showing these two parameters and JSD are in figures 4-8.
Hong Kong
The BDSKY model was used alongside the EpiFilter model to estimate the Rt for each dataset subsampled according to the different sampling strategies (Figure 4). Based on the proportional sampling scheme, which had the lowest JSD (Figure 4E), we initially infer a super-critical Rt value, with a mean around an Rt value of 2, that appears to fall swiftly in response to the state of emergency and the rapid implementation of NPIs. A steady transmission rate subsequently persisted throughout the following weeks around the critical threshold (Rt = 1). This period is succeeded by a sharp increase in Rt, peaking at a mean Rt value of 2.6. This is likely due to imported cases from North America and Europe (Cowling et al., 2020). This led to a ban on international travel resulting in a sharp decline in Rt (Figure 2). However, this decline lasted around a week with the mean Rt briefly increasing until more stringent NPIs such as the banning of major gatherings were implemented. Following this, the Rt continued its sharp decline falling below the critical threshold, with transmission becoming sub-critical (Figure 4).
These results were mirrored in the estimation of rt. (Figure 5) for which the uniform and proportional sampling schemes showed the least divergence (Figure 5E). There was an initial decline in the rt, which steadied at a value of ∼ 0, indicating that epidemic stabilisation has occurred. This stable period is followed by an increase in rt peaking at around a 5% increase in case incidence per day (Figure 5). In response to NPIs, the rt starts to decrease, falling below 0, indicating a receding epidemic. The rate of this decline peaks at around a 7.5% reduction in case incidence per day (Figure 5).
Brazil
Based on the uniform sampling scheme, which had the lowest JSD (Figure 6E), we initially infer a super-critical Rt (Rt > 1) value with a mean value of Rt = 3 (Figure 6). From this point, the Rt declines, although it remains above the critical threshold (Rt = 1) for much of the study period. Sub-critical (Rt < 1) transmission was only reached after the re-imposition of NPIs. This implies that initial restrictions, such as the suspension of commercial activities, were ineffective in lowering the Rt below its critical threshold. Only after more stringent restrictions were imposed did Rt become sub-critical. However, there is no evidence of a sharp decrease in Rt once restrictions were re-imposed, indicating they may have not had a significant impact on Rt.
Based on the uniform sampling which had the lowest JSD (Figure 7E) we infer a steady decline in rt which matches the pattern seen with the Rt value (Figure 7). The initial rt implied a 23% mean increase in case incidence per day. Subsequently, the rt falls over the study period. rt falls below 0 after the re-imposition of NPIs with a 3% reduction in mean case incidence per day by the end of the study period. There is no evidence of any noticeable declines in rt when interventions were introduced indicating that they may have had a minimal impact on the growth rate of P.1/gamma.
Discussion
In this study, phylodynamic methods have been applied to available SARS-CoV-2 sequences from Hong Kong and the Amazonas region of Brazil to infer their relevant epidemiological parameters and to compare the impact that various sampling strategies have on the phylodynamic reconstruction of these parameters.
We estimated the basic reproductive number of SARS-CoV-2 in Hong Kong to be 2.17 (95% CI = 1.43-2.83). This supports previous estimates of the initial R0 in Hong Kong (Cowling et al., 2020; Zhao et al., 2020) which estimates R0 to be 2.23 (95% CI = 1.47-3.42). For the Amazonas region in Brazil, we estimated the R0 to be 3.67 (95% CI = 2.83 – 4.48). Whilst the population of Amazonas State may not be fully susceptible to P.1/gamma (Faria et al., 2021), this shouldn’t affect the comparison between sampling schemes. Comparisons of different sampling schemes have revealed the R0 is robust to changes in sampling schemes (Figure 3A and C).
For the Hong Kong dataset, the proportional sampling scheme was superior to all other sampling schemes in estimating Rt. It successfully predicted the initial super-critical Rt, its decline in response to rapid NPIs, and subsequent increase and decline during the second wave of infections (Figure 4B). This was in comparison to the reciprocal-proportional that provided the worst JSD (Figure 4D) and in which the Rt remained relatively constant throughout the period. In addition, the proportional sampling scheme, alongside the uniform sampling scheme, best estimated rt (Figure 5B and C). In contrast, for the Amazonas dataset, the uniform sampling scheme best estimated the Rt and was joint best for rt (Figure 6C and Figure 7C). It captured both its initial super-critical Rt and high rt alongside their subsequent decline. Our estimations for Rt are consistent with previous estimates of P.1 in Amazonas state (Faria et al., 2021). This contrasted with the unsampled data in which the rt increased at the end of the period (Figure 7A). This highlights that unlike R0, both Rt and rt are sensitive to changes in sampling and that even related epidemiological parameters like Rt and rt may require different sampling strategies to optimise inferences.
Molecular clock dating of the Hong Kong and Amazonas dataset has revealed that the date of origin is robust to changes in sampling schemes. For Hong Kong, SARS-CoV-2 likely emerged in mid-December 2019 around 5 weeks before the first reported case on the 22nd of January 2020 (Cowling et al., 2020). The Amazonas dataset revealed that the date of the common ancestor of the P.1 lineage emerged around late October 2020, around 5 weeks before the first reported case on the 6th of December (Faria et al., 2021), with all BCI’s overlapping for each sampling strategy. Like the molecular clock dating, we found that the molecular clock rate was robust to changes in sampling strategies in both datasets with all sampling strategies having overlapped BCI’s (Supplementary Table 2 and Supplementary Figure 4). For the Hong Kong dataset, its clock rate is comparable to early estimations of the mean substitution rate per site per year of SARS-CoV-2 (Duchene et al., 2020). However, the clock rate estimated for the Brazilian dataset is lower than initial 8.00×10−4 s/s/y which is used in investigating SARS-CoV-2 (Andersen et al., 2020) and that has been used in previous analyses of P.1 (Naveca et al., 2021). This initial estimation of evolutionary rate was estimated from genomic data taken over a short time span at the beginning of the pandemic introducing a time dependency bias (Ghafari et al., 2022). By using a more appropriate clock rate it can improve tree height and rooting resulting in more robust parameter estimations (Boskova, Stadler and Magnus, 2018).
Treating sampling times as uninformative has been shown to be inferior to including them as dependent on effective population size and other parameters by several previous studies (Hall, Woolhouse and Rambaut, 2016; Karcher et al., 2016; Liu et al., 2020; Parag, du Plessis and Pybus, 2020). Whilst these studies did not consider the estimation of epidemiological parameters, they highlight the potential of systematic biases being introduced into the phylodynamic reconstruction by not using a sampling scheme or by assuming an incorrect model for how sampling schemes introduce information. This was supported by our results as phylodynamic inferences with no sampling strategy applied had the poorest performance for both Hong Kong and the Amazonas region. This implies that sampling has a significant impact on phylodynamic reconstruction, and that exploration of sampling strategies is needed to obtain the most robust parameter estimates.
While our results provide a rigorous underpinning and insight into the dynamics of SARS-CoV-2 and the impact of sampling strategies in the Amazonas region and Hong Kong, there are limitations. The Skygrowth and BDSKY models do not explicitly consider imports into their respective regions. This is particularly relevant for Hong Kong as most initial sequences from the region were sequenced from importation events (Adam et al., 2020) which can introduce error into parameter estimation. However, as the epidemic expanded, more infections were attributable to autochthonous transmission (Adam et al., 2020), and the risk of error introduced by importation events decreased. Moreover, while sampling strategies can account for temporal variations in genomic sampling fractions there is currently no way to account for non-random sampling approaches in either the BDSKY or Skygrowth models (Vasylyeva et al., 2020). It is unclear how network-based sampling may affect parameter estimates obtained through these models (Volz, Koelle and Bedford, 2013) presenting a key challenge in molecular and genetic epidemiology. Spatial heterogeneities were also not explored within this work. This represents the next key step in understanding the impact of sampling as spatial sampling schemes would allow the reconstruction of the dispersal dynamics and estimation of epidemic overdispersion (k), a key epidemiological parameter.
This work has highlighted the impact and importance that applying temporal sampling strategies can have on phylodynamic reconstruction. Whilst more genomic datasets from a variety of countries and regions with different sampling intensities and proportions are needed to create a more generalisable sampling framework and to dissect any potential cofounders, it has been shown that genomic datasets with no sampling strategy applied can introduce significant uncertainty and biases in the estimation of epidemiological parameters. This finding identifies the need for more targeted attempts at performing genomic surveillance and epidemic analyses particularly in resource-poor settings which have a limited genomic capability.
Data Availability
All data produced in the present study are available upon reasonable request to the authors
Role of the Funding Sources
N.R.F. acknowledges support from Wellcome Trust and Royal Society Sir Henry Dale Fellowship (204311/Z/16/Z), Bill and Melinda Gates Foundation (INV-034540) and Medical Research Council-Sao Paulo Research Foundation (FAPESP) CADDE partnership award (MR/S0195/1 and FAPESP 18/14389-0) (https://caddecentre.org). K.V.P. acknowledges support from grant reference MR/R015600/1, jointly funded by the UK Medical Research Council (MRC) and the UK Department for International Development (DFID) and from the NIHR Health Protection Research Unit in Behavioural Science and Evaluation at University of Bristol.
CRediT authorship contribution statement
R.P.D.I, K.V.P and N.R.F conceived and designed the study, R.P.D.I wrote and performed the analyses. R.P.D.I wrote the manuscript which was edited and supervised by K.V.P and N.R.F. All authors have contributed to and approved the manuscript for submission.
Supplementary Figures and Tables
Footnotes
↵5 Jointly supervised this work