Abstract
The wealth of genomic data that was generated during the COVID-19 pandemic provides an exceptional opportunity to obtain information on the transmission of SARS-CoV-2. Specifically, there is great interest to better understand how the effective reproduction number Re and the overdispersion of secondary cases, which can be quantified by the negative binomial dispersion parameter k, changed over time and across regions and viral variants. The aim of our study was to develop a Bayesian framework to infer Re and k from viral sequence data. First, we developed a mathematical model for the distribution of the size of identical sequence clusters, in which we integrated viral transmission, the mutation rate of the virus, and incomplete case-detection. Second, we implemented this model within a Bayesian inference framework, allowing the estimation of Re and k from genomic data only. We validated this model in a simulation study. Third, we identified clusters of identical sequences in all SARS-CoV-2 sequences in 2021 from Switzerland, Denmark, and Germany that were available on GISAID. We obtained monthly estimates of the posterior distribution of Re and k, with the resulting Re estimates slightly lower than resulting obtained by other methods, and k comparable with previous results. We found comparatively higher estimates of k in Denmark which suggests less opportunities for superspreading and more controlled transmission compared to the other countries in 2021. Our model included an estimation of the case detection and sampling probability, but the estimates obtained had large uncertainty, reflecting the difficulty of estimating these parameters simultaneously. Our study presents a novel method to infer information on the transmission of infectious diseases and its heterogeneity using genomic data. With increasing availability of sequences of pathogens in the future, we expect that our method has the potential to provide new insights into the transmission and the overdispersion in secondary cases of other pathogens.
Author summary Pathogen transmission is a stochastic process that can be characterized by two parameters: the effective reproduction number Re relates to the average number of secondary cases per infectious case in the current conditions of transmission and immunity, and the overdispersion parameter k captures the variability in the number of secondary cases. While Re can be estimated well from case data, k is more difficult to quantify since detailed information about who infected whom is required. Here, we took advantage of the enormous number of sequences available of SARS-CoV-2 to identify clusters of identical sequences, providing indirect information about the size of transmission chains at different times in the pandemic, and thus about epidemic parameters. We then extended a previously defined method to estimate Re, k, and the probability of detection from this sequence data. We validated our approach on simulated and real data from three countries, with our resulting estimates compatible with previous estimates. In a future with increased pathogen sequence availability, we believe this method will pave the way for the estimation of epidemic parameters in the absence of detailed contact tracing data.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
EH, RH, and CA were supported or received funding by the Swiss National Science Foundation (No 196046). MW, JR, and CA were supported or received funding by the Multidisciplinary Center for Infectious Diseases, University of Bern, Bern, Switzerland. JR was supported by the Swiss National Science Foundation (No 189498). CA received funding from the European Union's Horizon 2020 research and innovation program - project EpiPose (No 101003688). This project was supported by the ESCAPE project (101095619), funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them. This work was funded by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee (grant number 10051037). This work has received funding from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number 22.00482. EH was supported by a Swiss National Science Foundation Starting Grant (TMSGI3_211225).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
As well as multiple small edits to improve clarity and the flow of the manuscript, major edits to this version of the manuscript include: We highlight the conceptually novel aspect of our study, that identical sequences can efficiently inform epidemiological parameter inference even if the data sets are too large to allow phylodynamic analysis (pages 11-12, lines 307-324), further differentiating it from previous work. We have reparameterized the model to use an estimate of the mean number of mutations per transmission to inform the mutation probability μ, instead of determining μ based on estimates of the yearly mutation rate M and the mean generation interval D. Since the mutation probability is higher in the new version of the model, the estimates of the effective reproduction number Re are higher as well. The estimates of the dispersion parameter and the testing probability remain almost identical, however. Estimates with the original mutation rate can be found in Supplementary Material, section 7.4. We have reworked Figure 2 to include a more readily interpretable display of the error, by using the mean and 95% credible interval of the pooled samples of the posterior distribution. The original figure 2 using RMSE can now be found in section 5.2 of the Supplementary Material as Figure S3.
Data Availability
Sequence data are available via GISAID after registration, and are available in EPI_SET_240326pm (doi.org/10.55876/gis8.240326pm) for Switzerland, EPI_SET_240326mz (doi.org/10.55876/gis8.240326mz) for Denmark, and EPI_SET_240326uh (doi.org/10.55876/gis8.240326uh) for Germany; see also the supplementary tables 1-3. Code to generate identical sequence clusters from the starting alignments is available via github.com/emmahodcroft/sc2_rk_public. Functions for estimation of parameters and simulation of identical sequence clusters are available via the R package estRodis github.com/mwohlfender/estRodis. Code used for the analysis of data and results as well as the creation of plots and tables is available via github.com/mwohlfender/R_overdispersion_cluster_size.
http://doi.org/10.55876/gis8.240326pm
http://doi.org/10.55876/gis8.240326mz
http://doi.org/10.55876/gis8.240326uh
https://github.com/emmahodcroft/sc2_rk_public
https://github.com/mwohlfender/estRodis
https://github.com/mwohlfender/R_overdispersion_cluster_size