ABSTRACT
SARS-CoV-2 virus genomes are currently being sequenced at an unprecedented pace. The choice of sequences used in genetic and epidemiological analysis is important as it can induce biases that detract from the value of these rich datasets. This raises questions about how a set of sequences should be chosen for analysis, and which epidemiological parameters derived from genomic data are sensitive or robust to changes in sampling. We provide initial insights on these largely understudied problems using SARS-CoV-2 genomic sequences from Hong Kong and the Amazonas State, Brazil. We consider sampling schemes that select sequences uniformly, in proportion or reciprocally with case incidence and which simply use all available sequences (unsampled). We apply Birth-Death Skyline and Skygrowth methods to estimate the time-varying reproduction number (Rt) and growth rate (rt) under these strategies as well as related R0 and date of origin parameters. We compare these to estimates from case data derived from EpiFilter, which we use as a reference for assessing bias. We find that both Rt and rt are sensitive to changes in sampling whilst R0 and date of origin are relatively robust. Moreover, we find that the unsampled datasets (opportunistic sampling) provided, overall, the worst Rt and rt estimates for both Hong Kong and the Amazonas case studies. We highlight that sampling strategy may be an influential yet neglected component of sequencing analysis pipelines. More targeted attempts at genomic surveillance and epidemic analyses, particularly in resource-poor settings which have a limited genomic capability, are necessary to maximise the informativeness of virus genomic datasets.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
N.R.F. acknowledges support from Wellcome Trust and Royal Society Sir Henry Dale Fellowship (204311/Z/16/Z), Bill and Melinda Gates Foundation (INV-034540) and Medical Research Council-Sao Paulo Research Foundation (FAPESP) CADDE partnership award (MR/S0195/1 and FAPESP 18/14389-0) (https://caddecentre.org). K.V.P. acknowledges support from grant reference MR/R015600/1, jointly funded by the UK Medical Research Council (MRC) and the UK Department for International Development (DFID) and from the NIHR Health Protection Research Unit in Behavioural Science and Evaluation at University of Bristol
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
All data used within this study are open source with details of how to obtain them found in the methods section.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
↵5 Jointly supervised this work
Data Availability
All data produced in the present study are available upon reasonable request to the authors