Abstract
The unprecedented SARS-CoV-2 global sequencing effort has suffered from an analytical bottleneck. Many existing methods for phylogenetic analysis are designed for sparse, static datasets and are too computationally expensive to apply to densely sampled, rapidly expanding datasets when results are needed immediately to inform public health action. For example, public health is often concerned with identifying clusters of closely related samples, but the sheer scale of the data prevents manual inspection and the current computational models are often too expensive in time and resources. Even when results are available, intuitive data exploration tools are of critical importance to effective public health interpretation and action. To help address this need, we present a phylogenetic summary statistic which quickly and efficiently identifies newly introduced strains in a region, resulting clusters of infected individuals, and their putative geographic origins. We show that this approach performs well on simulated data and is congruent with a more sophisticated analysis performed during the pandemic. We also introduce Cluster Tracker (https://clustertracker.gi.ucsc.edu/), a novel interactive web-based tool to facilitate effective and intuitive SARS-CoV-2 geographic data exploration and visualization. Cluster-Tracker is updated daily and automatically identifies and highlights groups of closely related SARS-CoV-2 infections resulting from inter-regional transmission across the United States, streamlining public health tracking of local viral diversity and emerging infection clusters. The combination of these open-source tools will empower detailed investigations of the geographic origins and spread of SARS-CoV-2 and other densely-sampled pathogens.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
JM was supported by T32HG008345. This work was funded in part by CDC award BAA 200-2021-11554 to RBC.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
N/A
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
All data was obtained from the public repositories GISAID, COG-UK, and GenBank. Full sample credit is included in Supplementary Data 1. Code to replicate our analysis is available at https://github.com/jmcbroome/cluster-heuristic. Code for complete simulation of covid-like phylogenetic trees is available at https://github.com/jmcbroome/pandemic-simulator Our implementation of our heuristic is implemented as part of matUtils https://github.com/yatisht/usher with additional documentation at https://usher-wiki.readthedocs.io/en/latest/ Our website source code is available at https://github.com/jmcbroome/introduction-website.