Abstract
Repeated emergence of SARS-CoV-2 variants with increased fitness necessitates rapid detection and characterization of new lineages. To address this need, we developed PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative prevalence of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to fitness. Applying PyR0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase fitness, including previously identified spike mutations and many non-spike mutations within the nucleocapsid and nonstructural proteins. PyR0 forecasts growth of new lineages from their mutational profile, identifies viral lineages of concern as they emerge, and prioritizes mutations of biological and public health concern for functional characterization.
One Sentence summary A Bayesian hierarchical model of all SARS-CoV-2 viral genomes predicts lineage fitness and identifies associated mutations.
Competing Interest Statement
The authors have declared no competing interest.
Clinical Trial
Study is based on SARS-CoV-2 genetic sequences publicly available at GISAID.org.
Clinical Protocols
https://github.com/broadinstitute/pyro-cov
Funding Statement
This work was sponsored by the U.S. Centers for Disease Control and Prevention (BAA), as well as support from the Doris Duke Charitable Foundation (J.E.L.), the Howard Hughes Medical Institute (P.C.S.), and the Evergrande COVID-19 Response Fund Award from the Massachusetts Consortium on Pathogen Readiness (J.L.).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The study was conducted using data from a public database (GISAID). No IRB approval is necessary.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
1. We used our model to conduct a global analysis of variants and case counts during the entire first two years of the SARS-CoV-2 pandemic, extending the analysis to include data from the beginning of the outbreak through to January 2022 (rather than through July 2021). 2. Our data now includes 6.4 million genomes, over 3x times the size of our original study, and by far the largest genomic analysis for SARS-CoV-2 or any virus to date. 3. The updated model closely covers the emergence of Omicron and includes a detailed analysis of relative fitness of the Omicron sublineages, BA.1, BA.2, and BA.3, predicting BA.2 as the fittest lineage, a prediction that seems in line with current events. 3. We have completely rewritten the initial clustering method for the model to use a fine-grained, phylogenetic approach, improving on the previous approach that was based on PANGO lineages. The new clustering method recapitulates our earlier, PANGO-based findings, reveals heterogeneity within lineages, and captures these effects in the model to improve the inference. 4. We have added experimental data to probe and validate our findings, including assessment of several high-scoring mutations in cellular infectivity assays. We found that high-scoring spike RBD mutations do not consistently enhance infectivity; rather, they appear to confer immune escape. 5. We have further expanded our analysis of potential mechanisms driving fitness, by correlating immune escape predictions with our mutational fitness predictions. 6. On the basis of these new data and analyses, we have broadened the scope of our discussion to highlight the major forces dominating the pandemic during its first two years, and in particular the recent transition to an immune escape phase. 7. We have added new structural analyses and figures, showing that mutational fitness changes are highly concentrated in specific structural regions of proteins, especially spike, nucleocapsid, and ORF1a. 8. We have rewritten key parts of the manuscript to speak to the broader implications of our work for basic science and public health.
Data Availability
All data was gathered from other public resources. Data preprocessing scripts are open source.