ABSTRACT
Accurate, reliable, and timely estimates of pathogen variant risk are essential for informing effective public health responses to infectious diseases. Despite decades of use for influenza vaccine strain selection and PCR-based molecular diagnostics, data on pathogen variant prevalence and growth advantage has only risen to its current prominence during the SARS-CoV-2 pandemic. However, such data are still often sparse: a novel variant is initially rare or a region has limited sequencing. To ensure real-time estimates of risk are available in these types of data-sparse conditions, we develop a hierarchical modeling approach that estimates variant fitness advantage and prevalence by pooling data across geographic regions. We apply this method to estimate SARS-CoV-2 variant dynamics at the country level and assess its stability with retrospective validation. Our results show that more stable and robust estimates can be obtained even when sequencing data are sparse, as compared to established, single-country estimation approaches. We discuss how this method can inform risk assessment of novel emerging variants and provide situational awareness on currently circulating variants, for a range of pathogens and use-cases.
Introduction
Virus emergence, circulation and diversity can impact outbreak dynamics and control efforts both globally and locally. Over the last two decades, the use of viral genome sequencing has shifted from primarily retrospective research toward near-real-time analyses. As we have seen during the SARS-CoV-2 pandemic, real-time variant characterization using genomic sequencing has the potential to inform public health practice1–4, enhance disease forecasting models5 and aid vaccines, therapeutics, and molecular diagnostic assay development6, 7. An understanding of variant properties (i.e., immune evasion, transmissibility, and therapeutic efficacy) combined with accurate regional estimates of each variants’ prevalence can be used to design, apply, and evaluate public health strategies to mitigate transmission and disease1, 3, 8–13.
Disease systems as diverse as influenza, HIV, and dengue virus produce spatially and temporally structured competing populations14–16. These particularities lead to significant challenges in both the reliability of early characterization of emerging variants and the accessibility of real-time variant prevalence estimates in countries with limited sequencing capacity. However, modeling of the resulting complex dynamics is generally constrained by data sparsity and heterogeneous diagnostic and sequencing capacity across the globe. Robust methods that leverage data from across regions can improve our understanding of strain dynamics in real time for any pathogen and geographic context. Established analytic methods, such as multinomial or logistic regression, can provide reliable estimates of variant growth rates that account for uncertainty due to changes in sampling intensity over time17. However, these approaches often require large numbers of sequences and cannot readily handle geographic heterogeneity in sampling intensity. As a consequence, many existing methods for estimating variant growth rates are inaccurate at small sample sizes18, 19. These limitations can lead to under- or over-estimates of variant fitness advantage and/or risk posed by a variant.
Because genomic data are often quite sparse, these methods can be challenging to apply in practice. In addition to being limited in numbers, genomic data are often concentrated into a few geographies, both for emerging variants and for many countries as a whole20. For example, 90% of publicly available SARS-CoV-2 sequences come from just 10% of countries (Figure 1). These disparities are not only in the raw number of sequences available, but also in the percent of cases sequenced and the turnaround time from sample collection to sequence availability20. Because of the concentration of sequencing into a few countries, sample size limitations often make country-specific, regression-based approaches unreliable. We note that this sparsity can arise through a number of different mechanisms, including insufficient local capacity to sequence genomes (i.e., “sequencing capacity”) or limited ability to divert existing sequencing capacity from existing use cases toward a pathogen of public health interest (i.e., “sequencing effort”). Established regression-based approaches are therefore most appropriate for monitoring variants in countries or regions with high sequencing capacity and effort and rapid turnaround times.
Although the available sample size can be increased by pooling sequences from multiple countries, different populations have different immunological landscapes and seeding dynamics of novel variants4, 21. Consequently, relative variant fitness can differ substantially across regions. To address this challenge, we developed a hierarchical modeling approach that jointly estimates the trajectories of variants’ growth over time. By pooling all the available sequences, it estimates variant properties globally even in regions with limited sequencing efforts and/or capacity (Figure 2). From this model, we identify the key quantities of: i.) within-country prevalence, ii.) within-country growth rate, and iii.) global fitness relative to circulating variants. The method accounts for the observed heterogeneity in variant fitness across geographic regions and sparsity in sequencing to provide more robust estimates of variant fitness shortly after emergence. By using sequences more efficiently, this approach complements investments aimed at increasing global sequencing capacity and/or can offset reductions in sequencing effort (e.g., as happened throughout 2022 with SARS-CoV-2 genomic sequencing22).
Results
Global landscape of SARS-CoV-2 genomic surveillance
In Figure 1, we examine the landscape of global SARS-CoV-2 genomic sequencing, identifying systemic variability in sequencing capacity/throughput. The majority of the world’s SARS-CoV-2 sequencing has occurred in the United States and United Kingdom, which had contributed 54.9% of all SARS-CoV-2 sequences shared via the GISAID Initative23 as of July 1, 2022. This disparity extends beyond these two countries — 10% of the countries that have shared sequences with GISAID have produced 90.3% of the SARS-CoV-2 sequences (Figure 1A). These high sequencing-capacity countries are unevenly geographically distributed; with the exception of India, all are members of the Global North. In contrast, most countries in the African continent and the Middle East have submitted only a few thousand SARS-CoV-2 sequences in total, with the notable exceptions of South Africa and Kenya (Figure 1A, inset).
In Figure 1B, we demonstrate a pattern in sequencing effort targeted towards emerging variants, finding that for each variant the rate of sequencing accelerates over time. Early on, when a variant is at low prevalence, the total number of sequences of that variant grows slowly — at a linear rate. In the early stages of variant emergence, the amount of information about the variant is limited, with only a few sequences available and the benefit provided by pooling information is highest. Later on, when that variant begins to spread more widely, we see exponential growth in the total number of sequences. This pattern of an increasing rate of sequencing over time leads to the characteristic elbow shape in the plot.
In Figure 1C, we show that this second phase of rapid sequence collection is largely driven by the concentration of genomic sequencing within a few high capacity countries. Shortly after variant emergence, the majority of the BA.1, BA.4, and BA.5 lineage sequences collected were from outside of these high sequencing capacity countries. However, once these variants reached high sequencing capacity countries, sequences were collected more rapidly, becoming a majority of the sequences collected. This change highlights that genomic surveillance for emerging variants has been largely limited to a few high-resource geographies.
We describe a general method to estimate multi-strain dynamics and relative fitness advantages by partially pooling information across patches (geographic subunits, i.e. countries) and strains (Fig 2). This statistical approach leverages a hierarchical mixed-effects Bayesian framework. The model has two levels of hierarchy. In the first level, country-specific variant fitness advantages are structured such that the fitness advantage of a variant in one location informs the expected fitness advantages of variants in other locations to formalize the assumption that variants’ properties in one location are likely to be similar to others (Fig 2A). In the second level, variants’ mean fitness advantages, averaged over countries, consist of a shared (hierarchical) normal distribution (Fig 2B). This approach shares information across variants to formalize the ecological assumption that most variants will be similarly fit to their recent ancestors and observing extreme deviations in fitness is uncommon. This assumption leads to shrinkage on extreme fitness advantage estimates for novel or otherwise infrequently observed variants for which we might otherwise overfit to noise in the data (Fig 1C).
Because of this sharing of information, robust estimates can be produced in settings with extremely sparse data. Estimated quantities include region-specific variant proportions over time with associated uncertainty as well as global and region-specific characterization of variant properties. The framework presented here is general enough to be applied to any geographic scale and can be extended to estimate multi-strain pathogen dynamics beyond SARS-CoV-2.
Identifying SARS-CoV-2 variant fitness and global emergence dynamics
In Figure 3A, we present the estimated SARS-CoV-2 variant proportions alongside the number of collected sequences over time, colored by the most frequently observed variant on each day, for a subset of countries. Global prevalence of the BA.2 lineage declined during the period from April to July 2022,as the BA.4 and BA.5 variants rapidly took over. The invasion dynamics of BA.4 and BA.5 were heterogeneous across countries: both took over in parallel in South Africa, but BA.5 had substantially higher observed growth in Israel and Bangladesh. Notably, in Bangladesh we estimate that BA.5 had reached a higher proportion of cases (56.5% CI: 21.2-83.4%) than in India as of July 1, 2022 (13.8%, CI: 1.1-17.6%) despite their geographic proximity. In contrast, Senegal had not yet experienced substantial BA.5 invasions as of July 1, 2022. Although BA.2 prevalence declined, this change was driven by BA.4 and other lineages rather than by BA.5.
In Figure 3B, we present country-specific variant growth rates, finding high country-specific fitness advantages for BA.4, BA.5, and BA.2.12.1. Country-specific fitness advantages are heterogeneous within variant type: the relative fitnesses of BA.4 and BA.5 are much higher in the United States (BA.5: 1.31 CI: 1.29-1.33, BA.4: 0.95 CI: 0.92-0.97) and the United Kingdom (BA.5: 1.21 CI: 1.17-1.24, BA.4: 0.93 CI:0.91-0.96) than in India (BA.5: 0.42 CI: 0.34-0.52, BA.4: 0.25 CI: 0.18-0.36) and South Africa (BA.5: 0.35 CI: 0.31-0.41, BA.4: 0.25 CI: 0.18-0.36). We note that while these estimated fitness advantages are often similar between the United States and United Kingdom (e.g., in the cases of BA.4 and BA.5), this relationship does not always hold true. In the case of BA.2.12.1, estimates of relative fitness advantage in the United States (0.29 CI: 0.29-0.30) are much lower than in the United Kingdom (0.66 CI: 0.64-0.69).
In Figure 3C, we present global relative fitness advantages for multiple variants of concern, finding that BA.5 is meaningfully more fit than BA.4 and BA.2.12.1. These expectations are the means of the Gaussian posterior distributions of country-specific estimates (μhierarchical) in Figure 2 (i.e., the random effect distribution) and a robust measure of overall variant fitness advantage. Most of the posterior density of the expected relative fitness advantage for BA.5 (0.94 CI: 0.85-1.03%) is higher than that for BA.4 (0.67 CI: 0.59-0.73) which is itself higher than that for BA.2.12.1 (0.45 CI: 0.40 - 0.50) (Figure 3C). A weekly fitness advantage of 100% means that in one week the proportion of total cases caused by BA.5 is expected to double, relative to those of BA.2 (i.e. BA.5 would go from 1% to 2% of cases if the proportion of those due to BA.2 stay constant). All three variants (BA.5, BA.4, and BA.2.12.1) have fitness advantages substantially above 0, indicating that each is more fit than BA.2, while the expected fitness advantage of BA.1 is below 0 (−0.24 CI: -0.27 to -0.21) — indicating that it is less fit than BA.2.
Retrospective validation and comparison
We performed a retrospective validation of the multicountry model estimates from 5 successive reference dates. We compared these estimates to those from single country fits to a Maximum Likelihood Estimation (MLE) multinomial model (referred to as the single country model). Figure 4A depicts the results of this analysis, using BA.5’s emergence in Portugal as a case study. Portugal was chosen for display because it had an early seeding of BA.5 prior to its widespread circulation in higher sequencing capacity countries, thus representing a good reliability test for our models ability to provide early, robust estimates of variant fitness. The top panels of Figure 4A depict the sequences each model was fit to, with shading corresponding to the calibration period (i.e., the date the last available sequence from that dataset was collected). We compare the model estimates (both of the past and current variant proportions, and the 21-day forecasted variant proportions) to the observed data as of July 1st, 2022. In Figure S1, we depict the same figure but compared to the data the model was fit to as of that reference date. This data changes over time as sequences from earlier collection dates get submitted to GISAID and “back-fill” observed variant proportions on a specific date.
In Fig. 4B, we show the posterior distribution of the global estimates of the relative fitness advantage of BA.5 over time, demonstrating that it is stable. Fig. S2 depicts how the multicountry fitness advantage estimates of BA.5 in Portugal compare to the single country model fitness advantage estimates over time. The single country model estimates experience wider uncertainty at early reference dates and a sharper decline.
In Fig. 4C, we evaluate the model predicted variant proportions from each reference date using the Brier score. Lower Brier scores indicate a more accurate probabilistic prediction. We compared the model predicted BA.5 proportion in Portugal to the observed proportions from the data as of July 1st 2022, using the Brier score over both the calibration and forecast period combined.The multicountry model exhibits lower Brier scores for the early reference dates during variant emergence, with scores of 0.34 [95% CI:0.30-0.31] and 0.32 [95% CI:0.31-0.33] as of April 30th, 2022 and May 16th, 2022 respectively compared to scores of 0.42 [95% CI:0.37-0.82] and 0.73 [95% CI:0.74-0.82] for the single country model on those same dates.
In Figs. 4D - E, we assess the stability of estimates from both the multicountry model (D) and the single county model (E) by showing the difference between the country-specific variant fitness estimate at each reference date compared to the country-specific variant fitness estimated at the final date (July 1st), for the two models. However, because the single country model cannot estimate country-specific variant fitness advantage if the particular variant has not been observed in sufficient quantities (<3 sequences of a variant) in that country, there are gaps in the single country model estimates indicated by the gray boxes in Fig. 4E. In contrast, the multicountry model is able to provide country-specific variant fitness advantage estimates, even before the variant has been observed in each country. Figure S3 provides the absolute transmission advantage estimates for each country-variant combination for both models.
The estimated fitness advantages from the multicountry model are more stable than those produced from the single country model (Figures S4 and S5). The estimated coefficients from the single country model are initially higher than those from the multicountry model and they decline to a stable estimate more slowly than those of the multicountry model (Figure S4). Although lower, the estimated country-specific fitness advantages from the multicountry model also have an initial transient positive bias, as can be seen in Figure 4D and Figure S5. However, the estimated global mean fitness advantages are quite stable over time (Figure S6), likely due to the additional shrinkage on these estimates from the partial pooling across global variant fitnesses (see Methods for additional information).
Discussion
Enabling broader access to information from genomic surveillance for emerging and re-emerging pathogens is and will remain a global health priority5, 7, 20, 24–26. In addition to the crucial work of building global sequencing capacity, methodological advances in how sequences are analyzed can address gaps in the surveillance landscape. Here, we demonstrate a method to reliably estimate the growth of competing viral variants, generating both global and local estimates. We apply this method to emerging SARS-CoV-2 variants (Fig 3, 4, and Fig S3, S4, S5), highlighting its robustness both in regions with limited sequencing and for emerging variants with few sequences available. We also demonstrate its applicability across a number of settings, including influenza strain dynamics to show suitability across pathogens (Fig S7) and at the AL1 level in Brazil and Argentina to demonstrate relevance to local public health surveillance (Fig S8). We believe this approach is particularly relevant for lower and middle income countries (LMICs), increasing the information available for local public health decisions in the present while capacity building continues to strengthen surveillance systems for the future22.
Our method builds on the epidemiological (e.g.,27) and phylogenetic tools (e.g.,28) characterizing emerging variants, making two major contributions: stable characterization of emerging variant growth rate and improved nowcasting of variant prevalence. Characterizing the growth of emerging variants and accurate nowcasting of prevalence are challenging problems. There are often long lag times from sequence collection to submission and reporting, especially in locations with limited sequencing capacity (20). Emerging variants have few sequences available and, usually, have been detected in only a few countries. Consequently, characterizing expected growth and prevalence involves extrapolating from the limited information available and projecting yet-to-be-observed growth in new locales. Our method helps fill this information vacuum by pooling the information from all sequencing done globally to build a picture of the relationships between variant dynamics and countries’ individual characteristics (i.e., empirically observed correlations between variant fitness in specific countries, potentially driven by similarities in the immune landscape or contact patterns). This global landscape of variant fitness is then applied to individual countries, informing our understanding of variant dynamics in data-sparse regions. It produces more reliable estimates of fitness advantage and improves the calibration of uncertainty intervals, particularly in the early days after a variant’s emergence. It also generates more robust and stable estimates of variant dynamics over time than the standard approach of independent multinomial regressions2, 17, 29, 30.
Our estimates are robust even in settings with extremely sparse data (Fig 3 and Fig S3, S4, S5), such as LMICs with limited genomic surveillance. Early identification of high growth for emerging viral strains can be used to highlight potential new variants of concern or prepare health systems for the importation and impact of new SARS-CoV-2 variants. Identifying the viral strains likely to circulate in the upcoming months could also lead to improved vaccine strain selection for pathogens like influenza and SARS-CoV-2. The framework presented here is also general enough to be applied at the sub-country level, improving local estimates of variant prevalence at the municipality or state level and informing situational awareness for public health (Fig. S8).
We evaluate the pooling approach employed here by comparing it to the standard approach of country-independent multinomial regression models (Fig 4E)17, 31. We show that our method outperforms this standard approach at estimating the growth of an emerging variant (Fig 4). Notably, the statistical regularization reduces overfitting to the noisy emergence data, reducing the biased overestimates of growth rates for novel variants produced by standard multinomial regression (Fig 4, Fig S3, S4, S5). While the mechanisms driving this transient bias are not fully understood, we speculate that this behavior is likely driven in part by potential biases in the observation process (e.g., prioritization of samples for sequencing), stochasticity in the disease-ecological process (e.g., superspreading dominating initial transmission), and overfitting to noisy data. However, these mechanisms are unlikely to completely explain the consistency in the direction of the bias across the observed lineages and settings and we hypothesize that additional mechanisms that have not yet been elucidated likely contribute as well (Fig S3, S5).
Although the modeling approach developed here can be more robust than standard approaches, it relies upon a number of assumptions and results and should be interpreted with care. We assume that sequences are randomly sampled from each geographic unit of cases (e.g., country), that samples of one variant are not prioritized for sequencing over those of another variant (e.g., due to S-gene target failure in PCR testing), and that sequences are perfectly assigned to Pango lineages. Further we assume even mixing of variants within a patch (i.e., country), that the relative fitness of one variant to another is independent over time, and that the generation interval of each variant is the same. We note, however, that estimates of the fitness advantage can be robust to biases in sampling or sub-patch structure in disease dynamics provided that the sequences come from a consistent subset of the population. These estimated fitness advantages are likely also robust to misspecification of the generation interval distribution family8, but the idealization of a constant generation interval across variants can be inaccurate32, 33 and can thus bias estimates of transmission advantage34. Further, we don’t aim to explicitly identify the very first introductions of any given variant.
In future developments of our method, we intend to extend our preliminary work on influenza and extend to additional pathogens. Further we plan to explicitly include drivers of the immune landscape of a population, including previous infections and vaccine coverage, as well as mobility and assortative mixing. This will enable us to disentangle drivers of fine-scale epidemiological trends.
We believe this method complements investments in genomic sequence capacity building by better leveraging data across regions, to both incentivize data sharing and provide more equitable access to real-time localized variant dynamics. The global variant dynamics we uncover here improve our understanding of variant emergence and growth as well as provide an avenue to further develop our understanding across pathogens and geographic scales in the future. Methods such as the one we present here are critical for situational awareness, evidence-based decision-making, and policy design for pathogen spread. New analysis methods, such as the one we present here, are crucial to the continued maturation of genomic sequencing into real-time surveillance informing public health guidance.
Methods
Data processing
Line list SARS-CoV-2 sequence metadata was accessed via the GISAID EpiCov database23. The findings of this study are based on metadata associated with 2,032,779 sequences available on GISAID up to July 1, 2022, via https://doi.org/10.55876/gis8.230118ka. Using the Pango lineage annotations in the sequence metadata, we aggregated the observed sequences by country, collection date, and pango lineage to get daily counts of the number of lineages observed. We truncated lineage assignments to their root assignment (e.g., BA.5.1 to BA.5), but preserved specified lineages of interest (BA.1, BA.2, BA.2.12.1, BA.4, and BA.5). Lineages with fewer than 50 observed sequences in the time period of interest were marked as “Other”. Additional details on data processing are available in the Supplemental Methods.
Hierarchical generalized linear modeling approach
We modeled the dynamics of competing variants of a directly transmitted infectious disease using a hierarchical multinomial regression approach (Figure 2). This approach can be used generally for competing strains, but here we used the example of SARS-CoV-2 variants using sequence metadata from GISAID to produce estimates of relative variant growth rates and true proportion of total cases in a given country.
We modeled the observed sequence counts Yi, j,t of variant i in country j on day t as drawn from a multinomial distribution (Eq.1), with the probability pi, j,t of observing a given variant in a country on a given day defined as the softmax of the linear predictor (Eq. 2). The intercepts of the linear predictor (β0,i, j) are defined as random effects drawn from a normal distribution. The intercepts describe the initial variant prevalence on the scale of the linear predictor on day 0. The country-specific relative variant growth rates (β1,i, j) are defined as j vectors of random effects drawn from a multivariate normal distribution (Eq. 3). The country-specific relative growth rates β1,i, j describe the difference in intrinsic growth rates for the variant of interest and the reference variant (i.e., where βik j = ϕk j −ϕk j, where ϕkJ the intrinsic/Malthusian growth rate of the dominant variant in country K).
The elements of the vector of the mean relative variant growth rates in the multivariate normal distribution are themselves random effects, drawn from a normal distribution with mean μhierarchical (Eq. 4). The hierarchical structure and its implications are described schematically in Figure 2. For interpretability, we presented the global and country-specific estimates of the relative growth rates as relative weekly fitness advantages using the relation as described by Davies et al.30 Where r is the relative growth rate (either β1,i, j or ) and f is the weekly fitness advantage displayed in the results.
This modeling approach made a number of assumptions about the data generating process. These included random sampling of sequences within countries, that variants are correctly assigned to lineages, and that sequence collection dates are correctly reported. The modeling approach also assumed that relative variant fitness does not change over the modeled time period, such as could potentially occur due to a vaccination campaign changing the immune landscape of the host population (i.e., during the 90-day window).
Additional detail on the observation process model, model structure, prior specification, and model limitations is available in the Supplemental Methods.
Retrospective validation of country-specific variant prevalence projections
Processing and fitting of historical datasets
SARS-CoV-2 line-list sequence data from GISAID was accessed on the following dates, which we refer to as “reference” dates: April 30th, 2022, May 16th, 2022, May 27th, 2022, June 4th, 2022, June 27th, 2022 and July 1st, 2022. A consensus set of lineages was found by identifying all lineages that exceed the global threshold of 50 or more observed sequences in the past 90 days across any of the 6 reference datasets. For the dataset from each reference date, we defined a calibration period: the time period up to and including the day the last sequence was collected. For the multicountry model all sequences in the dataset were included and for the single country model only sequences from the country of interest were included. We also defined an associated forecast period: the days after the last sequence was collected. For example, if we accessed the data on April 30th, 2022, and the most recent collection date in that dataset was April 27th, 2022, than the calibration period would be any time before and including April 27th, and the forecast period would be any time after April 27th.To test the model’s predictive power, we forecasted a maximum of 21 days out and compared this to any data observed by the last reference dataset from July 1st, 2022.
Estimation model comparison
For each reference dataset, we compared variant prevalence estimations from the multicountry model with estimations from the single country multinomial model. In order to get an estimate of the uncertainty of these outputs, non-parametric bootstrapping with replacement was performed, generating 100 bootstrapped datasets with time points randomly sampled with replacement from the true data. To fit the single country model to the data and the boot-strapped datasets, we used the nnet package35 in R which returns the maximum likelihood estimation (MLE) of the multinomial model parameters, which we transform into variant fitness advantages and variant prevalence dynamics using equations 2 and 5.
Evaluation of model estimates
For both the multicountry and the single country model, we used 100 draws from the distributions of lineage prevalence estimates to evaluate the accuracy of the model predictions compared to the observed daily lineage prevalence from the July 1st, 2022 reference dataset. For countries that observed no sequences of a particular lineage in the consensus dataset during the 90 day time window, the single country estimation model does not estimate a prevalence for that lineage. To enable a fair comparison of the two model outputs at the country level, we collapsed all prevalence estimations from the multicountry model for these unobserved lineages into “other” for that country. We used the Brier score to evaluate the accuracy of each draw of the model output. The Brier score was calculated at the country-level for both the calibration period and the forecast period, and the combination. The result was a distribution of Brier scores at each reference time point.
Data Availability
All code is made publicly available at this Github repository: https://github.com/PandemicPreventionInstitute/ppi-variant-tracker-manuscript. We do not include any data in the repository, but all results can be reproduced using the GISAID SARS-CoV-2 and flu metadata for authenticated users.
https://github.com/PandemicPreventionInstitute/ppi-variant-tracker-manuscript
Author contributions statement
ZS, KEJ, SVS, and AIB designed the study, ZS, KEJ, SVS, LW, and AIB conceptualized the work, ZS and KEJ wrote the code, conducted the analyses and wrote the first draft. ZS, KEJ, SVS, and AIB contributed to the model formulations and interpreted the results. All authors contributed to editing the manuscript and gave final approval for publication and agree to be held accountable for the work performed therein.
Acknowledgements
We acknowledge the financial support of The Rockefeller Foundation, who funded this work. We thank Nick Reich, Spencer Fox and Moritz Kraemer for their valuable comments. We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative23, 36, 37, on which this research is based. We also gratefully acknowledge Our World in Data38, who curated data enabling this work.
Footnotes
↵* zsusswein{at}rockfound.org; abento{at}rockfound.org
This version of the manuscript has been updated to clarify the timeline and implementation of the variant dashboard based on the method in this work. It also makes modifications to use GISAID's preferred styling, citation, and method of crediting originating labs.