Abstract
Detecting novel pathogens at an early stage requires robust early warning that is both sensitive and pathogen-agnostic. Wastewater metagenomic sequencing (W-MGS) could meet these goals, but its sensitivity and financial feasibility depend on the relative abundance of novel pathogen sequences in W-MGS data. Here we collate W-MGS data from a diverse range of studies to characterize the relative abundance of known viruses in wastewater. We then develop a Bayesian statistical model to integrate these data with epidemiological estimates for 13 human-infecting viruses, and use it to estimate the expected relative abundance of different viral pathogens for a given prevalence or incidence in the community. Our results reveal pronounced variation between studies, with estimates differing by one to three orders of magnitude for the same pathogen: for example, the expected relative abundance of SARS-CoV-2 at 1% weekly incidence varied between 10-7 and 10-10. Integrating these estimates with a simple cost model highlights similarly wide inter-study and inter-pathogen variation in the cost of W-MGS-based early detection, with a mean yearly cost estimate of roughly $19,000 for a Norovirus-like pathogen and $2.9 million for a SARS-CoV-2-like pathogen at 1% incidence. The model and parameter estimates presented here represent an important resource for future investigation into the performance of wastewater MGS, and can be extended to incorporate new wastewater datasets as they become available.
Introduction
Biosurveillance of viruses present in wastewater has been used as a public health tool for decades1. After Poliovirus was first identified in sewage in 19392, the virus was frequently tracked in wastewater to aid eradication efforts1, and polio wastewater surveillance remains an important tool for monitoring and suppressing reintroductions of the disease3. Similarly, wastewater surveillance of norovirus has enabled both advance warning3 and phylogenetic analyses of norovirus evolution4.
Nevertheless, until very recently, wastewater biosurveillance was largely restricted to a few specific pathogens in particular locales. This changed with the COVID-19 pandemic, when the need to reliably track SARS-CoV-2 led to a surge in wastewater surveillance efforts across the world5. Public health agencies and private companies employed qPCR to track the spread of COVID-196 and amplicon sequencing to identify and trace SARS-CoV-2 variants7. Monitoring efforts are now expanding to track other pathogens, such as Chikungunya8, Dengue virus8, respiratory syncytial virus (RSV)9 and influenza9. However, most such efforts remain targeted, using qPCR9, amplicon sequencing10, or other assays11 to sensitively monitor for a defined list of target pathogens. Targeted methods thus limit the utility of wastewater surveillance for identifying novel pathogens.
In contrast, methods like untargeted metagenomic sequencing are in principle able to detect any nucleic acid present in a sample12. A wide diversity of enteric and non-enteric viruses have been found to shed via human feces into wastewater13, including many pathogens that are not routine targets for wastewater monitoring at present. This broad swath of pathogens therefore presents a promising target for metagenomic sequencing. Additionally, wastewater-based MGS (W-MGS) could detect novel pathogens spreading asymptomatically, thus complementing traditional surveillance approaches like syndromic hospital surveillance.
However, the utility of W-MGS for pathogen surveillance will depend on its sensitivity as a detection tool. This is heavily dependent on the number of pathogen reads present in a given dataset, which in turn is driven by two key parameters: the sequencing effort (number of sequencing reads produced) and the relative abundance of the pathogen (fraction of all reads in the dataset that arise from that pathogen). To a first approximation, the sequencing effort and relative abundance required to achieve a given sensitivity are inversely proportional: the lower the relative abundance, the more sequencing needed to achieve detection. The cost-effectiveness of W-MGS surveillance thus depends critically on pathogen relative abundance in W-MGS data: even given rapidly falling sequencing costs14, sufficiently low relative abundances would still make W-MGS cost-prohibitive as a surveillance tool.
To better understand the cost of W-MGS for pathogen surveillance, we need to understand the relationship between a disease’s epidemiological status (in particular, its incidence or prevalence in the community) and pathogen relative abundance in wastewater. Such relative abundance estimates, however, have not previously been generated in any systematic manner. To address this important gap in the biosurveillance literature, we developed a hierarchical Bayesian model that combines publicly available metagenomic datasets with epidemiological data for a range of known viral pathogens, allowing us to estimate the relative abundance of a given virus in wastewater at a given prevalence or incidence. Integrating these estimates with a simple sequencing cost model, we then calculated the cost of pathogen detection with W-MGS at a given target level of sensitivity.
Our results highlight pronounced variability in the relative abundance of different human-infecting viruses across studies, with corresponding variability in the predicted cost of detection with W-MGS. Under more optimistic assumptions, W-MGS appears a viable approach for pathogen detection in the medium-term, while under more pessimistic ones it faces severe obstacles to feasibility.
Results
Relative abundance of human-infecting viruses varies widely among wastewater studies
We performed a literature search for large (>100M read pairs), untargeted W-MGS datasets obtained from raw treatment-plant influent. We identified 10 such datasets (Table 1), obtained the data, and processed it with a Kraken2-based computational pipeline to estimate the relative abundance of human-infecting viruses (Methods). Figure 1a shows the relative abundance of different viral classes in each dataset, calculated as the fraction of all reads that map to the relevant taxa.
The relative abundance of viruses as a whole varied by several orders of magnitude between studies, from roughly 1 in 4 reads (Yang et al. 202123) to 1 in 11,000 (Ng et al. 201922) (Table S1).
Unsurprisingly, the relative abundance of RNA viruses was much higher in studies that conducted RNA sequencing, while that of DNA viruses was higher in studies that conducted DNA sequencing; however, the effect was much less dramatic for DNA than RNA viruses (Fig. S1), likely reflecting the presence of DNA virus transcripts in RNA data.
Relative abundance of human-infecting viruses also varied substantially between studies, from roughly 1 in 1,500 reads (Yang et al. 202123) to 1 in 8.3 million (Bengtsson-Palme et al. 201618) (Table S1). On average, human-infecting viruses accounted for roughly 1 in 440,000 reads across all studies (Table S1). The taxonomic composition of human-infecting viruses also varied dramatically (Fig. 1b); for example, the fraction of human-infecting virus reads mapping to Caliciviridae varied from 52% (Rothman et al. 202116) to 3% (Spurbeck et al. 202317), and Polyomaviridae relative abundance varied from 87% (Brumfield et al. 202219, DNA Subset) to 7% (Munk et al. 202221).
A multi-level Bayesian model enables flexible combination of epidemiological and metagenomic data
These large inter-study differences in viral abundance and composition could arise from a variety of factors, including differences in catchment demographics, sewershed hydrology, and sample processing methods. One especially critical factor, however, is the number of infected people contributing to the sewershed. We therefore collated public-health data on incidence (number of new infections per unit time) or prevalence (number of infected people as a fraction of the population) for a range of human-infecting viruses, and constructed a hierarchical Bayesian model in Stan24 to link these metrics to viral relative abundance for a subset of studies (Methods, Table 1).
For this purpose, we selected 16 human-infecting viruses (Table 2), based on availability of relevant public-health estimates as well as either public-health importance (Group 1) or high read counts in our selected datasets (Group 2; see Methods). For viruses that cause chronic infections, we collected estimates of prevalence, while for acutely infecting viruses, we estimated weekly incidence over the studies’ coverage period (Fig. 2, Table 2, Methods). We used these estimates to parameterize our model (Fig. 3) to predict expected viral relative abundance in wastewater at a given prevalence or incidence for each virus and study, abbreviated as RAp and RAi respectively.
Inter-study viral abundance differences persist after incorporating epidemiological data
To allow comparisons between studies and viruses, we focus here on RAi at 1% incidence (1/100 people infected per week) and RAp at 1% prevalence (1/100 people currently infected), denoted RAi(1%) and RAp(1%) respectively. See Figures S2-S4 for results at other incidence/prevalence values.
Even when controlling for viral prevalence and incidence in this way, our results highlight substantial differences in expected relative abundance among viruses and studies (Fig. 4). For example, median estimates of RAi(1%) vary by four orders of magnitude between studies for norovirus GI and three orders of magnitude for SARS-CoV-2 (Fig. 4a & S5, Table S5), while median estimates of RAp(1%) vary by four orders of magnitude for MCV and two orders of magnitude for BKV (Fig. 4b & S5, Table S5).
These differences between viruses across studies were not consistently uniform in direction; for example, Rothman shows higher RAi(1%) than Crits-Cristoph for norovirus GI and GII, but lower RAi(1%) for SARS-CoV-2 (Fig. 4a, Table S5). Exceptions to the general rule include HIV and CMV, both of which showed relatively uniform RAp(1%) across studies that contained reads for these two viruses (Fig. 4b & S5).
Many chronic viruses (e.g. HCV, CMV, and HPV) showed very low abundance in wastewater data, with zero mapped reads in at least two of the four studies. Influenza showed a similar pattern. In these cases it was only possible to estimate an upper bound on RAp/i(1%), representing the information we obtain from knowing that a study did not detect a virus at some sequencing depth. Such upper bounds differ between studies and viruses based on differences in total read count and epidemiological indicators; for example, HCV and HPV both had zero reads in Crits-Christoph, but their median RAp(1%) differed by two orders of magnitude (1.5 × 10-10 vs 3.9 × 10-12) (Fig. 4, Table S5).
In addition to estimating study-level RA(1%) values for different viruses, we also estimated these coefficients for different sampling locations within each study (Fig. S6). While more consistent than between studies, intra-study RA(1%) estimates still showed moderate variation between locations (Fig. S6, Table S6-S7).
Projected costs of metagenomic wastewater surveillance span several orders of magnitude
To learn more about the potential viability of untargeted wastewater-based biosurveillance, we developed a simple model to convert our estimates of RAp/i(1%) into estimates of the weekly number of sequencing reads required to detect a pathogen by the time it reaches a certain cumulative incidence in the population (Methods). We considered a virus to be detected when the cumulative reads associated with that virus crossed a predefined threshold. Given this detection threshold, a target cumulative incidence, and an RAi(1%) value, we can estimate the weekly sequencing depth V required for detection (Methods, Fig. 5):
Averaging across studies, we find that detection at 1% cumulative incidence, with a read threshold of 100 reads, requires an average of 7×107 sequencing reads for norovirus and 1×1010 reads for SARS-CoV-2 per week. However, these average values conceal substantial variation in estimates between studies (Fig. 5) which could alter the required read depth by multiple orders of magnitude in either direction. Changing the detection threshold alters the required read depth approximately proportionally.
The results of this model can be used to calculate initial cost estimates for effective pathogen detection via wastewater metagenomics. For example, given a cost of $5,500 per billion reads (Methods) and using the average read depth estimates above, detection at 1% cumulative incidence with a read threshold of 100 would cost roughly $19,000/year for a norovirus-like pathogen and $2,900,000/year for a SARS-CoV-2–like pathogen. Again, however, individual studies different substantially in their projected costs: for the same parameters, Rothman and Crits-Christoph would require $390,000 and $200,000 in sequencing per year to detect SARS-CoV-2, while Spurbeck’s sensitivity would translate into a prohibitively expensive $320,000,000. For norovirus, the cost span increases to four orders of magnitude, ranging from $200 (Rothman) to $960,000 (Spurbeck) per year.
Discussion
Metagenomic surveillance of wastewater (W-MGS) could enable monitoring of a wide range of known and unknown pathogens, strains, and variants without reliance on clinical presentation. However, any untargeted wastewater-based surveillance effort that hopes to detect novel viruses while they are still rare will need to sequence samples to high depth, incurring substantial sequencing costs. Estimating the scale of these costs for different pathogens and sequencing protocols is an important component of any attempt to estimate the cost-effectiveness of metagenomic biosurveillance.
To answer this question, we combined existing W-MGS datasets with epidemiological estimates of different pathogens to estimate the expected relative abundance of different viral pathogens at a given incidence or prevalence for a range of studies and sampling locations. We find that predicted relative abundance varies widely across both pathogens and studies, suggesting that non-epidemiological sources of inter-study variation (e.g. protocol choice, sampling methodology, and sewershed characteristics) will have a dramatic impact on the performance of metagenomic sequencing as a biosurveillance tool.
Using a simple model linking cumulative incidence to sequencing depth, we converted our relative-abundance predictions into estimates of the cost of pathogen detection using W-MGS-based biosurveillance. As with the relative-abundance estimates themselves, these cost estimates varied by orders of magnitude between studies and pathogens. At the optimistic end of this variation, the projected yearly cost to detect 100 cumulative reads of a SARS-CoV-2 type virus at a single location at 1% cumulative incidence was around $200,000 per year, while at the pessimistic end it exceeded $300 million. Further research into the driving factors behind this variability will be important for determining the near- and medium-term feasibility of W-MGS-based biosurveillance; under higher-cost regimes, successful implementation will require improvements in sample preparation methods, highly sensitive threat detection algorithms, and sustained drops in sequencing cost.
While this study represents an important step forward in quantifying the efficacy and cost of metagenomic wastewater surveillance, it nevertheless has important limitations. Most notably, there is the limited amount of available sequencing data from regions with robust public-health estimates. In total, the studies we selected for in-depth analysis comprised roughly 3B RNA-sequencing reads and 4B DNA-sequencing reads: a significant amount compared to most individual studies, but inadequate to quantify viruses at very low relative abundances. Many viruses returned 0 mapped reads in some or all of the included studies, a problem exacerbated by a lack of DNA studies that deliberately enriched for viruses. To address these issues, future research should obtain or generate significantly deeper sequencing data, generated with protocols optimized for viral W-MGS.
The available public health also had important limitations. Many RNA viruses of interest either had little epidemiological information available (metapneumovirus, rhinovirus) or remained close to zero incidence during the periods covered by our data (respiratory syncytial virus), making it impractical to generate estimates for these viruses. Where robust estimates were available, they were often presented as point estimates without adequate or consistent representation of uncertainty. Lastly, many of our estimates incorporated underreporting factors that were applied broadly across time and space (likely underrepresenting real-world variability in reporting) and varied in their methodology between pathogens. Improvements in available public-health estimates would substantially aid future research in this domain.
We see two avenues for future research building on our results. First, our parameter estimates should be integrated into more advanced cost-effectiveness models comparing the feasibility of different approaches to pathogen detection while incorporating other factors influencing W-MGS sensitivity (e.g. detection noise). Second, future work should improve those parameter estimates themselves, by incorporating larger viral MGS datasets, higher-quality public-health data, and additional sequencing platforms. Generating and incorporating large, virally-enriched DNA-sequencing datasets from wastewater would be especially valuable. As sequencing prices continue to decline over the coming years, future research on W-MGS will play a central role in determining whether and when this promising approach to biosurveillance merits widespread implementation.
Materials and Methods
Metagenomic Data
Study selection
We performed a literature search for studies that conducted untargeted shotgun metagenomic sequencing of municipal wastewater influent and generated a large amount of data (>100M read pairs). We identified and obtained data from 10 such studies (Table 1). For selected studies, we discarded data from samples from sources other than municipal influent (e.g. treated sludge, effluent) and from samples sequenced with methods other than untargeted shotgun MGS (e.g. target enrichment).
After investigating viral relative abundance and composition in the 10 selected studies (Fig. 1), we selected a subset of four studies for inclusion in our hierarchical Bayesian model (Fig. 3). Studies were included in this second subset on the basis of (i) taking place in areas with good public health data available for analysis, and (ii) using sample preparation methods well-suited for broad enrichment of viruses, such as size selection with a 0.22 µm filter16.
Three RNA studies from the US met these criteria: Crits-Cristoph et al. 202110, Rothman et al. 202116, and Spurbeck et al. 202317. We additionally included Brinch et al. 2020 from Denmark, as this represented the largest eligible DNA study in our set. Inclusion criteria for studies can be found in Table S3. All four of these studies conducted composite sampling of municipal influent (the three RNA studies all used 24-hour composite samples, while Brinch used 12-hour composites) and sequenced processed samples with paired-end Illumina technology.
Data Analysis
FASTQ files for each included study were obtained from the Sequencing Read Archive48 and analyzed with a custom computational pipeline (see “Data and Code Availability”) as follows:
1. Raw reads were cleaned with AdapterRemoval249 v2.3.1 (default settings) to detect and remove adapters, trim low-quality bases, and merge overlapping forward and reverse reads.
2. Cleaned reads were mapped using Kraken250 (default settings) to its Standard 16 GB reference database51(2022-12-09 build).
3. Taxonomic IDs corresponding to human-infecting viruses were identified using the Kyoto University Bioinformatics Center’s Virus-Host Database52 and used to subset the Kraken2 results to human-infecting virus (HV) reads.
4. Kraken2 results were validated by mapping HV reads to their corresponding RefSeq genome53 using BowTie254 (bowtie2 --local --very-sensitive-local --score-min G,1,0 --mp 2,0). Alignment scores were normalized by dividing by the natural logarithm of the read length, and reads with normalized alignment scores below a threshold value of 22 (Fig. S7) were discarded.
5. Finally, relative abundances were calculated as the number of validated reads assigned to human-infecting viruses for a given dataset, divided by the total number of reads in that dataset.
Epidemiological Data
Virus Selection
To select viruses for which to perform epidemiological estimates, we performed an exploratory literature review choosing viruses based on public-health importance and availability of applicable incidence or prevalence estimates (Table 2 & S4). We chose a set of 16 “Group 1” viruses for analysis based on these initial criteria. We later identified five further “Group 2” viruses (Table 2) that had both good public health data and significant read counts in our chosen studies. While this allows intra-study, inter-sample site comparison analysis (Fig. S6b), the inclusion of these viruses introduces selection bias55: similarly prevalent viruses that shed less into stool or were less amenable to the sequencing approaches used would not have been selected for study. Inter-study comparisons for Group 2 viruses are marked accordingly.
Data Analysis
For viruses that cause chronic infections we collected estimates of prevalence, while for acutely infecting viruses we estimated weekly incidence estimates. SARS-CoV-2 incidence estimates were based on daily confirmed and probable county-level COVID-19 cases (provided by CDC)25, and adjusted upwards by a uniform underreporting factor26. Influenza incidence was estimated using state-level, weekly positive testing data27, which is adjusted by a custom yearly underreporting factor based on the ratio between reported tests27 and CDC estimates of total symptomatic flu infections28.
Norovirus incidence rates were based on the number of monthly, nationwide outbreaks per year30, which were transformed into case counts using a 2006 US-wide incidence estimate29. Both Norovirus30 outbreak data and Influenza27 and COVID-19 testing data25 were provided by CDC. No incidence estimates were generated for the region and period covered by Brinch et al. 202015; this study conducted DNA sequencing, while all incidence viruses had RNA genomes.
For the United States, publicly available prevalence estimates were available for human immunodeficiency virus (HIV)37, hepatitis C virus (HCV)39, and herpes simplex virus 131. When prevalence estimates weren’t available, seroprevalence data—the share of individuals with antibodies against the virus—was used instead. This was the case for cytomegalovirus (CMV)33, and epstein-barr-virus (EBV)35. PCR testing data was available for human papilloma virus (HPV)41. Seroprevalence and PCR data was collected in the National Health and Nutrition Examination Survey (NHANES), a biannual health survey of a US population sample that includes testing for common chronic infections56.
For viruses of lower public-health concern, estimates were not generally available from the sources above. We thus expanded our search to individual seroprevalence studies for adeno-associated virus 2 (AAV-2)43, John-Cunningham virus (JCV)44, BK virus (BKV)44 and merkel cell virus (MCV)46.
For Denmark, publicly-available prevalence estimates were available for HCV39 and HIV38. For EBV36 and HPV42, we used published seroprevalence and qPCR positivity estimates, respectively, as proxies for overall prevalence. When in-country estimates weren’t available, we resorted to the best data source from a country or region with similar demographics, such as Switzerland (BKV)45, the Netherlands (MCV, CMV)34,47, Northern Europe (AAV-2)43, or Germany (HSV-1)32. Table 2 describes the method of estimate generation for each virus in detail.
Statistical Model
Derivation
We are interested in modeling the relative abundance of a given virus (henceforth the “focal virus”) in W-MGS data as a function of its incidence or prevalence (henceforth “public health predictor”) in the population. Relative abundance is the fraction of reads assigned to a focal virus and, as such, is constrained to the interval [0, 1]. We assume that when relative abundance is low, it increases proportionally with the public health predictor. With these properties in mind, a natural model is logistic regression with unit slope: where the second line holds as the argument of the inverse logit approaches zero. We estimate the parameter b, which governs the relationship between relative abundance and the public health predictor, as described below. We can transform an estimate of b into an estimate of RA(x) by substituting x for the value of the public health predictor into the equation above. For example, We estimate b and RA(1%) for each virus in each metagenomic study separately.
In each study, there are S metagenomic samples taken at various times from L sampling locations. For sample s ∈ {1,…, S}, the data consist of the sampling location l(s), the total number of reads ns, the number of focal viral reads ys, and the public health predictor xs. Because sampling locations vary along several dimensions including sample preparation methods, we use a hierarchical model with a separate term bl for each sampling location. The model thus produces a joint estimate of location-specific effects and an overall coefficient for each study and virus.
We model the focal viral counts in each sample as a binomial random variable: where θs is a latent variable, centered around the estimated log public health predictor, that accounts for three factors:
error in the public health predictor
differences between the population the public health predictor is derived from (e.g., the entire United States for all of 2021) and the population contributing to the sample (e.g., Orange County on May 21, 2021)
unbiased noise in the read counts that is not accounted for by the binomial model. For each study and virus, we infer the combined magnitude of these effects, σ, from the data.
Finally, we define a location-specific term bl(s), where l(s ∈ {1,…, L} is the location of the sample. The hierarchical model of intercept terms has the structure: Here, b is the overall intercept for the study and virus and τ is the standard deviation of the location-specific terms around this overall value. The prior on b is centered on so that b = µ roughly represents the naive estimate given by dividing the relative abundance in all of the data by the average log public health predictor estimate across samples. The prior on b is weakly informative: it supposes that the naive approximation is not off by more than a few orders of magnitude, but does not contain any other substantive scientific information. We chose the prior standard deviation of b to be as large as possible while still allowing for efficient sampling.
As with σ, τ is learned from the data, allowing for some studies to have a lot of variation between sampling locations and others to have little.
We fit the model using the Stan probabilistic programming language15. The code for the model is publicly available online (See “Data and Code Availability”).
Pseudocounts
Sometimes the incidence of a virus is zero for a particular sample. This happens when no new cases were reported in the study region during the period overlapping the sample. In order to obtain a finite log-incidence, for the model, we introduced pseudocounts of 0.1 total cases in a given region per week. In order to ensure that our choice of pseudocount did not strongly influence the results, we performed a sensitivity analysis. We found that for most viruses and studies, the effect of pseudocounts is minimal. However, we found that our inference for influenza in Crits-Christoph was dominated by the pseudocounts, so we dropped it from Figure 4.
Pathogens with zero reads
Many viruses of interest returned 0 mapped reads in some or all of the included studies. Based on our model, such an observation still enables us to estimate weak upper-bound estimates of RAp/i(1%): If two studies sequence at different depths, without detecting a virus with equal prevalence or incidence, the study that sequences more deeply can be expected to have lower RAp/i(1%).
Model Checking
To assess the posterior sampling, we examined plots of the marginal and joint posterior distributions of the parameters. Posterior distributions that have irregular “spikes” at particular values or are strongly multimodal indicate that the Hamiltonian Monte Carlo sampler may not be exploring the parameter space efficiently, likely due to model specification issues. We also compared the prior and posterior distributions to ensure that the priors did not have undue influence on the posteriors. For example, we found that our initial choice of prior b ∼ Normal(µ, 2) was too narrow, significantly constraining the posterior distribution even for the virus/study combinations with a lot of data. Conversely, we found that b ∼ Normal(µ, 10) was too broad, leading the sampler to mix poorly. Similarly, we adjusted the priors on σ and τ to avoid the lower bound at zero.
To check the fit of the model, we used the posterior samples of the model to generate simulated datasets. We compared the simulated data to the observed data, looking for features not accurately captured by the model. For example, earlier versions of the model did not include terms for sampling locations within studies. Posterior predictive checks revealed that this model failed to predict the low read counts for SARS-CoV-2 in the Point Loma (PL) location of the Rothman dataset, inspiring us to add location-specific effects.
Cost Model
Derivation
The sequencing effort required to detect a pathogen is a key determinant of the cost of genomic surveillance. To estimate the cost of detecting a virus, we created a simple model predicting the total sequencing reads required for detection as a function of viral incidence and RA(1%). In making this estimate, we did not consider fixed and per-sample costs that do not scale with the number of reads sequenced.
While detection algorithms differ, they all require the presence of reads originating from the virus of interest. Certain methods—such as mapping to a known reference—may detect a pathogen from just a few reads, while less targeted methods may require far more. To represent different computational detection methods, we model a virus as “detected” when the cumulative focal viral reads exceed some chosen threshold.
The expected number of viral reads observed over a given time period can be modeled as a function of viral incidence, which is converted into an estimate of relative abundance using the RAi values estimated by the model described above.
Let i(t) be the weekly incidence of the focal virus during week t at the location of interest. If n total reads are sequenced at that location each week, then we expect to find n × RAi (i(t)) reads from the focal virus. The cumulative read count from that virus is calculated as a sum over all previous weeks of monitoring: When relative abundance is low, the value of RAi estimated by our statistical model scales roughly proportionally with incidence. As a result, we can compute RAi(i(t’)) as a linear function of RAi(1%): Pulling out the constant terms gives The summation term on the right is just the cumulative incidence at week t. Rearranging the terms, the weekly total reads required for the expected cumulative viral reads to equal the detection threshold is given by: Note that we have not made any assumptions about the incidence time course, i(t). In particular, this last equation is independent of the growth rate. Thus, in our model, the required sequencing depth depends only on the sensitivity of the method (through the detection threshold) and the ability to sample reads from the focal virus (through RAi (1%)).
Sequencing Cost
Approximate cost of metagenomic sequencing was estimated using pricing information by MIT’s BioMicroCenter, Harvard University’s Bauer Core Facility, the Dana-Farber Cancer Institute’s Molecular Biology Core Facilities (MBCF). The MIT BioMicroCenter charges $23,760 per flow cell for NovaSeq S4 300nt57, Harvard’s Bauer Core facility charges $19,40958, and Dana Farber’s MBCF charges $36,00059. Averaging this cost (geometric mean) gives a cost of roughly $25,500. A NovaSeq 6000 with an S4 flow cell is advertised as giving 8B-10B read pairs. Doubling this cost when accounting for library preparation and personnel costs equates to roughly $5,500 per billion read pairs.
Data and Code Availability
Analysis and visualization was performed using Python60 with the Pystan61, Numpy62 and Pandas63,64 package, and Matplotlib65 and Seaborn66 visualization libraries. All sequencing data in this study is available through the European Nucleotide Archive67, using BioProject accessions listed in Table S2. The raw data used for estimating pathogen prevalence and incidence are listed in the respective Python script for each pathogen. All code used in this study can be found in two Github repositories, https://github.com/naobservatory/mgs-pipeline/tree/p2ra-manuscript (metagenomic data analysis pipeline) and https://github.com/naobservatory/p2ra-manuscript (epidemiological analysis, statistical models, figure generation).
Data Availability
All data produced are available online at https://github.com/naobservatory/p2ra-manuscript and https://github.com/naobservatory/mgs-pipeline/tree/p2ra-manuscript
Author contributions
J.T.K. and M.R.M. conceived the study; S.L.G. and J.T.K. performed literature reviews on pathogen incidence and prevalence and collected and evaluated epidemiological datasets with input from D.P.R.; J.T.K. and W.J.B. performed literature reviews on metagenomics data with feedback from M.R.M.; J.T.K. imported and processed sequencing data; D.P.R. created the study’s statistical models with feedback from M.R.M., W.J.B., and J.T.K.; S.L.G., J.T.K., and D.P.R. created figures with feedback from W.J.B.; S.L.G., W.J.B., J.T.K. & D.P.R. wrote the manuscript, with feedback from all authors.
Funding
S.L.G., J.T.K., D.P.R., W.J.B., and M.M.M. were funded for this research project by gifts from Open Philanthropy (to SecureBio) and the Musk Foundation (to MIT). S.L.G. was additionally supported through a grant by the Swiss Scholarship Foundation. C.W. was supported by Sir Henry Wellcome Postdoctoral Fellowship, reference 224190/Z/21/Z.
Supplementary Information
Supplementary Figures
Supplementary Tables
Acknowledgements
We thank Asher Parker-Sartori for help obtaining public-health data and generating prevalence and incidence estimates. We also gratefully acknowledge Lenni Justen for providing helpful feedback on earlier drafts of this preprint.
Footnotes
Corrected the middle initial of one author.