Reviewer #3:
The relative contributions of both asymptomatic infections and super spreading events to the ongoing SARS-COV-2 pandemic are critical, controversial questions. As far as I know this may be the first paper to utilize the approach combining phylogenetic inferences from genomic data with time series case data to estimate these parameters from available data applied to the ongoing SARS-COV-2 pandemic. However, with so many papers coming out so quickly it's possible I missed this.
Here, the authors combine viral phylogenetics with time series case data to estimate parameters (including temporally structured estimates of the reproductive number) about the SARS-COV-2 pandemic in 12 locations globally. They find that the number of undetected infections ranges substantially by location from 13% to 92% and the precision of their estimates improves substantially with the number of viral genomes included from each location and this is visualized in Figure 2.
However, in its current form it suffers from some shortcomings..
SARS-COV-2 evolves slowly relative to other viruses and this can lead to high levels of phylogenetic uncertainty in recovered trees and this can have a strong influence on parameter estimates. According to the methods and the supplemental material the authors inferred a single phylogenetic tree for each location. The authors should be encouraged to infer a distribution of trees for each location and condition their analyses across this additional uncertainty. If this has already been done then the manuscript needs to be augmented to make this clear.
Abstract:
This section requires a thorough edit to improve clarity, in its current form it is rather discombobulated and needs to better link aims to results to conclusions.
Introduction:
The first 2 paragraphs of the introduction should be switched. The introduction should start with the big questions - in this case why it is important in the big picture of epidemiology to estimate parameters like the total number of infections - and then introduce the study system in play to address the big questions in this case SARS-COV-2.
The third paragraph addresses other ways to directly estimate the number of infected through serological surveys. Missing from this paragraph is acknowledging the assumption that markers of immunity lasts long enough for such surveys to be effective in detecting past infected individuals.
The final paragraph of the introduction outlines the aims and is rather lacking in scientific detail namely what are the hypotheses? What are the alternatives? What are the predictions and tests of hypotheses in play? What specific hypotheses are the authors testing by applying their method? This requires clarification.
Methods:
Generally, the methods lack sufficient detail to replicate what the authors have done.
In the Viral genomes section of the methods it is stated that several locations were excluded due to "multiple circulating lineages" however nearly all of the locations included (e.g. Guangdong, Hubei, Shanghai, UK) also have multiple circulating lineages. What was done here needs to be clarified greatly.
Phylogenetic inference as performed in IQ-TREE is fine however as previously mentioned the authors need to minimally infer a distribution of trees for each region to condition their subsequent analyses across.
In the section on sub-sampling the sequences to the dominant lineages, how was lineage assignment done? Using Pangolin? Or another classification system? More detail is needed.
A bit more detail on how the authors determined convergence was achieved would be valuable. For example, how was visual confirmation of convergence done? Via visual inspection of parameter traces? A generalist reader may need more detail than has been provided.
Results:
More detail is needed in the figure legend for Figure 1. For example unless I misunderstand this it is mentioned that the red lines are HPD intervals on those days but it is actually a shaded area with a measure of central tendency as a red line.
Discussion:
Overall, the discussion puts the results in appropriate context. It seems though that caveats associated with these analyses were not appropriately acknowledged. A bit more thought should be put into appropriate acknowledgements of things which may affect the authors estimates and interpretations of findings.
On balance I do think that the approach utilized in this manuscript makes a potentially useful contribution to addressing the current pandemic and it is to my knowledge this approach has not yet been applied to SARS-COV-2. I would like to see additional analyses (incorporation of phylogenetic uncertainty) and a thorough edit and revision for clarity.