Limited genomic reconstruction of SARS-CoV-2 transmission history within local epidemiological clusters ======================================================================================================= * Pilar Gallego-García * Nair Varela * Nuria Estévez-Gómez * Loretta De Chiara * Iria Fernández-Silva * Diana Valverde * Nicolae Sapoval * Todd Treangen * Benito Regueiro * Jorge Julio Cabrera-Alvargonzález * Víctor del Campo * Sonia Pérez * David Posada ## Abstract A detailed understanding of how and when SARS-CoV-2 transmission occurs is crucial for designing effective prevention measures. Other than contact tracing, genome sequencing provides information to help infer who infected whom. However, the effectiveness of the genomic approach in this context depends on both (high enough) mutation and (low enough) transmission rates. Today, the level of resolution that we can obtain when describing SARS-CoV-2 outbreaks using just genomic information alone remains unclear. In order to answer this question, we sequenced 49 SARS-CoV-2 patient samples from ten local clusters for which partial epidemiological information was available, and inferred transmission history using genomic variants. Importantly, we obtained high-quality genomic data, sequencing each sample twice and using unique barcodes to exclude cross-sample contamination. Phylogenetic and cluster analyses showed that consensus genomes were generally sufficient to discriminate among independent transmission clusters. However, levels of intrahost variation were low, which prevented in most cases the unambiguous identification of direct transmission events. After filtering out recurrent variants across clusters, the genomic data were generally compatible with the epidemiological information but did not support specific transmission events over possible alternatives. We estimated the effective transmission bottleneck size to be 1-2 viral particles for sample pairs whose donor-recipient relationship was likely. Our analyses suggest that intrahost genomic variation in SARS-CoV-2 might be generally limited and that homoplasy and recurrent errors complicate identifying shared intrahost variants. Reliable reconstruction of direct SARS-CoV-2 transmission based solely on genomic data seems hindered by a slow mutation rate, potential convergent events, and technical artifacts. Detailed contact tracing seems essential in most cases to study SARS-CoV-2 transmission at high resolution. ## Introduction In recent years, genomic epidemiology has revealed itself as a powerful tool for tracking viral outbreaks (Grubaugh, Ladner, et al. 2019). Particularly for diseases with a high proportion of asymptomatic infections like COVID-19, the use of genomic information might be especially relevant to understand their dissemination. Several methods have been developed to reconstruct infectious disease outbreaks using genomic information (e.g., Didelot et al. 2014; Jombart et al. 2014; Worby, Chang, et al. 2014; Hall et al. 2015; Hall et al. 2016; Lumby et al. 2018; Didelot et al. 2021). However, these strategies rely on pathogen genomes mutating rapidly between infected individuals (Campbell et al. 2018). Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), responsible for the COVID-19 pandemic, has spread globally in a very short time. SARS-CoV-2 has a mutation rate in the order of 1 × 10−3 mutations per site per year (van Dorp, Richard, et al. 2020; Koyama et al. 2020). For MERS-CoV-2, in principle with a similar mutation rate, the prediction is that in most cases, the consensus sequences sampled from a transmission pair (donor and receptor) will be identical, precluding a complete reconstruction of the outbreak (Campbell et al. 2018). As a counterpart, for SARS-CoV-1, with a mutation rate four times higher, we expect to see several mutations between transmission pairs, which considerably augments the power to resolve transmission history (Campbell et al. 2018). These considerations are based on consensus sequences that represent the dominant viral lineage within a host. However, pathogens with high rates of evolution, such as RNA viruses, accumulate new mutations more or less rapidly as they replicate within the individuals they infect, generating intrahost genomic variation. The generation of this genomic diversity enables viral populations to evade host immune responses (Hensley et al. 2009; Henn et al. 2012; Parameswaran et al. 2017), alter disease severity (Vignuzzi et al. 2006), and adapt to changing environments (Stapleford et al. 2014; Stern et al. 2017). Notably, the study of the shared intrahost genomic variation among individuals can be critical for identifying contagion events and transmission clusters (Didelot et al. 2014; Worby et al. 2014; Park et al. 2015; Worby et al. 2017). Moreover, it also allows for estimating the size of the founding pathogen population transmitted from the donor to the recipient host (i.e., the transmission bottleneck size) (Frise et al. 2016; Sobel Leonard et al. 2017; Sobel Leonard et al. 2019). Several studies have already shown that intrahost genomic variation can be detected in most SARS-CoV-2 infections, generally at low levels, but with some variation among individuals (Kuipers et al. 2020; Seemann et al. 2020; Shen et al. 2020; Tonkin-Hill et al. 2020; Wölfel et al. 2020; Butler et al. 2021; Lythgoe et al. 2021; Valesano et al. 2021; Y. Wang et al. 2021). Most SARS-CoV-2 intrahost mutations appear at low frequencies, often less than 5%, are primarily under purifying selection, and display particular biochemical signatures (Tonkin-Hill et al. 2020; Graudenzi et al. 2021; Sapoval et al. 2021; Y. Wang et al. 2021). A key question is whether SARS-CoV-2 intrahost variation can be transmitted during contagion. The answer is not straightforward, as shared intrahost variants among unrelated individuals can also result from convergent evolution or mutational hotspots (Tonkin-Hill et al. 2020; Valesano et al. 2021). So far, a few studies have used shared genomic variants between putative donor-receptor pairs to infer (narrow) transmission bottlenecks, of 1-10 virions, in SARS-CoV-2 (Li et al. 2021; Lythgoe et al. 2021; San et al. 2021; D. Wang et al. 2021). Limited genomic diversity can prevent the reconstruction of disease outbreaks (Campbell et al. 2018). While distinct SARS-CoV-2 transmission clusters might be identified using consensus sequences (Letizia et al. 2020; Popa et al. 2020; Seemann et al. 2020), its moderate mutation rate and rapid transmission might prevent the detailed reconstruction of the transmission events within these clusters (Tonkin-Hill et al. 2020). Leveraging intrahost variation, San et al. (2021) studied two nosocomial SARS-CoV-2 outbreaks, showing that potential donor-recipient pairs are supported in some cases but not in others by shared intrahost variants. All in all, it is not clear whether the observed levels of inter and intrahost variation in SARS-CoV-2 and the apparently small size of the transmission bottleneck could limit our capability to reconstruct local SARS-CoV-2 outbreaks in detail using only genomic information. Intrahost mutations, typically at very low frequencies, are sensitive to methodological artifacts like sequencing errors (De Maio et al. 2020a; Turakhia et al. 2020; Kubik et al. 2021) and cross-sample contamination, and the occurrence of mutational hotspots can confound the identification of transmission events. Here, we wanted to assess our ability to reconstruct putative transmission chains and to infer reliable transmission bottleneck sizes in SARS-CoV-2. For this, we obtained high-quality genomic data from ten independent epidemiological clusters, with two replicates per sample and with unique oligonucleotide spike-ins to detect potential contamination, leveraging both interhost and intrahost variants and *ad hoc* phylogenetic techniques. Our results confirm the low levels of intrahost variability and the small transmission bottleneck of SARS-CoV-2, suggesting that genomic data alone might not be sufficient to fully resolve direct SARS-CoV-2 transmissions, revealing the need for additional sources of information like detailed contact tracing. ## Material and methods ### Sample collection According to the epidemiological records, we identified 49 patients infected with SARS-CoV-2 conforming ten independent transmission clusters originated in nursing homes, family households, and birthday parties from the same city (**Figure 1**; **Table S1**). After that, we recovered the corresponding diagnostic nasopharyngeal exudates collected. This study was conducted under the approval of the Galician Drug Research Ethics Committee (CEIm-G code 2020-301). ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F1) Figure 1. Transmission clusters and epidemiological information. Black arrows indicate “known” transmission events identified in the epidemiological records. Question marks highlight potential alternatives. Samples from patients in faded color failed at sequencing. ### Epidemiological information Cluster A and B belong to two different nursing homes, and in both cases, the primary case could not be established with confidence (**Figure 1**). Cluster C is a family in which there was a probable transmission from C2 to C4. Cluster D is a large family spanning four different households. D1 came from another Spanish city and likely started the D transmission at a birthday party. Cluster E is a family in which brothers E1 and E2 were infected abroad before infecting their parent, E3. Cluster F is another family that was likely infected by an unsampled case from another city. Cluster G originates in two individuals (G1 and G4) that attended the same event and afterward infected their respective families, G1 to G2 and G3, and G4 to G5 (G5 failed at sequencing). Cluster H is another family in which H3 likely infected H1 and H2. Cluster I starts with two children (I6 and I3; I6 failed at sequencing) that got infected at the same birthday party before infecting their families, I6 to I1 and I2, and I3 to I4 and I5. Cluster J is a family in which J1 infected partner J2 and child J3. After that, either J1 or J3 infected J4 and J5. ### RNA extraction Following the manufacturer’s recommendations, we extracted the viral RNA from the nasopharyngeal exudates using the MagNA Pure 24 Total NA Isolation kit (Roche Diagnostics, Basel, Switzerland). Different team members processed each RNA sample independently to obtain two technical replicates for each patient sample, from retrotranscription to library construction. ### Viral load measurement We measured SARS-CoV-2 genome copy concentration for each sample by real-time RT-PCR of the E gene with the Sarbecovirus E-gene ModularDx (TIB Molbiol, Berlin, Germany) kit in a LightCycler® z480 System (Roche Molecular Systems Inc, Meylan, France). Viral load was estimated using linear regression (R2 >0.99) from the standard curve generated with the Ct values obtained for serial dilutions (log) of RNA standards with known viral RNA genome equivalents / µL (Vogels et al. 2020). ### cDNA synthesis and multiplex amplification We followed the ARTIC sequencing protocol (v.3) (Quick et al. 2017), a multiplex PCR-based target enrichment that produces 400 bp amplicons that span the SARS-CoV-2 genome, with slight modifications. First, we retrotranscribed the RNA samples to cDNA using the SuperScript IV reverse transcriptase (Invitrogen, MA, USA), starting with 10 µL of RNA. Then we ran 30 PCR cycles for all the samples, independently of the Ct value, using the ARTIC primer Pool1 and Pool2 (IDT, CA, USA) and the Q5 Hot Start DNA polymerase (New England Biolabs, MA, USA). Next, we mixed the corresponding PCR products from each sample before cleaning (1.2:1 ratio beads to sample). We eluted the clean PCR products with 35 µL NFW, recovered 33 µL and performed quantification with the Qubit 3.0 using the dsDNA HS or BR kit (Thermo Fisher Scientific, MA, USA) and checked amplicon size with the 2200 TapeStation D1000 kit (Agilent Technologies, CA, USA). ### Addition of individual barcodes We added 1 µL of an X-mer single-stranded oligonucleotide with a unique barcode sequence at 38 fM to each retrotranscription reaction to detect potential sample cross-contamination. To prepare these barcode spike-ins, we used as a template the alcohol dehydrogenase 1 (*adh1*) mRNA (XM_008650471.2) from *Zea mays*, as described in the PrimalSeq v.4.0 protocol (Matteson et al. 2020). After a cleanup step (2:1 ratio beads to sample), we recovered a final volume of 22 µL and performed QC (Qubit 3.0 and 2200 TapeStation). We added F and R primers with the same barcode sequence at the same concentration as the ARTIC primer pools to amplify the barcodes in the multiplex PCR reactions. ### Library construction and genome sequencing We built 98 whole-genome sequencing libraries employing the DNA Prep (M) Tagmentation kit (Illumina, CA, USA) using ¼ of the recommended volume, with approximately 125 ng of input DNA. Finally, we checked the size of the libraries and quantified them as described above. We sequenced the 98 libraries in two high-output (7.5 Gb) runs (60 and 38 samples, respectively) on an Illumina MiniSeq (PE150 reads) at the sequencing facility of the University of Vigo. ### Detection of potential cross-sample contamination To assess the level of cross-sample contamination, we quantified the specific maize barcode content in each fastq file. For this, we aligned the raw reads against the *Zea mays adh1* sequence using BWA-mem (Li 2013) with default settings and counted the number of reverse and forward barcodes with cutadapt (v.2.10) (Martin 2011), with a minimum overlap of 15 and a maximum error rate of 0.1. ### Variant calling and consensus sequences We assessed the quality of the fastq files using FastQC (Andrews 2010). Then we aligned the reads to the reference MN908947.3 from Wuhan using BWA-mem (Li 2013) and trimmed them with iVar (Grubaugh, Gangavarapu, et al. 2019). We evaluated the quality of the aligned trimmed reads using Picard v2.21.8 ([http://broadinstitute.github.io/picard](http://broadinstitute.github.io/picard)). We used SAMtools depth v1.10 [Citation error] to calculate the sequencing coverage along the genome for each replicate. We only kept samples for which ten or more reads covered more than 75% of the viral genome in the two replicates and with less than 2.5% missing bases on the consensus sequence. We used iVar (Grubaugh, Gangavarapu, et al. 2019) to identify single nucleotide changes and indels, with a minimum base quality threshold of 20 and a minimum read depth of 10. The calls obtained were confirmed with LoFreq (Wilm et al. 2012). We only retained variants that appeared in both replicates with a minimum overall variant allele frequency (VAF) of 2%. Based on their frequency, we divided the genomic changes detected into *fixed* (VAF ≥ 0.98; to account for potential sequencing errors) and *intrahost* variants (0.02 ≤ VAF < 0.98). We masked and removed from further analyses positions containing complex variants (i.e., nucleotide changes plus indels) or those deemed as homoplasic (De Maio et al. 2020b), including the sites immediately before and after. To build a consensus sequence for each sample, we merged the reads from the two replicates with SAMtools *mpileup* and fed them to iVar *consensus* with a minimum VAF threshold of 0.5. We assigned the consensus sequences to a SARS-CoV-2 clade with Nextclade ([https://clades.nextstrain.org](https://clades.nextstrain.org)) and to a SARS-CoV-2 PANGO lineage (Rambaut et al. 2020) with Pangolin (O’Toole et al. 2021). ### Delimitation of epidemiological clusters The simplest method for delimiting epidemiological clusters using genomic data alone is estimating a phylogenetic tree using the consensus sequences. For this, we aligned the consensus sequences with the reference using MAFFT v.7 (Katoh and Standley 2013) (*mafft --maxiterate 500 *) and ran IQ-TREE (v.2.0.6) (Nguyen et al. 2015) (*iqtree2 -T AUTO -s -m TEST -b 1000 -o MN908947*.*3)* with the best-fit nucleotide substitution model and 1,000 bootstrap replicates. We also built a timetree based on the output tree of IQ-TREE and the dates of the samples using TreeTime (v.0.8.1) (Sagulenko et al. 2018) (*treetime --aln --tree --dates *). In addition, we tried six heuristics developed explicitly for the reconstruction of epidemiological clusters described in Worby et al. (2017). The weighted distance tree and the minimum distance tree use the genetic distances among consensus sequences. On the other hand, the weighted and maximum variant tree strategies rely exclusively on shared intrahost variants. Finally, the hybrid weighted tree and maximum tree procedures use intrahost variants and consensus genetic distances. Furthermore, we also estimated transmission clusters using the Transcluster algorithm (Stimson et al. 2019), assuming a mutation rate of 1 × 10−3 mutations/site/year. We explored four values for the transmission rate (10, 25, 50, 100 transmissions per year) and six for the transmission cutoff (1–6 transmission events). ### Inference of transmission history Within each cluster, we tried several approaches to estimate which individuals transmitted the virus and in which direction, that is, to learn who infected whom. First, we explored the Worby et al. heuristics, which assume that the donor/s for each sample has the most similar sequence or more shared intrahost variants. In addition, we implemented a simple approach that leverages the intrahost variation along a minimum spanning tree (MST). First, we computed Euclidean pairwise distances among all individuals within a cluster, with the *rdist* R package ([https://github.com/blasern/rdist](https://github.com/blasern/rdist)), using the VAF distributions. Afterward, we built the MSTs based on those distances with the function *mst* from the *ape* R package (Paradis and Schliep 2019). Then, assuming a single source for each cluster, we inferred the transmission direction that minimized the generation of novel variants in the receptor, meaning that in a pair of individuals, the donor should be the one with a higher number of private mutations. Finally, we also explored TransPhylo (Didelot et al. 2017), using the dated phylogeny obtained with TreeTime. We ran the algorithm for 150,000 MCMC iterations and assumed a Gamma distribution for the generation time with shape 1 and scale 0.01917 (Perera et al. 2021). ### Estimation of the transmission bottleneck size To estimate the transmission bottleneck size of SARS-CoV-2 (i.e., the size of the viral population transferred from the donor to the recipient host), we used the beta-binomial method of Sobel Leonard et al. (2017). This method assumes that the intrahost variants detected did not arise *de novo* in different patients. This calculation includes only intrahost donor variants shared with the recipient (note that they can be fixed in the recipient but not in the donor). We identified putative donor-recipient pairs according to the available epidemiological information (**Figure 1**). We lacked epidemiological information for clusters D and F, and we identified possible transmission pairs according to the genomic data (see Results). For the estimation of the transmission bottleneck size, we used the R code at [https://github.com/weissmanlab/BB_bottleneck](https://github.com/weissmanlab/BB_bottleneck) (version of March 24, 2020), under the *approximate* model (given that the sequencing depth per sample was very high, around 6,000X) and setting the maximum bottleneck size to an arbitrarily large value of 600, and the VAF cutoff to 0.02. ### Assessment of selective pressures The ratio of non-synonymous changes per non-synonymous site (*dN*) to the number of synonymous substitutions per synonymous site (*dS*) is one of the most popular statistics for detecting selective pressures at the molecular level. We estimated the *dN/dS* ratio for each sample using the dNdScv package (Martincorena et al. 2017), recently adapted for its application to SARS-CoV-2 (Tonkin-Hill et al. 2020). We used the default substitution model with 192 rate parameters. ## Results ### Viral load and sequencing Twenty-seven out of the 49 samples had a viral load above 103 copies / µL (**Table S1**). Sequencing coverage and breadth were high (mean depth ± sd: 6316.71 ± 2336.99; breadth: 0.999) (**Table S2**), except for three samples (D11, G5, and I6, all with a Ct > 32 for gene E), that we excluded from further analyses. We did not detect appreciable cross-contamination between samples (**Table S2**). ### Inter and intrahost variation Most variants were fixed (**Figure 2, Figure S1**). The number of differences among consensus sequences was, on average, 2.12, 2.28, and 7.57 variants, within early clusters (A-C), late clusters (D-J), and among early and late clusters (see also **Table S3**). We observed on average 19.76 variants per sample, of which 8.17 were intrahost (**Table S4**). Both fixed and intrahost variants were shared among samples at different VAFs. Several intrahost variants appeared recurrently in multiple samples, often corresponding to indels at low frequency (**Table S4**). These recurrent variants may correspond to potential sequencing errors and mutational hotspots, which might confound our analyses. Therefore, we decided to filter out intrahost variants present in more than one cluster. After filtering, there were 2.13 intrahost variants per sample on average, with a maximum of 11 (**Table S4**). Before and after filtering, the number of intrahost variants detected per sample was unrelated to sequencing depth, Ct values, or viral load (**Figure S2**). Furthermore, VAFs between sample replicates were significantly correlated (Pearson correlation coefficient = 0.99, p-value = 5.6 e-113) (**Figure S3**). All samples were assigned to two related clades/lineages (20A/B.1 and 20E(EU1)/B.1.177), which was not particularly surprising as these were the dominant lineages in the area at the time of sampling. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F2) Figure 2. Variant allele frequencies (VAF) per sample. VAFs were calculated after filtering recurrent variants. Fixed mutations (VAF ≥ 0.98) are in gray, fixed reference alleles (VAF < 0.02) are in white, and positions with missing data (depth below 20) are in light green. ### Delimitation of transmission clusters The maximum likelihood (ML) trees obtained with the consensus sequences showed the epidemiological clusters as distinct groups, mostly monophyletic and with relatively high bootstrap support (**Figure 3**). However, standard phylogenetic approaches do not explicitly inform about the number of clusters or the assignment of the different individuals to clusters. In the absence of additional epidemiological information (i.e., colors in the tree), it is up to the researcher to decide where to “cut”. Remarkably, adding the temporal information with TreeTime improves the phylogenetic resolution of the clusters, which become all monophyletic (**Figure 3B**). ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F3) Figure 3. Consensus-sequence phylogenetic trees. (A) Maximum likelihood phylogenetic tree inferred with IQ-TREE. Numbers above branches are bootstraps values (%). Only bootstrap values above 50 are shown. (B) Time-scaled ML tree inferred with TreeTime (using the dates of extraction). The weighted distance tree and the minimum distance tree, which also use consensus sequences but can explicitly delimit clusters, were identical and highly congruent with the epidemiological information (**Figure 4A**). In this case, the only “error” was that cluster D was divided into two, although we might expect it because D1, D3, and D6 share two consensus mutations that the rest of the D individuals do not present. Indeed, cluster D is large and phylogenetically diverse, and we might not have sampled all the infected individuals in this transmission chain. The weighted variant and maximum variant trees, based exclusively on intra-host variants, were also identical and produced a very complex network in which all individuals seemed related to each other (**Figure 4B**). After removing the recurrent intra-host variants common to multiple individuals and clusters (taking advantage of the epidemiological information), these methods identified three clusters primarily compatible with the epidemiological information, plus 33 unconnected individuals (**Figure 4C**). Cluster A was perfectly delimited, while cluster I formed a group with a sample from cluster H. The only other three clusterized samples were from cluster D (again D1, D3, and D6). The hybrid transmission methods, which use the connections established by the weighted variant and maximum variant trees and incorporate consensus information for those samples without a donor or recipient, did not result in any noticeable improvement compared to methods based on consensus sequences (data not shown). Finally, the transmission-based clustering method in Transcluster was able to identify some of the epidemiological clusters but not all (**Figure 4D**). In this case, congruence with the epidemiological data was maximal after setting a transmission rate of 25 and a transmission cutoff of 1. ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F4) Figure 4. Clustering approaches. (A) Weighted distance/minimum distance tree. (B) Weighted variant/maximum variant tree after standard masking. (C) Weighted variant/maximum variant tree after removing recurrent low-frequency variants (D) Transcluster transmission network (transmission rate = 25, transmission cutoff = 1). ### Inference of transmission history #### Transmission in nursing homes For clusters A and B, we had no epidemiological information other than the corresponding nursing home. In cluster A (**Figure 5**), the four samples share what seems to be an intrahost variant (27695-TCTTA). However, given its high VAF (0.93-0.95) and the fact that the individuals do not share other variants, this deletion may be a fixed variant, where sequencing or calling errors prevented its identification in all reads. In any case, the genetic data does not help identify the different transmission events in this cluster with confidence. In cluster B, no shared variation was apparent. B2 has two private fixed variants, suggesting it was infected later than the other cluster members or from a different source. Again, it was not possible to infer the transmission network for this cluster. ![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F5.medium.gif) [Figure 5.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F5) Figure 5. Variant sharing within clusters. Gray boxes indicate index cases or originating events. Gray dashed lines delimit households (nursery homes for clusters A and B). Shared variants are highlighted with the same color. Fixed variants (VAF ≥ 0.98) common to all members of a cluster are not shown. Crossed samples could not be sequenced. #### Transmission in clusters with partial contract trace information We had partial contact trace information for clusters C, E, G, H, I, and J. However, the lack of shared intrahost variants prevented a detailed reconstruction of their transmission history in most cases (**Figure 5**). The epidemiological record suggests a transmission from C2, the index case, to C4 in cluster C. This event is compatible with the genetic data, as C2 has a single intrahost variant at low frequency (0.05), which could have been lost during transmission to C4, which has no intrahost variants. Private variants with low VAFs in C1 and C3 could have arisen de novo within each individual after transmission, but three of them with higher VAFs (0.27, 0.34, and 0.67) are more difficult to explain in the same way. In cluster E, the genetic information cannot resolve whether E1 or E2 infected E3. In cluster G, we did not observe shared intrahost variants. In contrast, the distribution of the private variants is compatible with the epidemiological information, and it does not help resolve further the transmission history. In cluster H, the three samples share five fixed (or almost fixed) variants. The index case (H3) does not seem to have intrahost variants, contrary to H1 and H2. However, the quality of the sequencing data in the case of H3 is well below average, so it is possible that low-frequency variants were overlooked in this sample. All five members of cluster I share three variants with high VAF –or fixed in several cases. Variants 445C and 25062 could be genuinely fixed in all samples, including cases where the apparent VAF is 0.95-0.97. The distribution of variant 29366T is remarkable, as it appears in all cases with a VAF of 0.68-0.86. Another salient observation is that I3, one of the index cases, has a well-supported variant (9430T) with a VAF of 0.96 that does not appear in the other samples from this cluster. Cluster J lacked shared intrahost variants, so the genetic data neither confirmed nor invalidated the contact tracing information. #### Inferring transmission in the absence of contact trace information ##### Ad hoc approaches We did not have detailed information about contacts in clusters D and F, so we tried to identify transmission events considering just the genomic data (**Figure 6**). In cluster D, the transmission started at a birthday party where the index case was D1. D1 shares two variants with D3 and D6 (4543T and 18431T), both fixed in D1 and D6 and close to fixation in D3 (0.96 and 0.94, respectively). Therefore, we hypothesize that D1 → D3 and D1 → D6, but alternatively D3 → D6, could be transmission pairs. These two variants also appear in D2 but at a very low VAF (0.07 and 0.06, respectively). D3 and D2 further share 4142 A, but this variant has a low VAF in D3 (0.07) and a high VAF (0.89) in D2. Furthermore, D2 has 15857T at high VAF (0.88). Given that we assumed that D1 infected D3, we considered that D3 could have infected D2. However, the explanation for the observed VAF patterns might imply recombination and *de novo* mutation. Finally, D10 shares with D2 variants 4142A and 15857T at high frequency, so we also considered D2 → D10 another likely transmission pair. In cluster F, where we do not have an index case, F1 and F3 share a fixed variant (3737 T). Given that F1 has seven private variants at low frequency, but F3 only two, F1 might be the pair’s donor because it seems easier to lose these variants during the F1 → F3 bottleneck than to arise d*e novo* in F1 after an F3 → F1 transmission. F2 also has 3737 T fixed. Following the same logic, F2 could have been infected by F1, but also by F3. ![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F6.medium.gif) [Figure 6.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F6) Figure 6. Shared variants and inferred transmission events for clusters D and F. Below each sample ID, we show variant site and allele, followed by its VAF in parenthesis. Fixed variants (VAF ≥ 0.98) shared by all members of a cluster are not shown. Dashed arrows indicate putative transmission events. Question marks highlight potential alternatives. ##### Statistical and graphical approaches The Worby et al. approaches were not as helpful in inferring transmission as for delimiting clusters. Assigning the source of each sample to the patient with the highest number of shared intrahost variants or the minimum genetic distance (using weights or absolute values) resulted in samples with multiple potential sources and pairs with bidirectional transmission (**Figure 4A, B**). Relying only on intrahost shared variants proved inefficient, as most samples were not connected to any other (**Figure 4C**). Consensus sequences from the same cluster were very similar, so multiple samples were often equidistant, preventing choosing one of them as the source. The MST analysis (**Figure S4)** was incompatible with the epidemiological information. Apart from tied transmission paths for some of the clusters (clusters D and E, with three and two options, respectively), the starting point of the transmission did not coincide with the epidemiological information in any of the cases. TransPhylo could differentiate the different clusters (**Figure 7, Figure S5**); however, the inferred direct transmission events within clusters were often not compatible with the epidemiological information. ![Figure 7.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F7.medium.gif) [Figure 7.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F7) Figure 7. TransPhylo transmission graph. Gephi (Bastian et al. 2009) depiction of TransPhylo’s consensus transmission tree. Gray dots represent inferred unsampled individuals. ### Transmission bottleneck size We selected individual pairs representing direct transmission events to estimate the transmission bottleneck size according to the epidemiological and genomic information. We discarded clusters A and B (nursing homes) from this analysis because it was impossible to identify likely transmission pairs in these cases. We had contact information about at least a transmission pair for clusters C, E, G, H, I, and J. The situation was more complex for clusters D and F, so we identified possible transmission pairs considering both the epidemiological and the genomic data, as described above. Across the studied transmission pairs, we found an average of 0.38 (range 0-3) shared intrahost variants (**Table S6**). Accordingly, the estimated transmission bottleneck sizes were typically small (1-2 viral particles), except for the pair F1 → F3 (6 viral particles) (**Figure 8, Table S6**). To ensure that our selection of transmission pairs in clusters D and F was not biasing these estimates downwards, we also calculated the transmission bottleneck sizes for all potential pairs within these two clusters. The estimated bottlenecks were consistently 1-2. Note that the bottleneck size can only be estimated when there is at least one variant in the donor (regardless of whether that variant is observed in the recipient). If none of the donor variants appear in the recipient, the estimated bottleneck size will be 1, with a variable confidence interval depending on the variant calling threshold. ![Figure 8.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F8.medium.gif) [Figure 8.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F8) Figure 8. Estimated transmission bottleneck sizes. Labels on the Y-axis represent donor-recipient pairs. Estimates were obtained with the beta-binomial ML method (Sobel Leonard et al. 2017). Horizontal lines represent 95% confidence intervals. The X-axis is on a logarithmic scale. ![Figure 9.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F9.medium.gif) [Figure 9.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F9) Figure 9. Estimated *dN/dS* values per sample for missense variants. Values were estimated using the dNdScv package. Vertical lines represent 95% confidence intervals. ### Assessment of selective pressures We estimated *dN/dS* values for missense variants consistently below 1, suggesting a predominance of intrahost purifying selection across samples (**Figure 8**). ## Discussion Understanding SARS-CoV-2 transmission is crucial to identify which situations minimize or maximize the risk of infection and, therefore, implement more effective control strategies. Previous studies have tried to reconstruct SARS-CoV-2 local transmission chains with more or less success using a combination of epidemiological and genomic data (Popa et al. 2020; Sekizuka et al. 2020; Shen et al. 2020; Hamilton et al. 2021; San et al. 2021). However, it is unclear whether, in situations for which contact tracing information is limited, we can use SARS-CoV-2 genomic information alone to understand who infected whom. Here, we show that while SARS-CoV-2 genomic variation can be helpful to delimit distinct transmission clusters, it might not be enough to resolve with confidence direct transmission events. Using the most likely transmission pairs, we infer a tiny transmission bottleneck size for SARS-CoV-2 in the order of 1-10 viral particles. The level of interhost genomic variation that we detected was generally low. However, this did not prevent the distinction among local clusters sampled in the same month in the same city. When the sampling dates were taken into account, the concordance between genomic and epidemiological clusters was maximized, highlighting the relevance of the temporal information. Methods for cluster delimitation that rely exclusively on intrahost variants did not work well in this regard. In contrast, methods based on differences at the consensus level could differentiate the clusters near perfectly. The latter suggests that in SARS-CoV-2, consensus sequences alone are enough to separate samples belonging to different clusters from the same area. At the same time, intra-host diversity does not seem to be sufficient for this task. We found a limited number of intrahost variants (∼8 before filtering recurrent variants and ∼3 after filtering), as reported in other studies (Kuipers et al. 2020; Seemann et al. 2020; Shen et al. 2020; Tonkin-Hill et al. 2020; Wölfel et al. 2020; Butler et al. 2021; Valesano et al. 2021; Y. Wang et al. 2021). Half of our samples (27/48) had a viral load above 103 copies / µL, which is the threshold determined in Valesano et al. (2021) for reliable identification of intrahost variants with a VAF ≥ 2% in single replicates. Unlike previous studies, we used technical replicates to stress variant calling reproducibility and added unique barcodes to each sample to discard the potential effect of cross-sample contamination. Here, transmission history within nursing homes or households, where most SARS-CoV-2 infections occur (Lee et al. 2020), was complicated to decipher. In general, all the methods we tried, even those relying on intrahost variation, could not provide clear transmission patterns within clusters, as seen before in care homes (Hamilton et al. 2021). The latter can be explained by lack of genetic variation but also by homoplasy, as we observed several shared intrahost variants among apparently unrelated samples. In addition, we noticed that if VAF thresholds are relaxed, unique or uncommon mutations appear in many individuals, suggesting these variants are either recurrent artifacts or correspond to hotspot mutations (Tonkin-Hill et al. 2020). Deciding which shared intrahost variants in SARS-CoV-2 are the result of transmission events is not easy. Much care should be taken regarding reliable genotyping and identifying recurrent events, particularly for samples with low viral loads (van Dorp, Acman, et al. 2020; van Dorp, Richard, et al. 2020; Kubik et al. 2021; Valesano et al. 2021). Recurrent, low-frequency insertions in SARS-CoV-2 have already been detected elsewhere (Kuipers et al. 2020; Rayko and Komissarov 2020; Tonkin-Hill et al. 2020; Turakhia et al. 2020). Although not addressed in this study, another potential complication regarding the identification of shared intrahost variants is the occurrence of significant intrahost evolution. Several studies have reported VAF changes within days in SARS-CoV-2 (Jary et al. 2020; Tonkin-Hill et al. 2020; Voloch et al. 2020; Y. Wang et al. 2021), even faster in immunocompromised individuals (Avanzato et al. 2020; Kemp et al. 2021). On the other hand, more rigorous studies report that diversity does not increase over time, although this does not imply that VAFs cannot change significantly among different time points (Valesano et al. 2021). Significant intrahost evolution would imply that the amount of sharing between samples could change depending on the exact sampling dates, so the inferences derived from it. Our estimates indicate that the SARS-CoV-2 transmission bottleneck size is small or very small, with only a few viral particles being responsible for the successful growth within the recipient, which is consistent with previous studies (Li et al. 2021; Lythgoe et al. 2021; San et al. 2021; D. Wang et al. 2021). Notably, minimal bottleneck estimates have also been obtained for a highly transmissible SARS-CoV-2 lineage like Delta (Li et al. 2021). In contrast, Popa et al. (2020) estimated an average transmission bottleneck size for SARS-CoV-2 of 1,000, but these estimates have been put into question (Martin and Koelle 2021). The advantages of our study in this regard are the technical replicates, cross-contamination controls, and consistent filters for recurrent variants. A small transmission bottleneck size for SARS-CoV-2 is consistent with a dominance of aerosol transmission over direct contact, as seen in influenza (Varble et al. 2014; Frise et al. 2016; McCrone and Lauring 2018). If only one or a few virions are passed during transmission, then most of SARS-CoV-2 intrahost variation has to be due to the accumulation of *de novo* mutations (Voloch et al. 2020; Valesano et al. 2021). We inferred strong intrahost purifying selection across the genome for missense variants, as in previous comprehensive studies of SARS-CoV-2 intrahost variation (Tonkin-Hill et al. 2020). The latter is consistent with a severe transmission bottleneck reducing the efficiency of positive selection within hosts (McCrone and Lauring 2018). In summary, our results suggest that SARS-CoV-2 genomic diversity is helpful to delimitate different transmission clusters within a relatively small area, but that could be insufficient to fully resolve transmissions within a household or in the same social event. Thus, contact tracing data will be essential to study direct SARS-CoV-2 transmission events, as it occurs in typical slow-evolving pathogens (Campbell et al. 2018; Campbell et al. 2019). ## Data Availability Raw FASTQ files will be deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) (Leinonen et al. 2011). Viral consensus genomes will be available at the Global Initiative on Sharing All Influenza Data (GISAID) (Shu and McCauley 2017). Contact authors for details. ## Data availability Raw FASTQ files have been deposited at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) (Leinonen et al. 2011) (Project Accession No. PRJNAXXXXXX). Viral consensus genomes are available at the Global Initiative on Sharing All influenza Data (GISAID) (Shu and McCauley 2017) (accessions in Supplementary Table Sx). ## Author contributions DP conceived the study and designed the analyses. SP, JJC, VdC, and BR obtained the patient samples and the epidemiological information. SP, NEG, LdC, IFS, and DV planned and performed the laboratory experiments. PGG, NV, NS, and TT carried out the bioinformatic analyses. DP wrote the draft manuscript with the help of PGG and NV. All authors read the manuscript and contributed to interpretation and discussion. ## Supplementary Data Supplementary data are available at Virus Evolution online. ## Conflict of interest None declared. ## Supplementary Material View this table: [Table S1.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/T1) Table S1. Transmission clusters. Cluster characteristics, sample ID, date of sampling, qPCR cycle threshold (Ct) (gene E) in the nasopharyngeal sample, viral load (copies / µL) (gene E) in the 1/10 diluted RNA extract, mean coverage and standard deviation (SD), and NextClade clade and PANGO lineage for each sample. Asterisks highlight discarded samples with poor sequencing quality. View this table: [Table S2.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/T2) Table S2. Sequencing quality control of each sample (both replicates). Red-colored samples that we excluded for bad quality. *Reads main barcode* is the number of reads mapped to the main barcode. *% main barcode* is the percentage of reads mapped to the main barcode. *Reads other barcodes* is the number of reads mapped to other barcodes. *% other barcode* is the percentage of reads mapped to another barcode. *Main barcode* is the identifier of the main barcode. *Mean coverage* is the mean coverage of reads mapped to reference. *Breadth 1/10/100X*: is the proportion of positions covered by at least 1/10/100 reads. Samples marked with an asterisk failed during sequencing. View this table: [Table S3.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/T3) Table S3. Average number of differences among consensus sequences within and between clusters. View this table: [Table S4.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/T4) Table S4. Number of intrahost and fixed variants (SNVs and indels) per sample. View this table: [Table S5.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/T5) Table S5. Low-frequency variants (0.5 > VAF ≥ 0.02) evaluated. Number (#) and percentage (%) of samples in which each variant appears. Mean VAF and standard deviation (sd) of the variant and samples ID. The asterisk indicates variants with a VAF > 0.5 in at least another sample of the same cluster. View this table: [Table S6.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/T6) Table S6. Estimated transmission bottleneck sizes. The table shows the number and type of variants used for the calculation of the transmission bottleneck size. CI indicates the 95% likelihood confidence interval. Shared intrahost variants can be fixed in the recipient but not in the donor. Private intrahost donor variants are those that appear in the donor but not in the recipient. “-”: indicates pairs in which the absence of intrahost donor variants precludes the calculation of the bottleneck size. ![Figure S1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F10.medium.gif) [Figure S1.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F10) Figure S1. Variant allele frequency (VAF) distribution across samples. VAFs were calculated after filtering recurrent low-frequency variants. ![Figure S2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F11.medium.gif) [Figure S2.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F11) Figure S2. Number of intrahost variants. The plots describe the variation of the number of intrahost variants regarding Ct values (A, B), viral load (C, D), and sequencing depth (E, F), before (A, C, E) and after filtering recurrent low-frequency variants (B, D, F). ![Figure S3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F12.medium.gif) [Figure S3.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F12) Figure S3. VAF correlation (Pearson) between replicates across samples. (A) after masking and before filtering recurrent low-frequency variants. (B) after masking and filtering. ![Figure S4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F13.medium.gif) [Figure S4.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F13) Figure S4. Minimum spanning trees for each cluster. Arrows indicate the inferred direction of transmission according to the arbitrary rules described in the Methods section. ![Figure S5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/08/09/2021.08.08.21261673/F14.medium.gif) [Figure S5.](http://medrxiv.org/content/early/2021/08/09/2021.08.08.21261673/F14) Figure S5. Reconstruction of transmission history using TransPhylo based on the dated phylogeny. (A) Consensus transmission tree. Filled dots represent sampled individuals, and unfilled dots represent inferred unsampled individuals. (B) Heatmap of pairwise transmission probabilities. ## Acknowledgments This project was funded by grant EPICOVIGAL FONDO SUPERA-COVID19 from Banco Santander-CSIC-CRUE, grant CT850A-2 from ACIS SERGAS from the Consellería de Sanidade Xunta de Galicia, and grant ED431C2018/54-GRC from the Consellería de Cultura, Educación e Ordenación Universitaria of Xunta de Galicia. NS and TT were supported in part by a C3.ai Digital Transformation Institute award. * Received August 8, 2021. * Revision received August 8, 2021. * Accepted August 9, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), CC BY-NC 4.0, as described at [http://creativecommons.org/licenses/by-nc/4.0/](http://creativecommons.org/licenses/by-nc/4.0/) ## References 1. Andrews S. 2010. FastQC: A Quality Control Tool for High Throughput Sequence Data. [Online] [Internet]. Available from: [http://www.bioinformatics.babraham.ac.uk/projects/fastqc/](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) 2. Avanzato VA, Matson MJ, Seifert SN, Pryce R, Williamson BN, Anzick SL, Barbian K, Judson SD, Fischer ER, Martens C, et al. 2020. Case Study: Prolonged Infectious SARS-CoV-2 Shedding from an Asymptomatic Immunocompromised Individual with Cancer. Cell 183:1901–1912.e9. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 3. Bastian M, Heymann S, Jacomy M. 2009. Gephi: An Open Source Software for Exploring and Manipulating Networks. In: Third International AAAI Conference on Weblogs and Social Media. Available from: [https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewPaper/154](https://www.aaai.org/ocs/index.php/ICWSM/09/paper/viewPaper/154) 4. Butler D, Mozsary C, Meydan C, Foox J, Rosiene J, Shaiber A, Danko D, Afshinnekoo E, MacKay M, Sedlazeck FJ, et al. 2021. Shotgun transcriptome, spatial omics, and isothermal profiling of SARS-CoV-2 infection reveals unique host responses, viral diversification, and drug interactions. Nat. Commun. 12:1660. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-021-21361-7&link_type=DOI) 5. Campbell F, Cori A, Ferguson N, Jombart T. 2019. Bayesian inference of transmission chains using timing of symptoms, pathogen genomes and contact data. PLoS Comput. Biol. 15:e1006930. 6. Campbell F, Strang C, Ferguson N, Cori A, Jombart T. 2018. When are pathogen genome sequences informative of transmission events? PLoS Pathog. 14:e1006885. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.ppat.1006885&link_type=DOI) 7. De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. 2020a. Issues with SARS-CoV-2 sequencing data. [http://virological.org](http://virological.org) [Internet]. Available from: [http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473](http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473) 8. De Maio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. 2020b. Masking strategies for SARS-CoV-2 alignments. [https://virological.org](https://virological.org) [Internet]. Available from: [https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480](https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480) 9. Didelot X, Fraser C, Gardy J, Colijn C. 2017. Genomic Infectious Disease Epidemiology in Partially Sampled and Ongoing Outbreaks. Mol. Biol. Evol. 34:997–1007. 10. Didelot X, Gardy J, Colijn C. 2014. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol. 31:1869–1879. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/molbev/msu121&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24714079&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 11. Didelot X, Kendall M, Xu Y, White PJ, McCarthy N. 2021. Genomic Epidemiology Analysis of Infectious Disease Outbreaks Using TransPhylo. Curr Protoc 1:e60. 12. van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, Owen CJ, Pang J, Tan CCS, Boshier FAT, et al. 2020. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infect. Genet. Evol. 83:104351. 13. van Dorp L, Richard D, Tan CCS, Shaw LP, Acman M, Balloux F. 2020. No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2. Nat. Commun. 11:5986. 14. Frise R, Bradley K, van Doremalen N, Galiano M, Elderfield RA, Stilwell P, Ashcroft JW, Fernandez-Alonso M, Miah S, Lackenby A, et al. 2016. Contact transmission of influenza virus between ferrets imposes a looser bottleneck than respiratory droplet transmission allowing propagation of antiviral resistance. Sci. Rep. 6:29793. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/srep29793&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27430528&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 15. Graudenzi A, Maspero D, Angaroni F, Piazza R, Ramazzotti D. 2021. Mutational signatures and heterogeneous host response revealed via large-scale characterization of SARS-CoV-2 genomic diversity. iScience 24:102116. 16. Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, Tan AL, Paul LM, Brackney DE, Grewal S, et al. 2019. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol. 20:8. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s13059-018-1618-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30621750&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 17. Grubaugh ND, Ladner JT, Lemey P, Pybus OG, Rambaut A, Holmes EC, Andersen KG. 2019. Tracking virus outbreaks in the twenty-first century. Nat Microbiol 4:10–19. 18. Hall MD, Woolhouse MEJ, Rambaut A. 2016. Using genomics data to reconstruct transmission trees during disease outbreaks. Rev. Sci. Tech. 35:287–296. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.20506/rst.35.1.2433&link_type=DOI) 19. Hall M, Woolhouse M, Rambaut A. 2015. Epidemic Reconstruction in a Phylogenetics Framework: Transmission Trees as Partitions of the Node Set. PLoS Comput. Biol. 11:e1004613. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1004613&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=PMCPMC470101&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 20. Hamilton WL, Tonkin-Hill G, Smith ER, Aggarwal D, Houldcroft CJ, Warne B, Meredith LW, Hosmillo M, Jahun AS, Curran MD, et al. 2021. Genomic epidemiology of COVID-19 in care homes in the east of England. Elife 10:e64618. 21. Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, Macalalad AR, Berlin AM, Malboeuf CM, Ryan EM, Gnerre S, et al. 2012. Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog. 8:e1002529. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.ppat.1002529&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22412369&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 22. Hensley SE, Das SR, Bailey AL, Schmidt LM, Hickman HD, Jayaraman A, Viswanathan K, Raman R, Sasisekharan R, Bennink JR, et al. 2009. Hemagglutinin receptor binding avidity drives influenza A virus antigenic drift. Science 326:734–736. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzMjYvNTk1My83MzQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOC8wOS8yMDIxLjA4LjA4LjIxMjYxNjczLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 23. Jary A, Leducq V, Malet I, Marot S, Klement-Frutos E, Teyssou E, Soulié C, Abdi B, Wirden M, Pourcher V, et al. 2020. Evolution of viral quasispecies during SARS-CoV-2 infection. Clin. Microbiol. Infect. 26:1560.e1–e1560.e4. 24. Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C, Ferguson N. 2014. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput. Biol. 10:e1003457. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1003457&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24465202&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 25. Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30:772–780. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/molbev/mst010&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23329690&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000317002300004&link_type=ISI) 26. Kemp SA, Collier DA, Datir RP, Ferreira IATM, Gayed S, Jahun A, Hosmillo M, Rees-Spear C, Mlcochova P, Lumb IU, et al. 2021. SARS-CoV-2 evolution during treatment of chronic infection. Nature 592:277–282. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 27. Koyama T, Platt D, Parida L. 2020. Variant analysis of SARS-CoV-2 genomes. Bull. World Health Organ. 98:495–504. 28. Kubik S, Marques AC, Xing X, Silvery J, Bertelli C, De Maio F, Pournaras S, Burr T, Duffourd Y, Siemens H, et al. 2021. Recommendations for accurate genotyping of SARS-CoV-2 using amplicon-based sequencing of clinical samples. Clin. Microbiol. Infect. 27:1036.e1–e1036.e8. 29. Kuipers J, Batavia AA, Jablonski KP, Bayer F, Borgsmüller N, Dondi A, Dragan M-A, Ferreira P, Jahn K, Lamberti L, et al. 2020. Within-patient genetic diversity of SARS-CoV-2. bioRxiv [Internet]:2020.10.12.335919. Available from: [https://www.biorxiv.org/content/10.1101/2020.10.12.335919v1.abstract](https://www.biorxiv.org/content/10.1101/2020.10.12.335919v1.abstract) 30. Lee EC, Wada NI, Grabowski MK, Gurley ES, Lessler J. 2020. The engines of SARS-CoV-2 spread. Science 370:406–407. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNzAvNjUxNS80MDYiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOC8wOS8yMDIxLjA4LjA4LjIxMjYxNjczLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 31. Leinonen R, Sugawara H, Shumway M, International Nucleotide Sequence Database Collaboration. 2011. The sequence read archive. Nucleic Acids Res. 39:D19–D21. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkq1019&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21062823&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000285831700005&link_type=ISI) 32. Letizia AG, Ramos I, Obla A, Goforth C, Weir DL, Ge Y, Bamman MM, Dutta J, Ellis E, Estrella L, et al. 2020. SARS-CoV-2 Transmission among Marine Recruits during Quarantine. N. Engl. J. Med. 383:2407–2416. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 33. Li B, Deng A, Li K, Hu Y, Li Z, Xiong Q, Liu Z, Guo Q, Zou L, Zhang H, et al. 2021. Viral infection and transmission in a large, well-traced outbreak caused by the SARS-CoV-2 Delta variant. bioRxiv [Internet]. Available from: [http://medrxiv.org/lookup/doi/10.1101/2021.07.07.21260122](http://medrxiv.org/lookup/doi/10.1101/2021.07.07.21260122) 34. Li H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv [q-bio.GN] [Internet]. Available from: [http://arxiv.org/abs/1303.3997](http://arxiv.org/abs/1303.3997) 35. Lumby CK, Nene NR, Illingworth CJR. 2018. A novel framework for inferring parameters of transmission from viral sequence data. PLoS Genet. 14:e1007718. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1007718&link_type=DOI) 36. Lythgoe KA, Hall M, Ferretti L, de Cesare M, MacIntyre-Cockett G, Trebes A, Andersson M, Otecko N, Wise EL, Moore N, et al. 2021. SARS-CoV-2 within-host diversity and transmission. Science 372:eabg0821. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjE3OiIzNzIvNjUzOS9lYWJnMDgyMSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzA4LzA5LzIwMjEuMDguMDguMjEyNjE2NzMuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 37. Martincorena I, Raine KM, Gerstung M, Dawson KJ, Haase K, Van Loo P, Davies H, Stratton MR, Campbell PJ. 2017. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell 171:1029–1041.e21. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2017.09.042&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29056346&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 38. Martin M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17:10–12. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.14806.17.1.200&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22955987&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 39. Martin MA, Koelle K. 2021. Reanalysis of deep-sequencing data from Austria points towards a small SARS-COV-2 transmission bottleneck on the order of one to three virions. bioRxiv [Internet]:2021.02.22.432096. Available from: [https://www.biorxiv.org/content/10.1101/2021.02.22.432096v1.abstract](https://www.biorxiv.org/content/10.1101/2021.02.22.432096v1.abstract) 40. Matteson N, Grubaugh N, Gangavarapu K, Quick J, Loman N, Andersen K. 2020. PrimalSeq: Generation of tiled virus amplicons for MiSeq sequencing v1 (protocols.io.bez7jf9n). protocols.io [Internet]. Available from: [https://www.protocols.io/view/primalseq-generation-of-tiled-virus-amplicons-for-bez7jf9n](https://www.protocols.io/view/primalseq-generation-of-tiled-virus-amplicons-for-bez7jf9n) 41. McCrone JT, Lauring AS. 2018. Genetic bottlenecks in intraspecies virus transmission. Curr. Opin. Virol. 28:20–25. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.coviro.2017.10.008&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29107838&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 42. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32:268–274. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/molbev/msu300&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25371430&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 43. O’Toole’Scher E, Underwood A, Jackson B, Hill V, McCrone JT, Colquhoun R, Ruis C, Abu-Dahab K, Taylor B, et al. 2021. Assignment of Epidemiological Lineages in an Emerging Pandemic Using the Pangolin Tool. Virus Evolution [Internet]. Available from: [https://academic.oup.com/ve/advance-article-abstract/doi/10.1093/ve/veab064/6315289](https://academic.oup.com/ve/advance-article-abstract/doi/10.1093/ve/veab064/6315289) 44. Paradis E, Schliep K. 2019. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35:526–528. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/bty633&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30016406&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 45. Parameswaran P, Wang C, Trivedi SB, Eswarappa M, Montoya M, Balmaseda A, Harris E. 2017. Intrahost Selection Pressures Drive Rapid Dengue Virus Microevolution in Acute Human Infections. Cell Host Microbe 22:400–410.e5. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.chom.2017.08.003&link_type=DOI) 46. Park DJ, Dudas G, Wohl S, Goba A, Whitmer SLM, Andersen KG, Sealfon RS, Ladner JT, Kugelman JR, Matranga CB, et al. 2015. Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone. Cell 161:1516–1526. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2015.06.007&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26091036&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 47. Perera D, Perks B, Potemkin M, Gordon P, Gill J, van Marle G, Long Q. 2021. A novel computational approach to reconstruct SARS-CoV-2 infection dynamics through the inference of unsampled sources of infection. bioRxiv [Internet]. Available from: [http://medrxiv.org/lookup/doi/10.1101/2021.01.04.21249233](http://medrxiv.org/lookup/doi/10.1101/2021.01.04.21249233) 48. Popa A, Genger J-W, Nicholson MD, Penz T, Schmid D, Aberle SW, Agerer B, Lercher A, Endler L, Colaço H, et al. 2020. Genomic epidemiology of superspreading events in Austria reveals mutational dynamics and transmission properties of SARS-CoV-2. Sci. Transl. Med. 12:eabe2555. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTE6InNjaXRyYW5zbWVkIjtzOjU6InJlc2lkIjtzOjE1OiIxMi81NzMvZWFiZTI1NTUiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOC8wOS8yMDIxLjA4LjA4LjIxMjYxNjczLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 49. Quick J, Grubaugh ND, Pullan ST, Claro IM, Smith AD, Gangavarapu K, Oliveira G, Robles-Sikisaka R, Rogers TF, Beutler NA, et al. 2017. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat. Protoc. 12:1261–1276. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nprot.2017.066&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28538739&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 50. Rambaut A, Holmes EC, O’Toole ’, Hill V, McCrone JT, Ruis C, du Plessis L, Pybus OG. 2020. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol 5:1403–1407. 51. Rayko M, Komissarov A. 2020. Quality control of low-frequency variants in SARS-CoV-2 genomes. bioRxiv [Internet]:2020.04.26.062422. Available from: [https://www.biorxiv.org/content/10.1101/2020.04.26.062422v2.abstract](https://www.biorxiv.org/content/10.1101/2020.04.26.062422v2.abstract) 52. Sagulenko P, Puller V, Neher RA. 2018. TreeTime: Maximum-likelihood phylodynamic analysis. Virus Evol 4:vex042. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ve/vex042&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29340210&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 53. San JE, Ngcapu S, Kanzi AM, Tegally H, Fonseca V, Giandhari J, Wilkinson E, Nelson CW, Smidt W, Kiran AM, et al. 2021. Transmission dynamics of SARS-CoV-2 within-host diversity in two major hospital outbreaks in South Africa. Virus Evol 7:veab041. 54. Sapoval N, Mahmoud M, Jochum MD, Liu Y, Elworth RAL, Wang Q, Albin D, Ogilvie HA, Lee MD, Villapol S, et al. 2021. SARS-CoV-2 genomic diversity and the implications for qRT-PCR diagnostics and transmission. Genome Res. 31:635–644. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NjoiZ2Vub21lIjtzOjU6InJlc2lkIjtzOjg6IjMxLzQvNjM1IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDgvMDkvMjAyMS4wOC4wOC4yMTI2MTY3My5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 55. Seemann T, Lane CR, Sherry NL, Duchene S, Gonçalves da Silva A, Caly L, Sait M, Ballard SA, Horan K, Schultz MB, et al. 2020. Tracking the COVID-19 pandemic in Australia using genomics. Nat. Commun. 11:4376. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41467-020-18314-x&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32873808&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 56. Sekizuka T, Itokawa K, Kageyama T, Saito S, Takayama I, Asanuma H, Nao N, Tanaka R, Hashino M, Takahashi T, et al. 2020. Haplotype networks of SARS-CoV-2 infections in the Diamond Princess cruise ship outbreak. Proc. Natl. Acad. Sci. U. S. A. 117:20198–20201. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTE3LzMzLzIwMTk4IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDgvMDkvMjAyMS4wOC4wOC4yMTI2MTY3My5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 57. Shen Z, Xiao Y, Kang L, Ma W, Shi L, Zhang L, Zhou Z, Yang J, Zhong J, Yang D, et al. 2020. Genomic Diversity of Severe Acute Respiratory Syndrome–Coronavirus 2 in Patients With Coronavirus Disease 2019.Clin. Infect. Dis. 71:713–720. 58. Shu Y, McCauley J. 2017. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. [Internet] 22. Available from: [http://dx.doi.org/10.2807/1560-7917.ES.2017.22.13.30494](http://dx.doi.org/10.2807/1560-7917.ES.2017.22.13.30494) 59. Sobel Leonard A, Weissman DB, Greenbaum B. 2017. Transmission bottleneck size estimation from pathogen deep-sequencing data, with an application to human influenza A virus. Journal of Virology 91:e00171–17. 60. Sobel Leonard A, Weissman DB, Greenbaum B, Ghedin E, Koelle K. 2019. Correction for Sobel Leonard et al., “Transmission Bottleneck Size Estimation from Pathogen Deep-Sequencing Data, with an Application to Human Influenza A Virus.” Journal of Virology [Internet] 93. Available from: [http://dx.doi.org/10.1128/jvi.00936-19](http://dx.doi.org/10.1128/jvi.00936-19) 61. Stapleford KA, Cofley LL, Lay S, Bordería AV, Duong V, Isakov O, Rozen-Gagnon K, Arias-Goeta C, Blanc H, Beaucourt S, et al. 2014. Emergence and transmission of arbovirus evolutionary intermediates with epidemic potential. Cell Host Microbe 15:706–716. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.chom.2014.05.008&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24922573&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 62. Stern A, Yeh MT, Zinger T, Smith M, Wright C, Ling G, Nielsen R, Macadam A, Andino R. 2017. The Evolutionary Pathway to Virulence of an RNA Virus. Cell 169:35–46.e19. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.cell.2017.03.013&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28340348&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 63. Stimson J, Gardy J, Mathema B, Crudu V, Cohen T, Colijn C. 2019. Beyond the SNP Threshold: Identifying Outbreak Clusters Using Inferred Transmissions. Mol. Biol. Evol. 36:587–603. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/molbev/msy242&link_type=DOI) 64. Tonkin-Hill G, Martincorena I, Amato R, Lawson ARJ, Gerstung M, Johnston I, Jackson DK, Park NR, Lensing SV, Quail MA, et al. 2020. Patterns of within-host genetic diversity in SARS-CoV-2. Cold Spring Harbor Laboratory [Internet]:2020.12.23.424229. Available from: [https://www.biorxiv.org/content/10.1101/2020.12.23.424229v1.abstract](https://www.biorxiv.org/content/10.1101/2020.12.23.424229v1.abstract) 65. Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker CR, Hinrichs AS, Fernandes JD, Borges R, Slodkowicz G, et al. 2020. Stability of SARS-CoV-2 phylogenies. PLoS Genet. 16:e1009175. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pgen.1009175&link_type=DOI) 66. Valesano AL, Rumfelt KE, Dimche? DE, Blair CN, Fitzsimmons WJ, Petrie JG, Martin ET, Lauring AS. 2021. Temporal dynamics of SARS-CoV-2 mutation accumulation within and across infected hosts. PLoS Pathog. 17:e1009499. 67. Varble A, Albrecht RA, Backes S, Crumiller M, Bouvier NM, Sachs D, García-Sastre A, tenOever BR. 2014. Influenza A virus transmission bottlenecks are defined by infection route and recipient host. Cell Host Microbe 16:691–700. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.chom.2014.09.020&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25456074&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 68. Vignuzzi M, Stone JK, Arnold JJ, Cameron CE, Andino R. 2006. Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature 439:344–348. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature04388&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16327776&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000234682100047&link_type=ISI) 69. Vogels C, Fauver J, Isabel Grubaugh N. 2020. Generation of SARS-COV-2 RNA transcript standards for qRT-PCR detection assays v1. protocols.io [Internet]. Available from: [https://www.protocols.io/view/generation-of-sars-cov-2-rna-transcript-standards-bdv6i69e](https://www.protocols.io/view/generation-of-sars-cov-2-rna-transcript-standards-bdv6i69e) 70. Voloch CM, Ronaldo da Silva F, de Almeida LGP, Brustolini OJ, Cardoso CC, Gerber AL, de C Guimaraes AP, de Carvalho Leitão I, Mariani D, Ota VA, et al. 2020. Intra-host evolution during SARS-CoV-2 persistent infection. medRxiv [Internet]. Available from: [https://www.medrxiv.org/content/10.1101/2020.11.13.20231217v1.full-text](https://www.medrxiv.org/content/10.1101/2020.11.13.20231217v1.full-text) 71. Wang D, Wang Y, Sun W, Zhang L, Ji J, Zhang Z, Cheng X, Li Y, Xiao F, Zhu A, et al. 2021. Population Bottlenecks and Intra-host Evolution During Human-to-Human Transmission of SARS-CoV-2. Front. Med. 8:585358. 72. Wang Y, Wang D, Zhang L, Sun W, Zhang Z, Chen W, Zhu A, Huang Y, Xiao F, Yao J, et al. 2021. Intra-host variation and evolutionary dynamics of SARS-CoV-2 populations in COVID-19 patients. Genome Med. 13:30. 73. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH, Wong CH, Khor CC, Petric R, Hibberd ML, Nagarajan N. 2012. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40:11189–11201. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gks918&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23066108&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000313414800010&link_type=ISI) 74. Wölfel R, Corman VM, Guggemos W, Seilmaier M, Zange S, Müller MA, Niemeyer D, Jones TC, Vollmar P, Rothe C, et al. 2020. Virological assessment of hospitalized patients with COVID-2019. Nature 581:465–469. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2196-x&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 75. Worby CJ, Chang HH, Hanage WP, Lipsitch M. 2014. The distribution of pairwise genetic distances: a tool for investigating disease transmission. Genetics 198:1395–1404. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6ODoiZ2VuZXRpY3MiO3M6NToicmVzaWQiO3M6MTA6IjE5OC80LzEzOTUiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wOC8wOS8yMDIxLjA4LjA4LjIxMjYxNjczLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 76. Worby CJ, Lipsitch M, Hanage WP. 2014. Within-host bacterial diversity hinders accurate reconstruction of transmission networks from genomic distance data. PLoS Comput. Biol. 10:e1003549. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pcbi.1003549&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24675511&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F08%2F09%2F2021.08.08.21261673.atom) 77. Worby CJ, Lipsitch M, Hanage WP. 2017. Shared Genomic Variants: Identification of Transmission Routes Using Pathogen Deep-Sequence Data. Am. J. Epidemiol. 186:1209–1216. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/aje/kwx182&link_type=DOI)