Abstract
The efficacy of COVID-19 mRNA vaccines is high, but breakthrough infections still occur. We compared the SARS-CoV-2 genomes of 67 breakthrough cases after full vaccination with BNT162b2 (Pfizer/BioNTech), mRNA-1273 (Moderna), or JNJ-78436735 (Janssen) to unvaccinated controls (February-April 2021) in metropolitan New York, including their phylogenetic relationship, distribution of variants, and full spike mutation profiles. Their median age was 45 years; seven required hospitalization and one died. Most breakthrough infections (54/67) occurred with B.1.1.7 (Alpha) or B.1.526 (Iota). Among the 7 hospitalized cases, 5 were infected with B.1.1.7, including 1 death. Both unmatched and matched statistical analyses considering age, sex, vaccine type, and study month as covariates supported the null hypothesis of equal variant distributions between vaccinated and unvaccinated in chi-squared and McNemar tests (p>0.1) highlighting a high vaccine efficacy against B.1.1.7 and B.1.526. There was no clear association among breakthroughs between type of vaccine received and variant. In the vaccinated group, spike mutations in the N-terminal domain and receptor-binding domain that have been associated with immune evasion were overrepresented. The evolving dynamic of SARS-Co-V2 variants requires broad genomic analyses of breakthrough infections to provide real-life information on immune escape mediated by circulating variants and their spike mutations.
INTRODUCTION
The novel betacoronavirus SARS-CoV-2 arose as a new human pathogen at the end of 2019, and rapidly spread to every corner of the globe, causing a pandemic of enormous proportions, with 179 million cumulative infections and almost 4 million deaths counted worldwide as of mid-June 2021 1. As soon as the causative agent was identified and sequenced 2, massive efforts to develop vaccines were initiated and tested simultaneously in multiple clinical trials. These efforts led to the rapid deployment of several of these vaccines by December 2020, less than a year after the first SARS-CoV-2 viral sequence had been determined. In the United States, vaccination started in December 2020, using two novel mRNA-based vaccines, BNT162b2 (Pfizer/BioNTech), and mRNA-1273 (Moderna), both using the SARS-CoV-2 spike mRNA sequence of the original Wuhan variant. The Janssen COVID-19 vaccine, JNJ-78436735, was also deployed shortly after the mRNA vaccines, and these vaccines were shown to have high efficacy in clinical trials 3-5. With millions of people vaccinated in multiple countries, they have proven to be highly effective in the real world 6-12, although a very small percentage of breakthrough infections have occurred, as expected 13-21. Contemporaneously with the vaccination efforts in several countries, new SARS-CoV-2 variants emerged, four of which were designated by the WHO as variants of concern (VOC), B.1.1.7 or Alpha, B.1.351 or Beta, P.1 or Gamma, and B.1.617.2 or Delta, which arose originally in the UK, South Africa, Brazil, and India, respectively 22, 23. VOCs are classified as such if they are more transmissible, cause more severe disease or a significant reduction in neutralization by antibodies generated via previous infection or after vaccination. In addition, there are variants of interest (VOI), which carry mutations that have been associated with changes to receptor binding, reduced neutralization by anti-SARS-CoV-2 antibodies, or may increase transmissibility or severity. One of these variants is B.1.526 (Iota), which arose in New York City in late December 2020 24, 25. Although it is not yet clear that any particular VOC and VOI is associated with vaccine breakthrough, data from Israel suggest that there might be higher vaccine breakthrough rates with the B.1.351 variant 17. In addition, it is possible that certain amino acid mutations in spike, irrespective of affiliation to a specific VOC or VOI, are critical for vaccine breakthrough, but this has only been rudimentarily studied 18. There is scarcity of data about determinants of vaccine breakthrough. The few reported studies have included breakthrough cases after first or second immunization, however, breakthrough cases after full vaccination remained moderate or low 13-21. Thus, comprehensive studies with site-specific mutation analyses are needed on a larger number of fully vaccinated individuals from different geographic regions. Here, we added such data from the NY metropolitan area. We carried out full SARS-CoV-2 genome sequencing of SARS-CoV-2-positive individuals, 14 days after their completed vaccination series with mostly Pfizer and Moderna vaccines in the multi-center NYU Langone Health System. Analyses included statistical comparison of variant distribution as well as mutation rates at every residue in spike between vaccinated and unvaccinated individuals.
METHODS
Study design and sample collection
The design is an observational case-control study of SARS-CoV-2 vaccine breakthrough infections in the NYU Langone Health system, a large healthcare system in the New York City metro area, with primary care hospitals located in Manhattan (New York County), Brooklyn (Kings County) and Nassau County (Long Island). The case group consisted of individuals who tested positive by real-time RT-PCR for SARS-CoV-2 RNA regardless of cycle threshold (Ct), any time after 14 days of inoculation with the second dose of BNT162b2 (Pfizer/BioNTech) or mRNA-1273 (Moderna) vaccines, or with the single dose COVID Janssen vaccine, according to our electronic health records (EHR). The control group consisted of full-genome sequenced SARS-CoV-2 positive cases in our health system who were randomly selected for SARS-CoV-2 genomic surveillance, had Ct ≤30, and were collected in the same time period as the breakthrough infections. Nasopharyngeal swabs were sampled from individuals suspected to have an infection with SARS-CoV-2 as part of clinical diagnostics or hospital admittance. Samples were collected in 3 mL viral transport medium (VTM; Copan universal transport medium or equivalent). Clinical testing was performed using various FDA emergency use authorization (EUA)–approved platforms for detection of SARS-CoV-2, i.e., the Roche Cobas 6800 SARS-CoV-2 (90% of the samples in this study) and Cepheid Xpert SARS-CoV-2 or SARS-CoV-2/Flu/RSV assays.
RNA extraction, cDNA synthesis, library preparation and sequencing
RNA was extracted from 400 μl of each nasopharyngeal swab specimens using the MagMAX™ Viral/Pathogen Nucleic Acid Isolation Kit on the KingFisher flex system (Thermo Fisher Scientific) following the manufacturer’s instructions. Total RNA (11 μl) was converted to first strand cDNA by random priming using the Superscript IV first-strand synthesis system (Invitrogen, ref# 180901050). Libraries were prepared using Swift Normalase Amplicon SARS-CoV-2 Panel (SNAP) and SARS-CoV-2 additional Genome Coverage Panel (Cat# SN-5×296 core kit, 96rxn), using 10 μl of first strand cDNA, following the manufacturer’s instructions 26. Final libraries were run on Agilent Tapestation 2200 with high sensitivity DNA Screentape to verify the amplicon size of about 450 bp. Normalized pools were run on the Illumina NovaSeq 6000 system with the SP 300 cycle flow cell. Run metrics were paired-end 150 cycles with dual indexing reads. Typically, two pools representing two full 96 well plates (192 samples) were sequenced on each SP300 NovaSeq flow cell.
Sequenced read processing
Sequencing reads were demultiplexed using the Illumina bcl2fastq2 Conversion Software v2.20 and adapters and low-quality bases were trimmed with Trimmomatic v0.36 27. BWA v0.7.17 28 was utilized for mapping reads to the SARS-CoV-2 reference genome (NC_045512.2, wuhCor1) and mapped reads were soft-clipped to remove SNAP tiled primer sequences using Primerclip v0.3.8 29. BCFtools v1.9 30 was used to call mutations and assemble consensus sequences, which were then assigned phylogenetic lineage designations according to PANGO nomenclature 31. Sequences that did not yield a near-complete viral genome (<23,000bp, <4000x coverage) were discarded from further analysis.
Phylogenetic analyses
SARS-CoV-2 full genome sequences were aligned using Mafft v.7 32. The alignment was cropped to base pairs 202-29657 according to the Wuhan-Hu-1 reference to remove N- and C-terminal regions with unassigned base pairs. Maximum likelihood IQ trees were performed using the IQ-TREE XSEDE tool, multicore version 2.1.2, on the Cipres Science gateway v.3.3 33. GTR+I+G was chosen as the best-fit substitution model. Support values were generated with 1,000 bootstrap replicates and the ultrafast bootstrapping method. Phylogenetic trees were visualized in Interactive Tree Of Life (iTOL) v.6 34. The tree was constructed using 67 vaccine breakthrough and 1,187 unvaccinated control SARS-CoV-2 sequences from our NYU cohort (greater NYC area) together with 3,802 global reference sequences for a total of 5,056 SARS-CoV-2 genomic sequences. The reference sequences were retrieved from a Nextstrain build with North America-focused global subsampling [4] and included 1,361 other US sequences.
Mutation analysis
Spike amino acid counts and calculations of site-specific mutation frequencies compared to Wuhan-Hu-1 as reference were done on MAFFT-aligned SARS-CoV-2 sequences using R35 and R Studio36 using scripts based on the seqinr and tidyverse (dplyr, lubridate, magrittr) packages. The calculations exclusively considered residues that were accurately covered by sequencing, i.e., non-ambiguous characters and gaps. Fisher exact tests and multiplicity corrections (Benjamini-Hochberg) were done in Program R/RStudio. Multiplicity corrected P values (q) <0.05 were considered significant. Highlighter analyses were performed on MAFFT-aligned SARS-CoV-2 amino acid sequences of the spike genomic region. SARS-CoV-2 vaccine breakthrough sequences were compared to Wuhan-Hu-1 as master using the Highlighter tool provided by the Los Alamos HIV sequence database 37.
Statistical analysis
To address confounding and other sources of bias arising from the use of observational data, we estimated a propensity score for the likelihood of full vaccination, and matched vaccinated to unvaccinated patients, including age, sex, county of residence, and study month (February, March, April 2021) as covariates 38. Propensity-score matching was implemented using the nearest neighbor strategy using a 1:1 ratio without replacement, with the MatchIt algorithm in RStudio version 1.4.1106 39. Before and after matching, we evaluated the presence of three Pango lineages, B.1.1.7, B.1.526 and P.1 compared to all others lineages combined. On unmatched data, we calculated a Pearson chi-squared test statistic; on matched data we calculated McNemar’s test. The tables used for these calculations are in Supplementary Material (Supplementary Tables 1 and 2). Comparisons of Ct values and mutation counts between groups were made using non-parametric Mann-Whitney tests or Kruskal-Wallis tests with Dunn’s multiplicity correction.
Study Approval
This study was approved by the NYU Langone Health Institutional Review Board, protocol number i21-00493.
RESULTS
Alpha and Iota variants dominate vaccine breakthrough infections in metropolitan New York
A total of 126,367 fully vaccinated individuals were recorded in our electronic health records by April 30th, 2021, of whom the majority (123,511, 98%) were vaccinated with mRNA vaccines (Pfizer/BioNTech = 103,166 and Moderna = 20,345), and the rest (2,856) with the adenovirus-based Janssen COVID-19 vaccine, administered as single dose. We recorded 86 cases of vaccine breakthrough infection (69x Pfizer/BioNTech, 16x Moderna, and 1x Janssen) between February 1st and April 30th of 2021, representing 1.2% of the total SARS-CoV-2 positive cases in our healthcare system and 0.07% of the fully vaccinated population in our medical records. Out of the 86 cases, 67 cases (80%) yielded full SARS-CoV2 genomes (54x Pfizer/BioNTech, 12x Moderna, and 1x Janssen) that passed quality control (QC) and allowed us to determine the Pango lineage and mutations across the viral genome, including the spike gene. As expected, the 19 excluded breakthrough cases with low genome coverage and failed QC had significantly higher Ct values (median: 35, range: 31-42) than the 67 samples with full viral genome coverage (P<0.0001, Mann Whitney test), which had Ct values below 34 (median: 24, range: 13-34) (Figure 1A).
RT-qPCR Ct values in post-vaccine breakthrough infections. A) Ct plots of samples that yielded a full genome with sufficient coverage to determine lineage and mutations (QC passed = >23,000 bp and >4000X coverage; QC failed = <23,000 bp and <4000X coverage). B) Ct plots of all samples that passed QC, by lineages, B.1.1.7 (Alpha) and B.1.526 (Iota), compared to all others. Hospitalized cases are shown in larger sized black symbols. QC: quality control.
Although B.1.1.7 (Alpha) infection has been associated with overall higher viral load/lower Ct values 40, 41, the Ct values in our recorded breakthrough infections were not significantly different for B.1.1.7 compared to B.1.526 (Iota) or all other variants (Figure 1B).
Among the 67 COVID-19 cases post-vaccination with adequate SARS-CoV-2 genome coverage, the median age was 45 years, 33 were male and 24 were female. Seven required hospitalization for COVID-19, among whom there was one death in an elderly patient with multiple comorbidities who already was on in-home oxygen previous to post-vaccination COVID-19 infection and had a lengthy stay at the ICU (Table 1, Supplementary Table 3).
The distribution of PANGO lineages in the 67 vaccine breakthroughs was as follows: 26 (38.8%) had B.1.1.7, 28 (41.8%) had B.1.526 (including sublineages B.1.526.1 and B.1.526.2), 1 (0.2%) had P.1, and 12 (17.9%) had other variants (Supplementary Figure 1). Among the 1,187 sequences from the group of unvaccinated patients, 345 (29.1%) had B.1.1.7, 487 (41.0%) had B.1.526 (including sublineages B.1.526.1, B.1.526.2, and B.1.526.3), 12 (1.0%) had P.1, and 343 (28.9%) had other variants. Of the seven hospitalizations, five were infected with B.1.1.7, including the fatal case, and two with B.1.526.
VOCs and VOIs are equally distributed in vaccinated and unvaccinated infections
To compare vaccinated breakthrough infections with unvaccinated controls statistically, we included 1,187 unvaccinated individuals from the same study cohort who became SARS-CoV-2-infected in the same study months (February-April 2021) as the vaccine breakthrough cases. A chi-squared test for rejecting the null hypothesis of equal Pango lineage distributions (B.1.1.7, B.1.526, and other variants combined) between vaccinated and unvaccinated patients resulted in a p-value of 0.18.
To address confounding and other sources of bias arising from the use of observational data, we estimated a propensity score for the likelihood of full vaccination 38, and successfully matched all 67 vaccinated to unvaccinated patients, including age, sex, county of residence, and study month (February, March, and April 2021) as covariates. The standardized mean difference between the matched pairs was 0.0125, reduced by 99.7 percent from 0.590 prior to matching.
Supplementary Table 1 shows the distribution of variants in the matched pairs. McNemar’s test of the null hypothesis of equal distributions between vaccinated and unvaccinated patients, assessing the three VOC/VOI separately, could not be calculated due to sparse data. When we collapsed the table to reflect all VOCs/VOIs compared to other variants, McNemar’s test resulted in a p-value of 0.82 (Supplementary Table 2). Thus, vaccinated and unvaccinated individuals in the metropolitan New York area are similarly affected by the regionally circulating VOCs and VOIs. In addition, there is no clear association among vaccinated patients between type of vaccine received and Pango lineage (chi-squared test, p=0.36).
Widespread phylogenetic dispersal of vaccine breakthrough sequences among unvaccinated controls
As a way to ascertain potential bias in our sampling, we carried out a phylogenetic analysis of our 67 breakthrough sequences in the context of 1,187 unvaccinated SARS-CoV-2-positive controls (selected randomly as part of our greater New York catchment area genomic surveillance) together with sub-sampled sequences from the United States as well as globally, the latter two groups based on Nextstrain builds of GISAID sequences (Figure 2). The main variants circulating in the New York area between the months of February and April 2021 were B.1.1.7 and B.1.526 (purple). Accordingly, our vaccine breakthrough samples (orange branch symbols and gray rays) mostly engaged B.1.1.7 and B.1.526 clades and were interspersed among the unvaccinated controls as well as other United States sequences. There was no evidence of extensive clustering that might indicate onward transmissions or transmission chains of vaccine breakthrough infections. Instead, they were widely distributed and appeared to mostly involve independent clusters of infections.
IQ tree of 5,056 SARS-CoV-2 full genome sequences (base pairs 202-29,657 according to Wuhan-Hu-1 as reference), including 67 vaccine breakthrough (orange) and 1,187 unvaccinated control SARS-CoV-2 sequences from our NYU cohort (greater NYC area) (purple) together with 1,361 other US (cyan) and 2,441 non-US global reference sequences (black). The tree was generated with a GTR+I+G substitution model and 1,000 bootstrap replicates, and the substitution scale of the tree is indicated at the bottom right. The branches of the tree are colored as indicated. Vaccine breakthrough sequences are additionally highlighted by orange triangles as branch symbols and gray rays radiating from the root to the outer rim of the tree. Hospitalizations due to COVID-19 among the vaccine breakthrough infections are indicated by black triangles (h). The variants responsible for most vaccine breakthrough infections in our study cohort are labeled with respective Pango lineages (WHO classification in parenthesis).
Enrichment of NTD deletions and RBD escape mutations in vaccine breakthrough compared to unvaccinated control sequences
To screen whether vaccine breakthrough preferentially occurred with distinct vaccine escape mutations, we performed a comparative analysis of spike mutations between case and control groups. In our aligned data set of 67 vaccine breakthrough and 1,187 unvaccinated control sequences, spike mutations (compared to Wuhan-Hu-1) occurred at 233 amino acid residues. While most of these sites were more frequently mutated in control cases (191 sites), and one site (D614) being equally mutated in both groups (100% D614G), 41 sites exhibited increased mutation rates in the vaccine breakthrough group (Figure 3A). Interestingly, the degree of enrichment (Δ mut) was higher at the 41 breakthrough-enriched sites compared to the 191 sites enriched in controls. Contributing factors presumably included random mutations in the absence of immune pressure in controls, adaptive selection of immune escape mutations in vaccine breakthroughs, but also uneven case numbers in both groups. When we disregarded unique mutations per data set in our calculations, the mutation analysis yielded 21 distinct spike sites with enriched mutations in breakthrough infections (Figure 3A, Supplementary Table 4). Although individual sites did not achieve significance in Fisher’s exact tests, the array of sieved mutation sites drew a striking pattern of N-terminal domain (NTD) deletions (ΔV69-H70 and ΔY144), receptor-binding domain (RBD) mutations (K417, S477, E484, and N501), a mutation near the furin binding site (P681X), and also C-terminal mutations (T1027X and D1118X), all of which have been associated with enhanced immune evasion or ACE2 receptor binding, and/or recurring mutations in emerging VOCs/VOIs 22, 42.
A) Comparison of site-specific amino acid mutation (mut) frequencies in spike in 67 vaccine breakthrough sequences (Vacc) compared to 1,187 unvaccinated matched controls (Unvacc) from the same cohort. The Wuhan-Hu-1 sequence served as reference to call mutations per site, and all spike mutation sites of the study sequences are shown along the x-axis according to their spike position (n=233). The mirror plot displays differences of mutation frequencies between Vacc and Unvacc groups; orange bars (top) refer to higher mutation rates in Vacc sequences, whereas black bars (bottom) refer to higher mutation rates in Unvacc matched controls. B) Enrichment of spike mutations in SARS-CoV-2 vaccine breakthrough sequences. All sites with greater spike mutation rates in Vacc compared to Unvacc controls are shown; unique occurrences of mutations were disregarded. Mutation sites in the spike N-terminal domain (NTD), receptor binding domain (RBD), and S1/S2 interface region that have been associated with variants of concern and/or neutralization immune escape are highlighted.
DISCUSSION
Our data from a large metropolitan healthcare system in the greater New York City area underline a high vaccine efficacy in fully vaccinated individuals, 14 days or longer post-last dose of vaccination. Efficacy is maintained against several circulating variants including VOCs and VOIs. Compared to the large number of SARS-CoV-2 infections among unvaccinated individuals, the recorded breakthrough cases between February and April 2021 (n=67) remained at∼1% of total infections, with the caveat that both breakthrough cases as well as unvaccinated controls were not exhaustively screened and covered.
Despite the overall effectiveness of vaccination, our full spike mutation analysis revealed a broad set of spike mutations (n=21) to be elevated in the vaccine breakthrough group. It indicates that adaptive selection is in progress that may subsequently come into full effect. At this point, the breakthrough cases and differences in mutation rates between case and control groups are still too low to draw meaningful conclusions. However, the overrepresentation of functionally important spike mutations including NTD deletions ΔH69-V70 and ΔY144 together with RBD mutations K417X, S477X, E484X, and N501X as well as P681X (near the S1/S2 interface) in the absence of a clear elevation of VOCs or VOIs may indicate a starting sieve effect at individual or combinations of functional mutations. Spike mutations and deletions reported to confer neutralization escape in vitro23, 43, 44 might thus be responsible for a sieve effect in a real-life situation, i.e., among vaccinated individuals.
During the time of our sample and data collection, there were two major variants arising in the New York City metro area constituting about two thirds of all cases, B.1.1.7, which was first reported in the UK 45, and B.1.526, which arose in New York City around December of 2020 24, 25. B.1.1.7 was deemed a variant of concern by the WHO and CDC and was associated with higher viral loads in infected individuals 40, 41, enhanced epidemiological spread 45-47, an extended window of acute infection 48, less effective innate and adaptive immune clearance 49, and increased death rates in elderly patients and/or individuals with comorbidities 50-53. The B.1.1.7 variant, e.g., through the RBD N501Y mutation but also through NTD deletions, acquired an enhanced affinity to ACE2, higher infectivity and virulence 54-57, while maintaining sensitivity to neutralization though with slightly impaired nAb titers 54, 58-60. The B.1.526 variant and its derivatives were designated VOI because of the presence of RBD mutations such as E484K or S477N that favor immune evasion 23, 25.
To study whether the more transmissible B.1.1.7 or B.1.526 because of its critical RBD mutations and prevalence/place of origin in our city were overrepresented in the breakthrough cases, we performed a comparative statistical analysis. Extending a few other studies reported so far 13-21, we focused on a strict threshold of 14 days post last dose of vaccination, and we performed both unmatched and matched analyses side-by-side. We adjusted the confounding factors sex, age, study month, and residence area, though we are aware that other confounding factors such as biased sampling or behavioral differences between groups might also play a role.
Our finding that B.1.1.7 and B.1.526 are not preferentially represented in the vaccine breakthrough group but distributed at similar proportions as other variants in case and control groups suggests that the studied mRNA vaccines are comparably effective against these predominant variants in the NYC area. This is in agreement with recent data from Israel evaluating B.1.1.7 in this context 17. A sieve effect of the B.1.526 variant has not been studied in this detail, except for two studies with small sample sizes of n=2 and n=11 that did not allow strong conclusions 14, 19. A recent CDC study reported a total number of 10,262 SARS-CoV-2 vaccine breakthrough infections across the USA as of end-April 2021 causing hospitalization in 10% of cases (n=995) 20, a rate identical to our study.
Interestingly, the majority of our breakthrough infections had low Ct values (Ct <35), implying a moderate to high viral load in these cases, at least in the nasopharynx from where our samples were collected. These high viral loads suggest the possibility of transmission to others 61. Although our phylogenetic analysis does not provide evidence of transmission to others from our breakthrough cases at this time, this should be expected with the growing number of breakthrough cases in the next months. This may have significant epidemiologic and clinical significance if transmissible breakthrough infections carry specific spike mutations associated with immune evasion.
In conclusion, our data indicate that vaccine breakthrough in fully mRNA-vaccinated individuals is not more frequent with the VOC Alpha or the NY local VOI Iota, which underlines the high efficacy of mRNA COVID-19 vaccines against different variants. The Alpha variant appears overrepresented in the hospitalized group, though the low case numbers of hospitalizations do not allow firm conclusions. The increased presence of mutations in key regions of the spike protein in the vaccine breakthrough group is of potential concern and requires continued monitoring. Genomic surveillance of vaccine breakthrough cases should be carried out on a broader scale throughout the United States and the entire word to achieve higher case numbers and the inclusion of different VOCs and VOIs.
Data Availability
The SARS-C0V-2 sequences have been uploaded to GISAID
Author contributions
Ralf Duerr, MD, PhD designed the study, analyzed data and wrote the manuscript; Dacia Dimartino, PhD generated sequencing data; Christian Marier analyzed genomic data; Paul Zappile generated sequencing data; Guiqing Wang, MD, PhD collected samples and performed clinical SARS-CoV-2 detection; Jennifer Lighter, MD reviewed clinical information; Brian Elbel, PhD monitored epidemiological data and generated lists of breakthrough infections; Andrea Troxel, PhD performed statistical analyses; Adriana Heguy, PhD designed the study, generated genomic data and wrote the manuscript. All authors reviewed and edited the manuscript.
Supplementary Tables
Spike mutation statistics at 21 sites with enriched mutation rates in vaccine breakthrough infections in New York.
Supplementary Figures
Spike mutation patterns in vaccine breakthrough sequences. A) Highlighter plot showing spike amino acid mutations of 67 vaccine breakthrough SARS-CoV-2 sequences compared to the Wuhan-Hu-1 reference sequence as master (top line). Mutations are shown as ticks, color-coded according to the legend at the bottom. Key mutations are indicated by triangles and labeled. Study sequences are sorted according to the Neighbor-joining tree to the right (B). The scale bar at the bottom indicates the branch length of two substitutions. B.1.1.7 and B.1.526 sequences are highlighted in blue and purple, respectively.
Supplementary Methods
cDNA synthesis, library preparation and sequencing
Total RNA (11 μl) was converted to first strand cDNA by random priming using the Superscript IV first-strand synthesis system (Invitrogen, ref# 180901050). Random priming was performed as follows: 65°C for 5 mins, cooled on ice followed by addition of mix to run cDNA first strand synthesis 23°C 10mins, 50°C for 1 hour, 80°C for 10 mins, and 4°C hold. Libraries were prepared using Swift Normalase Amplicon Panel (SNAP) SARS-CoV-2 and Sar-Cov-2 additional Genome Coverage (Cat# SN-5×296 core kit, 96rxn), using 10 μl of first strand cDNA, following the manufacturer’s instruction: https://swiftbiosci.com/wp-content/uploads/2021/06/PRT-028-Swift-Normalase-Amplicon-Panels-SNAP-SARS-CoV-2-Panels-Rev-9.pdf). First step incorporated tiled primer pairs which target and enrich for the entire 29.9kb covid-19 viral genome (NCBI Reference Sequence NC_045512.2) during multiplex PCR. Multiplex PCR was run as follows: 98°C 30 secs, 4 repeating cycles of 98°C/ 10secs, 61°C/5mins, 65°C/1min. 24 repeating cycles of 98°C/10secs, 64°C/1min followed by 1 cycle of 65°C for 1 min and held at 4 °C. The multiplexed samples were cleaned with 1x volume room temperature Ampure XP beads (Beckman Coulter, #A63882) for 5 mins, washed with 80% ethanol twice, and beads were resuspended in 17.4ul of TE buffer (part of swift kit). Indexing reagent mix was added to the resuspended beads. One unique SNAP UD dual indexing primer pair was added to each sample. Indexing PCR was run at 37°C/20mins, 98°C/30secs, 9 repeating cycles of 98°C/10secs, 60°C/30secs, 66°C/1min. After indexing PCR was completed, samples were cleaned with 32.5ul of PEG NaCl solution (provided in swift kit, 5min bind time, 80% ethanol wash x2) and eluted in 20ul of post PCR TE buffer. Final libraries were run on Agilent Tapestation 2200 with high sensitivity DNA Screentape to verify amplicon size of about 450 bp. Samples were pooled according to the enzymatic normalase step. The loading concentration for the normalized pool was verified on QPCRusing the Kapa library quantification complete kit (Kapa-Roche, #KK4824) run on the Bio-Rad CFX384 Real-time system. Normalized pools were run on the Illumina NovaSeq 6000 system with the SP 300 cycle flow cell. Run metrics were paired end 150 cycles with dual indexing reads. Typically, 2 pools representing 2 full 96 well plates (192 samples) were sequenced on each SP300 NovaSeq flow cell.
Sequenced read processing
Binary base call files outputted by the Illumina NovaSeq 6000 sequencing instrument (RTA Version 3.4.4) were converted to FASTQ format and demultiplexed according to the unique dual-index adapter sequences of each sample using the Illumina bcl2fastq2 Conversion Software v2.20. Adapters and low quality bases were trimmed using Trimmomatic v0.36 (Bolger et al. 2014) with the settings ‘ILLUMINACLIP:trimmomatic.fa:2:30:10:1:true TRAILING:5 SLIDINGWINDOW:4:15 MINLEN:35’. The BWA-MEM algorithm of the alignment program, BWA v0.7.17 (Li and Durbin 2009), was utilized for mapping reads of each sample to the SARS-CoV-2 reference genome (NC_045512.2, wuhCor1) with the ‘-M’ flag for marking shorter split hits as secondary and the parameter ‘-U’ set to a 17 alignment score penalty for an unpaired read pair. Name-sorted mapped reads in SAM alignment format were soft-clipped to remove primer sequences specific to the short overlapping (tiled) amplicon design of the Swift Normalase Amplicon SARS-CoV-2 Panel using Primerclip v0.3.8 (Swift Biosciences: https://github.com/swiftbiosciences/primerclip), and the Picard v2.18.20 “AddOrReplaceReadGroups” tool (Broad Institute: http://broadinstitute.github.io/picard/) was used in converting the primer-trimmed SAM files to coordinate-sorted and indexed BAM format with read-groups added for downstream analysis. The variant caller, BCFtools v1.9 (Li et al. 2009), was utilized to detect genetic mutations of the collected viral samples from BAM input:
bcftools mpileup -r NC_045512.2 --count-orphans --no-BAQ --max-depth 50000 --max-idepth 500000 –annotate
FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,INFO/ADF,INFO/A
DR --output-type v |
bcftools call --ploidy 1 --keep-alts --multiallelic-caller --output-type v |
bcftools norm --multiallelics -any --output-type v |
bcftools filter -i “(DP4[0]+DP4[1])<(DP4[2]+DP4[3]) && ((DP4[2]+DP4[3])>0) &
FORMAT/AD[0:1]>20” --output-type v |
bcftools filter -e “IMF<0.5” --output-type v
To generate viral sequences, the output in VCF format was applied to the reference sequence using ‘bcftools consensus’ while the BEDTools v2.30.0 utilities (Quinlan and Hall 2010), “genomecov” (with the ‘-bga’ option to report depth at each base including zero coverage sites), “merge” (for creating a BED file of low depth sites with <1000x coverage to be masked), and “maskfasta” in conjunction with the multiple sequence alignment program, MAFFT v4.471 (Katoh et al. 2002) in ‘--auto’ mode, were used in the production of final consensus sequences with all bases below 1000x coverage masked and both non-targeted ends of the sequences trimmed. Sequences with fewer than 23,000 bp with at least 4000x coverage were discarded from further analysis for having not yielded a near-complete SARS-CoV-2 genome, resulting in 1,254 final sequences suitable for downstream analysis (Supplemental Table). Each assembled SARS-CoV-2 genome sequence was assigned the most likely phylogenetic lineage according to Phylogenetic Assignment of Named Global Outbreak Lineages (PANGO) nomenclature (Rambaut et al. 2020) using the PANGO lineage tool (version: 2021-06-05).
Acknowledgements
The authors wish to thank Vincent Setang and NYU Langone Health DataCore for support extracting the data for this study from clinical databases, Dr. Maria Agüero-Rosenfeld and the clinical laboratory technicians for assistance in testing, saving and retrieving specimens, especially Joanna Fung for her assistance with RNA extraction project. We also thank Dr. Joan Cangiarella for her continuous support of genomic surveillance for SARS-CoV-2 at NYULH, including providing institutional funding for this study. We are grateful to the authors and submitting laboratories who deposited data in GISAID, in particular to those whose sequences we used to create the phylogenetic tree. These are all named in Supplemental Table 5. The NYULH Genome Technology Center is partially supported by the Cancer Center Support Grant P30CA016087 at the Laura and Isaac Perlmutter Cancer Center.
Footnotes
Conflict of Interest Statement: The authors declare that no conflict of interest exists