Abstract
A large volume of SARS-CoV-2 genomic data has accumulated. To timely analyze and visualize the exponentially increasing viral genomic sequences, we developed the Coronavirus GenBrowser (CGB) based on the framework of distributed genome alignments and the evolutionary tree built on an existing subtree. Among the 330,942 high-quality SARS-CoV-2 genomic sequences analyzed, 253,798 mutations were identified, and three prevalent European variants were found to have no mutations in three months. This analysis also revealed a strain dated early March 2020 causing an outbreak in Beijing after three months of dormancy. Another strain with S:D614G was found to have 671 identical descendants spreading in six continents between February 2020 and January 2021. Results of this study show that CGB is an efficient platform for monitoring the dynamics of SARS-CoV-2 transmission.
One Sentence Summary CGB enables large-scale analysis of the latest SARS-CoV-2 genomic sequences to timely reveal the dynamics of COVID-19 pandemic.
Main Text
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) 1-3 has infected more than 116 million people, and more than 2.5 million people have died from COVID-19. Several web browsers have been developed to track the COVID-19 pandemic. The browser of the Johns Hopkins Coronavirus Resource Center displays global and regional COVID-19 status, including numbers of confirmed cases, deaths, tests, hospitalizations, and vaccinations (https://coronavirus.jhu.edu/). The COVID-3D browser, developed by the University of Melbourne, shows the three-dimensional structures of various wild-type and mutant proteins of SARS-CoV-2 based on their nucleotide sequences 4. The UCSC SARS-CoV-2 Genome Browser is derived from the well-established UCSC genome-browser for visualization of nucleotide and protein sequences, sequence conservations, potential epitopes for production of antibodies, primers for RT-PCR and sequencing, and many other properties of specific parts of wild-type and variants of SARS-CoV-2 5. The WashU Virus Genome Browser provides Nextstrain-based phylogenetic-tree view and genomic-coordinate, track-based view of genomic features of viruses 6. It also has many functions similar to those of the UCSC SARS-CoV-2 Genome Browser such as visualization of sequence variations, genomic sites for diagnostic tests, and immunogenic epitopes. In addition, it continuously updates repository of SARS-CoV-2 genomic data.
Nextstrain is a browser for real-time tracking of pathogen evolution. It hosts tools for phylodynamic analysis of genomic sequences of pathogens of both endemic and pandemic diseases. Nextstrain has allowed analysis of genomic sequences of approximately 4,000 strains of SARS-CoV-2 and investigation of its evolution 7. As more than 705,000 SARS-CoV-2 genomic sequences have been deposited in public databases, such as National Center for Biotechnology Information (NCBI) GenBank 8 and Global Initiative on Sharing All Influenza Data (GISAID) 9,10, analysis of these data has far exceeded the capacity of Nextstrain. New approaches are needed to accomplish this task. Therefore, we developed the Coronavirus GenBrowser (CGB).
To allow timely analysis of a large number of SAR-CoV-2 genomic sequences, we first solved the problem that all viral genomic sequences have to be re-aligned when nucleotide sequences of new genomes become available. This is extremely time consuming. With the distributed alignment system (Figure 1), we dramatically reduced the total time required for the alignment. We also built the evolutionary tree on the existing tree with new genomic data in order to reduce the complexity of tree construction. With these modifications, hundreds of thousands of SARS-CoV-2 genomes can be timely analyzed with data easily shared and visualized (Figure 1). For genomic sequence alignments, high quality SARS-CoV-2 genomic sequences were obtained from the 2019nCoVR database 11,12, which is an integrated resource based on GenBank, GISAID 9,10, China National GeneBank DataBase (CNGBdb) 13, the Genome Warehouse (GWH) 14, and the National Microbiology Data Center (NMDC, https://nmdc.cn/). The sequences were aligned 15 to that of the reference genome 3 and presented as distributed alignments. Genomic sequences of bat coronavirus RaTG13 16, pangolin coronavirus PCoV-GX-P1E 17, and SARS-CoV-2 strains collected before January 31, 2020 were jointly used to identify ancestral alleles of SARS-CoV-2. Mutations in strains of each branch of the evolutionary tree were indicated according to the principle of parsimony 18. A highly effective maximum-likelihood method (TreeTime) 19 was used to determine the dates of internal nodes with very minor revisions.
In total, 330,942 high quality SARS-CoV-2 genomic sequences were analyzed, and 253,798 (recurrent) mutations were identified. With sliding window analysis, 971 mutation cold spots were found with a false discovery rate (FDR) corrected P-value < 0.01 (Figure S5, Supplemental excel file). The coldest spot is located in ORF1a, which encodes nsp3 phosphoesterase (nucleotides 7,394 – 7,419) (FDR corrected P-value = 4.28 × 10−29). These mutation cold spots may be key functional elements of SARS-CoV-2 and can potentially be used for vaccine development and targets for detection.
The genome-wide mutation rate of coronaviruses has been determined to be 10−4− 10−2per nucleotide per year 20. As this range of mutation rate is too wide, we decided to estimate more precisely the genome-wide mutation rate (μ) of SARS-CoV-2 and determined that μ = 6.8017 × 10−4per nucleotide per year (95% confidence interval: 5.4262 to 8.2721 × 10−4). The estimated μ was lower than that of other coronaviruses, such as SARS-CoV (0.80 to 2.38 × 10−3 per nucleotide per year) 20 and MERS-CoV (1.12 × 10−3 per nucleotide per year) 21. It was slightly lower than that determined by other investigators (9.90 × 10−4per nucleotide per year) 22. Various mutation rates were found in different regions of the SARS-CoV-2 genome. The mutation rate of each gene is shown in Table S1.
Similar to Nextstrain 7, the pre-analyzed genomic data of SARS-CoV-2 variants and the associated metadata on CGB are shared with the general public. The size of distributed alignments is 9,498 Mb for the high-quality 330,942 SARS-CoV-2 genomic sequences. The tree-based data format allows the compression ratio to reach 2,499:1, meaning that the size of compressed core file containing pre-analyzed genomic data and associated metadata is as small as 3.80 Mb (Figure 1). This approach ensures low-latency access to the data and enables fast sharing and re-analysis of a large number of SARS-CoV-2 genomic variants.
To visualize, search, and filter the results of genomic analysis, both desktop standalone and web-based user-interface of CGB were developed. Similar to the UCSC SARS-CoV-2 Genome Browser 5 and the WashU Virus Genome Browser 6, six genomic-coordinate annotated tracks were developed to show genome structure and key domains, allele frequencies, sequence similarity, multi-coronavirus genome alignment, and primer sets for detection of various SARS-CoV-2 strains (Figure S11). To efficiently visualize the results of genomic analysis, movie-making ability was implemented for painting the evolutionary tree, and only elements shown on the screen and visible to the user would be painted. This design makes the visualization process highly efficient, and the evolutionary tree of more than one million strains can be visualized simultaneously.
CGB detects on-going positive selection based on S-shaped frequency trajectory of a selected allele (Figures S18, S19). It has been shown that the SARS-CoV-2 variant with G614 spike protein has a fitness advantage 23. Our analysis using CGB confirmed this finding even when the G614 frequency was very low (< 10%) (Figure 2), indicating that CGB can detect putative advantageous variants before they become widely spread. GCB also predicted an increase in the frequency of S:HV69-deletion of the B.1.1.7 linage 24 (Figure 2), suggesting that variants with the deletion may be advantageous. As an increase in mutation frequency could be due to sampling bias and epidemiological factors 23, putative advantageous variants should be closely monitored.
Using CGB, we analyzed branch-specific accelerated evolution of SARS-CoV-2 and found that 246 internal branches of the evolutionary tree (FDR corrected P < 0.05, Supplemental excel file) had significantly more mutations. All evolution-accelerated variants were not found to spread significantly faster than other variants during the same period of time as the majority (168/225 =74.6%) of these variants had relatively fewer descendants. This observation suggests that these variants are not highly contagious. It is likely that these evolution-accelerated variants were evolved neutrally instead of adaptively.
CGB is also an efficient platform to investigate local and global transmission of COVID-19 (Figure 3). There was an outbreak in Qingdao, China 25 after two dock workers were found to have asymptomatic infections on September 24, 2020. CGB lineage tracing revealed that the sequence of a sample collected from the outer packaging of cold-chain products is identical to that of the most recent common ancestor of the two strains isolated from the two dock workers (Figure 3B), suggesting that infection of these two individuals was cold-chain related.
CGB lineage tracing also revealed the difficulty in controlling COVID-19 pandemic. There was an outbreak in Xinfadi, Beijing, China 26,27. The sequences of two isolates (Beijing/IVDC-02-06 and Beijing/BJ0617-01-Y), collected from two Xinfadi cases on June 11 and 14, 2020, were found to be identical to the sequence of the variant (CGB4268.5142) (Figure 3C) dated March 6, 2020 (95% CI: February 28 – March 17, 2020). This variant was found to spread to Taiwan, India, Czech Republic, England, Denmark, and Colombia and caused the outbreak in Beijing three months later. It was also detected in the United States (NY-Wadsworth-21013812-01/2020) on December 18, 2020. These two Xinfadi strains were found to evolve significantly slowly (P =0.0043 and 0.0051, respectively) because no mutations were detected between March and June 2020, consistent with the hypothesis of cold-chain related transmission 27.
CGB is a powerful tool for identification of global and regional transmission of mutation-dormant strains as it can determine whether the mutation rate of a specific strain is significantly lower than the average mutation rate of the entire set of strains. This lineage-specific reduced mutation rate could be due to a long period of dormancy caused by the yet to be confirmed cold-chain preservation or other reasons. Among the 330,942 SARS-CoV-2 strains, 7,534 strains were found to evolve significantly slowly (FDR corrected P =9.68 × 10−9∼0.04, Supplemental excel file) and did not mutate in 100 days. In addition, three closely related variants were found to have no mutations in 3 months before their descendants were found to widely spread in Europe in late July 2020 (Figure 4A). Another variant with C3037T, A23403G (S:D614G), C14408T (ORF1b:P314L), and C241T, emerged in early December 2019. This variant had 671 identical descendants spreading in six continents between February 2020 and January 2021, indicating that it has no additional mutations in one year (Figure 4B). The most (582/671 = 86.7%) of the descendants were found in Europe and North America. These results (Figures 3 and 4) suggest that the transmission of mutation-dormant variants also plays a major role in COVID-19 pandemic.
All timely-updated data are freely available at https://bigd.big.ac.cn/ncov/apis/. The desktop standalone version (Figure S10A) provides the full function of CGB and has a plug-in module for the eGPS software (http://www.egps-software.net/) 28. Although the web-based CGB is a simplified version (https://www.biosino.org/genbrowser/ and https://bigd.big.ac.cn/genbrowser/) and designed mainly for educational purpose, it provides a convenient way to access the data via a web browser, such as Google Chrome, Firefox, and Safari (Figure S10B, C). The web-based CGB package can be downloaded and reinstalled on any websites. Nine language versions (Chinese, English, German, Japanese, French, Italian, Portuguese, Russian, and Spanish) are available.
Data Availability
The coronavirus genomic sequences used in this study were obtained from the 2019nCoVR database 11. Timely updated data of genomic sequences of SARS-CoV-2 variants are shared with the general public at https://bigd.big.ac.cn/ncov/apis/. The free software (desktop and web-based versions) can be downloaded from http://www.egps-software.net/.
Members of the language translation team
German: Ning He6, Jing Lv6, Ting Peng6
Italian: Ting Zhou6, Nan Yang6, Siyi Hou6
Portuguese: Huang Li6, Jingxuan Yan6, Chenglin Zhu6, Wenjing Liu6
Russian: Yuhong Guan6, Huanxiao Song6
Spanish: Qin Zhou6, Han Gao6, Jinglan He6, Tiantian Li6, Ruiwen Fei6, Shumei Zhang6
French: Yuyuan Guo6
Author contributions
YHP, GQZ, WZ, and HL designed the study; DY, XY, BT, YHP, JY, JZ, GD, ZQH, HM, LD, GQZ, WZ, and HL wrote the code and developed CGB; DY, XY, BT, YHP, JY, JZ, GD, ZQH, WH, XS, GQZ, WZ, and HL acquired, analyzed, and interpreted the data; LM, MZ, YC, GD, TJ, and CL integrated and curated the source data; members of the language translation team translated CGB into multiple languages; DY, YHP, JY, JZ, GQZ, WZ, and HL wrote the manuscript. All authors have approved the submitted version.
Competing interests
The authors declare no competing interests.
Data and materials availability
The coronavirus genomic sequences used in this study were obtained from the 2019nCoVR database 11. Timely updated data of genomic sequences of SARS-CoV-2 variants are shared with the general public at https://bigd.big.ac.cn/ncov/apis/. The free software (desktop and web-based versions) can be downloaded from http://www.egps-software.net/.
Acknowledgments
We thank Ya-Ping Zhang for providing valuable advices and encouragement and the researchers who generated and deposited sequence data of SARS-CoV-2 in GISAID, GenBank, CNGBdb, GWH, and NMDC making this study possible. This work was supported by a grant from the National Key Research and Development Project (No. 2020YFC0847000).