Biomathematical models for genetic diversity analyses in complete genomes of SARS-CoV-2

Robson da Silva Ramos; Pierre Teodósio Felix; Dallynne Bárbara Ramos Venâncio; Cícero Batista do Nascimento Filho; Antônio João Paulino

doi:10.1101/2020.10.01.20205120

Abstract

In this work, we evaluated the levels of genetic diversity in 38 complete genomes of SARS-CoV-2, publicly available on the National Center for Biotechnology Information (NCBI) platform and from six countries in South America (Brazil, Chile, Peru, Colombia, Uruguay and Venezuela with 16, 11, 1, 1, 1, 7 haplotypes, respectively), all with an extension of 29,906 bp and Phred values ≥ 40. These haplotypes were previously used for phylogenetic analyses, following the alignment protocols of the MEGA X software; where all gaps and unconserved sites were extracted for the construction of phylogenetic trees. The specific methodologies for Paired F_ST estimators, Molecular Variance (AMOVA), Genetic Distance, mismatch, demographic and spatial expansion analyses, molecular diversity and evolutionary divergence time analyses, were obtained using 20,000 random permutation.

1. Methodology

Databank: The 38 complete genome sequences of SARS-CoV-2 from South America (Brazil, Chile, Peru, Colombia, Uruguay and Venezuela with 16, 11, 1, 2, 1, 7 haplotypes, respectively) all with 29,906 pb extension and Phred values ≥ 40 and which now make up our study PopSet, were recovered from GENBANK (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049&Completeness_s=complete&Region_s=South%20America) on August 21, 2020).

Phylogenetic analyses: Nucleotide sequences previously described were used for phylogenetic analyses. The sequences were aligned using the MEGA X program (TAMURA et al., 2018) and the gaps were extracted for the construction of phylogenetic trees.

Genetic Structuring Analyses: Paired F_ST estimators, Molecular Variance (AMOVA), Genetic Distance, mismatch, demographic and spatial expansion analyses, molecular diversity and evolutionary divergence time were obtained with the Software Arlequin v. 3.5 (EXCOFFIER et al., 2005) using 1000 random permutations (NEI and KUMAR, 2000). The F_ST and geographic distance matrices were not compared. All steps of this process are described below.

FOR GENETIC DIVERSITY

Among the routines of LaBECom, this test is used to measure the genetic diversity that is equivalent to the heterozygosity expected in the groups studied. We used for this the standard index of genetic diversity H, described by Nei (1987). Which can also be estimated by the method proposed by PONS and PETIT (1995).

FOR SITE FREQUENCY SPECTRUM (SFS)

According to LaBECom protocols, we used this local frequency spectrum analytical test (SFS), from DNA sequence data that allows us to estimate the demographic parameters of the frequency spectrum. Simulations are made using fastsimcoal2 software, available in http://cmpg.unibe.ch/software/fastsimcoal2/.

FOR MOLECULAR DIVERSITY INDICES

Molecular diversity indices are obtained by means of the average number of paired differences, as described by Tajima in 1993, in this test we used sequences that do not fit the model of neutral theory that establishes the existence of a balance between mutation and genetic drift.

FOR CALCULATING THETA ESTIMATORs

Theta population parameters are used in our Laboratory when we want to qualify the genetic diversity of the populations studied. These estimates, classified as Theta Hom – which aim to estimate the expected homozygosity in a population in equilibrium between drift and mutation and the estimates Theta (S) (WATTERSON, 1975), Theta (K) (EWENS, 1972) and Theta (π) (TAJIMA, 1983).

FOR THE CALCULATION OF The DISTRIBUTION OF MISMATCH

In LaBECom, analyses of the mismatch distribution are always performed relating the observed number of differences between haplotype pairs, trying to define or establish a pattern of population demographic behavior, as already described by (ROGERS; HARPENDING, 1992; Hudson, Hudson, HUDSON, SLATKIN, 1991; RAY et al., 2003, EXCOFFIER, 2004).

FOR PURE DEMOGRAPHIC EXPANSION

This model is always used when we intend to estimate the probability of differences observed between two haplotypes not recombined and randomly chosen, this methodology in our laboratory is used when we assume that the expansion, in a haploid population, reached a momentary balance even having passed through τ generations, of sizes 0 N to 1 N. In this case, the probability of observing the S differences between two non-recombined and randomly chosen haplotypes is given by the probability of observing two haplotypes with S differences in this population (Watterson, 1975).

FOR SPATIAL EXPANSION

The use of this model in LaBECom is usually indicated if the reach of a population is initially restricted to a very small area, and when one notices signs of a growth of the same, in the same space and over a relatively short time. The resulting population generally becomes subdivided in the sense that individuals tend to mate with geographically close individuals rather than random individuals. To follow the dimensions of spatial expansion, we at LaBECom always take into account:

L: Number of loci

Gamma Correction: This fix is always used when mutation rates do not seem uniform for all sites.

nd: Number of substitutions observed between two DNA sequences.

ns: Number of transitions observed between two DNA sequences.

nv: Number of transversions observed between two DNA sequences.

ω: G + C ratio, calculated in all DNA sequences of a given sample.

Paired Difference: Shows the number of loci for which two haplotypes are different.

Percentage difference: This difference is responsible for producing the percentage of loci for which two haplotypes are different.

FOR HAPLOTYPIC INFERENCES

We use these inferences for haplotypic or genotypic data with unknown gametic phase. Following our protocol, inferences are estimated by observing the relationship between haplotype i and xi times its number of copies, generating an estimated frequency (^pi). With genotypic data with unknown gametic phase, the frequencies of haplotypes are estimated by the maximum likelihood method, and can also be estimated using the expected Maximization (MS) algorithm.

FOR THE METHOD OF JUKES AND CANTOR

This method, when used in LaBECom, allows estimating a corrected percentage of how different two haplotypes are. This correction allows us to assume that there have been several substitutions per site, since the most recent ancestor of the two haplotypes studied. Here, we also assume a correction for identical replacement rates for all four nucleotides A C, G and T.

FOR KIMURA METHOD WITH TWO PARAMETERS

Much like the previous test, this fix allows for multiple site substitutions, but takes into account different replacement rates between transitions and transversions.

FOR TAMURA METHOD

We at LaBECom understand this method as an extension of the 2-parameter Kimura method, which also allows the estimation of frequencies for different haplotypes. However, transition-transversion relationships as well as general nucleotide frequencies are calculated from the original data.

FOR The TAJIMA AND NEI METHOD

At this stage, we were also able to produce a corrected percentage of nucleotides for which two haplotypes are different, but this correction is an extension of the Jukes and Cantor method, with the difference of being able to do this from the original data.

FOR TAMURA AND NEI MODEL

As in kimura’s models 2 parameters a distance of Tajima and Nei, this correction allows, inferring different rates of transversions and transitions, besides being able to distinguish transition rates between purines and pyrimidines.

FOR ESTIMATING DISTANCES BETWEEN HAPLOTYPES PRODUCED BY RFLP

We use this method in our laboratory when we need to verify the number of paired differences scouting the number of different alleles between two haplotypes generated by RFLP.

TO ESTIMATE DISTANCES BETWEEN HAPLOTYPES PRODUCED MICROSATELLITES

In this case, what applies is a simple count of the number of different alleles between two haplotypes. Using the sum of the square of the differences of repeated sites between two haplotypes (Slatkin, 1995).

MINIMUM SPANNING NETWORK

To calculate the distance between OTU (operational taxonomic units) from the paired distance matrix of haplotypes, we used a Minimum Spanning Network (MSN) tree, with a slight modification of the algorithm described in Rohlf (1973). We usually use free software written in Pascal called MINSPNET. EXE running in DOS language, previously available at: http://anthropologie.unige.ch/LGB/software/win/min-span-net/.

FOR GENOTYPIC DATA WITH UNKNOWN GAMETIC PHASE

EM algorithm

To estimate haplotypic frequencies we used the maximum likelihood model with an algorithm that maximizes the expected values. The use of this algorithm in LaBECom, allows to obtain the maximum likelihood estimates from multilocal data of gamtic phase is unknown (phenotypic data). It is a slightly more complex procedure since it does not allow us to do a simple gene count, since individuals in a population can be heterozygous to more than one locus.

ELB algorithm

Very similar to the previous algorithm, ELB attempts to reconstruct the gametic phase (unknown) of multilocal genotypes by adjusting the sizes and locations of neighboring loci to explore some rare recombination.

FOR NEUTRALITY TESTS

Ewens-Watterson homozygosis test

We use this test in LaBECom for both haploid and diploid data. This test is used only as a way to summarize the distribution of allelic frequency, without taking into account its biological significance. This test is based on the sampling theory of neutral alllinks from Ewens (1972) and tested by Watterson (1978). It is now limited to sample sizes of 2,000 genes or less and 1,000 different alleles (haplotypes) or less. It is still used to test the hypothesis of selective neutrality and population balance against natural selection or the presence of some advantageous alleles.

Accurate Ewens-Watterson-Slatkin Test

This test created by Slatikin in 1994 and adapted by himself in 1996. is used in our protocols when we want to compare the probabilities of random samples with those of observed samples.

Chakraborty’s test of population amalgamation

This test was proposed by Chakrabordy in 1990, serves to calculate the observed probability of a randomly neutral sample with a number of alleles equal to or greater than that observed, it is based on the infinite allele model and sampling theory for neutral Alleles of Ewens (1972).

Tajima Selective Neutrality Test

We use this test in our Laboratory when DNA sequences or haplotypes produced by RFLP are short. It is based on the 1989 Tajima test, using the model of infinite sites without recombination. It commutes two estimators using the theta mutation as a parameter.

FS FU Test of Selective Neutrality

Also based on the model of infinite sites without recombination, the FU test is suitable for short DNA sequences or haplotypes produced by RFLP. However, in this case, it assesses the observed probability of a randomly neutral sample with a number of alleles equal to or less than the observed value. In this case the estimator used is θ.

FOR METHODS THAT MEASURE INTERPOPULATION DIVERSITY

Genetic structure of the population inferred by molecular variance analysis (AMOVA)

This stage is the most used in the LaBECom protocols because it allows to know the genetic structure of populations measuring their variances, this methodology, first defined by Cockerham in 1969 and 1973) and, later adapted by other researchers, is essentially similar to other approaches based on analyses of gene frequency variance, but takes into account the number of mutations between haplotypes. When the population group is defined, we can define a particular genetic structure that will be tested, that is, we can create a hierarchical analysis of variance by dividing the total variance into covariance components by being able to measure intra-individual differences, interindividual differences and/or interpopulation allocated differences.

Minimum Spanning Network (MSN) among haplotypes

In LaBECom, this tree is generated using the operational taxonomic units (OTU). This tree is calculated from the matrix of paired distances using a modification of the algorithm described in Rohlf (1973).

Locus-by-locus AMOVA

We performed this analysis for each locus separately as it is performed at the haplotypic level and the variance components and f statistics are estimated for each locus separately generating in a more global panorama.

Paired genetic distances between populations

This is the most present analysis in the work of LaBECom. These generate paired F_ST parameters that are always used, extremely reliably, to estimate the short-term genetic distances between the populations studied, in this model a slight algorithmic adaptation is applied to linearize the genetic distance with the time of population divergence (Reynolds et al. 1983; Slatkin, 1995).

Reynolds Distance (Reynolds et al. 1983)

Here we measured how much pairs of fixed N-size haplotypes diverged over t generations, based on F_ST indices.

Slatkin’s linearized F_ST’s (Slatkin 1995)

We used this test in LaBECom when we want to know how much two Haploid populations of N size diverged t generations behind a population of identical size and managed to remain isolated and without migration. This is a demographic model and applies very well to the phylogeography work of our Laboratory.

Nei’s average number of differences between populations

In this test we assumed that the relationship between the gross (D) and liquid (AD) number of Nei differences between populations is the increase in genetic distance between populations (Nei and Li, 1979).

Relative population sizes: divergence between populations of unequal sizes

We used this method in LaBECom when we want to estimate the time of divergence between populations of equal sizes (Gaggiotti and Excoffier, 2000), assuming that two populations diverged from an ancestral population of N0 size a few t generations in the past, and that they have remained isolated from each other ever since. In this method we assume that even though the sizes of the two child populations are different, the sum of them will always correspond to the size of the ancestral population. The procedure is based on the comparison of intra and inter populational (π’s) diversities that have a large variance, which means that for short divergence times, the average diversity found within the population may be higher than that observed among populations. These calculations should therefore be made if the assumptions of a pure fission model are met and if the divergence time is relatively old. The results of this simulation show that this procedure leads to better results than other methods that do not take into account unequal population sizes, especially when the relative sizes of the daughter populations are in fact unequal.

Accurate population differentiation tests

We at LaBECom understand that this test is an analog of fisher’s exact test in a 2×2 contingency table extended to a rxk contingency table. It has been described in Raymond and Rousset (1995) and tests the hypothesis of a random distribution of k different haplotypes or genotypes among r populations.

Assignment of individual genotypes to populations

Inspired by what had been described in Paetkau et al (1995, 1997) and Waser and Strobeck (1998) this method determines the origin of specific individuals, knowing a list of potential source populations and uses the allelic frequencies estimated in each sample from their original constitution.

Detection of loci under selection from F-statistics

We use this test when we suspect that natural selection affects genetic diversity among populations. This method was adapted by Cavalli-Sforza in 1996 from a 1973 work by Lewontin and Krakauer.

2. Results

Molecular Variance Analysis (AMOVA) and Genetic Distance

Genetic distance and molecular variation (AMOVA) analyses were not significant for the groups studied, presenting a variation component of 0.12 between populations and 4.46 within populations. The F_ST value (0.03) showed a low fixation index, with non-significant evolutionary divergences within and between groups, With a representative exception for haplotypes from Peru and Uruguai (Table 1) (Figures 1 and 2).

View this table:

Table 1.

Components of haplotypic variation and paired F_ST value for the 38 complete genome sequences of SARS-CoV-2 from South America.

Figure 1.

F_ST-based genetic distance matrix between for the complete genome sequences of SARS-CoV-2 from six countries in South America. * Generated by the statistical package in R language using the output data of the Software Arlequin version 3.5.1.2

Figure 2.

Matrix of paired differences between the populations studied: between the groups; within the groups; and Nei distance for the complete genome sequences of SARS-CoV-2 from six countries in South America.

A significant similarity was also evidenced for the time of genetic evolutionary divergence among all populations; supported by τ variations, mismatch analyses and demographic and spatial expansion analyses. With a representative exception for haplotypes from Venezuela (Table 2), (Figures 3, 4 5 and 6).

View this table:

Table 2.

Demographic and spatial expansion simulations based on the τ, θ, and M indices of sequences of the complete SARS-CoV-2 genomes from six South American countries.

Figure 3.

Comparison between the Demographic and Spatial Expansion of sequences of the complete genomes of SARS-CoV-2 from six countries in South America. (a and b) Graphs of demographic expansion and spatial expansion of haplotypes from Brazil, respectively; (c and d) Graphs of demographic expansion and spatial expansion of haplotypes from Venezuela, respectively. *Graphs Generated by the statistical package in R language using the output data of the Software Arlequin version 3.5.1.2

Figure 4.

Matrix of divergence time between the complete genomes of SARS-CoV-2 from six countries in South America. In evidence the high value τ present between the sequences of Brazil and Venezuela. * Generated by the statistical package in R language using the output data of the Software Arlequin version 3.5.1.2.

Figure 5.

Matrix of inter haplotypic distance in the complete genomes of SARS-CoV-2 from Venezuela. Note the great variation between haplotypes. *Generated by the statistical package in R language using the output data of the Software Arlequin version 3.5.1.2.

Figure 6.

Matrix of inter haplotypic distance and number of polymorphic sites the complete genomes of SARS-CoV-2 from six countries in South America. Note the great variation between haplotypes from Venezuela in relation to the others. *Generated by the statistical package in R language using the output data of the Software Arlequin version 3.5.1.2.

The molecular diversity analyses estimated per θ reflected a significant level of mutations among all haplotypes (transitions and transversions). Indel mutations (insertions or additions) were not found in any of the six groups studied (Table 3). The D tests of Tajima and Fs de Fu showed disagreements between the estimates of general θ and π, but with negative and highly significant values, indicating, once again, an absence of population expansions (Table 4). The irregularity index (R= Raggedness) with parametric bootstrap, simulated new θ values for before and after a supposed demographic expansion and in this case assumed a value equal to zero for all groups (Table 2); (Figure 7).

View this table:

Table 3.

Molecular Diversity Indices for the complete Genomes of SARS-CoV-2 from six countries in South America

View this table:

Table 4.

Neutrality Tests for the complete Genomes of SARS-CoV-2 from six countries in South America

Figure 7.

Graph of molecular diversity indices for the complete genomes of SARS-CoV-2 from six countries in South America. In the graph the values of θ: (θk) Relationship between the expected number of alllos (k) and the sample size; (θH) Expected homozygosity in a balanced relationship between drift and mutation; (θS) Relationship between the number of segregating sites (S), sample size (n) and non-recombinant sites; (θπ) Relationship between the average number of paired differences (π) and θ. * Generated by the statistical package in R language using the output data of the Arlequin software version 3.5.1.2.

5. Discussion

As the use of phylogenetic analysis and population structure methodologies had not yet been used in this PopSet, in this study it was possible to detect the existence of 6 distinct groups for the complete genome sequences of SARS-CoV-2 from South America, but with minimal variations among all of them. The groups described here presented minimum structuring patterns and were effectively slightly higher for the populations of Brazil and Venezuela. These data suggest that the relative degree of structuring present in these two countries may be related to gene flow. These structuring levels were also supported by simple phylogenetic pairing methodologies such as UPGMA, which in this case, with a discontinuous pattern of genetic divergence between the groups (supports the idea of possible sub-geographical isolations resulting from past fragmentation events), was observed a not so numerous amount of branches in the tree generated and with few mutational steps.

These few mutations have possibly not yet been fixed by drift by the lack of the founding effect, which accompanies the behavior of dispersion and/or loss of intermediate haplotypes throughout the generations. The values found for genetic distance support the presence of this continuous pattern of low divergence between the groups studied, since they considered important the minimum differences between the groups, when the haplotypes between them were exchanged, as well as the inference of values greater than or equal to that observed in the proportion of these permutations, including the p-value of the test.

The discrimination of the 38 genetic entities in their localities was also perceived by their small inter-haplotypic variations, hierarchised in all covariance components: by their intra- and inter-individual differences or by their intra- and intergroup differences, generating a dendogram that supports the idea that the significant differences found in countries such as Brazil and Venezuela, for example, were shared more in their form than in their number, since the result of estimates of the average evolutionary divergence found within these and other countries, even if they exist, were very low.

Based on the high level of haplotypic sharing, tests that measure the relationship between genetic distance and geographic distance, such as the Mantel test, were dispensed in this Estimators θ, even though they are extremely sensitive to any form of molecular variation (FU, 1997), supported the uniformity between the results found by all the methodologies employed, and can be interpreted as a phylogenetic confirmation that there is a consensus in the conservation of the SARS-CoV-2 genome in the Countries of America of America of South objects of this study, being therefore safe to affirm that the small number of existing polymorphisms should be reflected even in all their protein products. This consideration provides the safety that, although there are differences in the haplotypes studied, these differences are minimal in geographically distinct regions and thus it seems safe to extrapolate the levels of polymorphism and molecular diversity found in the samples of this study to other genomes of other South American countries, reducing speculation about the existence of rapid and silent mutations that, although they exist as we have shown in this work, they can significantly increase the genetic variability of the Virus, making it difficult to work with molecular targets for vaccines and drugs in general.

Data Availability

The data that support the findings of this study are available from the corresponding author, Felix, P.T, upon request.

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049&Completeness_s=complete&Region_s=South%20America

Footnotes

Adapted from “Arlequin suite ver 3.5” in Excoffier, L. and H.E. L. Lischer (2010) Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources. 10: 564-567.

6. References

Beaumont MA, Nichols RA (1996) Evaluating loci for use in the genetic analysis of population structure. Proceedings of the Royal Society London B 263, 1619–1626.
OpenUrl CrossRef
Cavalli-Sforza LL (1966) Population structure and human evolution. Proc R Soc Lond B Biol Sci 164, 362–379.
OpenUrl CrossRef
Chakraborty, R. 1990 Mitochondrial DNA polymorphism reveals hidden heterogeneity within some Asian populations. Am. J. Hum. Genet. 47:87–94.
OpenUrl PubMed Web of Science
Cockerham, C. C., 1969 Variance of gene frequencies. Evolution 23: 72–83.
OpenUrl CrossRef PubMed Web of Science
Cockerham, C. C., 1973 Analysis of gene frequencies. Genetics 74: 679–700.
OpenUrl Abstract/FREE Full Text
↵
Ewens, W.J. 1972 The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3:87–112.
OpenUrl CrossRef PubMed Web of Science
↵
Excoffier L. 2004. Patterns of DNA sequence diversity and genetic structure after a range expansion: lessons from the infinite-island model. Mol Ecol 13(4): 853–864.
OpenUrl CrossRef PubMed Web of Science
Excoffier, L. and H.E. L. Lischer (2010) Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources. 10: 564–567.
OpenUrl
Excoffier, L., Smouse, P., and Quattro, J. 1992 Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics 131:479–491.
OpenUrl Abstract/FREE Full Text
↵
Fu, Y.X. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147: 915–925. (1997)
OpenUrl Abstract/FREE Full Text
↵
Gaggiotti, O., and L. Excoffier, 2000. A simple method of removing the effect of a bottleneck and unequal population sizes on pairwise genetic distances. Proceedings of the Royal Society London B 267: 81–87.
OpenUrl CrossRef PubMed Web of Science
GenBank [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1982] - [cited 2020 Aug 21]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%;20taxid:2697049&Completeness_s=complete&Region_s=South%20America
↵
Kumar S, Stecher G, Li M, Knyaz C; Tamura K. MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms. (2018). Molecular Biology and Evolution 35:1547–1549.
OpenUrl CrossRef PubMed
Lewontin RC, Krakauer J (1973) Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74, 175–195.
OpenUrl Abstract/FREE Full Text
↵
Nei, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York, NY, USA.
↵
Nei, M., and W. H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc.Natl.Acad.Sci.USA 76:5269–5273.
OpenUrl Abstract/FREE Full Text
↵
Paetkau D, Calvert W, Stirling I and Strobeck C, 1995. Microsatellite analysis of population structure in Canadian polar bears. Mol Ecol 4:347–54.
OpenUrl CrossRef PubMed Web of Science
↵
Paetkau D, Waits LP, Clarkson PL, Craighead L and Strobeck C, 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943–1957.
OpenUrl Abstract/FREE Full Text
Pons, O.; Petit, J.R. Estimation, Variance and Optimal Sampling of Gene Diversity I. Haploid locus. Theor Appl Genet 90: 462–470, 1995.
OpenUrl
↵
Ray N, Currat M, Excoffier L. 2003. Intra-Deme Molecular Diversity in Spatially Expanding Populations. Mol Biol Evol 20(1): 76–86.
OpenUrl CrossRef PubMed Web of Science
↵
Raymond M. and F. Rousset. 1995 An exact tes for population differentiation. Evolution 49:1280–1283.
OpenUrl CrossRef Web of Science
↵
Reynolds, J., Weir, B.S., and Cockerham, C.C. 1983 Estimation for the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767–779.
OpenUrl Abstract/FREE Full Text
↵
Rogers, A. R., and H. Harpending, 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552–569.
OpenUrl CrossRef PubMed Web of Science
↵
Rohlf, F. J., 1973. Algorithm 76. Hierarchical clustering using the minimum spanning tree. The Computer Journal 16:93–95.
OpenUrl
Slatkin, M. 1996 A correction to the exact test based on the Ewens sampling distribution. Genet. Res. 68: 259–260.
OpenUrl CrossRef PubMed Web of Science
Slatkin, M. 1994b An exact test for neutrality based on the Ewens sampling distribution. Genet. Res. 64(1):71–74.
OpenUrl CrossRef PubMed Web of Science
↵
Slatkin, M. 1995 A measure of population subdivision based on microsatellite allele frequencies. Genetics 139: 457–462.
OpenUrl Abstract/FREE Full Text
↵
Slatkin, M.; Hudson, R. R. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics, 1991 Oct;129(2):555–62.
OpenUrl Abstract/FREE Full Text
↵
Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460.
OpenUrl Abstract/FREE Full Text
Tajima, F. 1989a. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585–595.
OpenUrl Abstract/FREE Full Text
1. Takahata, N. and
2. Clark, A.G.
Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology, edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37–59.
Tamura K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G + C-content biases. Molecular Biology and Evolution 9:678–687. (1992).
OpenUrl CrossRef PubMed Web of Science
↵
Waser PM, and Strobeck C, 1998. Genetic signatures of interpopulation dispersal. TREE 43–44.
↵
Watterson, G. 1978. The homozygosity test of neutrality. Genetics 88:405–417.
OpenUrl Abstract/FREE Full Text
↵
Watterson, G., 1975. On the number of segregating sites in genetical models without recombination. Theor.Popul.Biol. 7: 256–276.
OpenUrl

View the discussion thread.

Posted October 02, 2020.

Download PDF

Data/Code

Citation Tools

Subject Area

Health Informatics

Subject Areas

All Articles

Addiction Medicine (386)
Allergy and Immunology (701)
Anesthesia (193)
Cardiovascular Medicine (2859)
Dentistry and Oral Medicine (326)
Dermatology (244)
Emergency Medicine (431)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1011)
Epidemiology (12569)
Forensic Medicine (10)
Gastroenterology (807)
Genetic and Genomic Medicine (4447)
Geriatric Medicine (402)
Health Economics (716)
Health Informatics (2856)
Health Policy (1050)
Health Systems and Quality Improvement (1050)
Hematology (376)
HIV/AIDS (893)
Infectious Diseases (except HIV/AIDS) (13986)
Intensive Care and Critical Care Medicine (831)
Medical Education (415)
Medical Ethics (114)
Nephrology (464)
Neurology (4201)
Nursing (223)
Nutrition (617)
Obstetrics and Gynecology (788)
Occupational and Environmental Health (723)
Oncology (2205)
Ophthalmology (626)
Orthopedics (254)
Otolaryngology (319)
Pain Medicine (269)
Palliative Medicine (83)
Pathology (488)
Pediatrics (1172)
Pharmacology and Therapeutics (489)
Primary Care Research (483)
Psychiatry and Clinical Psychology (3658)
Public and Global Health (6787)
Radiology and Imaging (1494)
Rehabilitation Medicine and Physical Therapy (869)
Respiratory Medicine (902)
Rheumatology (430)
Sexual and Reproductive Health (433)
Sports Medicine (369)
Surgery (473)
Toxicology (57)
Transplantation (202)
Urology (174)

[1] Beaumont MA, Nichols RA (1996) Evaluating loci for use in the genetic analysis of population structure. Proceedings of the Royal Society London B 263, 1619–1626.
OpenUrl CrossRef

[2] Cavalli-Sforza LL (1966) Population structure and human evolution. Proc R Soc Lond B Biol Sci 164, 362–379.
OpenUrl CrossRef

[3] Chakraborty, R. 1990 Mitochondrial DNA polymorphism reveals hidden heterogeneity within some Asian populations. Am. J. Hum. Genet. 47:87–94.
OpenUrl PubMed Web of Science

[4] Cockerham, C. C., 1969 Variance of gene frequencies. Evolution 23: 72–83.
OpenUrl CrossRef PubMed Web of Science

[5] Cockerham, C. C., 1973 Analysis of gene frequencies. Genetics 74: 679–700.
OpenUrl Abstract/FREE Full Text

[6] ↵
Ewens, W.J. 1972 The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3:87–112.
OpenUrl CrossRef PubMed Web of Science

[7] ↵
Excoffier L. 2004. Patterns of DNA sequence diversity and genetic structure after a range expansion: lessons from the infinite-island model. Mol Ecol 13(4): 853–864.
OpenUrl CrossRef PubMed Web of Science

[8] Excoffier, L. and H.E. L. Lischer (2010) Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources. 10: 564–567.
OpenUrl

[9] Excoffier, L., Smouse, P., and Quattro, J. 1992 Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data. Genetics 131:479–491.
OpenUrl Abstract/FREE Full Text

[10] ↵
Fu, Y.X. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147: 915–925. (1997)
OpenUrl Abstract/FREE Full Text

[11] ↵
Gaggiotti, O., and L. Excoffier, 2000. A simple method of removing the effect of a bottleneck and unequal population sizes on pairwise genetic distances. Proceedings of the Royal Society London B 267: 81–87.
OpenUrl CrossRef PubMed Web of Science

[12] GenBank [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1982] - [cited 2020 Aug 21]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%;20taxid:2697049&Completeness_s=complete&Region_s=South%20America

[13] ↵
Kumar S, Stecher G, Li M, Knyaz C; Tamura K. MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms. (2018). Molecular Biology and Evolution 35:1547–1549.
OpenUrl CrossRef PubMed

[14] Lewontin RC, Krakauer J (1973) Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74, 175–195.
OpenUrl Abstract/FREE Full Text

[15] ↵
Nei, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York, NY, USA.

[16] ↵
Nei, M., and W. H. Li. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc.Natl.Acad.Sci.USA 76:5269–5273.
OpenUrl Abstract/FREE Full Text

[17] ↵
Paetkau D, Calvert W, Stirling I and Strobeck C, 1995. Microsatellite analysis of population structure in Canadian polar bears. Mol Ecol 4:347–54.
OpenUrl CrossRef PubMed Web of Science

[18] ↵
Paetkau D, Waits LP, Clarkson PL, Craighead L and Strobeck C, 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943–1957.
OpenUrl Abstract/FREE Full Text

[19] Pons, O.; Petit, J.R. Estimation, Variance and Optimal Sampling of Gene Diversity I. Haploid locus. Theor Appl Genet 90: 462–470, 1995.
OpenUrl

[20] ↵
Ray N, Currat M, Excoffier L. 2003. Intra-Deme Molecular Diversity in Spatially Expanding Populations. Mol Biol Evol 20(1): 76–86.
OpenUrl CrossRef PubMed Web of Science

[21] ↵
Raymond M. and F. Rousset. 1995 An exact tes for population differentiation. Evolution 49:1280–1283.
OpenUrl CrossRef Web of Science

[22] ↵
Reynolds, J., Weir, B.S., and Cockerham, C.C. 1983 Estimation for the coancestry coefficient: basis for a short-term genetic distance. Genetics 105:767–779.
OpenUrl Abstract/FREE Full Text

[23] ↵
Rogers, A. R., and H. Harpending, 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552–569.
OpenUrl CrossRef PubMed Web of Science

[24] ↵
Rohlf, F. J., 1973. Algorithm 76. Hierarchical clustering using the minimum spanning tree. The Computer Journal 16:93–95.
OpenUrl

[25] Slatkin, M. 1996 A correction to the exact test based on the Ewens sampling distribution. Genet. Res. 68: 259–260.
OpenUrl CrossRef PubMed Web of Science

[26] Slatkin, M. 1994b An exact test for neutrality based on the Ewens sampling distribution. Genet. Res. 64(1):71–74.
OpenUrl CrossRef PubMed Web of Science

[27] ↵
Slatkin, M. 1995 A measure of population subdivision based on microsatellite allele frequencies. Genetics 139: 457–462.
OpenUrl Abstract/FREE Full Text

[28] ↵
Slatkin, M.; Hudson, R. R. Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics, 1991 Oct;129(2):555–62.
OpenUrl Abstract/FREE Full Text

[29] ↵
Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460.
OpenUrl Abstract/FREE Full Text

[30] Tajima, F. 1989a. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585–595.
OpenUrl Abstract/FREE Full Text

[31] Takahata, N. and
Clark, A.G.
Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology, edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37–59.

[32] Takahata, N. and

[33] Clark, A.G.

[34] Tamura K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G + C-content biases. Molecular Biology and Evolution 9:678–687. (1992).
OpenUrl CrossRef PubMed Web of Science

[35] ↵
Waser PM, and Strobeck C, 1998. Genetic signatures of interpopulation dispersal. TREE 43–44.

[36] ↵
Watterson, G. 1978. The homozygosity test of neutrality. Genetics 88:405–417.
OpenUrl Abstract/FREE Full Text

[37] ↵
Watterson, G., 1975. On the number of segregating sites in genetical models without recombination. Theor.Popul.Biol. 7: 256–276.
OpenUrl

Biomathematical models for genetic diversity analyses in complete genomes of SARS-CoV-2

Abstract

1. Methodology

FOR GENETIC DIVERSITY

FOR SITE FREQUENCY SPECTRUM (SFS)

FOR MOLECULAR DIVERSITY INDICES

FOR CALCULATING THETA ESTIMATORs

FOR THE CALCULATION OF The DISTRIBUTION OF MISMATCH

FOR PURE DEMOGRAPHIC EXPANSION

FOR SPATIAL EXPANSION

FOR HAPLOTYPIC INFERENCES

FOR THE METHOD OF JUKES AND CANTOR

FOR KIMURA METHOD WITH TWO PARAMETERS

FOR TAMURA METHOD

FOR The TAJIMA AND NEI METHOD

FOR TAMURA AND NEI MODEL

FOR ESTIMATING DISTANCES BETWEEN HAPLOTYPES PRODUCED BY RFLP

TO ESTIMATE DISTANCES BETWEEN HAPLOTYPES PRODUCED MICROSATELLITES

MINIMUM SPANNING NETWORK

FOR GENOTYPIC DATA WITH UNKNOWN GAMETIC PHASE

EM algorithm

ELB algorithm

FOR NEUTRALITY TESTS

Ewens-Watterson homozygosis test

Accurate Ewens-Watterson-Slatkin Test

Chakraborty’s test of population amalgamation

Tajima Selective Neutrality Test

FS FU Test of Selective Neutrality

FOR METHODS THAT MEASURE INTERPOPULATION DIVERSITY

Genetic structure of the population inferred by molecular variance analysis (AMOVA)

Minimum Spanning Network (MSN) among haplotypes

Locus-by-locus AMOVA

Paired genetic distances between populations

Reynolds Distance (Reynolds et al. 1983)

Slatkin’s linearized FST’s (Slatkin 1995)

Nei’s average number of differences between populations

Relative population sizes: divergence between populations of unequal sizes

Accurate population differentiation tests

Assignment of individual genotypes to populations

Detection of loci under selection from F-statistics

2. Results

Molecular Variance Analysis (AMOVA) and Genetic Distance

5. Discussion

Data Availability

Footnotes

6. References

Citation Manager Formats

Subject Area

Slatkin’s linearized F_ST’s (Slatkin 1995)