SUMMARY
Background The knowledge of clinically relevant markers distribution might become a useful tool in COVID-19 therapy using personalized approach in the lack of unified recommendations for COVID-19 patients management during pandemic. We aimed to identify the frequencies and distribution patterns of rs11385942 and rs657152 polymorphic markers, associated with severe COVID-19, among populations of the world, as well at the national level within Russia. The study was also dedicated to reveal whether population frequencies of both polymorphic markers are associated with COVID-19 cases, recovery and death rates.
Methods We genotyped 1883 samples from 91 ethnic populations from Russia and neighboring countries by rs11385942 and rs657152 markers. Local populations which were geographically close and genetically similar were pooled into 28 larger groups. In the similar way we compiled a dataset on the other regions of the globe using genotypes extracted or imputed from the available datasets (32 populations worldwide). The differences in alleles frequencies between groups were estimated and the frequency distribution geographic maps have been constructed. We run the correlation analysis of both markers frequencies in various populations with the COVID-19 epidemiological data on the same populations.
Findings The cartographic analysis revealed that distribution of rs11385942 follows the West Eurasian pattern: it is frequent in Europeans, West Asians, and particularly in South Asians but rare or absent in all other parts of the globe. Notably, there is no abrupt changes in frequency across Eurasia but the clinal variation instead. The distribution of rs657152 is more homogeneous. Higher population frequencies of both risk alleles correlated positively with the death rate. For the rs11385942 we can state the tendency only (r=0,13, p=0.65), while for rs657152 the correlation was significantly high (r=0,59, p=0,02). These reasonable correlations were obtained on the Russian dataset, but not on the world dataset.
Interpretation Using epidemiological statistics on Russia and neighboring countries we revealed the evident correlation of the risk alleles frequencies with the death rate from COVID-19. The lack of such correlations at the world level should be attributed to the differences in the ways epidemiological data have been counted in different countries. So that, we believe that genetic differences between populations make small but real contribution into the heterogeneity of the pandemic worldwide. New studies on the correlations between COVID-19 recovery/mortality rates and population’s gene pool are urgently needed.
Introduction
A novel coronavirus known as Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) was first identified in China in December 2019 and due to its fast spread in three months it was identified as a pandemic by the World Health Organization [1]. The pathogen causes disease called coronavirus disease 2019 (COVID-19).
The disease course varies drastically among patients, from asymptomatic infection to death from the respiratory failure. It was reported that patients with severe COVID-19 were older and/or with a greater number of comorbid conditions [2]. Epidemiological indicators show that morbidity and mortality is highly variable not only between individuals but also among countries. So, to the date of paper writing the rate of COVID-19 cases in North European countries (Denmark, Estonia, Finland, Norway) was relatively low, while southern countries have experienced higher morbidity and mortality rates (Italy, Spain, and France) [3]. This can result from different social-demographic factors, environment factors (see [4]–[6] for correlation with the vitamin D levels) and, importantly, genetic factors. The genome-wide association study replicated on Italian and Spanish patients identified two key genome regions caused severe respiratory failure in COVID-19 patients [7]. The first site, rs11385942 (insertion-deletion G>GA) is located within a cluster of six genes, whose functions are relevant to COVID-19. Notably, the association is evident for the entire haplotype block of 50kb length. So high LD in this site, unusual for this segment of chromosome 3, resulted from an evolutionary recent introduction of this haplotype into human gene pool from Neanderthals [8]. The second site, demonstrating less pronounced association with severe Covid-19 (rs657152) is located in the gene encoding ABO blood groups, which may also determine the severity of respiratory failure in COVID-19 patients. The paper presenting both associated sites [7] stated that their frequencies vary among 1000 Genomes project populations, but has not described the pattern of their variation.Thus, in our study we focus on the global population variation of both genomic sites, associated with the severe Covid-19.
We analyzed the variation of both SNPs in worldwide populations, based on this study dataset (1883 individuals) from North Eurasian countries and on the database compiled from the published data on other regions of the globe (3088 individuals). We aimed to unravel the patterns of geographic variation of these polymorphisms and investigate whether these genetic factors contributed to the variety of COVID-19 epidemiological parameters between different populations. In the lack of unified recommendations for COVID-19 patients’ management the knowledge of clinically relevant markers distribution may become a useful tool in COVID-19 therapy using personalized approach.
Therefore, the initial goal of this study was to identify the frequency of rs11385942 and rs657152 markers among populations at the global level as well as at the national level within Russia. The latter level is important due to the high ethnic heterogeneity of populations from Russia and neighbor states. The second goal was to determine whether there is any connection between the population frequencies of rs11385942 and rs657152 polymorphic markers and COVID-19 cases, recovery and deaths rates.
Materials and methods
Generating the dataset on the Russian and neighboring populations
The samples from populations of Russia and neighbor North Eurasian counties (former USSR and Mongolia) were received from Biobank of North Eurasia [9]. All sample donors gave the written informed consent, approved by the Ethical Committee of the Research Centre for Medical Genetics. The set of analyzed samples covers the large area of Russian and neighboring countries and includes most of ethnic groups from this area (Figure 1). The sampling strategy (similar to the one of the 1000 Genomes project) was focused on the indigenous populations; for example in East Siberia we sampled Yakuts and Evenks rather than ethnic Russians who migrated to Siberia within last 1-3 centuries.
In total, we genotyped 1883 samples from 91 ethnic populations from Armenia, Azerbaijan, Belorussia, Georgia, Kazakhstan, Kirgizia, Lithuania, Mongolia, Russia, Tajikistan, Ukraine, and Uzbekistan. We focused on the two markers, rs11385942 and rs657152, associated with the severe Covid-19 [7]. We genotyped these samples by Infinium Omni5Exome-4 v1.3 BeadChip Kit (Illumina, USA) which includes 4,5 million SNPs. For this study we extracted genotypes of rs657152 and genotypes of 250 SNPs located in the vicinity of rs11385942 (positions 45700000-45900000 on the chromosome 3, Human Genome Build 37).
Most populations were represented by few samples each, while we needed sample size about 100 chromosomes per population to estimate the frequencies of the COVID-associated markers. Thus, local populations which were geographically close and genetically similar were pooled into larger groups. This pooling was made based on the genetic relationship between populations identified by means of the entire dataset of 4,5 million SNPs. The pooling procedure included i) standard filtering in PLINK [10] (geno 0.05, maf 0.01, mind 0.1, pruning indep-pairwise 1500 150 0.2), ii) running the set of PCAs by smartpca software [11], [12], and iii) identifying groups of populations with homogeneity within the groups but pronounced variation between the groups; we omitted some populations which could not stand alone because of the low sample size and could not be pooled because of their genetic peculiarities; the details of this pooling procedure are described in [13]. The pooling resulted in the final dataset of 1785 individuals from 28 pooled populations with the average sample size 128 chromosomes (64 individuals). Table 1 presents the list of these populations (“Russian dataset” section). Their geographic locations are present on the map (Figure 1).
Compiling the dataset on the world populations
In addition to the dataset on the Russian populations, we compiled a dataset on the other regions of the globe. The data came from the 16 papers [14] - [29] which studied indigenous populations by using genome-wide Illumina arrays (Illumina700k, Illumina730k, Illumina660k, Illumina650k, Illumina610k, Illumina550k, and Illumina1M). The resulting merged world dataset included 3336 samples, genotyped by rs657152 and 66 SNPs in the vicinity of rs11385942.
Similar to the Russian dataset we pooled some population samples to achieve larger sample sizes. In most cases we merged samples which came from the same country. Sometimes we have to merge data on the neighbor countries, for example data from Spain and France were merged. Two large countries – China and India – were present by 3 merged populations each. The samples from Russia and neighboring counties present in the world dataset were ignored to avoid double counting the same populations/samples. In total, the world dataset included 32 populations with the average sample size 104 chromosomes. Table 1 presents the list of the analyzed populations (“world dataset” section), their geographic locations are shown on the Figure 1.
Imputing
One of the two Covid-associated markers, rs657152, was directly genotyped in both, Russian and world datasets. The frequencies of this marker in all populations were counted using PLINK software and are present in the Table 1. The second marker, rs11385942, is missed in all genome-wide panels available to date. The initial genome-wide association study on COVID-19 [7] genotyped 712,189 SNPs, then imputed 170 million SNPs, and one out of these imputed SNPs, namely rs113859420, demonstrated the highest association with severe COVID-19. Actually there is a haplotype block (approx. 50 kb length [8]) in strong LD, and rs113859420 marks this block nearby as effective as any other SNP in the vicinity. In our study, for the Russian dataset we used a few times more dense panel (4,5 million SNPs) than the panel in [7]. Our panel included 250 SNPs in the vicinity of rs113859420. Imputing was done by the Beagle software [30] by 200 iterations, using 1000 Genomes project dataset as a reference.
As the world dataset was based on the regular (less dense) Illumina panels, including 66 SNPs in the vicinity of rs113859420, we first estimated the imputation quality. For this aim, we compared the genotypes generated for the same samples by using 250 SNPs and 66 SNPs. 1871 out of 1883 genotypes (99,4%) coincided, indicating that regular Illumina panels work well for imputing rs113859420 genotypes. These genotypes were used to count the allele frequencies of rs113859420 in the populations from the world dataset (Table 1).
Cartographic analysis
The maps of the geographic distribution of the two COVID-associated markers were constructed using the GeneGeo software [31], [32]. The maps of North Eurasia (Russia and neighboring counties) were created with the generalized Shepard’s method, the degree of weight function set to 3 and radius of influence of 1,500 km. The world maps were created by the same method with radius enlarged up to 2,500 km (to fill in gaps between studied populations) and the degree of weight function set to 2 to generate a smoother surface, i.e. trend [33]).
Statistical analysis
To provide correlation analysis we searched for the number of cases of COVID-19 per 1 M population, recovery rate per 1 M population and mortality caused by this disease per 1 M population, and mortality per finished COVID-19 cases (last updated on September 18, 2020). Because genetic data in each region described the indigenous populations while recovery/mortality rates were available for the total population of a region, we identified 16 regions of Russia and neighboring countries where indigenous populations constitutes the majority (on average, 85% according to the last census) and run correlation analysis on these 16 groups (Table 3). For correlation analysis at the world level we used the same 16 groups from Russian and neighboring countries as well as all groups from the world dataset except data on Native Americans and Greenlanders (Table 4).
To estimate the differences in allele frequencies between groups the Fisher’s exact test was used using the application of GraphPad InStat software package (GraphPad Software, San Diego, CA, USA); the differences were considered statistically significant at p<0.05 with the Bonferroni correction. To determine the correlation between the alleles frequencies and epidemiological parameters per 1 million the Pearson correlation coefficient was calculated using Statsoft Statistica (Dell Statistica, Tulsa, OK, USA); the differences were considered statistically significant at p<0.05 with the Bonferroni correction.
Results
Global variation of rs11385942
This single nucleotide insertion was proved to have strongest association with the severe course of SARS-COVID-19 [7]. To unravel its worldwide variation we genotyped 1883 samples from Russian and neighboring countries, and compiled the dataset on 3088 samples from other regions of the globe. The final worldwide dataset contains frequencies of rs11385942_GA in 60 populations (Table 1). Figure 2 presents the map of this marker’s variation worldwide. The highest frequency (20-30%) was found in South Asia, followed by West Asia and Europe (5-15%). This marker is rare or not detected in East Asia, North Asia (Siberia), Native Americans, and in Sub-Saharan Africa, while present in North Africa. The data on South-East Asia and Papua are scarce but indicate that this insertion might be present there at elevated frequencies. Generally, the worldwide spread of this marker follows the “West Eurasian” pattern, well-known in human populations genetics for many other genetic markers, including mitochondrial DNA and Y-chromosomal markers [34], [35]. The South Asian populations have much stronger genetic affinities with West Eurasians than with East Eurasians. The specific feature of rs11385942 distribution pattern is that its frequency is higher in South Asian part rather than in West Asian/European part of the West Eurasian gene pool.
This description of a global pattern is inevitable schematic. To compensate this, we coupled the analysis at the worldwide level with the detailed analysis at the national level of the largest country.
Variation of rs11385942 within Russia
Our extensive Russian dataset allows investigate the variation of rs11385942 within the area of Russia and neighboring countries in more details. Frequencies are available in the Table 1 and visualized on the map (Figure 3). The map reveals that the difference between higher frequencies in Europe and absence in North Asia (which is obvious on the world map) is not sharp. Instead, there is a pattern of clinal variation, i.e. gradual decrease in frequency throughout 7 thousand kilometers, from the frequencies 13-16% in Ukraine, Belorussia and westernmost Russian populations to the zero frequency in Kamchatka and Chukotka peninsulas at the Pacific coast. Generally, in populations of European Russia the average frequency is 11% while in Siberia it is only 3% (Table 1). In Central Asian countries the frequencies are generally low (1-4%). Tajikistan is the notable exception (frequency is 14%, significant difference from neighbor countries) because Tajikistan population is geographically and genetically close to South Asian populations which exhibit the world maximum of rs11385942.
Focusing on details of the frequency distribution, one may note that rs11385942 frequency in the southwestern regions of Russia is quite similar in comparison with bordering Belarussia and Ukraine - the differences were insignificant (Table 1). In the Caucasus, the lowest frequency of the risk GA allele variant was found in the central regions (Chechnya, Ingushetia, North Ossetia) - 6%, with higher frequency in western and eastern regions (≈ 10%), and particularly in South Caucasus (14%).
Global variation of rs657152
Figure 4 presents the frequency distribution map of rs657152. In comparison with rs11385942 polymorphic marker, this distribution is more homogeneous. Generally, it is quite frequent among almost all Old World populations. The highest frequencies (above 50%) were observed in Sub-Saharan Africa, while the most Eurasian populations have frequencies between 40 and 50%. At the periphery of Eurasia (Atlantic fridge of Europe, Far East, South-East Asia) the frequency tends to drop below 40% and in Native Americans and Australasians it is nearby absent.
Variation of rs11385942 within Russia
Among populations of European Russia the rs657153 frequency varied from 38 to 42%, which was comparable with Belorussia (33%, p>0.05) and lower than in Ukraine (51%, p<0.05) (Tables 1, 2). In Caucasus rs657152 was most frequently met in eastern regions (Daghestan, 52%), the lowest percentage was found in western regions (27%). Distribution differences in rs657152 among the population of these regions were most significant in comparison with other included regions (Table 2). In Asia, among Tuvans and Mongols, the frequency of rs657152 was 38-39%, while among the populations of Uzbekistan and Kyrgyzstan the frequency was 51% (with the insignificant differences in comparison with Kazakhstan and Tajikistan, Table 2).
Correlation between frequencies of COVID-associated markers and epidemiological parameters
We compared frequencies of both markers in various populations of Russia with the recovery and mortality rates from COVID-19 in the same populations. The epidemiological numbers (number of COVID-19 cases, recoveries and death) and population sizes are present in the Table 3. We have observed (Table 5) a negative insignificant correlations between the frequency of both risk polymorphic markers and the parameters linked with the number of cases of COVID-19 (either absolute numbers or numbers normalized by the population size). Instead, we found positive correlations between both risk alleles and the mortality rate, measured as number of deaths per number of Covid-19 cases, and negative correlation between both risk alleles and the recovery rate (number of recoveries normalized by number of cases). The correlation of mortality with the frequency of rs657152 (r=0,63) was significant with p-value 0,01 (Table 5), while correlation with the frequency of rs11385942 (r=0,24) was insignificant (p=0.38). Correlation between rs657152 and mortality rate remained significant after Bonferroni correction. When analyzed the combined dataset (Russian and world populations), all correlations became insignificant (Table 5).
Discussion
The genome-wide association study by Ellinghaus D. et al. revealed the strongest association signal at locus 3p21.31 comprised six genes (including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1): rs11385942 risk allele GA is associated with a genetic predisposition to COVID-19 mediated acute respiratory failure. The frequency of risk allele was higher among patients who received mechanical ventilation compared to those who received only oxygen supplementation. The most relevant genes to COVID-19 were SLC6A20, LZTFL1 and CXCR6. The gene SLC6A20 encodes sodium-iminoacid transporter 1, which closely interacts with ACE2 – the “entrance gate” of SARS-CoV-2 [36], [37]. LZTFL1 regulates the functions of cilium and intra-flagellar transport in cells [38]. The CXCR6 gene regulates the specific location of resident memory T cells in different parts of the lung, providing a stable immune response to pathogens in respiratory tract [39]. The carriage of GA risk allele of determined the decreased expression of CXCR6 and the increased expression of SLC6A20 and LZTFL1 genes in human lung cells, which determines the severe COVID-19 [7].
Ellinghaus D. et al noted that the frequency of risk alleles rs11385942 varies among populations worldwide, but did not analyze the pattern of this variation. We generated the dataset on the frequencies of this marker in various world populations and identified that its global variation follows the West Eurasian pattern. Like many other polymorphisms, rs11385942 is frequent in West Eurasians (Europeans, West Asians, and South Asians) but rare or absent in all other parts of the globe, including East Asians, North Asians, Native Americans and Sub-Saharan Africans (North African populations are related to West Eurasians and thus carry this markers at moderate frequencies). This marker (along with the linked SNPs in the 50kb length haplotype block) came from admixture with Neanderthals [8]. This explains well its absence in Sub-Saharan Africa (where no Neanderthals are known), while absence in East Eurasia can be attributed to genetic drift after separating West and East Eurasians. Our detailed analysis of populations from Russia demonstrated the absence of abrupt differences between Eurasian subcontinents: for example, elevated frequency of rs11385942 in Europe and absence in the Pacific coast are linked by the chain of populations with intermediate frequencies, gradually decreasing eastward.
The genome site exhibiting the second high association with severe COVID-19, rs657152 is located within the AB0 blood group locus. The importance of AB0 blood groups system for COVID-19 was also reported in other studies, namely the risk of respiratory failure in COVID-19 was highest in patients with blood group A compared to other groups while the lowest risk was in patients with blood group O, [40], [41], though these studies are primarily concerned with the risk of infection and not with disease severity [42]. The protective effect of blood group 0, in contrast to other groups, was explained by the presence of neutralizing antibodies against protein-linked N-glycans [7], [43]. It is also known that there is a linkage between ABO blood group locus and the expression of the von Willebrand factor’s gene (VWF) (12p13.31 locus), which in connection with VIII factor promotes the clots formation on the surface of damaged vessels. Pulmonary endothelial cells in non-O groups are associated with higher levels of VWF compared to O group [44], [45], which may explain the role of AB0 blood group locus in COVID-19 patients.
However, the rs657152 does not directly encode blood group A or any other blood group. Instead, it can be used to distinguish rare variants [46]. The fact that GWAS identified the strongest association with this SNP which does not code the blood group while other studies identified association with the classical blood groups indicates the complexity underlying these associations which might be resolved by future studies involving the sequencing of the entire AB0 locus. Meanwhile, we can consider the frequency distribution of AB0 group A (Figure 5). The most impressive feature of this map is the large amount of data it is based on: 2757 populations, which is almost thirty times more than the dataset for the other SNPs analyzed in this study. This was because the map (Figure 5) is based on the frequencies of blood groups accumulated in 20th century due to works of many scholars. This case exemplifies the value of old-fashion datasets, because using modern datasets (like 1000 Genomes Project or the dataset analyzed in our study) one could obtain a rough, approximate distribution map, while using the data from books published in the second half of 20th century provided the very detailed and reliable picture (Figure 5). This picture demonstrated that highest frequencies (30% and higher) were observed in West Europe, slightly lower in the Volga-Ural region in East Europe, and generally low frequencies (10-20%) in North Asia (Siberia), East Asia, and Sub-Saharan Africa. As for South Asian populations, most of them carry AB0_A at moderate frequencies (15-20%) while neighboring West Central Asians exhibit almost as high frequencies, as West Europeans do. Summarizing, we observed the higher frequency of the first risk allele rs11385942 in South Asia, somewhat lower in Europe and West Asia, and low in other regions of the globe. For the second risk allele rs657152 we observed the highest frequency in Africa and moderate in Eurasia. Finally, the AB0 group A is most frequent in Europe and moderately frequent across Eurasia. So that, Europeans and South Asians carry every risk allele at highest or high frequencies, as compared with populations from other regions.
Therefore, we investigated whether the elevated frequency of risk alleles in a population indeed results in worse epidemiological situation in this country or region. Though we cannot claim the causative effect, we found the reasonable pattern of correlations. First, we found no correlation between frequencies of risk alleles and numbers of COVID-19 cases: this was expected because these alleles are associated with severe course of disease rather than with the susceptibility. Second, the correlations with the outcomes of the disease were as they should be: for both markers, higher population frequency of risk alleles correlated positively with the deaths rate. For the rs11385942 we can state the tendency only, while for rs657152 the correlation was high (r=0,6) and significant (p-value 0,01, Table 5). Notably, we found these reasonable correlations on the “Russian” dataset only, while correlations on the world dataset were zero. This difference could be attributed to the pronounced difference between countries how the not-severe cases are counted: the number of underestimated COVID-19 cases affects the mortality rate dramatically, making it impossible to reveal the correlation with genetic factors. On the contrary, when using epidemiological statistics within one country (in our case, Russia) the ways to count COVID-19 cases become more uniform, and the correlation with variation of the risk alleles frequency becomes evident. The genetic data (allele frequencies) used in our study are also affected by relatively low samples sizes, thus statistical noise is expected to decrease the correlation values. To this end we believe that future studies will identify even stronger correlation between recovery/mortality rates and population’s gene pool, and that genetic variation between populations makes small but real contribution into the heterogeneity of the pandemic worldwide.
Data Availability
Authors declare that all data provided in this manuscript is available and could be submitted on request.
Funding and Disclosure
This work was supported by the Ministry of Science and Education of the Russian Federation (State assignments for the Research Centre for Medical Genetics and Vavilov Institute of General Genetics) and by the Ministry of Health (State assignment for the Russian Medical Academy of Continuous Professional Education).
The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.