Abstract
It is plausible that variants in the ACE2 and TMPRSS2 genes might contribute to variation in COVID-19 severity and that these could explain why some people become very unwell whereas most do not. Exome sequence data was obtained for 49,953 UK Biobank subjects of whom 74 had tested positive for SARS-CoV-2 and could be presumed to have severe disease. A weighted burden analysis was carried out using SCOREASSOC to determine whether there were differences between these cases and the other sequenced subjects in the overall burden of rare, damaging variants in ACE2 or TMPRSS2. There were no statistically significant differences in weighted burden scores between cases and controls for either gene. There were no individual DNA sequence variants with a markedly different frequency between cases and controls. Whether there are small effects on severity, or whether there might be rare variants with major effect sizes, would require studies in much larger samples. Genetic variants affecting the structure and function of the ACE2 and TMPRSS2 proteins are not a major determinant of whether infection with SARS-CoV-2 results in severe symptoms. This research has been conducted using the UK Biobank Resource.
Introduction
There is wide variation in the severity of symptoms in patients infected with SARS-CoV-2 and there are reports in the UK that members of ethnic minorities are more severely affected. An obvious possible explanation for these findings would be that genetic polymorphisms affecting structure or function of key proteins could influence host susceptibility and/or responses to infection. If these polymorphisms varied in frequency between different ethnic groups then this could contribute to differential outcomes.
Two key proteins involved in SARS-CoV-2 infective processes are ACE2, which is expressed on the cell surface and acts as a receptor for the viral S protein, and TMPRSS2, which cleaves the S protein to allow fusion of the viral and cellular membranes (Hoffmann et al., 2020). Variants in the genes coding for these proteins might contribute to different responses to infection.
Methods
The UK Biobank dataset was downloaded along with the variant call files for 49,953 subjects who had undergone exome-sequencing and genotyped using the GRCh38 assembly with coverage 20X at 94.6% of sites on average (Hout et al., 2019). Informed consent from the subjects and ethical approval for research uses of the data had been obtained by UK Biobank. All variants were annotated using VEP, PolyPhen and SIFT (Adzhubei et al., 2013; Kumar et al., 2009; McLaren et al., 2016). To obtain population principal components reflecting ancestry, version 1.90beta of plink (https://www.cog-genomics.org/plink2) was run with the options --maf 0.1 --pca header tabs -- make-rel (Chang et al., 2015; Purcell et al., 2007, 2009).
The COVID-19 results table was downloaded from UK Biobank on 28th April 2020. This contained results for 2,724 subjects who had undergone testing for SARS-CoV-2 infection between 16th March and 14th April 2020 (Armstrong et al., 2020). During this period, testing in the UK was done almost exclusively on patients admitted to hospital and thus patients testing positive can be assumed to have severe disease because patients with milder symptoms were generally left at home. Of the subjects tested, 185 had been exome sequenced, of whom 74 had tested positive, meaning that they had at least one swab which demonstrated the presence of viral RNA at detectable levels. The proportion of infected subjects who require hospitalisation rises with age but is still only 0.18 for those aged 80 or over (Verity et al., 2020). Thus the subjects who tested positive could be regarded as cases with an unusually severe response to infection whereas the subjects who tested negative or who were not tested could be regarded as unscreened controls, most of whom would not have severe symptoms even if infected.
SCOREASSOC was then used to carry out a weighted burden analysis to test whether, in ACE2 or TMPRSS2, sequence variants which were rarer and/or predicted to have more severe functional effects occurred more commonly in cases, i.e. subjects who tested positive for SARS-CoV-2, than all the other sequenced subjects. All available variants in each gene were included in the analyses. As originally described, variants were weighted according to frequency so that rare variants were accorded 10 times the weight of common variants (Curtis, 2012). Variants were additionally weighted according to their functional annotation using the default weights provided with the GENEVARASSOC program, which was used to generate input files for weighted burden analysis by SCOREASSOC (Curtis, 2019, 2016, 2012). For example, a weight of 5 was assigned for a synonymous variant, 10 for a non-synonymous variant and 20 for a stop gained variant. Additionally, 10 was added to the weight if the PolyPhen annotation was possibly or probably damaging and also if the SIFT annotation was deleterious, meaning that a non-synonymous variant annotated as both damaging and deleterious would be assigned an overall weight of 30. The full set of weights is shown in Table 1, copied from the previous reports which used this method (Curtis et al., 2019, 2018). Variants were excluded if there were more than 10% of genotypes missing in the controls or if the heterozygote count was smaller than both homozygote counts in the controls. For each variant, an overall weight was obtained consisting of the product of the frequency-based weight and the annotation-based weight. For each subject a gene-wise weighted burden score was derived as the sum of the variant-wise weights, each multiplied by the number of alleles of the variant which the given subject possessed. ACE2 is located on the X chromosome and males were treated as if they were homozygotes. If a subject was not genotyped for a variant then they were assigned the subject-wise average score for that variant.
A t test was carried out to determine whether the gene-wise burden scores differed between cases and controls and additionally ridge regression analysis with lamda=1 was performed incorporating the first 20 principal components, as described previously (Curtis, 2019; Curtis et al., 2019, 2018). To do this, SCOREASSOC first calculates the likelihood for the phenotypes as predicted by the principal components and then calculates the likelihood using a model which additionally incorporates the gene-wise burden scores. It then carries out a likelihood ratio test assuming that twice the natural log of the likelihood ratio follows a chi-squared distribution with one degree of freedom to produce a p value.
Results
The genotype counts and frequencies of variants are presented in Supplementary Table 1, with variant positions and annotations redacted in order to preserve subject anonymity. There were 512 valid variants in ACE2 and there was no tendency for the weighted burden scores to be different between cases (mean (sd): 25.9 (45.7)) and controls (22.6 (37.8)): t=0.74, 49951 df, p=0.45 and chi-squared=0.33, 1 df, p=0.57. There were 658 valid variants in TMPRSS2 and although the weighted burden scores were lower in cases (64.6 (38.1)) than in controls (74.0 (48.9)) this difference would not meet conventional standards for statistical significance after applying a Bonferroni correction for the fact that two genes were tested: t=-1.6, 49951 df, p=0.10 and, in the ridge regression analysis incorporating principal components, chi-squared=4.17, 1 df, p = 0.04. On visual inspection of the results there were no individual variants with markedly different frequencies between cases and controls. Of course, for both genes there were many rare variants which were observed in controls but not in cases but this is as expected given the disparity in sample sizes.
Discussion
Although the number of severely affected subjects who had been sequenced is very small it is nevertheless possible to draw some preliminary conclusions and given the importance of the topic it seems reasonable to communicate these findings. In general, the results are negative. It is not the case that a large proportion of severely affected subjects have a particular genetic variant in one of these genes which is relatively rare in the general population. Nor is it the case that there is a common variant which confers strong protection against severe infection. It remains possible that there might be rare variants which have a major effect on risk in individual subjects but such effects would only be detected with larger sample sizes.
The fact that the weighted burden scores were higher in controls than in cases is consistent with the hypothesis that rare genetic variants in TMPRSS2 with functional effects disrupting functioning of the protein might be protective against severe infection. Although this is biologically plausible it should be emphasised that the results obtained are not statistically significant. This could be investigated further by carrying out targeted sequencing of this gene in a sample of a few hundred severely affected subjects.
Genetic variants affecting the structure and function of the ACE2 and TMPRSS2 proteins are not a major determinant of whether infection with SARS-CoV-2 results in severe symptoms.
Data Availability
No new data was generated for this study. The data used is available on application from UK Biobank.
Supplementary Table 1 Table showing genotype counts and allele frequencies for variants in ACE2 and TMPRSS2 in cases tested positive for SARS-CoV-2, presumed to have severe illness, against background controls. Each variant is assigned a weight, with higher weights for variants which are rarer and/or with predicted damaging effects. For variants in ACE2, which is on the X chromosome, males are treated as homozygotes. To preserve subject anonymity, variant positions and annotations are redacted.
Acknowledgments
This research has been conducted using the UK Biobank Resource. The author wishes to acknowledge the staff supporting the High Performance Computing Cluster, Computer Science Department, University College London. This work was carried out in part using resources provided by BBSRC equipment grant BB/R01356X/1.