Abstract
Background Several risk factors have emerged for novel 2019 coronavirus disease (COVID-19) infection and severity. Yet, it is unknown to what degree these risk factors alone or in combination can accurately predict who is most at risk. It is also worthwhile to consider serological antibody titers to non COVID-19 infectious diseases, which may influence host immunity to COVID-19.
Methods In this retrospective study of multicenter UK Biobank participants, as of May 26th 2020, all COVID-19 testing data was collected by Public Health England for older adult in- and out-patients (69.6 ± 8.8 years). We used linear discriminant analysis with cross-validation and bootstrapping to determine the accuracy, specificity, and sensitivity of baseline data from 2006-2010 to predict COVID-19 infection and presumptive severity (i.e., testing at hospital). Receiver operating characteristic (ROC) curves were used to derive the area under the curve (AUC).
Findings This retrospective study included 4,510 unique participants and 7,539 testing instances (i.e., test cases). Testing resulted in 5,329 negative cases and 2,210 positive cases, split into 996 mild and 1,214 severe disease outcomes. Baseline data including demographics, bioimpedance-derived body composition, vitals, serum biochemistry, self-reported illness/disability, and complete blood count. A randomized subset of 80 participants with 124 test cases also had antibody titers for 20 common to rare infectious diseases. Among all test cases, accuracy was modest for final diagnostic models of COVID-19 infection (70.2%; AUC=0.570, CI=0.556-0.584) and severity (58.3%; AUC=0.592, CI=0.568-0.615). In the serology sub-group, by contrast, final models predicted infection and severity with an accuracy of 93.5% (AUC=0.969, CI=0.934-1.000) and 74.4% (AUC=0.803, CI=0.663-0.943) respectively. Models included titers to common pathogens (e.g., human cytomegalovirus), age, blood cell counts, lipids, and other biochemical markers.
Interpretation Risk profiles including serological titers and other risk factors could help policy makers and clinicians better identify who may get COVID-19 and require hospitalization.
Introduction
Coronavirus disease 2019 (COVID-19), caused by a novel beta-coronavirus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1, has become a worldwide pandemic, severely disrupting the economic, social, and psychological well-being of countless people. Clinical presentation of COVID-19 widely varies, ranging from asymptomatic profiles to mild symptoms like high fever or cough to acute respiratory disease syndrome and death. Given this heterogeneous symptom presentation, as well as difficulties with serology testing and contact tracing, worldwide public health efforts continue to focus on containment and especially isolating adults most at risk for COVID-19 infection and severe disease.
By extension, an expanding body of research has investigated potential factors that increase COVID-19 infection and disease severity risk. It is well known, for example, that adults aged >65 years are much more likely to be hospitalized or die due to COVID-19. Obesity itself and adverse health behaviors like smoking also increase infection risk and likelihood of hospitalization2,3. Several age and obesity-related conditions such as cardiovascular disease, cardiometabolic diseases (e.g., type 2 diabetes), hypertension, and other disease states and syndromes are also of concern4. Non-white ethnicity, particularly being black regardless of country of origin, socioeconomic deprivation, and low levels of education even after adjustment for health factors point to less privilege unfortunately conferring risk5. Among biological markers, COVID-19 infection or severity has been related to higher C-Reactive Protein and more circulating white blood cells and lower counts of lymphocytes or granulocytes (e.g., monocytes)6-8. SARS-CoV-1 has a similar profile except for a relatively normal total white blood cell count9.
These studies are invaluable for establishing or validating risk factors to guide clinical decisions and policymaker choices. However, we ultimately need to develop risk profiles derived from these factors to accurately predict who will and will not develop COVID-19, and if a COVID-19 disease course will be mild or presumptively severe (i.e., require hospitalization). Machine learning, data-driven modelling can be used to create robust, highly accurate prediction models based on routinely collected biomedical data like demographics, a complete blood count, and standard medical biochemistry data. Critically, using non COVID-19 serological data, we may gain insight into the host’s ability to fight COVID-19 by examining antibody titers that detail the host response to past infectious pathogens. This “virome” may affect host innate and adaptive immunity9,10. For example, human cytomegalovirus vastly changes the composition of T and B cells11, and may induce immune senescence that could account for worse SARS-CoV-2 infection outcomes.
Therefore, similar to previous work12, our objective was to use machine learning to determine what combinations of baseline measures, collected 10-14 years ago, could best predict which older adults developed COVID-19 and if disease presentation was mild or severe. In summary, we achieved a 93.5% accuracy for predicting COVID-19 infection based on a combination of age, biochemistry and leukocyte markers, and antibody titers to common pathogens like human cytomegalovirus, human herpesvirus 6, and chlamydia trachomatis. For COVID-19 severity, due to small sample size, only antibody titers loaded for finals models that more modestly predicted severe disease (74.4%). Nonetheless, this is the first report to propose retrospective risk factor profiles to clarify and better characterize who is most at risk for COVID-19. In addition, our results suggest that past infection history and antibody response may be an invaluable, novel predictor of host immunity to COVID-19 that warrants further study.
Methods
Study design and participants
This retrospective study involved the UK Biobank cohort13. UK Biobank consists of approximately 500,000 people now aged 50 to 84 years (mean age=69.4 years). Baseline data was collected in 2006-2010 at 22 centers across the United Kingdom14,15. Summary data is listed in Table 1.
Baseline Demographics and Data Characteristics
Our study used the May 26th, 2020 tranche of COVID-19 polymerase chain reaction (PCR) data from Public Health England. The following categories of predictors were downloaded 1) demographics; 2) health behaviors and long-term disability or illness status; 3) anthropometric and bioimpedance measures of fat, muscle, and water content; 4) pulse and blood pressure; 5) a serum panel of thirty biochemistry markers commonly collected in a hospital setting; and 6) a complete blood count with a manual differential for quantitation of total white blood cells and sub-types. Among a randomized subset of 9,695 participants, as part of a separate pilot project, baseline serum was thawed and tested to determine levels of antibodies to several antigens of 20 infectious diseases. Study subjects provided electronic, signed informed consent at recruitment. Ethics approval for the UK Biobank study was obtained from the National Health Service Health Research Authority North West - Haydock Research Ethics Committee (16/NW/0274). The detailed protocol is outlined at https://www.ukbiobank.ac.uk/.
COVID-19 Testing
Through a May 26th 2020 data upload from UK Biobank, our study was based on COVID PCR test data available from March 16th to May 19th 2020. There were 4,510 unique participants that had 7,539 individual tests administered, hereafter called cases or test cases. For modeling COVID-19 infection data, each test case was coded by UK Biobank as ‘0’ and ‘1’, respectively representing a negative or positive PCR test. For modeling COVID-19 disease severity, each test case was coded as ‘0’ and ‘1’, which represented out-patient testing (i.e., mild case) or hospital in-patient testing with clinical signs of infection (i.e., presumptively severe case).
To offer insight into this frequently updated resource, roughly weekly updates by Public Health England since inception (late April/early May) have so far consisted of new participants tested for the first time (35.6%), or who have follow-up testing (64.4%). As of the May 26th, 2020 upload (see Table 1), a given participant had anywhere between 1 and 20 tests for COVID-19 (mean=2.5 tests), with 1.8 ± 4.1 days between each test. Based on initial modelling, there are not yet enough test cases per participant to robustly model complex changes in disease status. As of June 5th, 2020, an in-vs. outpatient identification issue had been raised for 6 out of 105 hospitals and clinics. Thankfully, we found no evidence in our prediction models that laboratory of clinical origin influenced our results.
Demographics
These factors included participant age in years at baseline, sex, education qualifications, ethnicity, and Townsend deprivation index. Sex was coded as 0 for female and 1 for male. For education, higher scores roughly correspond to progressively more skilled trade/vocational or academic training and skill need to attain the qualification. Ethnicity was coded as UK citizens who identified as White, Black/Black British, or Asian/Asian British. The Townsend index16 is a standardized score, based on postal code (i.e., zip code) data taken from the census, indicating the relative degree of deprivation or poverty presumably experienced by the participant based on their permanent address.
Health Behaviors and Conditions
This category consisted of self-reported alcohol status, smoking status, a subjective health rating on a 1-4 Likert scale (“Excellent” to “Poor”), and whether the participant had a self-described long-term medical condition, illness, or disability. As noted in Table 1, 48.4% of participants indicated having such an ailment. We independently confirmed with ICD-10 based, NHS-confirmed diagnoses that this self-report data was accurate. These conditions included all-cause dementia and other neurological disorders, various cancers, major depressive disorder, cardiovascular (e.g., myocardial infarction) or cerebrovascular diseases and events (e.g., stroke), cardiometabolic diseases (e.g., type 2 diabetes), renal and pulmonary diseases (e.g., COPD), and other so called pre-existing conditions. We chose to use this single latent variable for simplicity, and because there were widely varying numbers of cases that severely underpowered our multivariable classifier analyses.
Vital Signs
The first automated reading of pulse, diastolic and systolic blood pressure at the baseline visit were used.
Body Morphometrics and Compartment Mass
Anthropometric measures of adiposity (Body Mass Index, waist circumference) were derived as described17. Data also included bioelectrical impedance metrics that estimate central body cavity (i.e., trunk) and whole body fat mass, fat-free muscle mass, and/or water content18.
Blood Biochemistry
Serum biomarkers were assayed from baseline samples as previously described19. See Table 1 for data summaries of the full COVID-19 sample and the sub-group with serology data. Briefly, using immunoassay or clinical chemistry devices, spectrophotometry was used to initially quantify values for 34 biochemistry analytes. UK Biobank deemed 30 of these markers to be suitably robust, after rigorous quality control to minimize systematic bias and random error in sample thawing and processing. We downloaded all fully quality-controlled data from the main showcase. We rejected a further 4 markers due data missingness >70% (estradiol, rheumatoid factor), or because there was strong overlap with other variables that had more stable distributions or trait-like qualities (glucose rejected vs. glycated hemoglobin or hba1c; direct bilirubin rejected vs. total bilirubin).
Serology Measures for Non COVID-19 Pathogenic Diseases
As described (http://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/infdisease.pdf), among 9,695 randomized UK Biobank participants selected from the full cohort, baseline serum was thawed and pathogen-specific assays run in parallel using flow cytometry on a Luminex bead platform20.
Here, the goal of the multiplex serology panel was to measure multiple antibodies against several antigens for different pathogens, reducing noise and estimating the prevalence of prior infection and seroconversion in at least UK Biobank. All measures were initially confirmed in serum samples using gold-standard assays with median sensitivity and specificity of 97.0% and 93.7%, respectively. Antibody load for each pathogen-specific antigen was quantified using median fluorescence intensity (MFI). CagA titer load to H. pylori was excluded due to lab-based data loss. Because seropositivity is difficult to assess for several pathogens, we did not use pathogen prevalence as a predictor in models.
Table 2 shows the selected pathogens, their respective antigens, estimated prevalence of each pathogen based roughly on antibody titers, and assay values. This array ranges from delta-type retroviruses like human T-cell lymphotropic virus 1 that are rare (<1%) to human herpesviruses 6 and 7 that have an estimated prevalence of more than 90%.
Baseline characteristics of infectious disease serology from 2006-2010
Statistical Analysis
SPSS (Subscription build 1.0.0.1327) was used for all analyses. Due to differences in sample size, Mann-Whitney U and Kruskal-Wallis tests were used to compare quantitative values and categories (e.g., sex) for all 7,539 test cases and the 124 test cases with serology data (i.e., the serology sub-group). Linear discriminant analysis (LDA) was then leveraged, using individual predictors or weighted combinations of predictors, to maximally distinguish between: 1) negative or positive diagnosis for COVID-19; and 2) mild or severe COVID-19 disease status. LDA relies on a regression-like linear set of functions that can combine several data (i.e., features) and create predictive models that are straightforward to interpret. It is recognized that having a small number of test cases, such as in the serology sub-cohort, with many data types and features can lead to overfitting21. To guard against non-robust estimations, parametric violations, and model
overfitting, 1-fold cross-validation with non-parametric bootstrapping (95% Confidence Interval, 1000 iterations) was used. While logistic regression is more robust to outliers than LDA, UK Biobank data is vigorously quality-controlled to remove extreme values. Due to the small sample size of the serology sub-group, logistic regression also would be more likely to have model overfitting that inflates true accuracy.
First, LDA was used to examine how useful each baseline predictor was for correctly determining COVID-19 infection classification (negative, positive) and disease severity (mild, severe). This was done separately for all 7,539 test cases and the 124 test cases in the serology sub-group. Next, a series of forced entry models were used to see how well a set of related variables or features (e.g., demographics) predicted COVID-19 infection or disease severity. We recognize that some of these forced entry models are likely overfitted, particularly for modeling disease severity risk. Nonetheless, these models may provide a “best case scenario” for how well (or poorly) a class of predictors can perform in classification. Finally, a stepwise approach (Wilks’ Lambda, F value entry=3.84) was used to combine predictors into a risk profile that best classified COVID-19 infection or separately for severity risk.
For each classification model, the accuracy (i.e., percentage of test cases that were correctly classified), sensitivity (i.e., true positives correctly identified), and specificity (i.e., true negatives correctly identified) were calculated. The area under the curve (AUC) with a 95% confidence interval (CI) was also used. Receiver operating characteristic (ROC) curves plotting sensitivity against 1-specificity were created to visualize differences in prediction accuracy among sets of similar predictors or stepwise models. For stepwise models, the Wilks’ Lambda statistic and standardized coefficients were used to interpret how well and in what direction a given variable discriminated between positive vs. negative COVID-19 infection and mild vs. severe disease. A lower Wilks’ Lambda corresponds to a stronger influence on the canonical classifier. Alpha was set at .05.
Role of the funding source
The funders of the study had no role in the study design, data collection, data analysis or interpretation, or writing of this report. The corresponding author (AAW) had full access to all of the data in this study and had final responsibility for the decision to submit the report for publication.
Results
As shown in Table 1, 7,539 total test cases for COVID-19 were conducted among 4,510 UK Biobank participants (69.6 ± 8.8 years) between March 16th to May 19rd 2020, either in outpatient or inpatient settings. There were 5,329 negative cases and 2,210 positive cases. Of the positive cases, there were 996 mild and 1,214 presumptively severe disease outcomes, defined as a test case occurring in a hospital setting. Baseline data from 10-14 years ago (Mean = 11.22 years) was available for demographic, laboratory, biochemistry, and clinical indices. A central theme of this report is the comparison of the 7,539 total test cases to a sub-group of 124 test cases with serology data (Table 2), in order to show that better model fit incorporating serology markers was not merely due to sample size differences or model inflation. Using non-parametric tests, then, Table 1 indicates that the full cohort and serology sub-groups largely did not differ on most measures. A few significant differences were clinically unremarkable for the serology sub-cohort and well within the range of normal values, including lower pulse rate, several markers reflecting better kidney function, and a mean 0.6 109/L lower total white blood cell count due to fewer lymphocytes.
Next, each baseline variable was used to predict COVID-19 infection for a given test case. For context, 70.6% of the 7,539 test cases were negative. Consequently, any predictor achieving an accuracy of 70.6% would be performing at chance. A better measure of accuracy in this case is the AUC, where 0.5 is at-chance prediction and 1.0 is perfect accuracy. We also focused on how well true COVID-19 positive cases were identified (i.e., sensitivity). Among all participants (Supplementary Table 1), any given significant predictor could not correctly distinguish any true positive test cases (0% sensitivity; AUC mean and range=0.525, 0.515-0.548). For the serology sub-cohort (Supplementary Table 2), several established risk factors that loaded had better overall fit (mean AUC=0.625, AUC range=0.528-0.712), due to better sensitivity of predicting infection. Examples included ethnicity (13.2%), alcohol status (15.4%), apolipoprotein B (10.3%), and two unusually strong biochemistry analytes: urate (25.6%) and testosterone (56.4%). In order to see if biomarkers of past host response to pathogens was useful for predicting a current host response to COVID-19, we then tested each antibody titer for an antigen to a specific pathogen. As shown in Supplementary Table 3, antibody titers to 14 antigens across 12 pathogens each performed as well on average as other types of predictors (mean AUC=0.627, AUC range=0.505-0.707). In particular, sensitivity was notable for antibody to the pp150 Nter antigen to Human Cytomegalovirus (33.3%) and BK VP1 to Human T Lymphotropic Virus 1 (30.8%).
Isolated effect of each non-serology predictor on COVID-19 risk among all test cases
Isolated effect of each non-serology predictor on COVID-19 risk among test cases with serology data
Isolated effect of each 2006-2010 antibody titer on predicting COVID-19 infection risk
Lastly, as listed in Table 3, sets of similar predictors were forced into a classifier model to gauge how well they collectively predicted COVID-19 infection. A stepwise model was also used to create a classifier that only included predictors which each provided unique predictive utility. Among all 7,539 test cases (top row), sets of predictors including the stepwise model were only able to correctly identify COVID-19 positive test cases up to 10% of the time. Supplementary Table 4 illustrates that predictors loading in the stepwise model included lipid and kidney health markers, white cell counts, as well as smoking status, ethnicity, and the Townsend Deprivation Index.
Predictors that loaded into the stepwise models for COVID-19 infection risk
Sets of predictors used to predict classification of COVID-19 test cases as negative or positive
In the serology sub-group (bottom row), some relatively sparse predictor sets had better sensitivity (e.g., 53.3%). While the biochemistry and serology forced entry models were likely overfitted, the analyses may nonetheless provide a “best case scenario” for their usefulness as a group. Notably, the stepwise model achieved 93.5% accuracy by correctly identifying when a COVID-19 test case was negative (94.1% specificity) or positive (92.3% sensitivity). Due to potential concerns with model overfitting, the stepwise model was re-run with only predictors that had individually loaded significantly (Supplementary Tables 2 and 3). This model had 6 variables and still achieved 79.8% accuracy. As shown in Supplementary Table 4, predictors that loaded in the stepwise model included antibody titers for antigens of several common pathogens (e.g., Human Cytomegalovirus, Chlamydia Trachomatis), lipid markers, age in years, white and red cell counts, and testosterone.
Another set of analyses next determined how each baseline predictor could predict which of the 2,210 positive COVID-19 cases had a mild or severe disease course. For context, 45% and 55% of test cases were mild or severe respectively. Thus, accuracy of 50% would be considered chance prediction. Curiously, while sensitivity was the difficult metric to achieve for COVID-19 infection risk, accurately distinguishing true negatives (i.e., specificity) was problematic for disease severity. Among all 2,210 COVID-19 positive test cases (Supplementary Table 5), significant predictors showed a trade-off between better sensitivity or specificity and in general were only modestly useful (AUC mean and range=0.536, 0.524-0.572). Similarly, for the serology only sub-group among 39 COVID-19 positive test cases, Supplementary Table 6 shows that only alanine aminotransferase and neutrophil count significantly predicted disease severity beyond chance. Likewise, for serology data, Supplementary Table 7 indicates that the only significant antibodies to load were for the U14 antigen to human herpesvirus 7 (accuracy=64.1%; AUC=0.729) and JC VP1 antigen to human JC polyomavirus (accuracy=59%; AUC 0.671).
Isolated effect of each non-serology predictor on COVID-19 severity among all test cases
Isolated effect of each non-serology predictor on COVID-19 severity for the serology sub-group
Isolated effect of each 2006-2010 antibody titer on predicting COVID-19 severity
Table 4 shows the relative predictive value of groups of predictors for COVID-19 severity. First, for the full sample of 2,210 positive test cases, accuracy remained low and the proportion of true negatives (i.e., specificity) identified did not exceed 37%. This was regardless of predictor sets with a sparse or dense number of predictors. Supplementary Table 8 illustrates that the stepwise model included only alanine aminotransferase, age in years, and monocyte count, which may explain its modest predictive utility above chance. For the serology sub-group of 39 test cases, despite strong concerns about model overfitting, the accuracy, sensitivity, and specificity were similarly modest compared to all 2,210 positive test cases for the biochemistry, immunology, and serology panels. The stepwise model was sparse and had better overall accuracy (74.4%) due to improved detection of actual mild cases (61.5%). Indeed, Supplementary Table 8 shows that the stepwise model loaded 2 predictors, antibody titers for HTLV-1 gag antigen to the rare Human T Lymphotropic virus and JC VP1 antigen for the Human Polyomavirus that has an estimated prevalence of 57.5% in at least UK Biobank.
Predictors that loaded into the stepwise models for COVID-19 severity risk
Sets of predictors used to predict classification of COVID-19 positive cases as mild or severe
Discussion
The objective of this study was to determine if baseline data from 2006-2010 could predict which older adults would develop COVID-19 in 2020, and if that infection was presumptively mild or severe due to being at hospital. In summary, using machine learning, we developed separate risk profiles that accurately predicted future host immunity for COVID-19 infection (93.5%) and severity (74.4%). Such profiles only require retrospective, routine self-report and blood tests typically collected in out- and in-patient clinics and hospitals. As proof-of-principle that these profiles work, for example, we confirmed as others have noted with previous UK Biobank COVID-19 data that non-white ethnicity, low socioeconomic status, and smoking can increase infectious risk5.
Our most novel finding was that antibody titers, reflecting pathogen exposure history and past host immunity, were strong predictors of COVID-19 infection and severity, both as a group and especially in concert with established risk factors like age, neutropenia, and dyslipidemia. This virome may consist of beneficial and detrimental pathogens that change how the immune system responds to a novel, persistent viral challenge like COVID-1910. For example, we found antigens to human cytomegalovirus were the strongest predictors of infection risk in our stepwise model. Older adults with prior human cytomegalovirus infection evince exhaustion of the naïve T cell pool and fewer memory versus effector cells22. This may explain why monocyte count was one of the few variables to predict COVID-19 severity among all test cases in this study, as innate immunity must compensate. For COVID-19 severity prediction, antibody titer to the JC polyomavirus was the only serology predictor that loaded significantly in our stepwise model and is expressed in a majority of the general population. This virus can induce hemagglutination in type O blood cells23, which may in some way influence why this blood type may be protective for COVID-19 infection. This may also explain why higher red blood cell count appeared to be an important predictor for infection risk.
For other immunologic factors, mobilization of innate immunity was not surprisingly relevant to infection risk and severity. In particular, granulocytes (e.g., neutrophils, monocytes) loaded significantly in COVID-19 infection and severity prediction models for stepwise models, but not cytokines such as C-Reactive Protein. C-Reactive Protein has been cited as a strong risk factor for COVID-1924. However, this marker merely reflects signaling of the acute phase response due to systemic infection, typically being initiated by macrophages in contact with the pathogen and monocytes in the blood. Although lymphopenia and suppression of humoral immunity has been noted in COVID-19, lymphocyte cell count was in this study a modest predictor by itself and did not load in final stepwise models.
We also confirmed and extended the importance of age and biological factors related to lipids and kidney health, but curiously not obesity or comorbid conditions. Among now elderly adults in UK Biobank, age was one of the few factors to impact both infection and severity risk. Perhaps in concert, lipoprotein metabolism changes with aging along with sedentary lifestyle can induce hyperlipidemia, which is a risk factor for cardiovascular disease and may increase COVID-19 infection risk25. The lack of association with bioimpedance-derived fat, muscle, and water quantitation, or long-standing illness, was unexpected but may be due to complex interactions that are beyond the scope of this report. Finally, levels of testosterone by itself and in concert with other factors in the serology sub-group could strongly identify adults who would later develop COVID-19. Sex differences favoring COVID-19 infection in men have been noted, and andropause-induced reductions in testosterone occur in aging men. As testosterone normally downregulates inflammation, this loss may increase disease susceptibility26.
Some major limitations should be noted in our study. At this time, the number of UK Biobank participants with COVID-19 and serology data is low, particularly for positive test cases. To temper this issue, we first used k-fold validation and rigorous bootstrapping to avoid model overfitting for the stepwise models. We also rigorously tested each predictor or set of predictors in the main sample or serology sub-group, where we found that model fit was not overly biased in general. Regardless, we acknowledge that sets of predictors with many variables may be overly optimistic in their prediction value. Larger sample sizes and gold standard classification schemes, such as training and testing using separate datasets, will be needed to validate that antibody titers and past pathogen history in general are relevant markers for COVID-19 infection and disease course. Another limitation was the modest predictive value of most variables across all test cases, which stands in contrast to high odds ratios for some of these same factors in UK Biobank and other cohorts. Studies with smaller sample sizes will often show inflated relative risk or prediction accuracy, due simply to less heterogeneous error variance. However, our study only examined so called main effects of predictors instead of complex interactions, such as darker skin, vitamin D content, and COVID-19 infection risk. Such interactions were beyond the scope of this report, which attempted to create relatively straightforward risk profiles that could be used in a clinic or by policymakers.
In summary, this is the first study to systematically use retrospective data in a large community cohort to predict future risk for COVID-19 infection and severity. Despite baseline data having been collected 10-14 years ago, we nonetheless achieved excellent to encouraging accuracy by combining several sets of emerging risk factors together. It is especially interesting that serological data performed as well or better than any other data type. Future work should leverage past pathogen history and host immunity to inform what may happen when the host is challenged by COVID-19.
Data Availability
All data was downloaded from the UK Biobank, a data resource open to all bona fide researchers.
Prediction accuracy of COVID-19 infection risk and severity among sets of predictors
The full sample consisted of all 7,539 test cases for COVID-19 infection risk, while there were 2,210 COVID-19 positive test cases examined for severity risk. Similarly, for the serology sub-group, there were 124 test cases and 39 test cases for COVID-19 infection and severity risk respectively. Test statistics for predictors are provided in Tables 3 and 4.
Acknowledgements
This research was conducted using the UK Biobank Resource under Application Number 25057. The study was funded by NIH AG047282 and AARGD-17-529552. No funding provider had any role in the conception, collection, execution, or publication of this work.