COVID-19 susceptibility and severity risks in a survey of over 500,000 individuals ================================================================================== * Spencer C. Knight * Shannon R. McCurdy * Brooke Rhead * Marie V. Coignet * Danny S. Park * Genevieve H.L. Roberts * Nathan D. Berkowitz * Miao Zhang * David Turissini * Karen Delgado * Milos Pavlovic * AncestryDNA Science Team * Asher K. Haug Baltzell * Harendra Guturu * Kristin A. Rand * Ahna R. Girshick * Eurie L. Hong * Catherine A. Ball ## ABSTRACT **Background** The enormous toll of the COVID-19 pandemic has heightened the urgency of collecting and analyzing population-scale datasets in real time to monitor and better understand the evolving pandemic. **Methods** The AncestryDNA COVID-19 Study collected self-reported survey data on symptoms, outcomes, risk factors, and exposures for over 563,000 adult individuals in the U.S. in just under four months, including over 4,700 COVID-19 cases as measured by a self-reported positive test. **Results** We replicated previously reported associations between several risk factors and COVID-19 susceptibility and severity outcomes, and additionally found that differences in known exposures accounted for many of the susceptibility associations. A notable exception was elevated susceptibility for males even after adjusting for known exposures and age (adjusted odds ratio [aOR]=1.36, 95% confidence interval [CI] = (1.19, 1.55)). We also demonstrated that self-reported data can be used to build accurate risk models to predict individualized COVID-19 susceptibility (area under the curve [AUC]=0.84) and severity outcomes including hospitalization and critical illness (AUC=0.87 and 0.90, respectively). The risk models achieved robust discriminative performance across different age, sex, and genetic ancestry groups within the study. **Conclusion** The results highlight the value of self-reported epidemiological data to rapidly provide public health insights into the evolving COVID-19 pandemic. **What is already known on this subject** * The COVID-19 pandemic has exacted a historic toll on human lives, healthcare systems and global economies, with over 83 million cases and over 1.8 million deaths worldwide as of January 2021. * COVID-19 risk factors for susceptibility and severity have been extensively investigated by clinical and public health researchers. * Several groups have developed risk models to predict COVID-19 illness outcomes based on known risk factors. **What this study adds** * We performed association analyses for COVID-19 susceptibility and severity in a large, at-home survey and replicated much of the previous clinical literature. * Associations were further adjusted for known COVID-19 exposures, and we observed elevated positive test odds for males even after adjustment for these known exposures. * We developed risk models and evaluated them across different age, sex, and genetic ancestry cohorts, and showed robust performance across all cohorts in a holdout dataset. * Our results establish large-scale, self-reported surveys as a potential framework for investigating and monitoring rapidly evolving pandemics. Keywords * COVID-19 * Modeling * Genetic Epidemiology * Statistics ## INTRODUCTION The COVID-19 pandemic has resulted in over 83 million COVID-19 cases and over 1.8 million deaths worldwide [1], including nearly 21 million cases and more than 350,000 deaths in the United States as of early January 2021 [2]. The growing impact of the pandemic intensifies the need for real-time understanding of COVID-19 susceptibility and severity risk factors, not only for public health experts, but also for individuals seeking to assess their own personalized risk. Prior research has indicated that differences in COVID-19 *susceptibility* are related to age [3], sex-dependent immune responses [4], and genetics [5, 6], while heightened *severity* of COVID-19 illness is associated with risk factors such as age [3,7–9], sex [4,10–12], genetic factors [13], and underlying health conditions [7,9,10,14,15]. Self-reported survey data, which can easily be collected in the home, afford the opportunity to dynamically monitor the continually evolving pandemic and allow for real-time estimation of individual-level COVID-19 risk [16–19]. Furthermore, self-reported surveys allow for collection of information about known exposures, of which few epidemiological COVID-19 studies have explicitly accounted for in association analyses to date [20]. In this paper, we aimed to replicate previous literature and to provide new insight into factors associated with susceptibility and severity of COVID-19 using a large survey cohort of 563,141 AncestryDNA customers who have consented to participate in the AncestryDNA COVID-19 Study [5]. We performed association tests of known or suspected COVID-19 risk factors with one susceptibility and two severity phenotypes and report unadjusted ORs and ORs adjusted for potential confounding factors. We additionally investigated associations of COVID-19 symptoms with susceptibility and severity. We further demonstrate that this type of self-reported dataset can be used to build accurate predictive risk models for COVID-19 susceptibility and severity outcomes. For susceptibility, we designed two models and additionally applied two literature-based models [18] to predict COVID-19 cases among respondents reporting a test result. We also designed models to predict two different COVID-19 severity outcomes based on minimal information about demographics, health conditions, and symptoms: hospitalization due to COVID-19 infection and progression of an infection to a life-threatening critical case among those reporting a positive COVID-19 result [14]. To evaluate the potential for generalizability, we assessed performance of all of the risk models across different age, sex and genetic ancestry cohorts. ## METHODS ### Ethics approval All data for this research project were from subjects who have provided informed consent to participate in AncestryDNA’s Human Diversity Project, as reviewed and approved by our external institutional review board, Advarra (formerly Quorum, IRB approval number: Pro00034516). Advarra operates under ethical principles underlying the involvement of human subjects in research, including the Declaration of Helsinki. All data were de-identified prior to use. ### Survey description Survey responses were collected from AncestryDNA customers consented to research in the U.S. between April 22 and July 6, 2020. The survey consisted of 50+ questions about COVID-19 test results, 15 symptoms among those who tested positive or who tested negative and had flu-like symptoms, disease progression for positive testers, age, height, weight, known exposures to biological relatives, household members, patients or any other contacts with COVID-19, and 11 underlying health conditions (Supplementary Tables 1 and 2). Collection of self-reported COVID-19 outcomes from U.S. AncestryDNA customers who consented to research for the study, and the survey design are described in more detail in a genome-wide association study (GWAS) on a very similar AncestryDNA dataset [5]. Here, participants reporting a negative test result were also assessed for symptoms and clinical outcomes. ### Outcome definitions The study assessed three outcomes: one for susceptibility and two for severity of COVID-19 infection. Cases for COVID-19 susceptibility were individuals who responded, “Yes, and was positive” to the question, “Have you been swab tested for COVID-19, commonly referred to as coronavirus?” Responders who answered, “Yes, and was negative” were used as controls for the susceptibility analysis. The hospitalization outcome was defined among COVID-19-positive cases if a participant responded “Yes” to a binary question about experiencing symptoms due to COVID-19 illness and “Yes” to the hospitalization question (“Were you hospitalized due to these symptoms?). Controls were defined by a response of “No” to the symptoms question or a response of “No” to the hospitalization question in addition to reporting a self-reported positive COVID-19 test result [5]. Critical cases of COVID-19 were defined via a response of “Yes” to one or more questions about ICU admittance or, alternatively, self-reported septic shock, organ failure, or respiratory failure resulting from a COVID-19 infection [14]. Controls were defined by a response of “No” across all of these questions in addition to self-reporting a positive COVID-19 test result. ### Genetic sex and ancestry definitions All individuals were genotyped, using previously described general genotyping and quality control procedures [21]. Both sex and genetic ancestry were defined for individuals based on their genotypes. Genetic ancestry was estimated using a proprietary algorithm to estimate continental admixture proportions [22]. All participants were assigned to one of four broad genetic ancestry groups: European ancestry, admixed African-European ancestry, admixed Amerindian ancestry, or other ancestry combinations. ### Data preparation for analysis Multiple-choice categorical questions were one-hot (“dummy”) encoded as binary risk factors. We considered several risk factors and outcomes questions in our association analyses and risk modeling efforts, some of which are summarized in Supplementary Tables 1 and 2. Based on the dependency structure of the survey, we made the following inferences: * Participants reporting “No” to a binary question about symptoms arising from COVID-19 infection were designated as negatives for dependent questions about individual symptoms, hospitalization due to symptoms, and ICU admittance due to symptoms. * Participants reporting “No” to a binary question about hospitalization were assigned to a hospital duration of 0 days and designated as negative for ICU admittance due to symptoms. For association analyses, responses to a question about individual symptoms (“Between the beginning of February 2020 and now, have you had any of the following symptoms?”) were converted to a binary variable based on the following mapping: 0 = None, Very Mild, Mild; 1 = Moderate, Severe, Very Severe. ### Association analysis Analyses were performed either with the *statsmodels* package in Python3 or in base R with the *glm* function. For each susceptibility and severity outcome and risk factor of interest, a simple logistic regression (LR) model was fit using unpenalized maximum likelihood (Supplementary Tables 6-14) [23]. Multiple logistic regression was used to adjust the ORs for known COVID-19 exposures and potentially confounding risk factors. The adjusted model included age, sex, and four known exposures (Y/N if any) for susceptibility outcomes; and age, sex, obesity (binarized if BMI >= 30), and health conditions (binarized if any) for severity outcomes (Figures 1 and 2). Individual adjustment variables were omitted when analyzing associations for risk factors within equivalent categories (e.g., age was not included in adjusted models for age bin risk factors). ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F1) Figure 1. Susceptibility odds ratios (ORs) and 95% confidence intervals (CIs) estimated from simple (“Unadjusted models,” grey) and multiple (“Adjusted models,” black) logistic regression with adjustment for other risk factors. Open circles indicate not significant (p-value > 0.05) after accounting for multiple hypothesis tests using Bonferroni correction. Age, sex, genetic ancestry, and obesity ORs were estimated in relation to the reference variables indicated. Exposure, health, and symptom ORs were each estimated separately as binary variables. Symptom ORs were estimated as binary variables among symptomatic testers only (Methods). Risk factor adjustments for susceptibility include: sex, age, and at least one known COVID-19 exposure. Where applicable, individual adjustment variables were omitted to avoid duplicate adjustment (Methods). ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F2) Figure 2. Severity (hospitalization) odds ratios (ORs) and 95% confidence intervals (CIs) estimated from simple (“Unadjusted models,” grey) and multiple (“Adjusted models,” black) logistic regression with adjustment for other risk factors. Open circles indicate not significant (p-value > 0.05) after accounting for multiple hypothesis tests using Bonferroni correction. Age, sex, genetic ancestry, and obesity ORs were estimated in relation to the reference variables indicated. Exposure, health, and symptom ORs were each estimated separately as binary variables. Symptom ORs were estimated as binary variables among symptomatic testers only (Methods). Risk factor adjustments for severity include: sex, age, obesity (Y/N), and underlying health conditions (Y/N if any). Where applicable, individual adjustment variables were omitted to avoid duplicate adjustment (Methods). See Supplementary Figure 1 for critical case severity ORs. For each risk factor, 95% confidence intervals (CIs) for the log odds ratio were estimated under the normal approximation. The significance threshold was Bonferroni-corrected for the 42 different risk factors examined (adjusted threshold of 0.05/42=0.0012) [23]. ### Risk factor selection and risk model training Prior to model training, the data were split with a fixed random seed into training and holdout datasets. We chose risk factors based on a minimal subset of nominally significant ORs within our training data as well as literature guidance [3,4,7,9,11,12,14,15]. For the susceptibility models without symptoms, we included a subset of exposure-related questions, based on the training OR analyses, as well as two demographic variables (age and sex, Supplementary Tables 15-17). For susceptibility models with symptoms, we additionally included the five symptoms most differentiated between symptomatic negative and positive testers from our training ORs. For the severity models, we included pre-existing conditions, based on the training OR analyses, predictive symptoms within our training dataset, morbid obesity (BMI >= 40), age, and sex (Supplementary Table 18). Once final risk factors were selected, we trained LR models with 5-fold cross-validated grid search on the training dataset to select an optimal lasso regularization parameter lambda [23]. For the grid search, we scanned 8 different values for lambda, equally partitioned geometrically across a 4-log space. We then re-trained on the entire training dataset with the optimal lambda and evaluated the final model on the holdout dataset. ### Model thresholding Phenotypes were predicted from the output of trained models based on a 50% probability threshold (i.e., logistic model output > 0.5). Sensitivity and specificity were then calculated based on the true vs. predicted binary outcomes. ### Estimation of performance error To estimate error in model performances, we bootstrapped our holdout dataset 1,000 times to generate a sampling distribution for each evaluation metric. We estimated the mean and 95% CIs for each metric based on the mean and standard deviation of this sampling distribution [23]. ## RESULTS ### Survey response and study population A total of 563,141 responses were collected, with 4,726 individuals reporting a COVID-19 positive test result, 28,872 a negative test result, 71,761 no COVID-19 test but flu-like symptoms, and 454,542 no COVID-19 and no flu-like symptoms. 3,240 reported pending test results and were excluded from further analyses. The survey completion rate was approximately 95%. In general, the COVID-19 positive test rate and self-reported clinical outcomes were consistent with those reported by the U.S. Centers for Disease Control and Prevention (CDC) over a similar period (Supplementary Note 1) [24]. The majority of participants were female (67.5%) and of European ancestry (75.4%), with some individuals of Admixed Amerindian (6.5%) or Admixed African-European (3.0%) ancestries. The median age of the entire cohort was 56, and the median age of those reporting a positive test result was 49 (Table 1 and Supplementary Tables 1-5). View this table: [Table 1.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T1) Table 1. Study population demographic information ### Susceptibility associations: replicated and novel We replicated many previously reported literature associations for susceptibility. The strongest associations for a positive COVID-19 test result were known COVID-19 exposures, either through a household case (OR=26.03; 95% confidence interval [CI]=(22.26, 30.43)), biological relative (OR=5.77; 95% CI=(4.99, 6.68)), or other source of “direct” exposure (OR=6.94; 95% CI=(6.02, 7.99)) (Figure 1, Supplementary Table 6). In general, adjusting for known exposures, age, and sex resulted in attenuation of the ORs, with many associations becoming insignificant after adjustment (Figure 1, Supplementary Table 7). One novel result was that the OR for males was not attenuated after adjustment, and males remained at elevated odds after adjusting for known exposures and age (aOR=1.36; 95% CI=(1.19, 1.55); Figure 1, Supplementary Table 7). We also note that males and females reported comparable exposure burden, with males slightly more likely to report a household case of COVID-19 but less likely to report a case of COVID-19 among biological relatives (Supplementary Tables 9 and 10). Consistent with previous reports [25–29], younger individuals (ages 18-29; OR=1.51; 95% CI=(1.26, 1.81)) were significantly more likely to test positive compared to older individuals (ages 50-64, the largest age group in this cohort), and individuals of admixed African-European (OR=1.48; 95% CI=(1.18, 1.85)) or admixed Amerindian ancestry (OR=1.49; 95% CI=(1.26, 1.77)) were more likely to test positive compared to those of European ancestry (Supplementary Table 6). Individuals in all three of these groups reported higher levels of COVID-19 cases within the household, cases among biological relatives, and/or other known “direct” COVID-19 exposures (Supplementary Tables 8-10). Adjusting for age (ancestry groups only), sex, and known exposures attenuated the OR for all of these groups (younger aOR=1.28; 95% CI=(1.03, 1.59), African-European aOR=1.23; 95% CI=(0.94, 1.62), and Amerindian aOR=1.27; 95% CI=(1.04, 1.57); Figure 1, Supplementary Table 7). Individuals reporting pre-existing medical conditions (e.g., cancer, cardiovascular disease, chronic kidney disease [CKD], diabetes, hypertension) were less likely to test positive for COVID-19 (Figure 1, Supplementary Table 6). We observed significantly decreased odds of a known “direct” exposure to COVID-19, as well as significantly decreased odds of a household case of COVID-19, among such individuals relative to those without any health conditions (OR=0.71; 95% CI=(0.65, 0.78) and OR=0.74; 95% CI=(0.65, 0.84), respectively; Supplementary Tables 8 and 9). ### Replicated associations for COVID-19 severity Consistent with previous reports [7,9,12,14,15], we observed positive associations between certain health conditions and COVID-19 severity outcomes; many of these associations remained significant after adjustment for age, sex, and obesity (BMI >= 30) (Figure 2, Supplementary Tables 11-14). COVID-19 cases reporting at least one underlying health condition were significantly more likely to progress to a critical case (OR=2.85; 95% CI=(1.78, 4.57); Figure 2, Supplementary Figure 1, Supplementary Table 13). Specific underlying health conditions that were associated with hospitalization and/or critical case progression included CKD, chronic obstructive pulmonary disease (COPD), diabetes, cardiovascular disease, and hypertension (Figure 2, Supplementary Figure 1, Tables S12 and S14). Among individuals testing positive for COVID-19, the oldest (≥65 years) were significantly more likely to be hospitalized compared to those aged 50-64 (OR=1.70; 95% CI=(1.13, 2.56); Figure 2, Supplementary Table 11). Individuals of admixed African-European ancestry who tested positive were significantly more likely to report progression to a critical case, compared to those with European ancestry (OR=2.07; 95% CI=(1.03, 4.17); Supplementary Figure 1, Supplementary Table 13). Among COVID-19 cases, males were significantly more likely than females to report progression to a critical case (OR=1.54, 95%; CI=(1.00, 2.37); Supplementary Figure 1, Supplementary Table 13); these findings are consistent with CDC reports of increased ICU admittance rates in males (3% vs. 2%) [24]. ### Differential symptomology between susceptibility and severity We compared associations between susceptibility and severity to provide a more nuanced view of symptoms associated with a positive test result versus those associated with severe illness progression (Figure 3) [17,18,30]. Among symptomatic people reporting a COVID-19 test result, those reporting change in taste or smell (OR=7.26; 95% CI=(5.54, 9.50)), fever (OR=1.60; 95% CI=(1.28, 2.01)), or feeling tired or fatigue (OR=1.41; 95% CI=(1.05, 1.89)) were more likely to test positive (Figure 3, Supplementary Table 6). Those reporting runny nose (OR=0.59; 95% CI=(0.47, 0.75)) or sore throat (OR=0.49; 95% CI=(0.39, 0.62)) were more likely to test negative, consistent with previous reports that these symptoms are more indicative of influenza or the common cold (Figure 3, Supplementary Table 6) [17,18,30]. Change in taste or smell, a hallmark symptom of COVID-19 infection, was not associated with hospitalization (OR=0.77, 95% CI=(0.55, 1.07); Figure 3, Supplementary Table 11). By contrast, dyspnea (shortness of breath) was strongly associated with hospitalization and critical case progression (OR=7.52; 95% CI=(4.92, 11.49) and OR=11.55; 95% CI=(5.91, 22.59), respectively) [31], but was not associated with a positive test result (OR=1.14; 95% CI=(0.91, 1.44); Figure 3, Supplementary Tables 6, 11, and 13). ![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F3.medium.gif) [Figure 3.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F3) Figure 3. Comparison of susceptibility adjusted ORs (horizontal axis) and severity adjusted ORs (vertical axis) for symptoms in Figures 1 and 2. Severity aORs are for hospitalization. Note that aORs for susceptibility and severity are adjusted differently according to descriptions in Figure 1 and 2 captions. Adjusted ORs are plotted on a log scale for visibility. Shortness of breath is the strongest indicator of increased severity, while change in taste or smell is the strongest indicator for testing positive for COVID-19 among symptomatic individuals (Methods). Refer to Supplementary Figure 2 for demographic, health condition, and exposure aORs. ### Predictive risk models We further developed risk models that predict an individual’s COVID-19 risk (positive test result or severity, Methods) [7,16–18,32,33]. The susceptibility models were designed to predict a COVID-19 result (positive or negative) from risk factors among testers. We compared four models: our model based on demographics and exposures only (“Dem + Exp”); our model based on demographics, exposures, and symptoms (“Dem + Exp + Symp”); and for benchmarking purposes, a replication of a previously published model called “How We Feel” based on nearly identical self-reported symptoms (“HWF Symp”), and one which also included self-reported exposures (“HWF Exp + Symp”) (Supplementary Note 2, Supplementary Table 18) [18]. All four susceptibility models performed robustly; the three models that included one or more symptoms outperformed the model without symptoms (Dem + Exp), underscoring the value of self-reported symptoms for discriminating between cases and controls (Figure 4, see Supplementary Tables 19-22 for detailed model performance data). The model with demographics, exposures, and symptoms (Dem + Exp + Symp) achieved the highest overall performance with an AUC of 0.94 ± 0.02, a sensitivity of 85%, and a specificity of 91% (Supplementary Note 3). Each of the models performed comparably across different age, sex, and genetic ancestry cohorts (Figure 4, Supplementary Tables 19-22). We observed no significant overfitting in any of the models as evidenced by comparable train-test performances (Supplementary Table 23). ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F4) Figure 4. Performance of risk models on independent holdout data. A) Receiver operating characteristic (ROC) curves for susceptibility models to predict COVID-19 cases among testers reporting a result (positive or negative). B) Area under the curve (AUC) for the four susceptibility models in A), stratified by cohort. “All” represents everyone in A). C) ROC curves for severity models to predict either hospitalization (red) or critical illness progression (black) among COVID-19 cases. D) Area under the curve (AUC) for the two severity models in B), stratified by cohort. “All” represents everyone in C). Refer to Methods as well as Supplementary Figure 3 and Supplementary Tables 18-25 for additional model performance data and model risk factor information. We trained two severity models, designed to predict either hospitalization or progression to critical illness among COVID-19 cases. We included a number of risk factors and symptoms most associated with severe COVID-19 outcomes from the literature and/or our training dataset (Figure 3, Supplementary Tables 16 and 17); these included age [7–9,14], sex [4,7,11,12,14], morbid obesity (BMI >= 40) [7, 34], and health conditions [7,9,12,14,15], as well as symptoms including shortness of breath [31], fever, feeling tired or fatigue, dry cough, and diarrhea. Both models performed robustly on an independent holdout dataset (AUCs of 0.87 +/− 0.03 and 0.90 +/− 0.03 for the hospitalization and critical models, respectively, Figure 4). The severity models performed comparably when stratifying by age, sex, and genetic ancestry (Figure 4, Supplementary Tables 24 and 25), and there was no significant overfitting bias as evidenced by comparable train-test performances (Supplementary Table 23). ## DISCUSSION The AncestryDNA COVID-19 Study provides a highly complete, self-reported dataset that contains information about a plethora of risk factors in the context of COVID-19 susceptibility and severity outcomes. The self-report framework provides fast, low-cost, population-scale data that are particularly valuable in a pandemic, where knowledge is both limited and evolving rapidly based on changing circumstances. Additionally, the broad collection mechanism enables data-gathering from many more participants than typically seen in a medical setting, including those with mild or no symptoms, and participants can safely provide data from their homes. The study highlights exposure burden as the primary risk factor for COVID-19 susceptibility, and the importance of accounting for known exposures when assessing differences in susceptibility to COVID-19. Few studies have measured and explicitly adjusted for known COVID-19 exposures at this scale [20]. Importantly, we found elevated susceptibility risk in males after adjusting for age and known exposures, and unlike most of the risk factors we evaluated, the adjusted odds were not attenuated compared to the unadjusted odds. This finding is distinct from previous findings on elevated severity risk in males [4,7,11]. This result could be due to differences between men and women in behaviors, unknown exposures, biology, genetics [4–6], or other risk factors not measured within this dataset and should be investigated in future studies. Another major contribution of this study is the use of self-reported data for the development of novel risk models for predicting an individual’s COVID-19 susceptibility and severity risk. The risk models presented here perform comparably or better than similar and more complex models reported previously [16–18,32,33]. Although some previously reported risk models have been assessed in different age or sex cohorts [16, 17], we are not aware of any that have been assessed across genetic ancestry cohorts [7,16–18,32,33], highlighting the potential utility and generalizability of the models to broader populations [17,18,30]. ### Limitations We note that there are some inherent limitations of self-reported data for studying COVID-19 risk factors. The most severe cases, especially those resulting in mortality, were not sampled. As a result, many of the risk factor effect estimates may be underestimated. Additionally, the AncestryDNA cohort is self-selected, slightly older, more European, and more female than the broader U.S. population. Another potential issue is that those who reported a negative test may have underestimated their exposures and symptoms relative to those who tested positive, leading to upwardly biased exposure effect estimates. Finally, misclassification of COVID-19 positive status is likely given the uneven availability of tests over the time period surveyed, potentially leading to susceptibility effect estimates that are biased toward the null. However, the fact that most of the associations observed in this study were similar to those previously reported in the literature and the fact that risk model performance remained high when data were stratified by age, sex, and genetic ancestry lends confidence to our findings in spite of limitations. ## CONCLUSION The COVID-19 pandemic has exacted a historic toll on healthcare systems and global economies and continues to evolve based on changes in human behavior, public health guidelines, and societal factors. This study demonstrates the power of a self-reported data in a large cohort to rapidly elucidate more details about COVID-19 risk factors and help point the way to minimizing disease burden. ## Data Availability A dataset (EGAC00001001762) is available to qualified scientists through the European Genome-phenome Archive (EGA). The EGA dataset includes the risk factors and outcomes studied here. The EGA dataset is de-identified and comprises ~15,000 individuals who tested for COVID-19, including more than 3,000 individuals who tested positive, many of whom are in this study. The EGA cohort is sufficient to nominally replicate the vast majority of susceptibility and severity associations from this study. [https://ega-archive.org/dacs/EGAC00001001762](https://ega-archive.org/dacs/EGAC00001001762) ## Contributors SCK, SRM, BR contributed equally to the manuscript and wrote the first draft of the paper. ARG provided direct project guidance and lead the COVID-19 research teams. SCK, SRM, BR performed the association analyses. SCK developed and assessed the risk models. MVC and KAR designed the COVID-19 survey questionnaire. SCK and DP supported the dataset creation. GHLR supported the phenotype definitions. MVC and NDB built the demographic tables. SCK, SRM, BR, MVC, GHLR, ARG, and KAR helped with additional analyses and interpretation. MZ, DP, DT, KD, MP, HG, AKHB helped with the EGA dataset. The AncestryDNA Science Team contributed to additional work, allowing for the completion of the COVID-19 research and manuscript. KAR, ELH, and CAB provided additional project guidance. All authors contributed to the final manuscript. ## AncestryDNA Science Team Yambazi Banda, Ke Bi, Robert Burton, Marjan Champine, Ross Curtis, Abby Drokhlyansky, Ashley Elrick, Cat Foo, Michael Gaddis, Jialiang Gu, Shannon Hateley, Heather Harris, Shea King, Christine Maldonado, Evan McCartney-Melstad, Alexandra McFarland, Patty Miller, Luong Nguyen, Keith Noto, Jingwen Pei, Jenna Petersen, Scott Pew, Chodon Sass, Josh Schraiber, Alisa Sedghifar, Andrey Smelter, Sarah South, Barry Starr, Cecily Vaughn, Yong Wang ## Funding All work was supported and funded by Ancestry.com, a privately owned corporation. ## Competing interests Authors affiliated with AncestryDNA may have equity in Ancestry. ## Patient consent for publication Not required. ## Data availability statement A dataset (EGAC00001001762) is available to qualified scientists through the European Genome-phenome Archive (EGA). The EGA dataset includes the risk factors and outcomes studied here. The EGA dataset is de-identified and comprises ∼15,000 individuals who tested for COVID-19, including more than 3,000 individuals who tested positive, many of whom are in this study. The EGA cohort is sufficient to nominally replicate the vast majority of susceptibility and severity associations from this study. Risk models trained within the EGA cohort achieve comparable discriminative performance to the models presented here when evaluated in an independent holdout dataset (Supplementary Figure 5). ## SUPPORTING INFORMATION ### SUPPLEMENTARY NOTES **Supplementary Note 1. Comparison of AncestryDNA data to CDC data.** In general, the COVID-19 testing results are consistent with those reported by the U.S. Centers for Disease Control and Prevention (CDC) over a similar period [24]. For example, the CDC reported a 12% cumulative overall positive test rate from 1 March to 30 May 2020, while 14.1% of AncestryDNA survey participants reported testing positive (the denominator includes tests with “results pending”). For hospitalization, the CDC reported that 14% of individuals testing positive for COVID-19 were hospitalized, and 2% were admitted to an intensive care unit (ICU), between 22 January and 30 May 2020. Among the AncestryDNA survey respondents, 11% of those reporting a positive COVID-19 test were hospitalized, and 4.9% were admitted to ICU between April 22 and July 6, 2020. **Supplementary Note 2. Risk factors for predictive susceptibility models.** The HWF Exp + Symp model includes three risk factors, including two exposures and one symptom (change in taste or smell, Supplementary Table 18). The HWF Symp model includes seven symptoms most commonly associated with COVID-19 according to the CDC (Supplementary Table 18) [18]. For the Dem + Exp and Dem + Exp + Symp models, we incorporated responses to several exposure questions that were primary risk factors in the OR analysis of the training set (Methods and Supplementary Table 15) [18]. We also considered age and sex, given their nominal association with COVID-19 in our ORs in the training dataset, as well as reports about these variables as risk factors (Supplementary Table 15) [3,4,29]. For Dem + Exp + Symp model, we selected the most differentiated symptoms between cases and controls amongst symptomatic COVID-19 testers within the training dataset, including fever, change in taste or smell, feeling tired/fatigued, runny nose, and sore throat [17,18,30]. **Supplementary Note 3. Comparison of AncestryDNA and HWF cohorts and performances.** The HWF models performed better when trained and evaluated on the AncestryDNA dataset as compared with the HWF dataset, particularly for the symptoms-only model (HWF Symp: AncestryDNA-trained AUC=0.87 and HWF-trained AUC=0.76 and HWF Exp + Symp: AncestryDNA-trained AUC=0.90 and HWF-trained AUC=0.87; Figure 4) [18]. This result could be due to differences in survey flow design, demographics, or the relative prevalences of self-reported symptoms between the two datasets. We re-trained and evaluated our models on a subset of symptomatic COVID-19 testers (positives and negatives) reporting at least one symptom of moderate or greater intensity. The performances of all models that included symptoms were slightly attenuated in this cohort, with AUCs for the HWF models approaching previous reports (HWF Symp: AncestryDNA-trained AUC=0.80 and HWF Exp + Symp: AncestryDNA-trained AUC=0.87); the model without symptoms performed the same (Supplementary Figure 4). This suggests that predicting COVID-19 cases on the basis of symptoms alone is more challenging when symptoms are less differentiated between positive and negative testers. ### SUPPLEMENTARY TABLES View this table: [Supplementary Table 1.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T2) Supplementary Table 1. Selected subset of survey questions considered in the context of COVID-19 severity outcomes analysis and modeling efforts. View this table: [Supplementary Table 2.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T3) Supplementary Table 2. Selected risk factors considered in susceptibility or severity outcomes analysis and modeling efforts. View this table: [Supplementary Table 3.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T4) Supplementary Table 3. Demographics for susceptibility modeling cohort View this table: [Supplementary Table 4.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T5) Supplementary Table 4. Demographics for severity modeling cohort View this table: [Supplementary Table 5.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T6) Supplementary Table 5. Demographics for susceptibility models with symptoms View this table: [Supplementary Table 6.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T7) Supplementary Table 6. Unadjusted risk factor odds ratios for COVID-19 susceptibility View this table: [Supplementary Table 7.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T8) Supplementary Table 7. Risk factor odds ratios for COVID-19 susceptibility, adjusted for exposure, age, and sex View this table: [Supplementary Table 8.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T9) Supplementary Table 8. Unadjusted risk factor odds ratios for a known “direct” COVID-19 exposure View this table: [Supplementary Table 9.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T10) Supplementary Table 9. Unadjusted risk factor odds ratios for a household case(s) of COVID-19 View this table: [Supplementary Table 10.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T11) Supplementary Table 10. Unadjusted risk factor odds ratios for a COVID-19 case(s) amongst close biological relatives (parent, grandparent, child, or sibling) View this table: [Supplementary Table 11.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T12) Supplementary Table 11. Unadjusted risk factor odds ratios for COVID-19 hospitalization View this table: [Supplementary Table 12.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T13) Supplementary Table 12. Risk factor odds ratios for COVID-19 hospitalization, adjusted for health conditions, obesity, age, and sex View this table: [Supplementary Table 13.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T14) Supplementary Table 13. Unadjusted risk factor odds ratios for progression of COVID-19 to a critical case. View this table: [Supplementary Table 14.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T15) Supplementary Table 14. Risk factor odds ratios for progression of COVID-19 to a critical case, adjusted for health conditions, obesity, age, and sex. View this table: [Supplementary Table 15.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T16) Supplementary Table 15. Unadjusted risk factor odds ratios for COVID-19 susceptibility (training dataset) View this table: [Supplementary Table 16.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T17) Supplementary Table 16. Unadjusted risk factor odds ratios for COVID-19 hospitalization (training dataset) View this table: [Supplementary Table 17.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T18) Supplementary Table 17. Unadjusted risk factor odds ratios for progression of COVID-19 to a critical case (training dataset) View this table: [Supplementary Table 18.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T19) Supplementary Table 18. Summary of risk factors and performances of different models View this table: [Supplementary Table 19.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T20) Supplementary Table 19. Stratified holdout performance assessment for a COVID-19 susceptibility model built from age, sex, and exposures (Dem + Exp) View this table: [Supplementary Table 20.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T21) Supplementary Table 20. Stratified holdout performance assessment for a COVID-19 susceptibility model built from age, sex, exposures, and symptoms (Dem + Exp + Symp) View this table: [Supplementary Table 21.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T22) Supplementary Table 21. Stratified holdout performance assessment for a literature-based COVID-19 susceptibility model based on two exposures and one symptom (HWF Exp + Symp) View this table: [Supplementary Table 22.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T23) Supplementary Table 22. Stratified holdout performance assessment for a literature-based COVID-19 susceptibility model based on seven symptoms (HWF Symp) View this table: [Supplementary Table 23.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T24) Supplementary Table 23. Comparison between train and holdout performances for different risk models View this table: [Supplementary Table 24.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T25) Supplementary Table 24. Stratified holdout performance assessment for models to predict hospitalization amongst COVID-19 positive cases View this table: [Supplementary Table 25.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/T26) Supplementary Table 25. Stratified holdout performance assessment for models to predict critical COVID-19 cases amongst all COVID-19 positive cases ### SUPPLEMENTARY FIGURES ![Supplementary Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F5.medium.gif) [Supplementary Figure 1.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F5) Supplementary Figure 1. Severity (critical case) odds ratios (ORs) and 95% confidence intervals (CIs) estimated from simple (“Unadjusted models,” grey) and multiple (“Adjusted models,” black) logistic regression with adjustment for other risk factors. Open circles indicate not significant (p-value > 0.05) after accounting for multiple hypothesis tests using Bonferroni correction. Age, sex, genetic ancestry, and obesity ORs were estimated in relation to the reference variables indicated. Exposure, health, and symptom ORs were each estimated separately as binary variables. Symptom ORs were estimated as binary variables among symptomatic testers only (Methods). Risk factor adjustments for severity include: sex, age, obesity (Y/N), and underlying health conditions (Y/N if any). Where applicable, individual adjustment variables were omitted to avoid duplicate adjustment (Methods). ![Supplementary Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F6.medium.gif) [Supplementary Figure 2.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F6) Supplementary Figure 2. Comparison of Susceptibility and Severity adjusted ORs. Comparison of susceptibility ORs (horizontal axis) and severity ORs (vertical axis) from Figures 1 and 2. Severity aORs are for hospitalization. aORs are adjusted as in Figures 1 and 2 (see Figure 3 for adjusted ORs for symptoms). Plotted on a log scale for visibility. A) **Demographics**, broken down by age, sex, and genetic ancestry. B) **Health conditions**: Some pre-existing health conditions have increased odds of critical outcomes but susceptibility is not characterized by health conditions. C) **Exposures** are important for susceptibility but not severity. ![Supplementary Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F7.medium.gif) [Supplementary Figure 3.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F7) Supplementary Figure 3. Risk model confusion matrices for independent holdout data. Numbers represent number of individuals in each category. A) Susceptibility model with demographics, exposures, and symptoms. B) Susceptibility model demographics and exposures (symptoms excluded). C) Literature-based susceptibility model with three risk factors including two exposures and change in taste or smell. D) Literature-based susceptibility model with seven COVID-19 symptoms. E) Hospitalization risk model. F) Critical case risk model. Refer to Methods as well as Supplementary Tables 18-25 for additional model performance data and model risk factor information. ![Supplementary Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F8.medium.gif) [Supplementary Figure 4.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F8) Supplementary Figure 4. Performance of susceptibility models evaluated in a symptomatic cohort of positive and negative testers. Plot depicts receiver operating characteristic (ROC) curves for four susceptibility models (Supplementary Table 18). Here, the modeling cohort has been restricted to positive and negative testers reporting at least one symptom of moderate or greater intensity for model training and evaluation. ![Supplementary Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2021/01/27/2020.10.08.20209593/F9.medium.gif) [Supplementary Figure 5.](http://medrxiv.org/content/early/2021/01/27/2020.10.08.20209593/F9) Supplementary Figure 5. Concordance between the EGA cohort and AncestryDNA cohort (V3) from this study. A) Concordance of susceptibility associations between the EGA dataset (vertical axis) and the AncestryDNA dataset from this study (V3, horizontal axis), plotted as log mean effect estimates (logORs). B) Concordance of hospitalization associations between the EGA dataset (vertical axis) and the AncestryDNA dataset from this study (V3, horizontal axis). C) Independent holdout performance of susceptibility models trained within the EGA dataset. D) Independent holdout performance of severity models trained within the EGA dataset. ## Acknowledgements We thank our AncestryDNA customers who made this study possible by contributing information about their experience with COVID-19 through our survey. Without them, this work would not be possible. We would like to thank Zach Bass, Robert Dowling, Disha Akarte, Swapnil Sneham, Sean Enright and the entire Cyborg team for their tireless work in the release and continued support of the COVID-19 survey. ## Footnotes * Updated discussion section to include limitations and updated supplementary figures. ## Abbreviations aOR : adjusted odds ratio AUC : area under the curve BMI : body mass index CI : confidence interval CDC : U.S. Centers for Disease Control CKD : chronic kidney disease COPD : chronic obstructive pulmonary disease EGA : European Genome-phenome Archive GWAS : genome-wide association studies HWF : “How We Feel” study ICU : intensive care unit LR : logistic regression OR : odds ratio ROC : receiver operating characteristic * Received October 8, 2020. * Revision received January 26, 2021. * Accepted January 27, 2021. * © 2021, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/) ## REFERENCES 1. 1.World Health Organization. Coronavirus Disease (COVID-19)– Weekly Epidemiological Update. 2021.[https://www.who.int/publications/m/item/weekly-epidemiological-update\---|5-january-2021](https://www.who.int/publications/m/item/weekly-epidemiological-update\---|5-january-2021) (accessed 6 Jan 2021). 2. 2.U.S. Centers for Disease Control and Prevention (CDC). Coronavirus Disease 2019 (COVID-19): CDC COVID Data Tracker. 2021.[https://covid.cdc.gov/covid-data-tracker/~cases_casesper100klast7days](https://covid.cdc.gov/covid-data-tracker/~cases_casesper100klast7days) (accessed 6 Jan 2021). 3. 3.CMMID COVID-19 working group, Davies NG, Klepac P, et al. Age-dependent effects in the transmission and control of COVID-19 epidemics. Nat Med 2020;26:1205–11. doi:10.1038/s41591-020-0962-9 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-020-0962-9&link_type=DOI) 4. 4.Yale IMPACT research team, Takahashi T, Ellingson MK, et al. Sex differences in immune responses that underlie COVID-19 disease outcomes. Nature Published Online First: 26 August 2020. doi:10.1038/s41586-020-2700-3 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2700-3&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32846427&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 5. 5.Roberts GHL, Park DS, Coignet MV, et al. AncestryDNA COVID-19 Host Genetic Study Identifies Three Novel Loci. Epidemiology 2020. doi:10.1101/2020.10.06.20205864 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoibWVkcnhpdiI7czo1OiJyZXNpZCI7czoyMToiMjAyMC4xMC4wNi4yMDIwNTg2NHYxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjEvMDEvMjcvMjAyMC4xMC4wOC4yMDIwOTU5My5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 6. 6.Zhao Y, Zhao Z, Wang Y, et al. Single-cell RNA expression profiling of ACE2, the receptor of SARS-CoV-2. medRXiv Published Online First: 26 January 2020. doi:10.1101/2020.01.26.919985 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1101/2020.01.26.919985&link_type=DOI) 7. 7.Williamson EJ, Walker AJ, Bhaskaran K, et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature 2020;584:430–6. doi:10.1038/s41586-020-2521-4 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2521-4&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32640463&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 8. 8.Liu Y, Mao B, Liang S, et al. Association between age and clinical characteristics and outcomes of COVID-19. Eur Respir J 2020;55:2001112. doi:10.1183/13993003.01112-2020 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiZXJqIjtzOjU6InJlc2lkIjtzOjEyOiI1NS81LzIwMDExMTIiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wMS8yNy8yMDIwLjEwLjA4LjIwMjA5NTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 9. 9.Li X, Xu S, Yu M, et al. Risk factors for severity and mortality in adult COVID-19 inpatients in Wuhan. J Allergy Clin Immunol 2020;146:110–8. doi:10.1016/j.jaci.2020.04.006 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jaci.2020.04.006&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32294485&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 10. 10.Clark A, Jit M, Warren-Gash C, et al. Global, regional, and national estimates of the population at increased risk of severe COVID-19 due to underlying health conditions in 2020: a modelling study. Lancet Glob Health 2020;8:e1003–17. doi:10.1016/S2214-109X(20)30264-3 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2214-109X(20)30264-3&link_type=DOI) 11. 11.Gebhard C, Regitz-Zagrosek V, Neuhauser HK, et al. Impact of sex and gender on COVID-19 outcomes in Europe. Biol Sex Differ 2020;11:29. doi:10.1186/s13293-020-00304-9 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s13293-020-00304-9&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 12. 12.Grasselli G, Zangrillo A, Zanella A, et al. Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region, Italy. JAMA 2020;323:1574. doi:10.1001/jama.2020.5394 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.5394&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32250385&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 13. 13.Ellinghaus D, Degenhardt F, Bujanda L, et al. Genomewide Association Study of Severe Covid-19 with Respiratory Failure. N Engl J Med 2020;:NEJMoa2020283. doi:10.1056/NEJMoa2020283 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMoa2020283&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32558485&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 14. 14.Richardson S, Hirsch JS, Narasimhan M, et al. Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area. JAMA 2020;323:2052. doi:10.1001/jama.2020.6775 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.6775&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32320003&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 15. 15.Guan W, Liang W, Zhao Y, et al. Comorbidity and its impact on 1590 patients with COVID-19 in China: a nationwide analysis. Eur Respir J 2020;55:2000547. doi:10.1183/13993003.00547-2020 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiZXJqIjtzOjU6InJlc2lkIjtzOjEyOiI1NS81LzIwMDA1NDciO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMS8wMS8yNy8yMDIwLjEwLjA4LjIwMjA5NTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 16. 16.Wynants L, Van Calster B, Collins GS, et al. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ 2020;:m1328. doi:10.1136/bmj.m1328 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjE3OiIzNjkvYXByMDdfMi9tMTMyOCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIxLzAxLzI3LzIwMjAuMTAuMDguMjAyMDk1OTMuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 17. 17.Menni C, Valdes AM, Freidin MB, et al. Real-time tracking of self-reported symptoms to predict potential COVID-19. Nat Med 2020;26:1037–40. doi:10.1038/s41591-020-0916-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-020-0916-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32393804&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 18. 18.Allen WE, Altae-Tran H, Briggs J, et al. Population-scale longitudinal mapping of COVID-19 symptoms, behaviour and testing. Nat Hum Behav 2020;4:972–82. doi:10.1038/s41562-020-00944-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41562-020-00944-2&link_type=DOI) 19. 19.Martin LM, Leff M, Garrett C, et al. Validation of self-reported chronic conditions and health services in a managed care population. Am J Prev Med 2000;18:215–8. doi:10.1016/s0749-3797(99)00158-0 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0749-3797(99)00158-0&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=10722987&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000086044700004&link_type=ISI) 20. 20.Nguyen LH, Drew DA, Graham MS, et al. Risk of COVID-19 among front-line health-care workers and the general community: a prospective cohort study. Lancet Public Health 2020;5:e475–83. doi:10.1016/S2468-2667(20)30164-X [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2468-2667(20)30164-X&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32745512&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 21. 21.Han E, Carbonetto P, Curtis RE, et al. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat Commun 2017;8:14238. doi:10.1038/ncomms14238 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ncomms14238&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28169989&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 22. 22.Ball CA. Ethnicity Estimate 2019 White Paper. 2019.[https://www.ancestrycdn.com/dna/static/pdf/whitepapers/EV2019\_white\_paper\_2.pdf](https://www.ancestrycdn.com/dna/static/pdf/whitepapers/EV2019_white_paper_2.pdf) (accessed 1 Aug 2020). 23. 23.Hastie T. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York:: Springer 2009. 24. 24.Stokes EK, Zambrano LD, Anderson KN, et al. Coronavirus Disease 2019 Case Surveillance — United States, January 22–May 30, 2020. MMWR Morb Mortal Wkly Rep 2020;69:759– 65. doi:10.15585/mmwr.mm6924e2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.15585/mmwr.mm6924e2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 25. 25.Rosenberg A, Keene DE, Schlesinger P, et al. COVID-19 and Hidden Housing Vulnerabilities: Implications for Health Equity, New Haven, Connecticut. AIDS Behav 2020;24:2007–8. doi:10.1007/s10461-020-02921-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s10461-020-02921-2&link_type=DOI) 26. 26.Price-Haywood EG, Burton J, Fort D, et al. Hospitalization and Mortality among Black Patients and White Patients with Covid-19. N Engl J Med 2020;382:2534–43. doi:10.1056/NEJMsa2011686 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsa2011686&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 27. 27.Millett GA, Jones AT, Benkeser D, et al. Assessing differential impacts of COVID-19 on black communities. Ann Epidemiol 2020;47:37–44. doi:10.1016/j.annepidem.2020.05.003 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.annepidem.2020.05.003&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32419766&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2021%2F01%2F27%2F2020.10.08.20209593.atom) 28. 28. Webb Hooper M, Nápoles AM, Pérez-Stable EJ. COVID-19 and Racial/Ethnic Disparities. JAMA 2020;323:2466. doi:10.1001/jama.2020.8598 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.8598&link_type=DOI) 29. 29. Julie Bosman, Mervosch S. As virus surges, younger people account for ‘disturbing’ number of cases. N. Y. Times. 2020.[https://www.nytimes.com/2020/06/25/us/coronavirus-cases-young-people.html](https://www.nytimes.com/2020/06/25/us/coronavirus-cases-young-people.html) 30. 30.Chow EJ, Schwartz NG, Tobolowsky FA, et al. Symptom Screening at Illness Onset of Health Care Personnel With SARS-CoV-2 Infection in King County, Washington. JAMA 2020;323:2087. doi:10.1001/jama.2020.6637 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2020.6637&link_type=DOI) 31. 31.Shi L, Wang Y, Wang Y, et al. Dyspnea rather than fever is a risk factor for predicting mortality in patients with COVID-19. J Infect 2020;:S0163445320302887. doi:10.1016/j.jinf.2020.05.013 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jinf.2020.05.013&link_type=DOI) 32. 32.Jehi L, Ji X, Milinovich A, et al. Individualizing Risk Prediction for Positive Coronavirus Disease 2019 Testing. Chest 2020;:S0012369220316548. doi:10.1016/j.chest.2020.05.580 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.chest.2020.05.580&link_type=DOI) 33. 33.Jehi L, Ji X, Milinovich A, et al. Development and validation of a model for individualized prediction of hospitalization risk in 4,536 patients with COVID-19. PLOS ONE 2020;15:e0237419. doi:10.1371/journal.pone.0237419 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0237419&link_type=DOI) 34. 34.Dietz W, Santos-Burgoa C. Obesity and its Implications for COVID-19 Mortality. Obesity 2020;28:1005–1005. doi:10.1002/oby.22818 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/oby.22818&link_type=DOI)