Abstract
Big data approaches to discovering non-genetic risk factors have lagged behind genome-wide association studies that routinely uncover novel genetic risk factors for diverse diseases. Instead, epidemiology typically focuses on candidate risk factors. Since modern biobanks contain thousands of potential risk factors, candidate approaches may introduce bias, inadequately control for multiple testing, and miss important signals. Bayesian model averaging offers a solution, but classical statistics predominates, perhaps because of concern that the prior unduly influences results. Here we show that simultaneous Bayesian and frequentist discovery of direct risk factors is possible via a model-averaged hypothesis testing approach for large samples called ‘Doublethink’. Doublethink produces interchangeable posterior odds and p-values that control the false discovery rate (FDR) and familywise error rate (FWER). We implement the Doublethink approach in R and apply it to discover direct risk factors for COVID-19 hospitalization in 2020 among 1,912 variables in UK Biobank. We find nine exposome-wide significant variables at 9% FDR and 0.05% FWER. These include several commonly reported risk factors (e.g. age, sex, obesity) but exclude others (e.g. diabetes, cardiovascular disease, hypertension) which might be mediated through variables measuring general comorbidity (e.g. numbers of medications). We identify significant direct effects among infrequently reported risk factors (psychiatric disorders, infection, dementia and aging), and show how testing groups of correlated variables is a useful alternative to pre-analysis variable selection. We discuss the potential for impact and limitations of joint Bayesian-frequentist inference, and the mutual insights afforded into the long-standing differences on statistical approaches to scientific discovery.
Introduction
The big data era has seen the advent of biobank-scale scans for genetic determinants of diverse health outcomes in cohorts like UK Biobank (1, 2). But similar data-driven identification of non-genetic determinants, termed risk factors, has not become commonplace. Instead, current epidemiology typically reports on candidate risk factors. Studies address the question: What is the total effect of a variable on the outcome? Is it non-zero? For instance, more than 100 published studies have analysed dozens of candidate risk factors for COVID-19 outcomes in UK Biobank (Table S1). Synthesizing these findings is difficult because: (i) Other, more important, risk factors that were not analysed may exist among the thousands measured. (ii) It is unclear how to appropriately limit false positives caused by multiple testing. (iii) The processes of selecting candidate risk factors and deciding to publish are vulnerable to bias. The experience of candidate gene studies, largely superseded by genome-wide association studies (GWAS), raises further questions about strength of evidence and reproducibility in candidate risk factor studies (3, 4, 5).
A major complication for systematic studies of non-genetic risk factors, compared to GWAS, is the problem of mediation (6). Mediation occurs when the total effect of a variable (e.g. age) on an outcome (e.g. COVID-19 severity) is wholly or partially mediated through another variable (e.g. prior pneumonia). This conceptually divides the total effect into direct and indirect effects. Mediation is ignored in GWAS because genetic variables are coinherited at conception; they cannot generally cause one another. So the question is effectively: What is the direct effect of a variable on the outcome? Is it non-zero? In GWAS, artefactual signals generated by confounding are instead the major concern.
Controlling for other variables helps avoid confounding (7), but controlling for mediating variables alters the scientific question by shifting attention from total to direct effects. Unfortunately, direct effects can differ in direction and magnitude to total effects, a source of bias known as the Table 2 fallacy (8). Further pitfalls include reverse causation and collider bias (9).
Nevertheless, the demand for GWAS-inspired exposome-wide association studies (10) presents an opportunity, which has been partly filled by machine learning (11, 12). Machine learning offers a data-driven agnostic approach. A major advantage is the ability to analyse high dimensional data with minimal curation, even in the presence of collinearity and widespread correlation between variables.
But the question is different: What is the contribution of a variable to predicting the outcome? Usually there is no formal test. More importantly, a variable can be valuable for prediction due to confounding (13). Machine learning is therefore problematic for risk factor identification. Other concerns have been raised with artificial intelligence approaches in healthcare, particularly in terms of often difficult-to-achieve interpretability and equity (14).
Bayesian methodology offers a solution to the question of identifying direct effects in biobank-scale data while controlling for confounding (15). An important advantage is the ability to account for uncertainty in model choice by averaging over the inclusion or exclusion of other variables when estimating or testing the direct effect of each variable. This uncertainty can strongly influence conclusions. The question is therefore: What is the explanatory value of each variable, over and above all the other variables? Is it non-zero? With a careful approach to feature engineering to mitigate issues around mediation, reverse causality and collider bias, and with independent validation akin to GWAS, Bayesian model averaging (BMA) offers a powerful approach. But Bayesian approaches are seldom used in current epidemiology: none of 127 published studies of risk factors for COVID-19 outcomes in UK Biobank was Bayesian (Table S1). This might be explained by several issues, including lack of familiarity among researchers, high computational requirements, and difficulties specifying prior distributions (16). Many practitioners worry about the role of the prior in Bayesian hypothesis testing, which can lead different researchers to different conclusions from the same data (17).
Here we apply Bayesian model averaging to test for direct risk factors among individual variables and groups of variables. Moreover, we use our new method, Doublethink, that assumes a specific prior and large sample size, to produce a one-to-one correspondence between Bayesian model-averaged posterior odds and traditional p-values (18). This allows simultaneous control of not just the Bayesian false discovery rate (FDR), but also the frequentist familywise error rate (FWER). We apply Doublethink to investigate direct risk factors for COVID-19 hospitalization in UK Biobank, and we compare our results to the literature. This framework provides a highly capable model-averaging approach that can be applied to the systematic evaluation of direct risk factors in biobank-scale resources.
Theory
The Doublethink method (18) considers a general regression setting in which there are n observed outcomes Y1 … Yn and v variables (features) with regression coefficients β1… βv. The aim is to identify which variables directly influence the outcome, i.e. which of the regression coefficients are non-zero. In total there are 2v models which we identify via vector s, the jth element of which indicates whether variable j is included (sj = 1; βj ≠ 0) or excluded (sj = 0; βj = 0) from the model.
We are interested in testing the null hypothesis that the variables indexed by a set (V) all have regression coefficients βj = 0 (for j ∈ V). We can average over the inclusion or exclusion of all other variables to produce a set of models compatible with the null hypothesis, and a complementary set of models compatible with the alternative hypothesis, where S is the state space of s. In the Bayesian setting, we reject the null hypothesis if the posterior odds of AV versus OV, exceed some threshold τ. The Bayesian false discovery rate (FDR), both local and global (19), is then controlled at or below 1/(1 + τ), contingent on the prior.
Lemma 1 (18). The Bayesian procedure defined by Equation 1, which rejects the null hypothesis Ov when POAv:Ov > τ, is a closed testing procedure (20), and therefore controls the frequentist FWER in the strong sense.
To implement simultaneous Bayesian and frequentist inference, we assumed priors of the form where μ are the prior odds that βj≠0, h is a prior scale factor, θs are the unconstrained parameters in model s (the βj for which sj = 1, and any nuisance parameters), and Js is the per-observation Fisher information matrix for model s, evaluated at θs = 0. Fisher’s information matrix has been used widely in the definition of reference priors (e.g. 21, 22), and to generate concordance between Bayesian and frequentist point and interval estimates (see Table 1 of 18).
Subject to further assumptions, principally that (i) the outcomes are independent, given the model, (ii) n is large, and (iii) h is proportional to n (known as ‘local alternatives’), Johnson (23, 24) showed that the posterior odds between model t and nested model s are i.e. proportional to the maximized likelihood ratio Rt:s . These assumptions enable interconversion between Bayesian and frequentist hypothesis tests. From the well-known result that the deviance (2 log Rt:s) follows a chi-squared distribution with |t| 2 |s| degrees of freedom, conditional on model s, one can transform the posterior odds into a p-value as which is a rearrangement of the familiar (25). Note the p-value depends on Rt:s, but not the hyper-parameters μ and h.
Following Johnson (23, 24), we drop the assumption that h is proportional to n, and assume it is fixed. Although this contradicts a motivating assumption, it has the desirable effects of (i) removing the dependency of the prior on n, (ii) achieving statistical consistency and (iii) recapitulating the Bayesian information criterion (BIC) when h = 1, which has been shown to reasonably approximate a wide range of Bayes factors when n is large (26, 27). We further assume that μ < 1, and each variable has one parameter, and one degree-of-freedom. We use the theory of regularly varying random variables (28, 29) to derive the following.
Theorem 1 (18). We find that, under these assumptions, the model-averaged posterior odds, can be transformed into an asymptotically valid p-value for large n. Here RAv:Ov is a weighted mean of the nested maximized likelihood ratios in Ov and Av, and the p-value (unadjusted for multiple testing) is Theorem 2 (18). The level at which this procedure controls the FWER in the strong sense is As a corollary, the Bayesian procedure is equivalent to rejecting the null hypothesis Ov when an adjusted p-value, is smaller than threshold α.
An equivalent interpretation of these results is that the model-averaged deviance (2 log RAv:Ov) follows a chi-squared distribution with one degree of freedom, when large. This means Doublethink p-values cannot be arbitrarily rescaled by the prior parameters μ and h because (i) the null distribution of the model-averaged deviance does not depend on them, and (ii) the realized value depends on them only through weights. Therefore μ and h influence the power of the test, but not its theoretical distribution under the null hypothesis. This makes model-averaged hypothesis testing a viable frequentist procedure by facilitating a prior-agnostic approach to quantifying Bayesian significance thresholds in terms of FWER, for large samples.
Methods
We implemented the Doublethink approach as a Monte Carlo Markov Chain approach (30) in R (31) and Python (32) and applied it to identify risk factors for COVID-19 hospitalization in UK Biobank, following the COVID-19 Host Genetics Initiative definition as applied to UK Biobank.
Outcomes
Cases were identified from Public Health England’s Second Generation Surveillance System (SGSS), the National Health Service’s Hospital Episode Statistics (HES) and the National Health Service’s death registry between January and December 2020 as PCR positive for SARS-CoV-2 in SGSS, and hospitalized with International Classification of Diseases, Tenth Revision (ICD-10) diagnosis code U07.1 or U07.2 in HES. Participants not identified as cases were considered controls.
We excluded participants that died before 2020, non-England residents determined by assessment centre, and those that withdrew before the analysis. The total numbers of controls were down-sampled to 200,000 to speed computation. The total number of cases was 1,917.
Variables
We considered data fields approved for UK Biobank project 53100 ‘Microbiology, disease and genetics’, across the categories Population characteristics, Assessment centre, Biological samples, Online follow-up, Additional exposures and Health-related outcomes. We excluded Compound, Date, Text and Time variables, and variables concerning genetics and sampling processes. For repeated measures, we took the first instance. We excluded factors exceeding 50 levels, except self-reported illnesses, and variables missing in more than 15% of participants. Special values, including negative factor levels, were treated as missing. We imputed missing continuous and integer covariates taking the mean of non-missing values. Missing factor levels were treated as a separate level and excluded. We created binary variables for all levels of every factor observed with frequency above 0.2%. We created a binary variable for every ICD-10 code with frequency above 0.2% recorded before 2020 in HES. Overall, we analysed 184 covariates, binary variables encoding 865 levels across 193 factors, and 863 ICD-10 admission codes, a total of 1,912 variables (Supplementary Table S2).
Model
We fitted the data separately for each outcome via a logistic regression model implemented in R using the glm function, assuming an additive linear predictor with an intercept term. We assumed the prior odds of variable inclusion were μ = 0.0053, independently for the v = 1,912 variables, implying a prior expectation of 10 variables in the model. We assumed a unit information prior (h = 1) for the regression coefficients (27). We disallowed the inclusion of collinear variables by defining a zero likelihood.
Implementation
Like the implementation in (33), we employed a Markov Chain Monte Carlo (MCMC) sampler over the variable inclusion vector s. We ran 100 chains with 25,000 iterations of burn-in and 75,000 iterations of sampling. Chains were initialized using a furthest neighbor algorithm to avoid including correlated variables. For initialization, we clustered variables into 200 groups with the scikit-learn-extra KMedoids algorithm, using rank correlation distance. Each chain was initialized with the medoid of one group, before adding nine more variables iteratively from the next-least correlated variables. Three Metropolis Hastings moves were implemented that respectively added, removed, or swapped pairs of variables with relative proposal probabilities 9:9:2. Variables were swapped preferentially for those with high squared correlation. We simulated regression coefficients directly from conditional Normal distributions by post-processing the MCMC iterations. We calculated posterior odds and parameter estimates by combining chains, computing standard errors across independent chains.
Grouping variables
We could perform valid post-hoc variable grouping while controlling the FDR and FWER, which was useful since correlated variables reduce one-another’s individual posterior inclusion probabilities. We clustered variables in two ways: (i) A posteriori, using the scikit-learn OPTICS algorithm with distances defined by their posterior correlation in inclusion probabilities. This grouped the variables with the strongest negative correlations in inclusion probabilities. (ii) A priori, using the algorithm applied to one minus the squared correlation between variables. We computed the posterior odds of including any variable among each group.
p-value calculation
We used the chi-squared distribution to compute adjusted p-values using Theorem 2. In the case of orthogonal variables with one degree-of-freedom, this is conservative for p < 0.02. Since the large n assumption implies interest in small significance thresholds, we report any p-value larger than 0.02 as n.s. (not significant) or 8-9. This makes Doublethink incompatible with any threshold exceeding α = 0.02. In practice, we did not explicitly set a threshold, instead reporting adjusted p-values alongside posterior odds.
Literature review
We reviewed the variables included in published analyses of COVID-19 risk factors in UK Biobank using the query “UK Biobank” (Abstract) and “COVID” (Abstract) in www.webofscience.com on 19 September 2023. After excluding Review Articles and Editorial Material, this search returned 203 publications. We analysed a subset of 127 of these papers that quantified the effect of non-genetic risk factors on COVID-19 outcomes; this predominantly excluded papers reporting genetic risk factors, two-sample Mendelian randomization, and COVID-19 as an exposure for other outcomes (Supplementary Table S1). We manually categorized the variables analysed by these 127 papers into groups (Supplementary Table S3). We summarized the frequency with which each category of variable was included in the published analysis or abstract.
Results
We aimed to identify risk factors that directly influenced COVID-19 hospitalization in the UK Biobank, to help understand the underlying processes, by using model-averaged hypothesis tests to account for uncertainty in variable selection and deplete for potential confounders, subject to the limitations of the data in terms of unmeasured variables and measurement error. We intended to limit the impact of collider bias by focusing on exposure variables measured before 2020, and by comparing cases to the rest of the biobank. This compounded the case definition with selection bias, for example access to testing, which may affect interpretation (9). We focus on risk factors for hospitalization with COVID-19, because there were more cases than critical illness, and less obvious selection bias than infection (since testing was more widely available in hospitals).
Doublethink facilitates joint Bayesian-frequentist model-averaged hypothesis tests
Figure 1 shows a Manhattan plot displaying the evidence that each of the 1,912 individual variables (points) directly affected the risk of COVID-19 hospitalization in UK Biobank, averaged over uncertainty in the effect of all other variables. Points are plotted against both the log10 posterior odds (left side) and the −log10 adjusted p-value from Theorem 2 (right side). This interconversion allows a Bayesian or frequentist approach to evaluating the strength of evidence.
Comparison of the two vertical axis scales shows that in the Doublethink model, the model-averaged posterior odds and adjusted p-values are approximately linearly related, for small p-values. Significant variables are identified by applying a threshold to either the posterior odds or the adjusted p-value; this simultaneously controls the FWER and – for the assumed prior – the FDR, under the asymptotic approximation. For example, a Bayesian threshold of τ = 10 would control the FDR at 1/(1 + τ) = 0.091 and the FWER at α = 10-3.3 = 0.00047. The latter is much smaller than the conventional threshold of 0.05, partly because of the large sample size.
At a significance threshold of τ = 10 and α = 10-3.3, nine variables were identified as individually exposome-wide significant. Significant variables, such as the ICD-10 codes F03 Unspecified dementia, J22 Unspecified acute lower respiratory infection and R29.6 Tendency to fall, not elsewhere classified, appeared to affect risk of COVID-19 hospitalization, even after controlling for the effects of all other measured variables. This differs from the common practice of testing the significance of a variable in the context of a single model that controls for a limited set of other variables. Model averaging was important here because no single model had high posterior probability.
Several significant variables were indicators or aggregates of presumptive underlying processes, such as 41214 Carer support indicators : 1 : Yes, which indicates a hospital record of past carer support, 137 Number of treatments/medications taken, which summarizes the recruitment interview, and Z86.4 Personal history of psychoactive substance abuse, which indicates a hospital record of past alcohol, tobacco or drug use. The direct effect of these proxies was to increase the risk of COVID-19 hospitalization in all cases (Table 1). In contrast, significant measures of educational attainment, 6138 Qualifications : 3 : O levels/GCSEs or equivalent, and 6138 Qualifications : 1 : College or University degree, had protective direct effects on risk of COVID-19 hospitalization.
The significance of some variables was, at first glance, unexpectedly low, such as the well-established risk factors 31 Sex : 1 : Male (Posterior probability, PP = 49.9; p⋆ = 10−2.23; where posterior odds = PP/(1-PP)) and 34 Year of birth (years) (PP = 40.8; p⋆ = 10−2.05; Table 1). This is explained by the inclusion in the data of the other very highly correlated variables 31 Sex : 0 : Female, 21003 Age when attended assessment centre (years) and 21022 Age at recruitment (years). Including variables that are correlated, whether strongly or weakly, inevitably dilutes the significance of individual variables when testing for the existence of a direct effect, over and above all other variables. For age and sex, an obvious solution would be to exclude these correlated variables. However, correlation is pervasive in biobank-scale data. An alternative solution is to define groups of correlated variables and test whether one or more members of a group affect the outcome. Doublethink allows arbitrary groups of variables to be tested in this way, while controlling the FDR and FWER.
Testing the significance of groups of variables reveals more signals
Nine groups of variables defined a priori were significant at τ = 10 and α = 10-3.3, often when the individual member variables were not. In Figure 1, vertical lines illustrate the boost in the significance of groups of variables compared to their most significant member. The groups are numbered for cross-reference with Table 1. For example, the well-established risk factors age (Group 1; PP = 100; p⋆ < 10−5.95), indices of multiple deprivation (Group 2; PP = 100; p⋆ < 10−5.95) and sex (Group 3; PP = 100; p⋆ = 10−5.95) were significant despite containing no individually significant member variables. In these examples, testing groups of variables recovered signal that was diluted by the inclusion in the data of highly correlated variables.
Finding that a group of variables is significant means there is evidence that one or more of them influence the outcome, after controlling for all other measured variables. This controls confounding caused by variables outside the group, but combines signals within the group, increasing power. For example, Group 8 was strongly significant (PP = 99.5, p⋆ = 10−4.71) while containing variables that were individually less so: Z50.1 Other physical therapy (PP = 79.9, p⋆ = 10−2.89) and Z50.7 Occupational therapy and vocational rehabilitation, not elsewhere classified (PP = 19.6, p⋆ > 0.02). One of these variables, or something they measure that is not captured by other variables, presumably influences the risk of COVID-19 hospitalization, even if we cannot attribute the effect specifically to either.
Testing groups is useful but defining them a priori is not the most effective method of discovering signals, because the groupings might not be relevant to the outcome under investigation. For example, Group 8 also included variable Z50.5 Speech therapy, which appeared to contribute nothing to the group’s overall significance (PP = 0.0, p⋆ > 0.02). Conversely, failure to group relevant variables together can cause signals to be overlooked, as we will see.
Doublethink allows arbitrary groups to be tested
One of the advantages of the Doublethink approach is it motivates the testing of arbitrary groups of variables without inflating the FWER or FDR through a multiple testing ‘fishing expedition’. This is because the thresholds of all possible tests are pre-defined in the closed testing procedure. Therefore we were free to search for the most significant groups of variables. To this end, we grouped variables post-hoc whose posterior inclusion probabilities (PPs) were negatively correlated, because this suggests they ‘competed’ for inclusion in the model.
Figure 2 and Table 2 show that post-hoc grouping revealed signals that were weaker in pre-defined groupings, such as Group D (PP = 99.5, p⋆ = 10−4.75), which captured aspects of obesity by combining the individual variables 48 Waist circumference (cm) (PP = 89.0, p⋆ = 10−3.22), 21001 Body mass index (BMI) (Kg/m2) (PP = 5.5, p⋆ > 0.02) and 23104 Body mass index (BMI) (Kg/m2) (PP = 5.0, p⋆ > 0.02). In contrast, prior Group 10 (PP = 89.0, p⋆ = 10−3.22) had only combined 48 Waist circumference (cm) with the non-significant variables 21002 Weight (Kg) and 23098 Weight (Kg) (Table 1).
Some post-hoc groups overlapped the pre-defined groups but dropped non-significant variables that did not contribute to the significance of the group. For example, Group H (PP = 99.2, p⋆ = 10−4.48), contained only the three most significant deprivation scores of the eight members of Group 2. Other groups revealed new connections between variables, such as Group I (PP = 95.7, p⋆ = 10−3.69), which combined the individually non-significant K59.0 Constipation (PP = 66.1, p⋆ = 10−2.55) and N39.0 Urinary tract infection, site not specified (PP = 58.4, p⋆ = 10−2.40), a combination that might reflect underlying kidney disease.
The post-hoc grouping of 41218 History of psychiatric care on admission : 8 : Not applicable with I10 Essential (primary) hypertension was at first glance surprising from the field descriptions (Group G: PP = 99.3, p⋆ = 10−4.55). However, the former variable indicates a history of non-psychiatric hospital care. This suggests it may act, in a manner interchangeable with I10, as a proxy for a history of underlying poor physical health. The direct effect of both variables was to increase risk of COVID-19 hospitalization (Table 2).
The ability to quantify the evidence for groups of variables offers an alternative to approaches such as pre-analysis selection of representative candidate variables among groups of correlated variables. Doublethink permits all and any groups of variables to be tested while controlling the FDR and FWER. This presents new possibilities for identifying significant groups, and the identification of these groups may help with the interpretation of the role in the individual variables in the outcome.
Comparison to the literature on COVID-19 outcomes in UK Biobank
Since early in the COVID-19 pandemic, before the discovery of effective treatments, there were intense research efforts to understand susceptibility to infection, disease and poor outcomes. Many focused on large established cohorts like UK Biobank that could rapidly link to data on SARS-CoV-2 testing (34), COVID-19 hospitalization (35) and mortality (36). Since then, many risk factors have been reported, including smoking (37, 38, 39, 40, 41, 42), diabetes (38, 43, 44, 45), asthma (46, 47) and vitamin D (48, 49) as predisposing to worse outcomes. We compared our results to the literature on COVID-19 in UK Biobank to identify any differences to standard approaches and find new insights. At the time of analysis, we identified 127 comparable studies through Web of Science. We manually assigned the most common risk factors in published analyses of COVID-19 outcomes to larger categories for comparison to the variables and groups listed in Tables 1 and 2, which we assigned to the same list of categories.
Table 3 shows the most common categories of risk factors included in published analyses of COVID-19 outcomes in the 127 UK Biobank studies. Two summaries are shown: the percentage of papers and the percentage of abstracts in which each category of risk factors appeared. Alongside we show the evidence from our analysis, with values of PP < 50% (corresponding to p⋆ > 10−2.20) omitted, since the Bayesian interpretation is this represents evidence against those risk factors.
Age, Sex, Obesity, Ethnicity, Socioeconomic status (including deprivation indices) and Smoking were included in more than 66-90% of published analyses. These well-established risk factors were mentioned in 6-20% of abstracts. Our analysis supported direct effects of all these categories with PP g 99.5% and p⋆ f 10−4.75 except ethnicity (Supplementary Table S4, S5). Post-hoc Group J had strong support (PP = 94.1, p⋆ = 10−3.54) but combined self-reported ethnicity and country of birth with geographic measures of pollution. However, pre-defined Group 11, which did not contain pollution metrics, was not significant at τ = 10 and α = 10-3.3, despite the evidence being suggestive (PP = 80.0, p⋆ = 10−2.89).
Other reasonably common categories of risk factor for which our analysis found evidence of direct effects included Lung disease, Alcohol intake, General comorbidity, Kidney disease and Educational attainment. Risk factors in these categories featured in 28-46% of published analyses and 4-9% of abstracts. Our analysis supported these categories with PP g 95.7% and p⋆ f 10−3.69.
Many categories of risk factors that appeared commonly in published analyses received no significant support for direct effects in our analysis. Diabetes, Cardiovascular disease and Hypertension were notable for inclusion in 54-63% of published analyses, and 10-12% of abstracts. No variables or groups of variables corresponding to these categories received support for direct effects in our analyses (PP < 50%, p⋆ > 10−2.20). However, no evidence of a direct effect does not imply no evidence of an effect. These common diseases contribute to a general decline of health, and it is possible that their effects were mediated through pathways better represented by variables or groups we categorised under General comorbidity, such as 137 Number of treatments/medications taken and Group G.
Several notable categories of risk factor that we found to have significant direct effects were rarely included in published analyses of COVID-19 outcomes in UK Biobank. Variables representing Psychiatric disorders, Infection, Dementia and Aging were included in 9-15% of published analyses, and 2-9% of abstracts, whereas we found strong evidence of direct effects of variables we assigned to these categories (PP g 99.4% and p⋆ f 10−4.65), including 41214 Carer support indicators : 1 : Yes (which we categorised under Psychiatric disorders), J22 Unspecified acute lower respiratory infection (Infection), F03 Unspecified dementia (Dementia) and R29.6 Tendency to fall, not elsewhere classified (Aging). Therefore a model-averaging big data analysis that accounts for widespread correlations among variables and uncertainty in variable selection can bring new insight to our understanding of well-studied health outcomes like COVID-19 hospitalization in UK Biobank.
Discussion
Doublethink for discovery of non-genetic risk factors
Bayesian model averaging is an approach especially suited to accounting for model uncertainty, such as which variables directly affect an outcome. Here we showed that appropriate construction of the prior allows simultaneous control of the Bayesian FDR and the frequentist FWER, facilitating its use in fields, like epidemiology, where classical statistics predominates. This allowed us to screen 1,912 variables for direct effects on COVID-19 hospitalization and adjust for confounders via model averaging without the need to specify a candidate risk factor (primary exposure) nor select variables in a pre-analysis step such as a univariable scan, stepwise regression or machine learning. Instead of attempting to identify independent variables – a near-impossible task in biobank scale data where correlation is pervasive – we performed tests on groups of correlated variables, which can increase power and, in some cases, interpretability.
This work offers a new approach at a time when there are increasing calls for exposome-wide association studies (e.g. 10). With some exceptions (9, 12, 45, 50, 51, 52, 53), agnostic exposome-wide approaches to discovering new risk factors were absent from published studies of COVID-19 outcomes in UK Biobank (Table S1). Perhaps that is particularly surprising given that COVID-19 was a new disease in which all risk factors were initially unknown.
The approach of this paper had numerous limitations. Principally, we did not assess the total effect of a variable on the outcome, only the direct effect. In some applications it is necessary to estimate the total effect to understand the likely impact of an intervention on the outcome. The direct effect can differ in magnitude and direction to the total effect, and confusing the two is a pitfall known as the Table 2 fallacy (8). Focusing on direct effects therefore limits the interpretation of our conclusions. In particular, we are unable to predict the effect on the outcome of a hypothetical intervention in the study population.
Importantly, we cannot conclude that no evidence of a direct effect implies no evidence of an effect. For example, hypertension was mentioned in 11% of abstracts and included in 54% of published analyses of COVID-19 outcomes in UK Biobank, but we did not find significant evidence of its direct effect on hospitalization. This does not rule out an indirect effect mediated through another variable, such as a general decline in health. Several fields captured general comorbidity, including 137 Number of treatments/medications taken. Not only might they mediate indirect effects, but fields like 137 that pool, aggregate or summarize data from several other sources might be favoured for inclusion by the sparsity-inducing prior, which imposed a penalty μ on every additional parameter. This ability of the Bayesian prior to influence the final results is inevitable and exists despite the ability to control the frequentist FWER.
In practice, the main considerations for running Doublethink are (i) preparation of the outcome data and variables, (ii) choice of hyper-parameters and (iii) computational feasibility. (i) As demonstrated with the UK Biobank analysis, it is not necessary to filter variables a priori to find ‘independent’ sets, usually an impossible task. Instead, variables can be grouped post hoc to sidestep correlation and detect signals. Data quality control is still paramount. (ii) To choose the hyper-parameters, the main consideration is the average number of variables expected to be included in the model, which determines µ; we chose vμ/(1 + μ) = 10. For many purposes, the unit information prior of h = 1 will suffice. For non-Bayesians, these hyper-parameters determine the performance envelope of the analysis, with performance optimal when the ‘truth’ resembles the prior. The aim of the paper is the method should still be useful, and the p-values theoretically well calibrated, at other times. (iii) Computation is the major limitation. The analysis of v = 1,912 variables in n = 201,912 UK Biobank participants required 100,000 iterations of 100 independent chains running for 35 hours each. This is substantially slower than many machine learning algorithms.
The Monte Carlo Markov Chain approach pursued here was computationally intensive, despite restricting our attention only to direct effects. Requiring 3500 CPU hours, its feasibility relied on efficiency gains stemming from (i) asymptotic approximations motivated by an assumption of large sample size and (ii) a convenient prior. The computational demands of the approach prevented us from investigating important phenomena like interactions between variables and non-linear effects such as time-since-exposure. With a less computationally expensive approach, we might have investigated other potential risk factors with large numbers of rare variables, such as occupation and use of specific medicines.
In an analysis of non-genetic direct risk factors, there is no possibility of a definitive approach, even for an agnostic scan. Partly, this is because direct effects are only defined relative to a fixed set of variables: conceptually, a direct effect could be mediated through one or more downstream variables that were not measured or included. Moreover, no method is free of data curation. This includes choice of the exposure variables to uphold quality control, avoid reverse causation and avoid collider bias (we restricted analysis to pre-2020 exposures), and choice of outcome (we restricted attention to 2020 given the time-varying dynamics and likely impact of vaccination status, which we did not know). Methods to impute missing values, handle repeat measures and encode factors can all impact the final results.
Doublethink and simultaneous Bayesian/frequentist hypothesis testing
Other theoretical considerations that may limit the applicability of Doublethink include assumptions of large sample sizes and a specific family of priors (or ‘random effects’) parameterized by µ and h. The prior covariance on the coefficients (β), based on Fisher information, is hard to justify except through its convenience for pursuing joint Bayesian/frequentist inference. This motivating aim is only achieved theoretically as n becomes arbitrarily large. So strictly speaking that aim is not truly met, meaning the theoretical properties of the p-value may not hold precisely, leading to inflation or deflation, particularly when p is not small, and reducing robustness to poor choices of h and particularly µ. The theory contains a technical contradiction, because h is initially assumed to scale with n (the local alternatives assumption), allowing the posterior odds to be written in terms of the maximized likelihood ratio, but later h is assumed to be constant with respect to n, which affords simplifications in deriving a p-value for the model-averaged posterior odds (23, 24).
Beyond its practical utility, Doublethink has wider implications for bridging the gap between Bayesian and classical philosophies to scientific inference. First, Lemma 1 showed that the Bayesian approach to testing a collection of null hypotheses βj = 0, j ∈ V, in which the null hypothesis is rejected when the posterior odds exceed a fixed threshold τ, is a closed testing procedure (20), which therefore controls the frequentist FWER in the strong sense at or below some level α. This result is general and does not depend on the Doublethink model. Importantly, it disproves the idea that the FWER is a fundamentally non-Bayesian quantity, inherently more stringent than the FDR, which the Bayesian approach controls at or below level 1/(1 + τ).
Second, Theorem 2 gave an analytic expression for α, the level at or below which the FWER is controlled in the strong sense, asymptotically under the Doublethink model. From there, we could interconvert model-averaged posterior odds and adjusted p-values, and equivalently, we could interconvert FWER and FDR thresholds. This affords insights by allowing frequentist multiple testing thresholds to be understood in terms of Bayesian prior assumptions. According to Theorem 2, α is asymptotically proportional to v μ (h/n)1/2/ τ. That is, asymptotically proportional to (i) the number of variables, v, which underlies Bonferroni correction, (ii) the prior odds, µ, of the alternative hypothesis versus the null, and (iii) the square root of the prior precision h, higher values of which make the alternative hypothesis more similar to the null; and inversely proportional to (iv) the Bayesian threshold τ, and (v) the square root of the sample size n. The relationship ⍺ = ⍺0 v μ (h/n)1/2, for some constant α0, offers a resolution to the classical paradox about multiple testing (54): should I vary α0 to correct for the number of tests in an analysis, the number of tests in the whole paper, the number of tests I perform in my career, or the number of tests in the scientific literature? The Bayesian response is to fix α0 not α, that is to fix the FDR and allow the FWER to vary depending on v, µ, h and n. As n grows, this controls the FWER far more stringently than the FDR anyway.
Third, Theorem 2 revisits the Jeffreys-Lindley paradox (55) by emphasizing a principal difference between Bayesian and frequentist hypothesis tests, not in philosophical issues like the treatment of parameters as fixed or random, but in the practical choice of significance threshold. The common practice of fixing the FWER irrespective of n, e.g. at 0.05 or 0.005 (56), leads to tests that are inconsistent under the null, because there is a tangible probability, α, of wrongly rejecting the null hypothesis even for arbitrarily big data (57). This is solved by varying α with n-1/2, an alternative starting point from which one could calculate either the FWER or the FDR.
The FDR considered here is related to, but distinct from, some FDR concepts common in the literature. First, we control the Bayesian FDR, rather than a frequentist FDR controlled by procedures like Benjamini and Hochberg’s (58) or Storey’s (59). Second, we control the local FDR (19), meaning we only call an individual variable significant when its posterior inclusion probability exceeds the threshold, τ/(1 + τ). Often, frequentist FDR procedures call as many individual variables significant as possible such that the mean FDR is controlled. On average this will reject more individual variables, but there are two caveats. First, it allows individual variables with lower posterior probability to be called significant, meaning we reject the null hypothesis that their direct effects are zero. Second, it may still miss variables whose individual significance has been diluted by correlation with other variables. As we have seen, such signals can be recovered by testing groups of variables. When a group of variables is called significant, we reject the null hypothesis that all their direct effects are zero, without necessarily pinpointing which variables have a non-zero direct effect.
Further research is needed to determine the generalizability of some of our theoretical findings beyond the Doublethink model. The form of Theorem 2 depends on the model choice prior, in which µ is fixed. The impact of co-estimating µ requires attention. The prior on the coefficients also matters. On one hand, Doublethink may be general in that log Bayes factors for nested hypothesis tests converge asymptotically to the Schwarz criterion (27, 57), which Doublethink recapitulates when n is large. On the other hand, the derivation of Theorem 2 relies heavily on the theory of regular variation (28, 29, 60). In Doublethink, the posterior odds are regularly varying random variables, but slowly varying posterior odds arise in non-nested settings (61, 62). The interconversion of p-values and posterior odds (or equivalently FWER and FDR) should then behave quite differently.
Doublethink is closely related to recent developments in combined hypothesis tests that exploit heavy tailed distributions such as the Cauchy combination test (63) and the harmonic mean p-value (HMP; 64). The HMP provides a model-averaging approach, starting with p-values, whereas Doublethink pursues joint Bayesian/frequentist model-averaging beginning with nested maximized likelihood ratios. These are closely related, but a theoretical advance over the HMP is the ability of Doublethink to average over uncertainty in the null hypothesis, as well as the alternative hypothesis. An interesting result from Doublethink is that under the null hypothesis, the model-averaged deviance asymptotically follows a chi-squared distribution with one degree of freedom. This mirroring of the null distribution of the classical likelihood ratio test statistic emerges from the self-similarity or ‘fractal’ property of sums of heavy tailed random variables. Moreover, the model-averaged deviance could be interpreted instead of the posterior odds or Bayes factor, which are strongly influenced by the prior (17).
Doublethink, like the HMP, enables us to reconsider established positions concerning the philosophy and practice of hypothesis testing. In particular, the multilevel nature of these tests, in which all possible combinations of hypotheses are simultaneously controlled via pre-determined thresholds, like (65), supports refinements to concepts like fishing for significance, data dredging and p-hacking (66). For a fixed set of predetermined null hypotheses, Doublethink allows us to search in arbitrary ways for significant groups of variables, without impacting the FDR and, at least asymptotically, the strong-sense FWER. Since exhaustive searches are not generally practicable, the methods by which signals are sought through grouping variables become important. We examined just two possible methods of grouping variables, but the strong impact of the groupings on the relative prominence of signals in the data means more work is required in this area. From these insights and through new avenues of research, this work has the potential to help advance scientific discovery and bridge the differences between Bayesian and classical hypothesis testing.
Data Availability
To access UK Biobank data, researchers must register and submit a research application (https://www.ukbiobank.ac.uk/register-apply). Registration is open to all bona fide researchers for all types of health-related research that is in the public interest. The registration and application process ensures researchers and projects meet UK Biobank's obligations to its participants and funders.
Funding
This work was funded by the Robertson Foundation, the Wellcome Trust and the Royal Society (grant no. 101237/Z/13/B). This study was supported by the National Institute for Health Research (NIHR) Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance (NIHR200915), a partnership between the UK Health Security Agency (UKHSA) and the University of Oxford. The views expressed are those of the author(s) and not necessarily those of the NIHR, UKHSA or the Department of Health and Social Care. The research was supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
Supplementary Material
Table S1 Literature review of published papers analysing COVID-19 outcomes in UK Biobank, containing fields from Web of Science and curated with information on the type of analysis, exclusion criteria, and outcomes and exposures included in the abstract and analysis.
Table S2 All UK Biobank fields included in the analysis, annotated by field or ICD-10 code, UK Biobank or ICD-10 description, level (if a factor) and numeric encoding in R.
Table S3 Synonyms and categories of variables used in the interpretation of the literature review of published papers analysing COVID-19 outcomes in UK Biobank.
Table S4 Categories applied to results in Table 1, prior groupings, for comparison to the literature review
Table S5 Categories applied to results in Table 2, post hoc groupings, for comparison to the literature review
Acknowledgements
The authors wish to thank Naomi Allen, Jeff Chen, Steven Lin, Gil McVean, Tim Peto and Sarah Walker for comments and advice, and the Mathematisches Forschungsinstitut Oberwolfach, organizers and participants of workshop 2308 Design and Analysis of Infectious Disease Studies.