Abstract
Introduction Artificial intelligence holds promise for individualized medicine. Yet, transitioning models from prototyping to clinical applications poses challenges, with confounders being a significant hurdle. We introduce a two-dimensional confounder framework (Confound Continuum), integrating a statistical dimension with a biomedical perspective. Informed and context-sensitive confounder decisions are indispensable for accurate model building, rigorous evaluation and valid interpretation.
Methods Using prediction of hand grip strength (HGS) from neuroimaging-derived features in a large sample as an example task, we develop a conceptual framework for confounder considerations and integrate it with an exemplary statistical investigation of 130 candidate confounders. We underline the necessity for conceptual considerations by predicting HGS with varying confound removal scenarios, neuroimaging derived features and machine learning algorithms. We use the confounders alone as features or together with grey matter volume to dissect the contribution of the two signal sources.
Results The conceptual confounder framework distinguishes between high-performance models and pure link models that aim to deepen our understanding of feature-target relationships. The biological attributes of different confounders can overlap to varying degrees with those of the predictive problem space, making the development of pure link models increasingly challenging with greater overlap. The degree of biological overlap allows to sort potential confounders on a conceptual Confound Continuum. This conceptual continuum complements statistical investigations with biomedical domain-knowledge, represented as an orthogonal two-dimensional grid.
Exemplary HGS predictions highlighted the substantial impact of confounders on predictive performance. In contrast, choice of features or learning algorithms had considerably smaller influences. Notably, models using confounders as features often outperformed models relying solely on neuroimaging features.
Conclusion Our study provides a confounder framework that combines a statistical perspective on confounders and a biomedical perspective. It stresses the importance of domain expertise in predictive modelling for critical and deliberate interpretation and employment of predictive models in biomedical applications and research.
Short description The paper explores the challenges of transitioning predictive models from scientific prototyping to clinical use, with a focus on the significant impact of confounders. Using the example of predicting hand grip strength in the UK Biobank, the study introduces a framework that integrates statistical and biomedical perspectives on confounders, emphasizing the vital role of informed confounder decisions for accurate model development, evaluation and interpretation.
1. Confounders in precision medicine
Artificial intelligence (AI) holds promise for personalized medicine and is increasingly employed in biomedical research and applications. Machine Learning (ML) workflows use large, high-dimensional and multimodal data to arrive at predictive models to identify biomarkers of health and disease or to aid in diagnosis, prognosis and treatment choice, targeted to individuals1–3. For instance, deep learning-based models showed promising results for improved cancer diagnosis, subtyping and staging4. Beyond cancer, (chronic) inflammatory diseases stand as significant global contributors to mortality. AI has proven promising in enhancing inflammatory disease risk prediction and facilitating personalized early interventions5. In the field of psychiatry, predictive modelling with neuroimaging data has demonstrated the potential to outperform DSM/ICD-based diagnoses6. However, translation of promising models to real-world clinical applications still remains challenging, sometimes referred to as AI chasm7–10. The AI chasm stems from unreliable predictions11– 14, challenges with reproducibility and replicability, non-interpretability8, and limited generalizability15 of models (for further challenges see e.g. 3,7,12,16,17). Confounding effects contribute significantly to these concerns through misleading predictions and interpretations, thereby exacerbating the AI chasm18–20.
In a predictive modelling context, confounders are variables that correlate with features and targets, but are not of primary interest or may even introduce misleading associations21,22 (see e.g.22–28 for in-depth technical elaborations). Confounders can influence predictions, especially when they carry a strong signal about the target. For example, in a neuroimaging context, a model predicting hand grip strength (HGS) from neuroimaging derived features could be primarily driven by sex, i.e. men on average being stronger than women. Other classical examples of confounders include measurement artifacts27,29–31, site effects32, demographics33–35, or lifestyle factors36.
It is essential to deal with confounding effects to obtain models that give valid scientific insights and models that can be deployed in clinical practice. Established tools to control for confounders at the level of study design, such as randomized control trials, restriction or matching23, may not be feasible in observational data6,20,22,37. Consequently, post-hoc statistical approaches, such as (linear) confounder regression are commonly used18,19,24,27,38–40. Alternatively, the contribution of confounders can be quantified by including them as predictors18,27,41.
In many biomedical disciplines it is common to correct for a conventionally established set of confounders18,42,43. While for instance in the field of genetics it is common to adjust for a broader set of confounders, in neuroimaging studies sex and age are most prevalently considered12,44. This reliance on convention, however, risks overlooking other potential confounders. Overlooking confounders or insufficient removal of their signal contributions can lead to overestimated effects because predictions are driven by confounding signals rather than the actual signal of interest18. Conversely, removal of too many confounders can eliminate signal of interest and lead to unstable models32,45,46. Adjusting for a variable that is actually a consequence of the features (i.e. not a real confounder) may even induce a non-existent association (Berkson’s paradox)18,47–49. Overall, generic conventional and non-contextual confound removal (too many or too little) can result in suboptimal models19,22. Consequently, it is important to identify confounders that align with the goals of the modelling task at hand18,19,22. Furthermore, even if a suitable set of context-specific confounders is identified, particularly in a research context it often remains unclear whether a “vanilla” model with no confound removal or a confounder adjusted model should be preferred. Taken together, suboptimal treatment of confounding contributes to the challenges of transitioning models from development to clinical applications, aggravating the AI chasm.
The goal of this paper is to emphasize a better understanding of confounders for a given research endeavour. We elaborate on the necessity of acknowledging domain expertise and biomedical knowledge to form a biomedical dimension of confounder considerations. We introduce a two-dimensional (2D) grid (Confound Continuum), of which the horizontal axis acknowledges the degree of biomedical impact of a confounder on a predictive problem, while the vertical axis evaluates the statistical impact. Adopting such an integrated perspective of statistical and biomedical confounder considerations fosters informed, context-sensitive decisions on confounders as an indispensable step towards accurate and valid model development and interpretation. Our aim is to encourage critical and deliberate employment of AI, both for medical applications and biomedical research.
2. Defining the context of confound removal
2.1. The statistical context of an exemplary GMV-HGS prediction
We illustrate concepts with the example prediction of hand grip strength (HGS) from grey matter volume (GMV) features in the UK Biobank50. HGS is an ideal target variable for this demonstration. It is reliable51,52 and eliminates further complexities associated with latent target measures such as intelligence or executive functioning scores. Additionally, HGS is an objective and cost-effective assessment commonly used in clinical settings53.
Commonly, the relevance of a set of candidate confounders is determined by assessing their statistical association with the data. Variables with strong associations or high shared variance are considered as confounders in the predictive analysis. Understanding such associations is crucial because removing confounders without shared signal may inadvertently introduce confounder information into features or target54. Conversely, removing confounders with high shared variance may enhance the signal-to-noise ratio of the feature-target relationship.
Mimicking such a statistical approach, we exemplarily correlated 130 candidate confounders from the UKB with both HGS and GMV (Figure 1). The correlations revealed that mostly body composition measures, sex and respiratory variables were associated with either the target HGS or the GMV features. Variables such as “length of the working week in the main job”, “systolic blood pressure”, “age” and “bone density” exhibited medium to small correlations with HGS or GMV. For a more comprehensive statistical investigation of confounders in the UKB see e.g.18.
2.2. High-performance versus pure link approach motivates a conceptual context of confound removal
To develop unbiased models, beyond statistics, it is crucial to understand the context of a prediction task. We therefore introduce the distinction between a high-performance model and a model aimed at investigating a pure link as overarching research goal. Both setups require careful consideration of confounders.
The high-performance approach aims to achieve accurate predictions by utilizing all available information irrespective of its origin. Here, confounders may even be included as features if they improve model accuracy. It nevertheless remains essential to satisfy the fundamental assumption of predictive modelling that training and testing data are drawn from the same distribution and are independent and identically distributed (iid). Satisfying the iid assumption avoids sampling bias and helps build generalizable models that can apply patterns learned from the training set to unseen testing data. Otherwise, a model may perform well on training but fail on testing data, exacerbating the AI chasm. Differences in training and testing data distribution are sometimes referred to as data distribution shift55. For instance in healthcare applications, differences in patient demographics or medical practices between hospitals can cause such a shift. Covariate shift is a specific form thereof, where particularly the distribution of the independent variables (features and/or confounders) changes56. To avoid shift-related issues, even in the high-performance setting, training and testing data must be comparable in their key characteristics, including their relationship with confounders.
The pure link setting aims to deepen our understanding of specific feature-target relations by discovering systematic, biologic mechanisms underlying the feature-target interactions. Such models selectively utilize specific aspects of the available information in the data. Concretely, this approach prioritizes the signal components in the features that hold biomedical meaning to predict the respective outcome, such as a phenotype, behaviour or disease, but aims to exclude encoded information of confounders in the biomedical feature signal (e.g. neuroimaging-derived features). By doing so, it aims to uncover the “pure” biology of the problem space and contribute to a broader comprehension of biomedical mechanisms.
However, achieving such “purity” becomes an idealized goal when dealing with biologically highly linked confounders. To illustrate this challenge, we consider two of the statistically evaluated potential confounding variables for the GMV-HGS prediction task: “Length of working week in the main job” and “sex” (Figure 2A). Unlike “length of working week”, “sex” significantly overlaps in its biological attributes with those of the GMV-HGS problem space, i.e. “sex” and the problem space have a high “shared biology” (not to be confused with a high shared variance in a statistical sense). From biomedical domain knowledge it is known that sex influences testosterone levels, which, in turn, impact muscle growth and the muscle mass determines HGS. In the pure link setup, confound removal is expected to preserve all meaningful connections between GMV and HGS, while eliminating unwanted influence of confounders, expecting to obtain the “pure” biology of the problem space (Figure 2A bottom: middle & left). This expectation of “purity” can be fulfilled for non-overlapping variables, such as “length of working week” (which in this extreme would then not be considered as confounder). However, removing highly overlapping variables, such as “sex”, results in a new (artificial) set of biological attributes of the GMV-HGS problem space (Figure 2A bottom: right, non-circular red outline). This artificial shape is biologically ambiguous and challenging to interpret. Consequently, the more a confounder overlaps in its biology with the problem space, the less its removal can lead to “purity”. This problem particularly arises in the biomedical field due to the low-dimensionality and interconnected nature of many biological phenomena and necessitates to acknowledge a conceptual dimension of confound removal.
3. Integrating the statistical and conceptual level of confound removal
3.1. The Confound Continuum
Beyond the extreme examples of “length of working week” and “sex”, numerous further potential confounders exist for the GMV-HGS prediction task. Statistically, these can be ordered along a vertical axis based on increasing (absolute) strength of statistical association with the prediction task, as introduced in Figure 1.
Conceptually, further potential confounders can be ordered along a horizontal axis based on their increasing overlap with the biological attributes of the problem space (Figure 3A). On this continuum, “length of working week” exemplifies a low biological overlap or link, followed by the further potential confounder “bone density”. The latter likely has differing driving factors than both GMV and HGS, yet the possibility of a biological link cannot be entirely ruled out. Advancing in the direction of increasing overlap, “systolic blood pressure” potentially shares driving factors with HGS, such as physical fitness, without a clearly identified pathway. “Sex” and hormonal composition almost reflect a 1:1 mapping of the same underlying biology, forming an example of a high biological link.
Integrating this horizontal conceptual axis (Figure 3B, blue) with the vertical statistical axis (Figure 3B, red) creates a two-dimensional (2D) orthogonal space (Figure 3B), emphasizing the independence of conceptual and statistical considerations. This independence becomes particularly evident for the off-diagonal variables in the 2D grid (Figure 3B, grey shaded areas). For instance, although “systolic blood pressure” only correlated marginally with GMV, biologically both may be influenced by a third factor such as physical fitness. Conversely, “length of working week” was correlated with HGS yet lacks evident overlap of biological attributes. The statistical dimension determines the amount of shared signal and thereby either ensures that no confounder information is inadvertently introduced to the data (no shared signal) or reveals which variables’ removal may enhance the signal to noise ratio (high shared signal). While statistical evaluations are essential, they cannot address the semantic meaningfulness of removing confounders. Put differently, they cannot assess the biomedical validity of confound-adjusted features, targets and resultant models and predictions. In contrast, the conceptual dimension offers valuable insights in the achievable purity of a feature-target (here: brain-behavioural) link, complementing statistical approaches with domain expertise and biological knowledge. Together, these dimensions dissect the different roles of potential confounding variables for a specific predictive problem from complementary viewpoints.
3.2. Nested and cascadic influences
Potential confounding variables may exhibit nested or cascadic overlaps with the problem space. For instance, sex and age demonstrate a nested overlap with the GMV-HGS problem-space (Figure 2B, top left, “no removal”). Adjusting for age preserves the shared area of sex, age and the problem space (Figure 2B, bottom right, “no age”: green area) because it is encompassed by the sex-problem-space overlap. Consequently, only a small section of the red problem space outline (visually spoken) is missing in Figure 2B (bottom right, “no age”), preserving most interpretability of the problem space. With sex adjustment, a comparable scenario emerges, but with a somewhat higher impact due to the larger overlap of sex-problem-space attributes (Figure 2B, top right, “no sex”). Adjusting for multiple confounders results in an additive removal effect in both nested and non-nested settings. However, in a non-nested scenario, removing one confounder alone would reveal the entire impact of removal. In contrast, in the nested sex-age example, only the joint removal reveals their full impact (Figure 2B, bottom left, “no sex & age”).
Cascadic influences emerge because biomedical mechanisms usually form complex networks (see e.g.57,58 for a formulation using directed acyclic graphs). For example, sex and hormones influence body fat composition. However, body fat composition in conjunction with sex can also influence hormones59, which then can affect the biological cascade sex → testosterone → muscle growth → HGS. Body fat compositions may further overlap with respiratory performance, shaping additional factors such as physical fitness. Consequently, even seemingly unrelated variables may indirectly impact the actual relationship between GMV and HGS.
In summary, statistical and conceptual evaluations of confounder influences are independent but can be integrated as a two-dimensional grid – the Confound Continuum. This framework emphasizes that biomedical and statistical validity are distinct but complementary concepts to enhance our understanding of the role of confounders in a predictive task. The Confound Continuum can facilitate informed decisions on confound removal, acknowledging problem-specific nuances.
4. Confound removal can influence predictions more than feature or algorithm choice
To illustrate the importance of considering confounding variables in predictive workflows, we conducted the GMV-HGS prediction based on cortical60, subcortical61 and cerebellar62 GMV features. The “vanilla” model, without removing confounders and using a linear support vector regression (SVR), yielded a Pearson correlation between true and predicted HGS of R2 = 0.39 (r = .63, Figure 4A, left). We compared this “vanilla” model with models that linearly regressed out confounders prevalent in the field (scan-site, age, and sex)12,44. Additionally, we examined the combined effect of sex and age to illustrate a nested (additive) scenario. The scan-site adjusted model performed similarly to the vanilla model (R2 = 0.40, r = 0.64). Adjusting GMV for sex substantially reduced performance (R2=.03, r = 0.20), while age adjustment had no effect (R2 = 0.39, r = 0.63). However, removing both sex and age resulted in a pronounced drop in performance (R2 = -0.0, r = 0.08, Figure 4A, right), suggesting a nested additive scenario where regressing out sex revealed the signal contributions of age in GMV.
The choice of both, features and learning algorithm plays a crucial role in neuroimaging predictive modelling. Features should provide sufficient information about the target variable, and different learning algorithms can capture different aspects of the feature-target relationship (e.g. linear vs. non-linear relations). Therefore, in neuroimaging predictive workflows, often the features and learning algorithms are tweaked to explore if other neuroimaging derivatives carry a stronger signal about the target or other learning algorithms can detect the relationship better. In our example, using functional connectivity (FC) features instead of GMV, maintained comparable accuracy (R2 = 0.34, r = 0.58, Figure 4C), while cortical thickness (CT) features less good (R2 = 0.13, r = 0.36). Tweaking the learning algorithm or its fine-tuning had minimal impact (Figure 4C). Importantly, these influences were observed without confound removal. Thus, the lower performance of CT does not necessarily indicate it contains less information about HGS but could imply that CT carries less information about sex (and age) compared to GMV and FC.
To validate these findings, we additionally used confounders directly as features, with and without neuroimaging-derived features (Figure 5). Age and sex together as features (without “brain” features) outperformed models solely based on neuroimaging derived features (R2 = 0.60, r = 0.77, Figure 5C, left). Adding GMV or CT to “sex & age” or “sex” as confound-features did not improve accuracy (Figure 5B & C). Incorporating FC alongside these two confound-feature setups even resulted in slight performance drops (R2 = 0.37, r = 0.65 and R2 = 0.36, r = 0.64, respectively, Figure 5B & C, right). In contrast, all brain-derived features contributed meaningfully to age as a confound-feature (Figure 5A). These insights align with Figure 4, emphasizing that the high performance of the HGS “vanilla” model is strongly driven by neural encodings of sex (and age) in the neuroimaging derived features.
While sex and age confounder adjustment significantly impacted predictive performance (r = .63 to r = .08), the most substantial difference due to feature or algorithm choice was only between GMV (r = .63) and CT (r = .36). This underscores that confounders can have a more pronounced impact on predictions than feature or learning algorithm selection. Selecting meaningful features and aligning algorithm choice with the assumed nature of the feature-target relationship is undoubtedly important. However, our results highlight that it is (at least) equally important to consider and understand the role of confounders in a predictive workflow.
5. Discussion
Precision medicine ML workflows are susceptible to context-dependent confounding influences. We differentiate between two overarching research endeavours, high-performance and pure link. Both require a nuanced understanding of confounders, either to avoid generalizability issues and identify potential covariate shifts (in high-performance case) or to determine the achievable purity of the problem space (in pure link case). We elaborated that such purity is difficult to achieve, if even reachable, in the case of biologically highly linked variables. To address the gradual nature of shared biology between potential confounders and a predictive problem space, we introduced a conceptual dimension of confound removal, ordering variables based on increasing biological link. This supplements statistical confounder evaluations by providing insights into biomedical implications of confound removal. The empirical HGS predictions underpinned the pivotal role of confounders in predictive workflows.
The substantial difference between the “vanilla” and the age-sex-adjusted model, raises the crucial question of which model is the correct one. Although this decision depends on the research endeavour (high-performance vs. pure link), yet the interpretation must align with this decision. The high-performance vanilla model predicts HGS decently but does not allow a statement about the finding of neural encodings of HGS. Additionally, also in this high-performance setup, training and test distributions must match to avoid covariate shift and enable transition from development into clinical practice, counteracting the AI chasm. Conversely, the sex-age-adjusted model may show lower performance but elucidates that sex and age encodings in GMV drive linear predictions of HGS. Despite lower accuracy, such models can enhance the understanding of (in this example) brain function beyond biologically overlapping other behavioural and phenotypical measures. Removing the influence of relevant variables, such as sex, uncovers smaller underlying signals and unmasks the necessity for deeper investigations. In fact, the nested additive effect of age in the GMV-HGS prediction would not have been discerned without removing the influence of sex.
The Confound Continuum aims to support informed confound removal decisions in a problem-dependent manner, bridging the gap between statistical and conceptual perspectives. It emphasizes that biomedical and statistical validity are distinct concepts and connects confound removal to model interpretation. In the realm of biology, no variable exists in complete isolation from others. Certain datasets might create the impression of some variables being biologically unrelated, but this likely reflects the inherent limitations of any dataset, which can only capture a finite number of measured variables. Therefore, it is crucial to dissect the interconnectedness of biological variables from a bio-conceptual perspective and combine this perspective with statistical data-insights to derive valid models and corresponding interpretations – a bridge provided by the Confound Continuum.
The necessity for integrating a bio-conceptual dimension with a statistical dimension of confound removal extends beyond neuroimaging predictive scenarios, being relevant for the entire domain of precision medicine. Despite successes in various prediction tasks, including cancer diagnosis and prognosis, inflammatory disease risk prediction, Alzheimer’s disease progression prediction, identification of hyperkalaemia from electrocardiograms or identification of genetic conditions from facial appearance63, the integration of AI in clinical practice still faces significant challenges. Most AI systems are far from achieving reliable generalizability, a prerequisite for clinical applicability63. For example, prognostic breast cancer models or predictive models for schizophrenia treatment outcomes only perform well in internal validation cohorts, but fail in external validation cohorts or trials64,65, i.e. the models fail to generalize to unseen data. This is problematic because accuracy achieved during model development does not necessarily represent clinical efficacy, particularly if high performances were achieved by neglecting confounder influences. While various factors contribute to failure in generalizability, confounding influences, such as technical differences between sites, variations in local clinical practices or differing demographics between patients in different hospitals, represent a major obstacle. Undoubtedly, high performance is crucial for constructing useful clinical AI systems. Nevertheless, there will be always a degree of uncertainty and error in predictive models, so that it is essential to understand the strengths and limitations of AI tools66. Recognizing the impact of confounders on predictive models and particularly their biological and clinical meaning, as supported by the conceptual dimension of the Confound Continuum, can contribute to a more nuanced understanding and future development of these tools.
The present study has a limited statistical scope, focusing on correlations for the statistical Confound Continuum and linear regression for confounder adjustment. Exploring non-linear methods was not the intention as that can be found elsewhere (e.g.18). Although current non-statistical guidance for confound removal in brain-behavioural predictive modelling is limited, the conceptual considerations are not meant as a step-by-step guide to determine which confounders to remove. Future research in biomedicine or causal modelling may offer more specific guidance. Instead, it aims to raise awareness of the non-statistical biomedical dimension of confound removal, emphasizing the importance of appropriate model and results interpretation and of providing biomedical meaning and validity to predictive outcomes.
6. Conclusion
In data-driven predictive models, confounder decisions often rely solely on statistical and historical criteria. We here want to stress the necessity of supplementing statistical approaches with domain expertise and biomedical knowledge. The introduced 2D Confound Continuum integrates statistical and conceptual considerations, aiding in assessing the statistical and biomedical role of specific confounders for a particular research question and predictive context. When a statistical relationship exists between a confounder and the feature(s)/target, both removing or not removing potential confounders holds validity. However, the chosen strategy must match the intended goal of the model and interpretation of outcomes must differ accordingly. While reaching high performances is important, reflecting on the meaning of a model and how it can help to improve the medical field and our understanding of biomedical mechanisms is at least as important. The Confound Continuum fosters such an overall perspective, supporting accurate model interpretation and discouraging uncritical model employment.
Data Availability
All individual data used in this study were obtained from the UK Biobank, a major biomedical database (www.ukbiobank.ac.uk), and are available to all approved UK Biobank researchers.
7. Acknowledgments
This research has been conducted using data from UK Biobank, a major biomedical database (www.ukbiobank.ac.uk). This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 431549029 -Collaborative Research Centre CRC1451 on motor performance project B05.