Abstract
Background Multimorbidity, the co-occurrence of two or more conditions within an individual, is a growing challenge for health and care delivery as well as for research. Combinations of physical and mental health conditions are highlighted as particularly important. The aim of this study was to investigate associations between physical multimorbidity and subsequent depression.
Methods and Findings We performed a clustering analysis upon physical morbidity data for UK Biobank participants aged 37-73 years at baseline data collection between 2006-2010. Of 502,353 participants, 142,005 had linked general practice data with at least one physical condition at baseline. Following stratification by sex (77,785 women; 64,220 men), we used four clustering methods (agglomerative hierarchical clustering, latent class analysis, k-medoids and k-modes) and selected the best-performing method based on clustering metrics. We used Fisher’s Exact test to determine significant over-/under-representation of conditions within each cluster. Amongst people with no prior depression, we used survival analysis to estimate associations between cluster-membership and time to subsequent depression diagnosis.
The k-modes models consistently performed best, and the over-/under-represented conditions in the resultant clusters reflected known associations. For example, clusters containing an overrepresentation of cardiometabolic conditions were amongst the largest clusters in the whole cohort (15.5% of participants, 19.7% of women, 24.2% of men). Cluster associations with depression varied from hazard ratio (HR) 1.29 (95% confidence interval (CI) 0.85-1.98) to HR 2.67 (95% CI 2.24-3.17), but almost all clusters showed a higher association with depression than those without physical conditions.
Conclusions We found that certain groups of physical multimorbidity may be associated with a higher risk of subsequent depression. However, our findings invite further investigation into other factors, like social ones, which may link physical multimorbidity with depression.
Introduction
Multimorbidity, the simultaneous occurrence of two or more long-term conditions in an individual is increasingly common as populations age, and it challenges existing health systems 1,2. Multimorbidity is more common with increasing age, in women and in the less affluent 3,4. Studying the co-occurrence of multiple long-term conditions in the same individual has the potential to inform understanding of disease causation and support planning of current and future health and care services 5–7.
Depression affects millions of people worldwide 8,9 and is ranked by the World Health Organization as one of the most burdensome diseases 9,10. There is strong evidence that depression co-occurs with other mental health disorders 11,12, and several ongoing studies aim to identify potential shared mechanisms 11,13. However, previous studies have also found depression to be more common in people with particular chronic physical illnesses, such as cardiovascular disease 14, multiple sclerosis 15, and inflammatory bowel disease 16. Physical ill-health might cause depression because it creates psychological disturbance through ‘biographical disruption’ that threatens a sense of identity, or because of impact on physical or social function. Alternatively, physical conditions may cause depression through intermediate biological processes, like inflammation 15,17, in which case we might expect that different combinations or patterns of physical conditions would be more strongly associated with depression than others.
Several studies have used cluster analyses to identify common patterns of physical conditions 18–21, typically using one method, such as agglomerative hierarchical clustering 18,19,22, k-medoids 20,23, Latent Class Analysis 21,24–26, or k-means approaches 27–30. Additionally, since morbidity data is binary (a person has a condition or does not), some common clustering methods are inappropriate since they use similarity measures incompatible with categorical data 18,31. Therefore, the aim of this study was to explore and compare the use of four independent clustering methods appropriate for binary data and to examine whether certain groups of physical conditions are associated with the subsequent diagnosis of depression.
Materials and methods
Data selection and pre-processing
We used data from UK Biobank 32. Participants aged 37-73 years attended a baseline assessment during 2006-2010 which collected data on demography, lifestyle habits, health conditions, and a range of physical and laboratory measurements. Participants provided written informed consent for linkage to national datasets including general practice (GP) (primary care), hospital, cancer registry and death records 32. The UK Biobank has ethical approval from the NHS North West Research Ethics Committee (reference: 21/NW/0157).
To robustly ascertain a broad range of long-term conditions, our study population included participants with a continuous GP record from at least a year before to at least one day beyond their baseline assessment. We excluded records from the UKB extract of the Vision practice management system in England because the extraction process excluded participants who died prior to data extraction. We also excluded participants who withdrew from the study (Supplementary Fig. 1).
We ascertained the presence of depression and 69 long-term physical health conditions at baseline using data from the baseline visit and from linked GP, hospital, and cancer registry records based on previously published lists 33,34 (Supplementary Table 1). The UK National Health Service limits registration to one practice at a time, and GP records transfer between practices so should capture an individual’s entire medical history. However, available hospital and cancer registry records began at different times for England, Wales and Scotland. Therefore, we used all GP records up to baseline assessment date, and to be consistent across countries, we used hospital and cancer registry records within eight years before baseline assessment date. We used published codelists to identify diagnoses from GP records using Read V2 and CTV3 diagnosis codes, hospital records using ICD-10 diagnosis codes and OPCS-4 procedure codes, and cancer registry records using ICD-10 codes 34. We similarly ascertained depression during follow-up using information from GP, hospital and death records. Eligible participants with no history of depression prior to baseline were followed up to the earliest of depression diagnosis, death or the end of their available GP or hospital records.
Models and metrics
We explored the suitability of four methods (k-modes 35,36, k-medoids 23, Latent Class Analysis (LCA) 24, and agglomerative hierarchical clustering (AHC) 22 (Supplementary Appendix 1)) to cluster all participants based on binary features denoting the absence/presence of the 69 baseline physical conditions. Participants with no physical conditions at baseline were excluded from the clustering analysis. We additionally clustered separately for men (all 69 conditions) and women (67 conditions since erectile dysfunction and hyperplasia of the prostate are only found in men), as well as in the whole population, because of known sex differences in patterns of individual morbidities and multimorbidity 37,38.
To select the number of clusters for each method, we used various heuristics, including the elbow method on a scree plot 39 for both k-modes and k-medoids, the minimal Bayesian Information Criterion 40 for LCA, and Hamming distance 41 for AHC. To assess suitability and performance of these clustering methods, we used three performance metrics (Calinski and Harabasz score 42, Davies-Bouldin score 43, and Silhouette score 44 (Supplementary Appendix 1)), which are appropriate for unsupervised clustering. Since k-modes and LCA are sensitive to differences in initialization 24,45, we repeated them five times and compared with other models using the mean and standard deviation across the five experiments.
Thereafter, we used two metrics to analyze over-and under-representation of conditions in each cluster. We designed one metric, the adjusted relative frequency (ARF) to measure the magnitude of over-or under-represented conditions within a cluster, relative to prevalence in the whole cohort. For each condition, ARF is calculated as: An ARF of exactly one indicates that the condition occurs at the same relative frequency as it does in the entire cohort, and values greater or less than one indicate over-and under-representation, respectively. We used Fisher’s Exact Test (two-sided, α=0.05) to evaluate whether over-or under-representation of a condition in each cluster was statistically significant, using a Bonferroni correction to account for multiple testing 46–48. Finally, we visualised and compared statistically significant results on a bubble heatmap (Supplementary Appendix 2). To allow others to conduct similar cluster analyses, we made the code available as a software package (https://github.com/laurendelong21/clusterMed).
Survival analysis to predict depression diagnosis
Using participants without a record of depression at baseline, we applied Cox regression models 49 to evaluate time to depression diagnosis by condition cluster, accounting for death as a competing risk 50. Participants with no physical conditions at baseline were included as the reference group. We ran separate models for the whole cohort and for men and women separately, examining associations between cluster membership and subsequent depression. All models were adjusted for baseline age, ethnicity, country of residence and deprivation. The model for the whole cohort was additionally adjusted for sex. Baseline age was included in the models as a continuous variable; all other variables were categorical. Ethnicity was self-reported at baseline, and we categorized it into five groups (Black, Mixed, South Asian, White, and any other ethnic group 51). Country of residence (England, Wales or Scotland) and area-based deprivation, measured by the Townsend Deprivation Index 52, were derived from participants’ home addresses at baseline. We divided the Townsend Deprivation Index into deciles within the entire UK Biobank cohort. A small number of participants (368 women and 435 men) who were missing data on ethnicity, country or deprivation were excluded from the survival analyses.
Results
Performance metrics across various clustering methods
There were 140,956 participants (73,036 women and 67,920 men) with at least one physical condition at baseline who were included in the clustering analysis (Supplementary Fig. 1). Performance metrics for each of the four methods explored are reported in Table 1.
Models based on AHC consistently achieved the poorest Calinski and Harabasz, and Davies-Bouldin scores in all three cohorts. Models based on LCA had better metrics than AHC-based models, with particularly high Calinski and Harabasz scores, but the Davies-Bouldin scores were consistently worse in comparison to k-modes or k-medoids based models. In contrast, the best Davies-Bouldin scores were achieved by the k-modes and k-medoids based models. However, since the Davies-Bouldin score assesses similarity between the most similar clusters, scores may be optimistic when several clusters only contain a single (or very few) participant(s). This was the case for the k-medoids models for the whole and men-only cohorts. The presence of singleton clusters is also concurrent with a larger number of total clusters. Specifically, all three models based on k-modes discovered eight clusters, all models using AHC discovered ten clusters, and all models using LCA discovered five or six clusters. In contrast, the k- medoids models discovered 25 clusters for the whole population (17 only had one participant, six for women-only (no singletons), and 13 for men-only (seven singletons) (Table 1). Therefore, while k- medoids models had comparable Davies-Bouldin scores to k-modes models, the results were less informative and consistent. For each cohort, we therefore selected the best performing k-modes model with the highest Calinski and Harabasz score among the five independent runs.
Differential representation of physical conditions within k-modes clusters
Many of the significantly over-represented conditions within several clusters aligned with body systems (Fig. 1) and we therefore used clinical judgement to name the clusters according to the systems or conditions which were most prominent (Supplementary Tables 2-5). The four largest clusters in whole cohort are Mixed including cancer (27.9% of participants in the cohort), Healthy + Rhinitis (22.2%), Cardiovascular disease (CVD) + diabetes (15.5%), and Very extensive morbidity (12.5%). For women, the four largest clusters are Mixed including cancer (29.3%), CVD + diabetes (19.7%), Musculoskeletal (MSK) (16.4%), and Healthy + Rhinitis (15.9%). Finally, for men, the four largest clusters are CVD + diabetes (24.2%), Mixed including cancer (20.8%), MSK + others (19.1%), and Healthy + Rhinitis (17.2%) (Fig. 2, Table 2).
Of the remaining clusters, there were some similarities across all three cohorts (e.g. Respiratory clusters). Generally, clusters with more participants also tended to have fewer conditions per participant (Fig. 2). For example, the Mixed including cancer clusters had the lowest mean conditions per participant (whole: 1.77; women: 1.75; men: 1.62). Such clusters may serve as "miscellaneous" categories for participants with condition profiles that are not easily grouped and/or people with one dominant condition. However, there were also differences. For example, there were clusters which only appeared in the whole population (Migraine) and clusters which only appeared in the sex-stratified cohorts (Digestive and MSK clusters).
Subsequent incident depression per identified cluster
Analysis of time to incident depression diagnosis included 141,001 participants (73,036 women and 67,920 men), excluding 30,770 participants with a history of depression at baseline (20,592 women and 10,178 men). In addition to participants included in the clustering analysis, this analysis also included 30,551 participants with no physical conditions at baseline (16,238 women and 14,313 men) (Supplementary Fig. 1). During an average follow-up of 6.8 years, 5,904 (4.2%) participants, including 3,574 (4.9%) women and 2,330 (3.4%) men, had a new depression diagnosis. Generally, participants with physical conditions at baseline had a higher rate of subsequent depression than participants with no physical conditions at baseline (Table 2, Supplementary Fig. 3).
There were several consistencies across cohorts. The Very extensive morbidity clusters were the most strongly associated with depression in all three cohorts (whole: HR 2.42, 95% CI 2.17-2.69; women: HR 2.67, 95% CI 2.24-3.17; men: HR 2.65, 95% CI 2.22-3.18). Additionally, the Healthy + rhinitis (whole: HR 1.59, 95% CI 1.46-1.75; women: HR 1.48, 95% CI 1.30-1.67; men: HR 1.50, 95% CI 1.29-1.75) and Mixed including cancer (whole: HR 1.62, 95% CI 1.48-1.77; women: HR 1.63, 95% CI 1.46-1.82; men: HR 1.60, 95% CI 1.38-1.86) clusters were generally the most weakly associated. Finally, association with depression also appeared to increase with the number of conditions per participant (Fig. 3). However, there were some exceptions; for example, the whole cohort’s Macular degeneration + diabetes cluster had the fourth highest mean number of conditions (3.09), but was only weakly associated with depression (HR 1.29, 95% CI 0.85-1.98).
Discussion
Summary of findings
This study systematically explored clustering of physical health conditions using four methods appropriate for binary data (k-modes 35,36, k-medoids 23, Latent Class Analysis 24, and agglomerative hierarchical clustering (AHC) 22). K-modes performed best, and the clusters identified were reasonably interpretable and often aligned with known associations between conditions. People with any physical condition at baseline were generally more likely to develop depression than people without any physical condition. There was some variation in this association by cluster which may be at least partly driven by differences in the mean number of physical conditions in each cluster.
Comparison with other studies
Existing studies of morbidity clustering typically apply a single method. One study compared LCA to a Bayesian, network-based approach, but used age and admission type, rather than conditions alone, to drive cluster formation 53. Two other studies explored AHC and k-means in the same dataset, but chose k-means on the basis of AHC being too computationally intensive rather than based on performance 27,28. Additionally, despite the use of k-means 54 by several multimorbidity studies 27–30, it typically relies upon Euclidean distance as its similarity measure 31, which is unsuitable for binary data 18. Other multimorbidity studies have used k-means after a Multiple Correspondence Analysis 55,56, which represents categorical features as a low-dimensional Euclidean space 29,30. While this transforms the data features into an appropriate format for k-means, it also manipulates the data based on their pairwise co-occurrences, which may not be appropriate for every dataset.
This study finds that almost all physical morbidity clusters are associated with higher risk of subsequent depression than the group with no physical conditions at baseline. Although the strength of association varied by cluster, this seemed to be partly explained by the mean number of conditions in the cluster. This is consistent with a similar study 29 which found associations between severe mental illness and a higher number of physical conditions. Another similar study, which aimed to identify groups of physical conditions associated with incident depression within a Taiwanese cohort, also found that social factors played a role on the risk of subsequent depression diagnosis 21. Specifically, they found that amongst four Cardiometabolic, Arthritis-cataract, Multimorbidity, and Relatively healthy clusters, those within the Arthritis-cataract and Multimorbidity clusters had significantly higher risk of depression than healthy individuals. However, this association was attenuated for participants who engaged in social activities, including a job, volunteer experience, or community activities 21.
Strengths and limitations
Strengths of this study include the analysis of a large dataset which records morbidities in both baseline research data and linked routine data, as well as the inclusion of a wide set of morbidities recommended by a recent consensus study 57. Notably, this study is unique for its implementation and comparison between four clustering methods appropriate for binary data. A limitation is that the data are collected from volunteers who are generally more affluent than the UK average, and people from ethnic minorities are somewhat under-represented 58. Additionally, there is no standard way to evaluate the validity of identified clusters, although the observed clusters do include several known clinical associations. Consequently, this warrants further validation studies in other datasets to explore reproducibility of cluster solutions.
Implications for research
Many previous studies of morbidity clustering do not provide much information about which conditions are over or under-represented in clusters, which leaves readers relying solely on author-chosen cluster labels for interpretation 19,21. For example, it is common for other studies to identify a ‘cardiometabolic’ cluster 21 and the CVD + diabetes cluster in our study was amongst the three largest clusters in all three cohorts. However, it is not straightforward to compare clusters across studies because of considerable variation in the conditions included in analysis, and because many clustering studies do not provide detailed information about the nature of identified clusters. Key implications are that clustering studies should be more consistent in the choice of conditions to include (and, at a minimum, follow consensus recommendations 57). Additionally, they should report the nature of clusters to help understand them beyond their high-level labels (by, for example, visualizing the prevalence of individual conditions in each cluster alongside over/under representation). We believe that our Adjusted Relative Frequency (ARF) measure with visualization in a bubble heatmap demonstrates one way to do this. However, there is a need for multimorbidity researchers to develop improved and consistent cluster visualizations and explanations to facilitate interpretation and to enhance clinical utility.
Morbidity clustering studies also typically use one clustering method, but there is no single clustering method which is likely to be optimal for every dataset. We therefore believe that clustering studies should more systematically explore different methods and make explicit how they choose the best method for their datasets and purposes. To encourage similar systematic comparison of different cluster methods, we have provided access to our code (https://github.com/laurendelong21/clusterMed). Many studies also cluster the entire population which is likely not sensible given the very different incidence and prevalence of disease with age and, to a lesser extent, sex and ethnicity. In this analysis, study participants were mostly middle-aged (so we did not further stratify by age) and overwhelmingly white, but we found some differences in the clusters identified in the whole population versus women or men separately. Although whole population clustering may be appropriate in some circumstances, reporting of clusters stratified by age and sex (and ethnicity if the data permits) would be valuable to explore how clustering varies by demographic characteristics.
Finally, further research to better understand why physical multimorbidity is associated with subsequent depression is needed. The general trend between increased risk of subsequent depression and mean number of conditions suggests a social explanation: suffering more conditions may more strongly interrupt one’s life or sense of self. However, the relationship between depression and physical conditions is very likely bidirectional and longitudinal research which better examines how the two interact over a lifetime would be valuable.
Conclusions
Using the best performing of four different clustering methods, this study identified several multimorbidity clusters which align with known clinical associations. Association with depression varied between clusters, but this may be partly driven by differences in the number of conditions. More research is needed to better understand the mechanisms underlying such associations.
Data Availability
The UK Biobank data is not openly available to protect the rights of participants. Researchers can register for access here: https://www.ukbiobank.ac.uk/enable-your-research/register.
Code Availability
Corresponding code is available at https://github.com/laurendelong21/clusterMed.
Competing Interests
The authors declare no competing interests.
Author Contributions
All authors contributed to writing the manuscript. L.N.D. and K.F. conducted the analyses and made the figures. L.N.D. and P.G. wrote the software. K.F. and R.P. processed and prepared the data. J.D.F. and
B.G. designed and supervised the study.
Data Availability
The UK Biobank data is not openly available to protect the rights of participants. Researchers can register for access here: http://www.ukbiobank.ac.uk/enable-your-research/register.
Supporting information
Supplementary Appendix 1. Methodological Descriptions.
To identify a clustering method best suited for our data, we explored several combinations of metrics and methods which were, in theory, capable of handling the binary nature of the morbidity data. Specifically, we used the following metrics within this study:
Hamming distance 1: This is a dissimilarity metric denoting the number of mismatching categories between two objects. It is formally defined as 2: where:
cosine similarity 3: This is a measure which was originally used to indicate how similar two vector angles are to one another. For binary data, it can be understood as the number of true, or positive features are shared between two objects, divided by the product of the number of true objects from each object. Formally, it is defined as 4:
With these metrics, we explored the following four clustering methods:
k-modes 5,6: This uses the same methodology as k-means clustering 7, but the centroid of each cluster is defined on the number of matching categories between data points, computed via the Hamming distance 1. In other words, the centroid represents the mode of the cluster, rather than the mean 5,6. Centroid initialization was performed via the frequency-based Huang metric 5.
k-medoids 8: As above, this uses the same methodology as k-means clustering 7, but the centroid is an actual data point acting as the "median", computed via cosine similarity.
Latent Class Analysis (LCA) 9: This aims to find groups or subtypes of cases (latent classes) in multivariate categorical data. It gives probabilities of class membership, rather than concrete class assignments, which are unique, so the user can see the likelihood that a data point truly belongs to its assigned class 9.
agglomerative hierarchical clustering (AHC) 10: This is best understood as a "bottom-up" approach in which samples start out alone, then merge to form larger and larger clusters 11. We used a complete linkage (the maximum distance between points in two clusters 12), computed via Hamming distance 1.
Finally, we assessed cluster performance, including separation and overlap, with the following three performance metrics:
Calinski and Harabasz score 13: This is the ratio of between-cluster dispersion to within-cluster dispersion. A higher Calinski and Harabasz score indicates better performance.
Davies Bouldin score 14: This is a measure of cluster similarity to each cluster’s most similar cluster. A Davies Bouldin score closer to zero indicates better performance.
Silhouette score 15: This is a measure of cluster fit which accounts for the mean distance between points in each individual cluster as well as the mean distance to points in the closest neighboring cluster. A silhouette score closer to one indicates better performance. Hamming distance was utilized as the distance metric.
The best similarity or dissimilarity metrics were selected for k-modes, k-medoids, and AHC by testing each of them upon a random selection of participants (1,417). The k-modes method used with an alternative initialization technique, called the Cao metric 6, resulted in some clusters containing less than ten participants, while others contained hundreds. Such imbalance is uninformative for our purposes. Similarly, k-medoids with Hamming distance and Jaccard similarity 16, another similarity metric, resulted in several empty clusters. Finally, AHC with other metrics and linkage types resulted in poor separation between clusters, with high overlap between branches. These issues were not present with the specified metrics.
Supplementary Appendix 2. Bubble Heatmap.
The bubble heatmap places ARF values on a grid in which the y-axis contains conditions, the x-axis contains clusters, and data points are colored blue (under-representation) or red (over-representation) at each intersection. The magnitude of under-or over-representation is indicated by the size of the data point, or bubble. Points are not statistically significant, as determined by the Fisher’s Exact test (REF), were omitted. Therefore, conditions with no significant values are omitted entirely from the y- axis.
Notably, for visualisation purposes, the ARF values are adjusted so that values denoting under-representation (between zero and one) were mapped to a similar scale as those denoting over-representation (values greater than one). Specifically, we used the following function, in which x denotes the original ARF value:
Supplementary Table 2. ARF values and adjusted p-values per condition and cluster.
Statistically significant values (p < 0.05) are denoted in bold.
See corresponding excel file (S2Table.xlsx).
Acknowledgments
This work was co-funded by the Medical Research Council and the National Institute for Health Research (grant number MC/S028013), and the NIHR AIM-CISC programme (grant number NIHR202639). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. The study was conducted using the UK Biobank Resource under application number 57213.
LND is individually funded by a Global Informatics Scholarship from the School of Informatics at the University of Edinburgh. The School of Informatics had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors would like to thank the UK Biobank participants and the UK Biobank staff for their contributions to this study. The authors would like to thank the public members of our advisory board, Dr Paul Kelly and Pat Watson, for providing thoughtful feedback throughout our project. This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/).