Abstract
Background Quality Adjusted Life Years (QALYs) are often used in economic evaluations, yet utility weights for deriving them are rarely directly measured in mental health services.
Objectives We aimed to: (i) identify the best Transfer To Utility (TTU) algorithms and predictors for adolescent weighted Assessment of Quality of Life - six dimensions (AQoL-6D) health utility and (ii) assess ability of TTU algorithms to predict longitudinal change.
Methods We recruited 1107 young people attending Australian primary mental health services, collecting data at two time points, three months apart. Five linear and three generalised linear models were explored to identify the best TTU algorithm. Forest models were used to assess predictive ability of six candidate measures of psychological distress, depression and anxiety and linear / generalised linear mixed effect models were used to construct longitudinal predictive models for AQoL-6D change.
Results A depression measure (Patient Health Questionnaire-9) was the strongest independent predictor of health utility. Linear regression models with complementary log-log transformation of utility score were the best performing models. Between-person associations were slightly larger than within-person associations for most of the predictors.
Conclusions Adolescent AQoL-6D utility can be derived from a range of psychological distress, depression and anxiety measures. TTU algorithms estimated from cross-sectional data can approximate longitudinal change but may slightly bias QALY predictions.
Toolkits The TTU models produced by this study can be searched, retrieved and applied to new data to generate QALY predictions with the Youth Outcomes to Health Utility (youthu) R package - https://ready4-dev.github.io/youthu.
1 Introduction
To efficiently allocate scarce public resources between competing mental health programs, it is useful to have a common measure of benefit. Quality adjusted life years (QALYs) are generic indices of outcome that inform public health policy in many countries [1] and are frequently used in health economic evaluations, including in mental health. The “quality” in QALYs is often measured via the use of multi-attribute utility instruments (MAUIs), where domains of quality of life measured by a questionnaire are weighted using the preferences of people [2]. This approach produces a single health utility weight for each individual for each measured health state, anchored on a scale where 0 represents death and 1 represents perfect health. Health utility weighs can be converted to QALYs by weighting the duration (the “years” part of QALYs) each individual spends in each health state.
MAUIs are regularly used in research studies such as clinical trials and epidemiological surveys, but rarely feature in routine data collection by mental health services. In the absence of direct measurement, Transfer to Utility (TTU) analysis has been developed to map utility weights from standard health status measurements [3]. In mental health settings, TTU algorithms have been developed to map psychological distress (measured using Kessler Psychological Distress Scale – 10 items, K10) and depression and anxiety symptoms (measured using Depression, Anxiety, and Stress Scale – 21 items, DASS-21 [4]) to a range of health utility measures including the Assessment of Quality of Life – 8 dimensions (AQoL-8D [5]). Published mental health TTU algorithms have been developed for adult [5] or child [6] general populations; however, they have questionable appropriateness for predicting health utility in clinical mental health samples of young people. Other difficulties with currently available TTU algorithms include over-reliance on cross-sectional data (not capturing the longitudinal dimension of QALYs), and a limited range of predictors.
With a sample of help-seeking young people attending primary mental health care services, we aimed to: (i) identify the best TTU regression models to predict adolescent weighted AQoL-6D utility and evaluate the predictive ability of six candidate measures of psychological distress, depression and anxiety; and (ii) assess ability of the TTU algorithms to predict longitudinal (three-month) change.
2 Methods
2.1 Sample and setting
This study forms part of a research program to develop better outcome measures for young people seeking mental health support, and the study sample has previously been described [7]. Briefly, young people aged 12 to 25 years who presented for a first appointment for mental health or substance use related issues were recruited from three metropolitan and two regional Australian youth-focused primary mental health clinics (headspace centres) between September 2016 to April 2018. Sample characteristics are similar to previous descriptions of headspace clients, with slight differences in age (less aged 12-14, more aged 18-20), cultural background (more Culturally and Linguistically Diverse and less Aboriginal and Torres Strait Islander young people), sexuality (fewer heterosexual clients) and housing (more in unstable accommodation) [7].
2.2 Measures
We collected data on utility weights, six candidate predictors of utility weights including psychological distress, depression and anxiety measures as well as demographic, clinical and functional population information.
2.2.1 Utility weights
We assessed utility weights using the Assessment of Quality of Life – Six Dimension scale (AQoL-6D; [8]) MAUI. It was selected due to the relevance of its domains for a clinical mental health sample [9] and its acceptable participant time-burden. The AQoL-6D instrument contains 20 items across the six dimensions of independent living, social and family relationship, mental health, coping, pain and sense. Health utility scores were calculated using a published algorithm for adolescents (available at https://www.aqol.com.au/index.php/aqolinstruments?id=92), using Australian population preference weights.
2.2.2 Candidate predictors
Data from six measures of psychological distress (one measure), depression (two measures) and anxiety (three measures) symptoms were used as candidate predictors to construct TTU models. These measures were selected as they are widely used in clinical mental health services or clinically relevant to the profiles of young people seeking mental health care.
The Kessler Psychological Distress Scale (K6; [10]) was used to measure psychological distress over the last 30 days. It includes six items (nervousness, hopelessness, restlessness, sadness, effort, and worthlessness) of the 10 item version of this measure, K10. Individual items use a five-point frequency scale that spans from 0 (“none of the time”) to 4 (“all of the time”).
The Patient Health Questionnaire-9 (PHQ-9; [11]) and Behavioural Activation for Depression Scale (BADS; [12]) were used to measure degree of depressive symptomatology. PHQ-9 includes nine questions measuring the frequency of depressive thoughts (including self-harm/suicidal thoughts) as well as associated somatic symptoms (e.g., sleep disturbance, fatigue, anhedonia, appetite, psychomotor changes) in the past two weeks. PHQ-9 uses a four-point frequency scale ranging from 0 (“Not at all”) to 3 (“Nearly every day”). For the PHQ-9 a total score is derived (0-27) with higher scores depicting greater symptom severity. BADS measures a range of behaviours (activation, avoidance/rumination, work/school impairment as well as social impairment) reflecting severity of depression. BADS includes 25 questions on behaviours over the past week, scored on a seven-point scale ranging from 0 (“Not at all”) to 6 (“Completely”). A total score is derived for the BADS (0-150) as well as subscale scores, with higher scores indicating greater activation.
The Generalised Anxiety Disorder Scale (GAD-7; [13]), Screen for Child Anxiety Related Disorders (SCARED; [14]) and Overall Anxiety Severity and Impairment Scale (OASIS; [15]) were used to measure anxiety symptoms. GAD-7 measures symptoms such as nervousness, worrying and restlessness, over the past two weeks using seven questions, with a four-point frequency scale ranging from 0 (“Not at all”) to 3 (Nearly every day”). A total score is calculated with scores ranging from 0 to 21 and higher scores indicating more severe symptomatology. SCARED is an anxiety screening tool designed for children and adolescents which can be mapped directly on specific Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR) anxiety disorders including generalised anxiety disorder, panic disorder, separation anxiety disorder and social phobia. It includes 41 questions on a three-point scale of 0 (“Not true or hardly ever true”), 1 “Somewhat True or Sometimes True” and 2 (“Very true or often true”) to measure symptoms over the last three months. A total score is derived with scores ranging from 0-82, with higher scores indicative of the presence of an anxiety disorder. The OASIS was developed as a brief questionnaire to measure severity of anxiety and impairment in clinical populations. The OASIS includes five questions about frequency and intensity of anxiety as well as related impairments such as avoidance, restricted activities and problems with social functioning over the past week. Total scores range from 0-20 with higher scores depicting more severe symptomatology.
2.2.3 Population characteristics
We collected self-reported measures of demographics (age, gender, sex at birth, education and employment status, languages spoken at home and country of birth). We also collected clinician or research interviewer assessed measures of mental health including primary diagnosis, clinical stage [16] and functioning (measured by the Social and Occupational Functioning Assessment Scale (SOFAS) [17]).
2.3 Procedures
Eligible participants were recruited by trained research assistants and written consent was obtained from the young person and a parent/guardian if the participant was aged <18 years.
Participants responded to the questionnaire via a tablet device and participants’ clinical characteristics were obtained from clinical records and research interview. At three-months post-baseline, participants were contacted in person or by telephone, to complete a 3-month follow-up assessment.
2.4 Statistical analysis
Basic descriptive statistics were used to characterise the cohort in terms of baseline demographics and clinical variables. Pearson’s Product Moment Correlations (r) were used to determine the relationships between candidate predictors and the AQoL-6D utility score.
2.4.1 TTU regression models
As AQoL-6D utility score is normally left skewed and constrained between 0 and 1, ordinary least squares (OLS) models with different types of outcome transformations (such as log and logit) have been previously used in TTU regression [3]. Similarly, generalised linear models (GLMs) address this issue via modelling the distribution of the outcome variable and applying a link function between the outcome and linear combination of predictors [18].
We compared predictive performance of a range of models predicting AQoL-6D utility scores using the candidate predictor that had the highest Pearson correlation coefficient with utility scores. The models compared include OLS regression with log, logit, log-log (f(y) = -log(-log(y))) and clog-log (f(y) = -log(1-y))) transformation; GLM using Gaussian distribution with log link; and GLM using Beta distribution with logit and clog-log link. Ten-fold cross-validation was used to compare model fitting using training datasets and predictive ability using testing datasets using three indicators including R2, root mean square error (RMSE) and mean absolute error (MAE) [19,20].
To evaluate whether candidate predictors could independently predict utility scores, we established multi-variate prediction models using baseline data with the candidate predictor and a range of other risk factors including participants’ age, sex at birth, clinical stage, cultural and linguistic diversity, education and employment status, primary diagnosis, region of residence (whether metropolitan - based on location of attending service) and sexual orientation. Functioning (as measured by SOFAS), was also included in each model to evaluate whether it can jointly predict utility with clinical symptom measurements.
2.4.2 Candidate predictor comparison
Two steps were used to compare the usefulness of the candidate predictors. First, we used a random forest model including all six candidate predictors. Anxiety and depression measurements are highly collinear, making it difficult to compare these candidate predictors using one regression model. Random forest models provide flexible methods for comparing correlated predictors’ relative ‘importance’ (loss of accuracy from random permutation of the predictor) for the overall prediction model [21]. Second the predictive performance of candidate predictors using selected TTU regression model were compared using 10-fold cross-validation. This procedure helped us to directly evaluate the independent predictive ability of different candidate predictors.
2.4.3 Methods to evaluate the ability of measures to predict longitudinal change in health utility
After identifying the best TTU regression model(s), we established longitudinal models to evaluate the ability to predict change. This was achieved using generalised linear mixed-effect models (GLMM) including both the baseline and follow-up data. The detailed model is specified in the following equation:
g() is the link function of the model; Ui,j is AQoL-6D utility score of individual i in observation j; Si,baseline is the baseline distress/depression/anxiety score for individual i and ΔSi,j is the score change from the baseline for individual i at observation j. We used β0 to represent fixed intercept, bi to represent the random intercept for individual i (controlling for clustering at individual level) and ϵi,j to represent the random error. Hence for baseline observations ΔSi,j = 0; and at follow-up ΔSi,j = Si,follow − up − Si,baseline. With this parameterisation, βbaseline can be interpreted as between person association and βchange as within person association. When βbaseline = βchange, Equation 1 can be generalised to:
for both baseline and follow-up observations. The discrepancy between βbaseline and βchange can be interpreted as bias of estimating longitudinal predictive score changes within individual using cross-sectional score difference between individuals.
Bayesian linear mixed models were used to avoid common convergence problems in frequentist tools [22]. Linear mixed effect model (LMM) can be fitted in the same framework with Gaussian distribution and identify link function. Clustering at individual level is controlled via including random intercepts. Model fitting was evaluated using Bayesian R2 [23].
2.4.4 Secondary analyses
We repeated the previous steps to develop additional TTUs - a set of models that used SOFAS as an independent predictor (Secondary Analysis A) and a set of models that combined anxiety and depression predictors (Secondary Analysis B).
2.4.5 Software
We undertook all our analyses using R 4.0.2 [24]. We used a wide range of third-party code libraries in the analysis and reporting (see Supplementary Information, Table A.5). We wrote our analysis and reporting algorithms as R packages so that they can be used by others as tools for predicting QALYs, replicating this study and developing TTUs with different utility measures and predictors. Where it is not feasible to publicly release study data synthetic replication datasets can be useful [25]. We created such a dataset and included it in one of our R packages.
3 Results
3.1 Cohort characteristics
Participants characteristics at baseline and follow-up are displayed in Table 1. This study included 1068 out of the 1107 participants with complete AQol-6D data. This cohort predominantly comprised individuals with anxiety/depression (76.7%) at early (prior to first episode of a serious mental disorder) clinical stages (91.7%). Participant ages ranged between 12-25 with a mean age of 18.13 (SD = 3.26).
Participant characteristics
There were 643 participants (60.2%) who completed AQol-6D questions at the follow-up survey three months after baseline assessment.
3.2 AQol-6D and candidate predictors
Distribution of AQol-6D total utility score and sub-domain scores are displayed in Figure 1, the mean utility score at baseline is 0.59 (SD = 0.24) and 0.68 (SD = 0.24) at follow-up. Distribution of candidate predictors, BADS, GAD-7, K6, OASIS, PHQ-9 and SCARED, are summarised in Table 2. PHQ-9 was found to have the highest correlation with utility score both at baseline and follow-up followed by OASIS and BADS; baseline and follow-up SCARED was found to have the lowest correlation coefficients with utility score although all correlation coefficients can be characterised as being strong.
Candidate predictors distribution parameters and correlations with AQoL-6D utility
Distribution of AQoL-6D domains
3.3 TTU regression model performance
The 10-fold cross-validated model fitting index from TTU models using PHQ-9 are reported in Table A.1 in the Supplementary Material. Both training and testing R2, RMSE and MAE were comparable between GLM model types. The best OLS model was found to be either no transformation, log transformation or clog-log transformation. Model diagnoses (such as heteroscedasticity, residual normality) suggested better model fit of the clog-log transformed model, as the distribution clog-log transformed utility are closest to normal distribution among all transformation methods. Another benefit of the clog-log model is that the predicted utility score will be constrained with an upper bound of 1, thus preventing out of range prediction. Therefore, both GLM with Gaussian distribution and log link and OLS with clog-log transformation were selected for further evaluation. Predictive ability of each candidate predictor using baseline data were also compared using 10-fold cross-validation.
As shown in Table A.2, PHQ-9 had the highest predictive ability followed by OASIS, BADS, GAD-7 and K6. SCARED had the least predictive capability. This is consistent with the random forest model in which PHQ-9 was found to be the most ‘important’ predictor (see Figure A.1). The confounding effect of other participant characteristics were also evaluated when using the candidate predictors in predicting utility score. Using the baseline data, SOFAS was found to independently predict utility scores in models for all six candidate predictors (p<0.005). No other confounding factor was identified for the either predictor prediction model; sex at birth was found to be a confounder for K6 model (p<0.01). A few other confounders, including primary diagnosis, clinical staging and age were identified as weakly associated with utility in TTU models using anxiety and depression measurements other than PHQ-9. Considering many of these factors are unlikely to change over three months, they were not evaluated in the mixed effect models.
3.4 Longitudinal TTU regression models
Regression coefficients of the baseline score and score changes (from baseline to follow-up) estimated in individual GLMM and LLM models are summarised in Table 3. Bayesian R2 from each model is reported. Modelled residual standard deviations (SDs) were also provided to support simulation studies which need to capture individual level variation. In GLMM and LLM models, the prediction models using OASIS and PHQ-9 respectively had the highest R2 (0.68 and 0.76) and lowest estimated residual SD. R2 were above 0.7 for all LLM models and above 0.6 for all GLMM models except for the K6 model. Variance of the random intercept was comparable with the residual variance.
Estimated coefficients from longitudinal TTU models for candidate predictors
The coefficients of score change from baseline were generally estimated to be lower compared with coefficients of baseline score (except for SCARED). The mean ratio between two coefficients (βchange/βbaseline) is 0.82 for K6, between 0.8 and 0.85 for depression measurements and between 0.9 and 1.09 for anxiety measurements.
Distribution of observed and predicted utility scores and their association from GLMM (Gaussian distribution and log link) and LLM (complementary log log transformation) using PHQ-9 are plotted in Figure 2. Compared with GLMM, the predicted utility scores from the LLM model converge better to the observed distribution and provide better estimations at the tail of the distribution. When the observed utility scores were low, the predicted utility were too high in GLMM model, see Figure 2 (B). The observed and predicted distributions of utility scores for other anxiety and depression measurements were similar from LLM models. However, GLMM models had low coverage in utility scores below 0.3 and also made predictions out of range (over 1).
Comparison of observed and predicted AQoL-6D utility score from longitudinal TTU of PHQ-9 (A) Density plots of observed and predicted utility scores (GLMM with Gaussian distribution and log link) (B) Scatter plots of observed and predicted utility scores by timepoint (GLMM with Gaussian distribution and log link) (C) Density plots of observed and predicted utility scores (LMM with clog-log transformation) (D) Scatter plots of observed and predicted utility scores by timepoint (LMM with clog-log transformation))
We also evaluated models with SOFAS at baseline and SOFAS change from baseline added to psychological distress, depression and anxiety predictors (see Tables A.3 and A.4). SOFAS scores were generally found to be associated with utility scores when controlling for anxiety and depression symptom measurements in longitudinal models.
The secondary analysis where SOFAS is the sole predictor resulted in models with slightly lower R2 than all primary analysis models. Adding the PHQ-9 depression measure to each anxiety measure predictor did not notably improve the performance of these models.
Detailed summaries of all models from the primary and secondary analyses are available in the online data repository (see “Availability of data and materials”).
3.5 Toolkits for predicting QALYs and modelling additional TTUs
We created an online results data-repository and three R packages to facilitate easy access to and application of study outputs and replication of study methods. See “Availability of data and materials” for details of where these resources (and supporting documentation) can be accessed.
4 Discussion
MAUIs are largely absent in routine data collection in clinical mental health services. This gap means that it can be difficult for researchers, service planners and service commissioners to derive much economic insight from the often-rich outcome data that is collected in administrative and treatment evaluation datasets. Existing TTU algorithms may not appropriately predict longitudinal change in utility weights especially in help-seeking young people. Our study addresses this important gap and is the first to evaluate longitudinal mapping ability between affective symptom measurements and health utility in a cohort of help seeking young people.
Although there is encouraging evidence about the quality, effectiveness and cost-effectiveness of youth mental health service innovations worldwide [26][27], the public health and economic returns from systemic reforms to support better mental health in young people still needs to be better understood [28]. Our study contributes to this goal by developing tools that can extract additional economic insights into existing mental health datasets by facilitating prediction of QALYs with our TTU algorithms and supporting the development of additional TTU algorithms by other researchers.
By helping to translate measures commonly collected in youth mental health services to QALYs, our TTU algorithms enable greater use of cost-utility analyses (CUAs). Unlike alternative economic evaluation types (e.g., Cost Consequence Analysis and Cost-Effectiveness Analysis using measures other than health utility) CUAs have commonly understood willingness to pay benchmarks for outcomes and facilitate comparison of the value for money claims of interventions from different illness groups. In practical terms, CUAs can help a decision-maker assess the competing economic claims of an intervention for depression compared to an intervention in anxiety or determine whether it may be efficient to fund expanded access to specified mental health services by redirecting parts of the general health budget.
As many youth mental health services routinely collect data on at least one of our six candidate predictors and the measure of functioning (SOFAS) included in our models, the TTU algorithms we developed in this study may have widespread applicability. Importantly, our TTUs were developed in a clinical sample of 12-25 year olds, using adolescent AQoL-6D weights. We were able to independently predict adolescent AQoL-6D from each of the six candidate measures we assessed, with PHQ-9 having the best predictive performance. Predictive performance was improved when adding SOFAS as an additional predictor or confound to each model; SOFAS also performed well as an independent predictor. These results may be useful for service system planners in helping to prioritise which measures should be included in routine data collection. Although direct measurement of health utility with measures such as the ReQoL [29] may be feasible in some mental health services, relying on clinical measures that can also map to health utility may be an attractive alternative.
A key feature of QALYs is their longitudinal dimension - health utilities are weighted and aggregated based on the time spent in varying health states. Our results suggest that psychological distress, depression and anxiety measurements explain the variations of health utility and cross-sectional variations can be used to approximate the longitudinal change in this cohort. However, a finding of our study is that. for psychological distress and depression measures at least, TTU algorithms developed from cross-sectional data may slightly over-estimate these changes, introducing bias into QALY predictions (overestimating QALYs for populations whose health utility improves over time, underestimating QALYS for those with deteriorating mental health).
Key strengths of our study include the novelty of our clinical youth mental health study sample, the use of clinically relevant and frequently collected outcome measures as predictors, the appropriateness and range of statistical methods deployed, the comparison of within-person and between-person differences in health utility weight predictions and highly replicable, publicly disseminated study algorithms. We acknowledge limitations that our data pertained to a single country, and we explored only one MAUI-derived utility weight. We did not examine some potential predictors that may be more common in some mental health services (for example we explored K6, as opposed to the expanded, and commonly used measure, the K10).
However, using utility weight input data derived from the same country as that to which an analysis pertains may be relatively unimportant [30], particularly when the MAUI is well suited to the relevant health condition (as is the case with AQoL and mental health [9]). Furthermore, our R packages should help make it relatively straightforward for others to replicate our study algorithm in different samples (non-Australian, non-clinical and/or non-youth populations) and generalise our methods to developing TTU algorithms that use different predictors (other clinical, functioning and demographic measures) and other utility measures (e.g., EQ-5D). Clinical trial datasets, which now usually collect MAUIs, could provide rich opportunities for applying our algorithm to develop and test new TTU algorithms.
By distributing study outputs as freely available open science resources we hope to make it easier to access and appropriately and consistently apply study findings. Open science resources also provide a valuable opportunity for other researchers to contribute refinements and extensions so that the usefulness of our study algorithm improves with time.
5 Conclusions
We have found that it is possible to predict both within-person and between-person differences in adolescent AQOL-6D utility weights from measures routinely collected in youth mental health services. TTU algorithms developed from cross-sectional data can approximate longitudinal changes in health utility, but may slightly over-estimate these changes. The TTU algorithms we have developed can help inform resource allocation decisions relating to the mental health of young people. Our toolkits also provide a basis for future research that extends our work with additional TTU algorithms.
Availability of data and materials
Detailed results in the form of catalogues of the TTU models produced by this study and other supporting information are available in the results repository https://doi.org/10.7910/DVN/DKDIB0. Tools for finding and using the TTU models appropriate for use with new prediction datasets are available as part of the youthu R package (https://ready4-dev.github.io/youthu). The youthvars R package (https://ready4-dev.github.io/youthvars/) provides a number of tools helpful for replicating this study (including a synthetic dataset) while TTU (https://ready4-dev.github.io/TTU/) has tools for both replicating the study and generalising our algorithms to develop TTU algorithms with other utility measures and predictors.
Ethics approval
The study was reviewed and granted approval by the University of Melbourne’s Human Research Ethics Committee, and the local Human Ethics and Advisory Group (1645367.1).
Funding
This study was funded by the National Health and Medical Research Council (NHMRC, APP1076940), Orygen and headspace.
Conflict of Interest
None declared.
A Appendix
A.1 Additional tables
10-fold cross-validated model fitting index for different OLS or GLM models for using PHQ9 total scores as predictor with the baseline data
10-fold cross-validated model fitting index for different candidate predictors estimated using GLM with Gaussian distribution and log link with the baseline data
Estimated coefficients from longitudinal TTU models based on candidate predictors and SOFAS score using LLM (with cloglog transformation)
Estimated coefficients from longitudinal TTU models based on individual candidate predictors and SOFAS score using GLM (Gaussian distribution with log link)
R Packages used in data analysis and reporting
A.2 Additional figures
Variable importance estimated using random forest
Footnotes
Addresses inconsistencies between written summary and tables caused by error in program to render manuscript.