Abstract
Background Research to understand the complex aetiology of depressive and anxiety disorders often requires large sample sizes, but this comes at a cost. Large-scale studies are typically unable to utilise “gold standard” phenotyping methods, instead relying on remote, self-report measures to ascertain phenotypes.
Aims To assess the comparability of two commonly used phenotyping methods for depression and anxiety disorders.
Method Participants from the Genetic Links to Anxiety and Depression (GLAD) Study (N = 37,419) completed an online questionnaire including detailed symptom reports. They received a lifetime algorithm-based diagnosis based on DSM-5 criteria for major depressive disorder (MDD), generalised anxiety disorder (GAD), specific phobia, social anxiety disorder, panic disorder, and agoraphobia. Any anxiety disorder included participants with at least one anxiety disorder. Participants also responded to single-item questions asking whether they had ever been diagnosed with these disorders by health professionals.
Results Agreement for algorithm-based and single-item diagnoses was high for MDD and any anxiety disorder but low for the individual anxiety disorders. For GAD, many participants with a single-item diagnosis did not receive an algorithm-based diagnosis. In contrast, algorithm-based diagnoses of the other anxiety disorders were more common than the single-item diagnoses.
Conclusions The two phenotyping methods were comparable for MDD and any anxiety disorder cases. However, frequencies of specific anxiety disorders varied depending on the method. Single-item diagnoses classified most participants as having GAD whereas algorithm-based diagnoses were more evenly distributed across the anxiety disorders. Future investigations of specific anxiety disorders should use algorithm-based or other robust phenotyping methods.
Introduction
Depression and anxiety disorders are common and debilitating, impacting approximately 30% of the population during their lifetime (1,2), and accounting for 10% of years lived with disability (3). This highlights the importance of understanding disorder-related risks and outcomes. In order to undertake research or treatment of these conditions, a vital step is identifying participants with or without the disorder of interest. The “gold standard” for phenotyping in psychiatric research is a structured or semi-structured diagnostic interview conducted in person (or over the phone) by a trained interviewer, such as the Composite International Diagnostic Interview (CIDI) (4) or Structured Clinical Interview for DSM-5 (SCID) (5). However, conducting in person interviews is time-consuming and costly. Due to the heterogeneous and complex aetiology of anxiety and depression, studies often require extremely large samples. This renders in-person interviews impractical and large-scale studies increasingly use online, self-report questionnaires to ascertain depression and anxiety disorder diagnostic status in participants.
There are two common methods to ascertain a diagnosis when using online questionnaires. Algorithm-based diagnoses involve a screening questionnaire which asks participants to self-report specific symptoms. The questionnaire responses are run through an algorithm based upon diagnostic criteria, such as the Diagnostic Statistical Manual (DSM-5; (6), to assess whether the participant qualifies for a diagnosis. This has been referred to as, variously, either strictly-defined, detailed, or symptom-based phenotyping (7–9). Single-item diagnoses take a contrasting approach and utilise a single question where participants are asked about the presence or absence of a clinical diagnosis from a health professional for a psychiatric disorder across their lifetime. They are also known as minimal, broad, or light-touch phenotyping (8,10). Both algorithm-based and single-item diagnostic methods are in widespread use in depression and anxiety research; however, it is unclear how they compare to one another. In this study, we compared algorithm-based and single-item lifetime diagnoses for major depressive disorder (MDD) and the five core anxiety disorders (generalised anxiety disorder (GAD), specific phobia, social anxiety disorder, panic disorder, and agoraphobia). Our aim was to assess agreement between these two phenotyping methods to determine to what extent they can be used interchangeably.
Methods
Sample
The Genetic Links to Anxiety and Depression (GLAD) Study (https://gladstudy.org.uk) is an online research platform to recruit individuals with a lifetime experience of depression and/or anxiety for future research. The design and implementation of this study are described elsewhere (11). Recruitment is ongoing and this paper includes data from all participants that completed the survey as of May 19th, 2020 (N = 37,419). The average age of these participants was 38.1 years, 79.6% were female; the majority were white (94.5%), and a large proportion had a university degree (56.8%). Participants responded to an online, self-report questionnaire that included two methods for ascertaining likely depression and anxiety disorder diagnoses: algorithm-based and single-item.
Algorithm-based diagnoses
Algorithm-based diagnoses for MDD and GAD were evaluated using an adapted version of the short form Composite International Diagnostic Interview (CIDI-SF) (12) as used in UK Biobank (13). Similarly, items derived from the Diagnostic Statistical Manual (DSM-5) criteria assessed specific phobia, social anxiety disorder, panic disorder, and agoraphobia (14). Algorithms were developed to categorise participants as having a lifetime algorithm-based diagnosis for a disorder if their responses corresponded closely to DSM-5 criteria (see Appendix 1 in Supplementary Materials).
Single-item diagnoses
Single-item diagnoses were self-reported in response to the question: “Have you ever been diagnosed with one or more of the following mental health problems by a professional, even if you don’t have it currently?” Participants were prompted to select all diagnoses that applied. Participants were categorised as having a single-item diagnosis if they selected the most comparable option to the relevant diagnosis (e.g., “Depression” for MDD). These single-item diagnoses reflect self-reports of a previous medically-provided diagnosis and were not validated against electronic health records (EHR). Phrasing for each of these items can be found in Appendix 2 in Supplementary Materials. We included the single-item of “panic attacks” as well as “panic disorder”, and separately compared both to algorithm-based panic disorder.
“Any anxiety” diagnosis
It is common in research to combine the anxiety disorder subtypes into a single category, arguing that the overlap between risk factors and outcomes is comparable (e.g., Purves et al15). We were interested in assessing agreement of algorithm-based and single-item diagnoses of “any anxiety” as well as that for the individual anxiety disorders. Algorithm-based “any anxiety disorder” was defined as participants with an algorithm-based diagnosis for at least one of the individual anxiety disorders (e.g., GAD, specific phobia, social anxiety disorder, panic disorder, or agoraphobia). Single-item diagnosis of “any anxiety disorder” included participants who self-reported receiving at least one anxiety disorder diagnosis from a health professional.
Analysis
We calculated the number of participants with zero, one, and two or more algorithm-based and single-item diagnoses. We also assessed the frequency of algorithm-based and single-item diagnoses for each disorder as percentages of the whole sample, excluding participants with missing data on one of the measures (e.g., a participant with single-item GAD but missing data for algorithm-based GAD was excluded from the GAD frequencies). Agreement and disagreement levels between these two phenotyping methods were assessed by calculating Cohen’s kappa, sensitivity, and specificity. Sensitivity is the proportion of individuals with a disorder that the measure correctly classifies as having a diagnosis (proportion of true positives). In contrast, specificity is the proportion of individuals without a disorder that are correctly classified as not having a diagnosis (proportion of true negatives). Since we lacked a ‘gold standard’ reference in this sample, sensitivity and specificity analyses were conducted in both directions. All data cleaning and analyses were conducted in R version 3.5.3 (16).
Code availability
R scripts for the diagnostic algorithms and analyses included in this paper are available at https://github.com/mollyrdavies/GLAD-Diagnostic-algorithms.
Data availability
The data that support the findings of this study are available on request from the corresponding author, TCE. The data are not publicly available due to restrictions outlined in the study protocol and specified to participants during the consent process.
Results
Frequencies
Frequency of single-item diagnoses were higher than algorithm-based diagnoses. As shown in Table 1, 35,399 (94.6%) participants reported a diagnosis of a major depressive or anxiety disorder on the single-item method (such a high proportion is expected since GLAD participants identified themselves as having had an anxiety and/or depressive diagnosis at some point in their lives), whereas 33,787 (89.8%) participants screened for at least one of the algorithm-based diagnoses. A higher proportion of participants (73.9%) reported two or more single-item diagnoses compared to two or more algorithm-based diagnoses (62.3%).
Figure 1 displays the frequencies of algorithm-based and single-item diagnoses in the sample for each of the disorders. MDD had the highest frequency, which was consistent across phenotyping methods (88.4% algorithm-based, 88.7% single-item). The frequencies of the anxiety disorders varied widely depending on the measure. The majority of participants had a single-item diagnosis of GAD (78.6%), but the percentage of algorithm-based GAD diagnosis (50.2%) was approximately two-thirds of that, indicating a large discrepancy between the two methods. The remaining anxiety disorders had higher frequencies of algorithm-based than single item diagnoses. For instance, the percentages of participants with algorithm-based specific phobia (18.5%), panic disorder (21.6%), and agoraphobia (20.1%) were more than double those of the respective single-item diagnoses. However, the proportion of algorithm-based panic disorder (21.6%) was only around half the frequency of single-item panic attacks (40.0%).
Agreement
We examined the agreement between algorithm-based and single-item diagnoses. Figure 2 displays the agreement and disagreement for each disorder. The results in Figure 3 were also examined post-hoc by sex, but differences were minimal (see Appendix 3 in Supplementary Materials). Sensitivity, specificity, and Cohen’s kappa are presented in Table 2.
MDD had the highest overall agreement (84.6%) between algorithm-based and single-item diagnoses, whereas GAD had the lowest (58.2%). However, Cohen’s kappa values for all diagnoses were low (0.09-0.32), meaning that the reliability between these measures for all disorders is minimal at best (17).
Sensitivity (proportion of true positives) was high, but specificity (proportion of true negatives) was low of single-item MDD (0.91; 0.33), any anxiety (0.91; 0.29) and GAD (0.87; 0.28) for the respective algorithm-based measure. This indicates that these single-item diagnoses had high proportions of both true and false positives when compared to the algorithm-based measure.
Notably, sensitivity and specificity of algorithm-based MDD (0.91; 0.33) was the same as that found for single-item diagnoses, meaning that proportions of true positives and true negatives between these measures are comparable for these disorders, regardless of the direction of comparison. Sensitivity and specificity values indicated that algorithm-based any anxiety (0.81; 0.48) had high proportions of true and false positives for single-item any anxiety.
In contrast to the findings for MDD, GAD and any anxiety, sensitivity of single-item diagnoses for specific phobia, social anxiety disorder, panic disorder, and agoraphobia was low (0.13-0.43) while specificity was high (0.86-0.98). For these anxiety disorders, single-item diagnoses had low proportions of true positives, high proportions of false negatives, and high proportions of true negatives for the corresponding algorithm-based diagnoses. The sensitivity of algorithm-based diagnoses for single-item GAD, specific phobia, social anxiety disorder, panic disorder, and agoraphobia was low to moderate (0.36-0.65) while specificity was moderate (0.69-0.83). This demonstrates that the algorithm-based anxiety subtypes predicted true positives of single-item diagnoses at approximately random chance (50%) but were moderately better at classifying true negatives.
The single-item measure of panic attacks had moderate sensitivity (0.62) and specificity (0.66) for algorithm-based panic disorder, indicating that classification of true positives and true negatives was slightly above random chance. Single-item panic attacks had a higher proportion of true and false positives than single-item panic disorder (0.14; 0.93) for algorithm-based panic disorder.
Discussion
Overview
In this study, we examined the agreement and disagreement between lifetime algorithm-based and single-item diagnoses of MDD, any anxiety, and the five core anxiety disorders (GAD, specific phobia, social anxiety disorder, panic disorder, and agoraphobia). Analyses were conducted in participants who had self-defined as having lifetime experience of a depressive and/or anxiety disorder. We also assessed how single-item panic attacks compared to algorithm-based panic disorder, to determine whether agreement was better or worse than single-item panic disorder. Single-item diagnosis refers to the self-report of a diagnosis from a clinician, whereas algorithm-based diagnosis is based on participant responses to symptom questions that are then assessed against DSM-5 criteria for the disorder. Since the anxiety subtypes are sometimes grouped together in research (e.g., (15)), we included the “any anxiety” category to compare agreement for anxiety disorders as a group as well as individually.
Our results showed high agreement between algorithm-based and single-item diagnoses for MDD (84.6%). The lowest agreement between the two measures was for GAD (58.2%). Agreement for any anxiety (76.7%) and the other anxiety subtypes (specific phobia, social anxiety disorder, panic disorder, and agoraphobia) were higher (70.3 - 82.3%). Results from the sensitivity and specificity analyses demonstrated that single-item MDD, any anxiety, and GAD tended to over-diagnose compared to the respective algorithm-based diagnosis. Interestingly, algorithm-based MDD and any anxiety also had high proportions of false positives (low specificity) for the single-item measures. This suggests that single-item and algorithm-based measures for MDD and any anxiety have high disagreement overall on participants not meeting diagnostic criteria.
In contrast, our results suggested that single-item specific phobia, social phobia, panic disorder, and agoraphobia tended to under-diagnose when compared to the respective algorithm-based measure. Many participants with algorithm-based diagnoses of these anxiety subtypes did not report a single-item diagnosis of the same disorder. The majority of participants reported a single-item diagnosis of GAD rather than one of the other anxiety subtypes.
As expected, single-item panic attacks had a high proportion of false positives when compared to algorithm-based panic disorder. Panic attacks are a symptom that can manifest in isolation (18) and are not specific to panic disorder. However, the sensitivity of single-item panic attacks was higher than single-item panic disorder for the algorithm-based measure, indicating that this broader diagnosis captured a higher proportion of participants with algorithm-based panic disorder.
Implications
Our findings demonstrated that algorithm-based and single-item diagnoses for MDD and any anxiety are reasonably comparable and have particularly high agreement on participants with a diagnosis of MDD. These findings suggest that single-item MDD and any anxiety may be comparable to algorithm-based diagnoses for identifying cases of MDD but differ in the classification of those without a diagnosis. This is useful in the context of the efficacy of broad phenotyping of MDD, with a diversity of opinions as to its value and utility in the field. Some studies reported that participants ascertained using single-item measures of diagnosis or even treatment-seeking have high genetic overlap with algorithm-based or clinically-ascertained MDD samples (19,20), suggesting comparability between the measures. Other researchers have argued that broad depression phenotyping shows the same genetic overlap with neuroticism and therefore is not specific to MDD (8). Algorithm-based MDD has also been found to have significantly higher heritability than single-item MDD, suggesting that utilising the single-item measure could decrease the power to detect genetic effects despite the increase in sample size (8,21). These reduced heritability estimates could be partially explained by the low sensitivity of single-item for algorithm-based MDD and any anxiety, as misclassification dilutes the power of case-control analyses to detect differences between the samples (22,23). Combining multiple broad phenotyping measures (e.g., single-item diagnoses, single-item help-seeking questions, and self-reported antidepressant usage) has been shown to reduce misclassification and increase heritability of MDD cases to equal or exceed heritability estimates of algorithm-based MDD in the UK Biobank (21).
However, our results indicate algorithm-based and single-item diagnoses for the anxiety subtypes (GAD, specific phobia, social anxiety disorder, panic disorder, and agoraphobia) differ substantially in classifying positive diagnoses. Single-item methods categorised the majority of participants as having GAD, whereas algorithm-based measures show more even distribution across the subtypes. The lower percentage of single-item diagnoses of the anxiety disorders (aside from GAD) could be due to a lack of treatment-seeking or recognition. Many individuals with symptoms do not seek treatment for mental health or related problems (24,25) and those that do more commonly discuss their problems with a general practitioner (GP) rather than a mental health professional (24). However, research has shown that there is an under-recognition of anxiety disorders, particularly by GPs (26–29). GPs have limited amounts of time and resources and lack specialised training (30) to conduct comprehensive assessments of anxiety symptoms. It is therefore possible that GPs encountering distressed patients may identify symptoms as “anxiety” without specifying a disorder. In the GLAD Study, the phrasing of the single-item GAD question encapsulates general nerves, worry, or anxiety to account for this, but may be over-estimating the number of participants given a specific GAD diagnosis as a result.
These findings have important implications for research studies investigating disorder-specific risk factors or outcomes for anxiety subtypes. Although some factors are largely shared between major depressive and anxiety disorders (e.g., genetic factors), others show more specificity (e.g., environment (31,32), treatment approaches (33)). As such, genetic research studies focussed on expanding sample sizes may find that single-item measures are sufficient, since many of the genetic influences are shared between major depressive and anxiety disorders (32,34). However, single-item or broad phenotyping for anxiety disorders tends to categorise the majority of participants as having GAD. Therefore, in order to understand disorder-specific risk factors or investigate treatment approaches for the anxiety subtypes, an algorithm-based or more stringent assessment (e.g., SCID interview) would be required.
Limitations
The GLAD Study has been successful in recruiting a large number of participants to complete detailed phenotyping measures, which has enabled us to complete this thorough comparison of these two types of measures. However, as with any study, there are limitations. Eligibility criteria for the study included having either an algorithm-based or single-item diagnosis for depression, anxiety, or other related psychiatric disorder (e.g., bipolar disorder, obsessive-compulsive disorder). By design, we therefore have a low representation of participants without MDD or any anxiety (see Figure 1 for frequencies). Specificity (proportion of true negatives) was low between the measures, suggesting that single-item and algorithm-based methods may differ in terms of who they categorise as not having a disorder. However, this sample may not be equipped to accurately estimate specificity due to the small number of participants without a diagnosis. Another study conducted in the UK Biobank, a cohort of older adults recruited from the general population, compared algorithm-based and single-item MDD and found much lower agreement between the measures (7). This difference in method agreement between the two studies may be due to the higher proportion of participants in the UK Biobank sample without a diagnosis. Notably, both the UK Biobank and the GLAD Study samples are disproportionately white and highly educated compared to the UK population. The GLAD Study sample is also disproportionately female. Exploration of measurement agreement in more representative samples, both in terms of population demographics and prevalence of psychopathology, would establish whether the high agreement we found between these measures is generalisable.
The algorithm-based and single-item diagnoses have not been compared to a ‘gold standard’ clinical interview. There has been minimal and conflicting evidence for the validity of the self-report CIDI-SF for MDD, which was utilised here to determine algorithm-based MDD. Some studies show comparable overlap between the self-report CIDI-SF with diagnostic interviews (35,36) while others do not (37). Other studies comparing single-item measures to clinical interviews have found moderate agreement of single-item MDD (38,39) but poor agreement for single-item anxiety disorders (24). As a result, we cannot make any conclusions about which diagnosis is more accurate from the analyses conducted here.
Further research is therefore required to validate these measures against ‘gold standard’ clinical interviews. Validation of these measures is key to ensuring that research findings are relevant to clinical practice. Nonetheless, it is worth noting that some researchers have argued that a ‘gold standard’ diagnosis does not exist. Even structured and semi-structured interviews may result in different classifications of diagnosis and estimates of population prevalence (40). Other validation methods for these measures are worth exploring, such as comparing the genetic overlap or by comparing against clinical outcome measures such as functional impairment or treatment response.
At this point we could not assess whether participants’ self-report of a clinical diagnosis matched their clinical data nor which health professional provided the diagnosis (e.g., general practitioner or psychiatrist). Other studies which have utilised single-item diagnoses in the context of genetics have similarly done so without medical record validation (15,19,20).
Furthermore, since individuals with depressive and anxiety disorders often do not present in clinic or go undiagnosed (24,25,41), reliance on health records alone is not a substitute for asking the participant. However, all GLAD Study participants have consented to providing medical record access, so this comparison could be conducted in our sample in the future.
Conclusion
Large-scale research projects that lack the resources to conduct ‘gold standard’ clinical interviewing commonly utilise algorithm-based and single-item phenotyping methods. We compared these two measures and found good comparability between algorithm-based and single-item MDD and “any anxiety” disorder for categorisation of participants with a diagnosis. However, in contrast, there was poor agreement between these two types of ratings for participants not having the relevant disorder. Of note, ascertainment of participants with diagnoses for the individual anxiety disorders was largely different depending on which phenotyping measure was applied. Our results suggest that single-item diagnoses may be sufficient for discovery of shared genetic effects, but investigation of disorder-specific factors or outcomes would require an algorithm-based or other strictly-defined measure. In designing future studies, including and combining multiple methods of ascertaining diagnostic status, such as single-item, algorithm-based, and EHR data, may yield more robust phenotypes and increase power for analyses (21).
Data Availability
Code availability R scripts for the diagnostic algorithms and analyses included in this paper are available at https://github.com/mollyrdavies/GLAD-Diagnostic-algorithms. Data availability The data that support the findings of this study are available on request from the corresponding author, TCE. The data are not publicly available due to restrictions outlined in the study protocol and specified to participants during the consent process.
Authorship contribution statement
G.B., T.C.E., J.R.B., C.R.H., M.H., I.R.J., N.K., A.M.M., D.J.S., D.V., and J.T.R.W. designed the GLAD Study.
M.R.D., B.N.A, C.A., S.C.B.C, C.H., G.K., I.M., M.M.K., Y.L., D.M., H.C.R., and K.N.T. carried out the data collection.
M.R.D. and T.C.E. conceived of the presented idea.
K.A.S.D. and K.A.G. advised on analyses to be included in the paper and assisted with interpretation.
Y.L., C.H, K.N.T., and M.R.D. cleaned and prepared the data. M.R.D. analysed the data, assisted by A.J.P., M.S., and A.t.K.
J.E.J.B., G.K., D.V., and R.Z. provided vital clinical input on the diagnostic algorithms and interpretation of results.
M.R.D. and T.C.E. wrote the paper.
G.B. and T.C.E. jointly supervised this work.
All authors contributed to interpretation of results. All authors provided critical feedback on manuscript drafts and approved the final version.
Funding
This work was supported by the National Institute of Health Research (NIHR) BioResource [RG94028, RG85445], NIHR Biomedical Research Centre [IS-BRC-1215-20018], HSC R&D Division, Public Health Agency [COM/5516/18], MRC Mental Health Data Pathfinder Award (MC_PC_17,217), and the National Centre for Mental Health funding through Health and Care Research Wales. Prof Eley and Dr Breen are part-funded by a program grant from the UK Medical Research Council (MR/M021475/1). Dr Buckman was supported by a Clinical Research Fellowship from the Wellcome Trust (201292/Z/16/Z). Dr. Goldsmith receives funding from NIHR, MRC, NIH, and the Juvenile Diabetes Research Foundation (JDRF). Dr Krebs is funded by a Clinical Research Training Fellowship from the Medical Research Council (MR/N001400/1).
Declaration of competing interest
Prof Breen has received honoraria, research or conference grants and consulting fees from Illumina, Otsuka, and COMPASS Pathfinder Ltd. Prof Hotopf is principal investigator of the RADAR-CNS consortium, an IMI public private partnership, and as such receives research funding from Janssen, UCB, Biogen, Lundbeck and MSD. Prof McIntosh has received research support from Eli Lilly, Janssen, and the Sackler Foundation, and has also received speaker fees from Illumina and Janssen. Prof Walters has received grant funding from Takeda for work unrelated to the GLAD Study. Dr Zahn is a private psychiatrist service provider and co-investigator on a Livanova-funded observational study. He has received honoraria for talks at medical symposia sponsored by Lundbeck as well as Janssen. He collaborates with EMOTRA, EMIS PLC and Alloc Modulo. The remaining authors have nothing to disclose.
Acknowledgements
We thank the GLAD Study and NIHR BioResource volunteers for their participation, and gratefully acknowledge the NIHR BioResource centres, NHS Trusts and staff for their contribution. We thank the National Institute for Health Research, NHS Blood and Transplant, and Health Data Research UK as part of the Digital Innovation Hub Programme. This study presents independent research funded by the NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King’s College London. Further information can be found at http://brc.slam.nhs.uk/about/core-facilities/bioresource. The views expressed are those of the authors and not necessarily those of the NHS, NIHR, HSC R&D Division, Department of Health and Social Care.
Footnotes
One author (consortium) removed.
Abbreviations
- CIDI
- Composite International Diagnostic Interview
- CIDI-SF
- Composite International Diagnostic Interview - short form
- SCID
- Structured Clinical Interview for DSM-5
- MDD
- major depressive disorder
- GAD
- generalised anxiety disorder
- DSM-5
- Diagnostic Statistical Manual 5
- GLAD
- Genetic Links to Anxiety and Depression
- EHR
- electronic health records
- GP
- general practitioner