Abstract
Background Avoiding “causal” language with observational study designs is common publication practice, often justified as being a more cautious approach to interpretation.
Objectives We aimed to i) estimate the degree to which causality was implied by both the language linking exposures to outcomes and by action recommendations in the high-profile health literature, ii) examine disconnects between language and recommendations, iii) identify which linking phrases were most common, and iv) generate estimates by which these phrases imply causality.
Methods We identified 18 of the most prominent general medical/public health/epidemiology journals, and searched and screened for articles published from 2010 to 2019 that investigated exposure/outcome pairs until we reached 65 non-RCT articles per journal (n=1,170). Two independent reviewers and an arbitrating reviewer rated the degree to which they believed causality had been implied by the language in abstracts based on written guidance. Reviewers then rated causal implications of linking words in isolation. For comparison, additional review was performed for full texts and for a secondary sample of RCTs.
Results Reviewers rated the causal implication of the sentence and phrase linking the exposure and outcome as None (i.e., makes no causal implication) in 13.8%, Weak in 34.2%, Moderate in 33.2%, and Strong in 18.7% of abstracts. Reviewers identified an action recommendation in 34.2% of abstracts. Of these action recommendations, reviewers rated the causal implications as None in 5.3%, Weak in 19.0%, Moderate in 42.8% and Strong in 33.0% of cases. The implied causality of action recommendations was often higher than the implied causality of linking sentences (44.5%) or commensurate (40.3%), with 15.3% being weaker. The most common linking word root identified in abstracts was “associate” (n=535/1,170; 45.7%) (e.g. “association,” “associated,” etc). There were only 16 (1.4%) abstracts using “cause” in the linking or modifying phrases. Reviewer ratings for causal implications of word roots were highly heterogeneous, including those commonly considered non-causal.
Discussion We found substantial disconnects between causal implications used to link an exposure to an outcome and the action implications made. This undercuts common assumptions about what words are often considered non-causal and that policing them eliminates causal implications. We recommend that instead of policing words, editors, researchers, and communicators should increase efforts at making research questions, as well as the potential of studies to answer them, more transparent.
Introduction
Health sciences research often investigates the relationship between a particular exposure and an outcome. Causal effects between these variables are often implicitly of interest, including studies based on non-random assignment of the exposure. Most researchers are aware that inferring causality may be fraught with difficulty, and that cautious interpretation may be warranted. However, this “caution” often manifests itself as avoiding causal language, rather than cautious examination of methodological strength of inference and uncertainty. Some author guidelines (e.g., Journal of the American Medical Association1) explicitly prohibit the use of causal language in any study that is not a randomized controlled trial (RCT), often justified by the inaccurate, but common, belief that causal inference is only possible with RCTs.2,3 Health scientists and editors often employ euphemisms or language workarounds.4,5 For example, researchers may reserve use of causal language for only some parts of the manuscript6 or use language that can pass as either causal or non-causal. Alternatively, non-causal language may be used throughout the manuscript, but practical recommendations may still be offered that suggest or require a causal interpretation.7 It is not entirely clear what “counts” as causal language, with no clear standards and few attempts6,8–12 to define and categorize what constitutes causal language.
The use of ambiguous language leads to potential disconnects between the authors’ intentions, methods, conclusions, and perceptions of the work by research consumers and decision-makers.4,5,13 It may also indirectly erode research quality by enabling researchers to make ambiguously causal implications without being accountable to the methodological rigor required for causal inference. Otherwise non-causal language may morph into causal language in outlets for medical practitioners,7,10 press releases,14–16 and media reports.17,18 Ambiguous language may also imply greater support for any practical recommendations that require causal interpretation.19 While some loss of nuance may be attributed to press officers, journalists, and news recipients, too-strong language often starts from the study publications themselves.17 Most importantly, choice of language impacts research consumers’ and decision-makers’ perceptions,13 which in turn impacts health decisions.
Despite widespread discussions about causal language use,4,5,20 systematic evidence of its usage in practice is limited. In a review of 60 observational studies that were published in The BMJ, a fifth were judged to have inconsistencies in their use of causal language.6 Prevalence and use of causal language has been examined in studies concerning the overall medical literature,6,17,21 obesity,11 and orthopedics,22 noting that in the latter all uses of causal language in non-RCTs were assumed to be “misuse.” To date, there have been no large-scale systematic assessments of language used to link exposures and outcomes in the medical and epidemiological literature; existing efforts6,8–12 heavily focus on binary assessments of the language used (causal vs. non-causal).
This study systematically examined the linking language used in studies with a primary exposure and outcome in the high-profile medical and epidemiological literature. Our objectives were to (i) identify the linking words and phrases used to describe relationships between exposures and outcomes, (ii) generate estimates of the strength of causality stated or implied by the linking phrases and sentences using a guided subjective assessment process, (iii) examine the prevalence of action recommendations that would require causal inference to have been made, and (iv) examine disconnects between causal implications in linking sentences and action implications.
Methods
Our target sample consists of studies that primarily quantified the relationship between a primary exposure and an outcome in humans and were published in high-profile general health, medicine or epidemiology journals between 2010 and 2019. Years 2020-2021 were not included due to disproportionate focus on the coronavirus disease of 2019 (COVID-19). The study was pre-registered on the Open Science Framework (OSF): https://osf.io/jtdaz/. Changes made to the protocol after preregistration are documented and explained in Appendix 1.
Search
Our search was structured in two steps: a preliminary search for appropriate journals and a secondary search for published papers within these journals.
Journal inclusion/exclusion criteria
The “top” journals in health, medicine, and epidemiology were determined by journal ranking from journals listed under Journal Citation Reports (JCR)23 categories for medicine and public health and SciMago’s category for Medicine. The top 200 journals from the SciMago Journal rank (SJR)24 and JCR’s impact factor rating for medical journals, and the top 200 highest impact factor rating journals for Public Health as extracted on May 26, 2020 were screened according to the following inclusion criteria: (1) primarily serves articles that are peer-reviewed, about health-specific topics, reporting primary data (e.g., the journal cannot be one which primarily serves reviews, meta-analyses, and other secondary data), primarily concerning human-level observations (e.g., not animal models or microbiology); (2) must be a general health/medicine/epidemiology journal (i.e., journals which are focused on a narrow speciality and/or disease area of medicine were excluded); (3) the journal must have been founded in 2010 or earlier.
Among the journals meeting these criteria, lists of the 15 highest ranked journals by (1) impact factor, (2) h-index, and (3) SJR score were combined into a single list without duplicates. An additional decision was made during screening on June 24, 2021 to drop journals that had screening acceptance rates of <10% and/or did not have sufficient numbers of remaining unscreened articles meeting our search criteria to meet journal quotas (See Appendix 1).
Search terms
Once the journal list was acquired, we performed a PubMed search to obtain all articles published in these journals from 2010 to 2019 (details in Appendix 2), with Medical Subject Headings (MeSH) terms to eliminate article types not meeting inclusion criteria. The search was performed in R25 using the EasyPubMed package26.
Articles were stratified by journal and whether they had the “Randomized Controlled Trial” MeSH tag. Identified articles were sorted in journal/article type stratified random order for screening. Disease areas were obtained for each article using the 2020 MeSH tag hierarchy27 for disease area headings.
Screening
Study inclusion/exclusion criteria
Study inclusion criteria were that the study was primarily concerned with the quantitative association of a primary exposure/outcome pair, as assessed by reviewers as below:
Observations must be human- or at an aggregate group of humans-level of observation
○ The primary research question must be to examine the causal and/or non-causal association between one primary exposure concept and one primary outcome concept
○ One “primary” exposure/outcome can include multiple measures of the same or similar broad exposure and/or outcome concept.
▪ Articles can include many exposures/outcomes, but focus in particular on one exposure/outcome pair as their primary association of interest (e.g., in the title, in the study aims)
▪ Articles that are about more than one primary concept (e.g., searching for what risk factors are associated with the outcome) were excluded.
The primary research question must be examined quantitatively using primary data (i.e., not a review or meta-analysis)
Studies investigating more than one exposure/outcome set were excluded because (1) it would not be possible to assess a primary exposure/outcome pair per study; (2) study objectives and designs could not easily be compared with other papers; and (3) it would impose additional strain on the management of the data and review.
Procedures
Articles were screened continuously for each journal until journal quotas were met with the addition of a small buffer used for training purposes and for replacement of articles rejected later during review. The journal quotas were 65 non-RCT articles and 6 RCT articles per journal, totalling 1,278 articles (1,170 non-RCTs and 108 RCTs). This sample size is based on informal explorations of sample datasets to yield a reasonable variety of language among the journal dataset and constrained by review capacity. We did not perform a formal sample size calculation because: 1) this descriptive study does not involve substantial hypothesis testing, 2) the variance in the language to be analysed in this study is unknown and is one of the key objectives of this study, and 3) the larger the sample size, the more in-depth we can explore less frequently used language, so we aimed to fully exhaust the available review capacity.
Articles were randomly assigned to three of 18 screening reviewers, with two primary reviewers and one arbitrating reviewer. During screening, the arbitrating reviewer made the inclusion/exclusion decision only in cases where the two primary reviewers disagreed.
Main review
Reviewer recruitment and selection
Reviewers were recruited through a combination of personal solicitations and Twitter-based networks. After initial expression of interest, reviewers were selected based on relevant graduate school education, expertise in relevant areas (e.g., epidemiology, causal inference, medicine, econometrics, meta-science, etc.), availability, and to maximize the diversity of fields, life experiences, backgrounds, and kinds of contributions to the group. All reviewers who completed assigned main reviews are coauthors.
Reviewer roles and training
All reviewers received one hour of instruction training and an additional set of training reviews to complete before the primary review. During the training process and the main review, reviewers were encouraged to engage in an active discussion on Slack to clarify guidelines, discuss issues, and generate community standards for review areas that may be more ambiguous. Reviewers were instructed to avoid referring to specifics of a particular study and to instead keep the discussion in general terms at all times to balance eliciting individual subjective opinions with group guidance. By design, reviewers may have developed improved clarity and different understandings of the guidance and how to give responses over time through discussion, and were allowed to make changes.
Each article was first reviewed by three unique randomly selected reviewers; two independent primary reviewers and an arbitrating reviewer. The arbitrating reviewer was given the submitted data from the primary reviewers. Rather than simply resolving conflicts, the arbitrating reviewer’s task was to generate what they believed to be the best and most accurate review of the article given the information available from both the primary reviewers, their own reading, and the ongoing community discussions. Arbitrating reviewers were free to decide in favor of one reviewer over another, consolidate and combine reviewer responses, or overturn both primary reviewers as they believed the situation dictated. The main output of the review process is the arbitrator’s review, which underlies subsequent analyses.
Review framework and tool
The review framework and tool were designed to elicit well-guided, replicable, subjective assessments of the key questions for our study. The framing and definitions of words used (e.g., what “causal” language means in this context) are provided in Appendix 3.
Reviewers had the option to recuse themselves of reviewing each article for any reason (e.g., conflicts of interest, connections to authors, etc.); the article was then reassigned to another reviewer. Reviewers could also request that an administrator reevaluate the inclusion of a study. If the administrator determined that the article did not meet inclusion criteria, it was replaced with one from the buffer of accepted screened reviews.
Reviewers first identified the primary outcome and exposure, preferably from the title of the study. Reviewers were asked to identify and copy and paste the primary linking sentence, which generally was a sentence in the conclusions section of the abstract or full text containing the primary exposure, outcome, and the linking word/phrase. A linking word/phrase is defined as a word or phrase describes the nature of the connection between some defined exposure and some defined outcome as identified by the study analysis. This can describe the type of relationship (e.g., “associated with”) and/or differences in levels (e.g., “had higher”) that may or may not be causal in nature. Then, reviewers were asked to identify modifying phrases, or any words/phrases that modify the nature of the relationship in the linking phrase. This includes signals of direction, strength, doubt, negation, and statistical properties of the relationship (e.g., “may be”, “positively”, “statistically significant”).
Reviewers assessed the degree to which the linking sentence implied that the analysis identified a causal relationship between the exposure and outcome on a four point scale (“linking sentence causal strength”), as shown in Table 1..
Next, reviewers were asked to identify any sentences that contained action recommendations (how a consumer of the research in question might utilize the results and conclusions of the research). This may include recommending that some actor(s) consider changes (or no changes) in some set of procedures and actions. General calls for additional research were not considered action recommendations. After identifying this sentence (if applicable), reviewers were asked to consider the extent that this recommendation would require that a causal relationship had been identified, as shown in Table 1.
Notably, in this framing “no causal implication” does not imply “no or null effects.” Reviewers were instructed to consider causal implications conceptually separately from the size (or lack thereof) of associations and correlations. Strong causal implications may be made even if the effect size measured was null, so long as the language implied that the nature of what was being estimated was causal.
All articles received a review of the titles and abstracts. In addition, one-third of the articles underwent full text assessment. This extended review 1) also included the abstract review questions for the discussion section and for any pop-out sections (i.e., sections that do not appear as part of the main text or abstract, but summarize and highlight key aspects of the study), and 2) included additional questions to help indicate potential areas of causal intent,28 as described in more detail in the review tool provided in the supplementary data. Reviewers also extracted whether there was any theoretical discussion about causal relationships between the exposure and outcome in the introduction, the number of covariates controlled or adjusted for, whether confounding was explicitly mentioned by name, whether a formal causal model was used, and whether explicit causal disclaimer statements were made (e.g., “causation cannot be inferred from observational studies, but…”).
Root linking words/phrases language strength
After arbitrator reviews were completed, we compiled and curated a list of words from the linking words/phrases in the arbitrator reviews, and manually stemmed them to obtain their root words. Reviewers then rated the causal implications of those root words that were found more than once in our sample. This was to mimic language decision processes that base their causal language assessment on selecting words that are or are not causal, and to establish our own systematic assessments of word ratings. For context, reviewers were presented with up to four randomly selected linking words/phrases that contained the root word and had been submitted by arbitrating reviewers (e.g., the root word “associate” had four phrases, including phrases like “associated with” or “association”).
Analysis
The statistical analysis was largely descriptive (e.g., describing the distributions of key extracted variables). Except for comparisons between RCTs and non-RCTs, all statistical analysis was performed on the arbitrated dataset of the non-RCTs only.
Comparisons between two ordinal categorical variables (e.g., strength ratings for causal implications of linking sentence vs. action implications) were estimated by Spearman’s correlation coefficients. Associations between strength ratings and key binary variables (e.g., study type, journals, topic areas, etc.) were estimated with ordinal logistic regression.
All measures of statistical uncertainty were clustered by journal and calculated using a block bootstrapping procedure unless otherwise specified, where 95% confidence intervals (CIs) were obtained through percentiles of the bootstrapped estimate distribution. In the case that the journals themselves were covariates, the clustered sandwich estimator was used instead. For root word rating proportions, there were no journal clusters, and as such the Wilson estimator was used. No weights were applied (i.e., journals and articles respectively contribute equally to our main results).
Heterogeneity between reviewers was evaluated using Krippendorf’s alpha. For the purpose of this review, disagreement between reviewers is a key result (i.e., heterogeneity between subjective opinions), rather than error.
All data management and analyses were conducted using R v4.0.5.25 Spearman correlation coefficients were determined using the pspearman package.29 Ordinal logistic regression was performed using the MASS package.30
Data and code availability
All data and code are publicly available through our OSF repository: https://osf.io/jtdaz, except for files containing personal identifying information and/or personal API keys.
Patient and Public Involvement statement
No patients or participants were involved with this research. All data were obtained from academic literature sources.
Ethics approval
This research is not human subjects research, and as such no ethical approval was required. This research complies with the Declaration of Helsinki.
Results
Search and screening
Figure 1 displays the flow diagram for journal and article selections. Eighteen journals were identified meeting our search criteria: American Journal of Epidemiology, American Journal of Medicine, American Journal of Preventive Medicine, American Journal of Public Health, Annals of Internal Medicine, BioMed Central Medicine, British Medical Journal, Canadian Medical Association Journal, European Journal of Epidemiology, International Journal of Epidemiology, Journal of Internal Medicine, Journal of the American Medical Association, Journal of the American Medical Association Internal Medicine, The Lancet, Mayo Clinic Proceedings, New England Journal of Medicine, PLOS Medicine, and Social Science and Medicine.
After searching PubMed for articles published in these journals from 2010-2019, we screened articles until 65 non-RCTs and 6 RCTs were accepted from each of these 18 journals; except for one journal (European Journal of Epidemiology) where only 3 RCTs were identified and included. This yielded 1,170 non-RCTs and 105 RCTs, totalling 1,275 studies reviewed. There were 10 recusals recorded during the main review. The three most common disease areas (as proxied by MeSH headings) in our sample were “Pathological Conditions, Signs and Symptoms” (n=377), “Cardiovascular Diseases” (n=324), and “Nutritional and Metabolic diseases” (n=198). See Appendix 4 for full terms.
Linking words and phrases
After the arbitrator reviews were completed, root words were obtained through stemming the linking phrases to identify and rate the root linking words themselves.
As shown in Figure 2, by far the most common root linking word identified in abstracts was “associate” (n=535/1,170; 45.7%, 95% CI 40.0, 51.9%), followed by “increase” (n=71/1,170; 6.1%, 95% CI 4.7, 7.8%). The same root word was identified in both the abstract and discussion for 48.2% cases (95% CI 43.7, 53.6%). We found only 9 (0.8%, 95% CI 0.4, 1.3%) studies where the primary root linking word was “cause.” There were 16 (1.4%, 95% CI 0.6, 2.3%) articles that used the word “cause,” when additionally including any instance of the word “cause” in either the linking or modifying phrases.
Causal implication(s) strengths
Summary data
Reviewers rated the abstract linking sentence as having no causal implication in 13.8% (95% CI 11.9, 15.9%), weak in 34.2% (95% CI 31.4, 36.7%), moderate in 33.2% (95% CI 29.8, 36.7%), and strong in 18.7% (95% CI 15.1, 22.6%) of cases as shown in Figure 3. Proportions of language used were very similar between the abstract, full-text discussion, and pop-out sections, driven largely by very similar linking sentences in these sections.
Reviewers identified an action recommendation in 34.2% (95% CI 29.0, 39.6%) of abstracts. Of these action recommendations, 5.3% (95% CI 3.5, 7.2%) were rated as having a causal implication of None, 19.0% (95% CI 15.2, 23.0%) Weak, 42.8% (95% CI 39.0, 46.4%) Moderate, and 33.0% (95% CI 29.0, 37.1%) Strong.
By comparison, the full-text review observed a prevalence of action recommendations in the discussion sections of 60.3% (95% CI 52.7, 67.5%), about twice that in abstracts. Pooling all action implications recorded and comparing the rated implication strength between abstracts compared to discussion sections and popout sections, we found negligible, if any, differences between the overall strength of action implications. The log odds of discussion sections having higher ratings than abstracts was -0.00026 (95% CI -0.00024, 0.00013).
No clear pattern is observed for the ratings over time, as shown in Appendix 5
Comparison of linking sentence strength vs. action implication strength
Figure 4 shows the distributions of causal implications in the linking sentences compared with the action recommendations among the 400 (34.0%) studies with an action recommendation in the abstract. Panel A shows the overall distribution of studies, where 15.3% (95% CI 11.7, 19.2%) of studies with action recommendations had action recommendations that were weaker than the linking sentence language, 40.3% (95% CI 35.1, 45.8%) commensurate, and 44.5% (95% CI 39.9, 48.4) stronger. The Spearman correlation coefficient between the strength of causal implication in the linking sentence and action recommendations was 0.349 (95% CI 0.256, 0.435), indicating that strength of causal implications was weakly positively correlated between the linking sentences and action recommendations among those abstracts that made action recommendations, as shown in panel B. Panel B shows the distribution of action recommendations at each level of linking causal strength. While stronger causal action recommendations are less likely to occur when linking sentences are weaker, studies with weaker linking sentences often make strong causal action implications. Among the 76.0% of studies with no action recommendation in the abstract, 14.5% (95% CI 11.6, 17.6%) were rated as “None” for linking sentence causal strength, 34.0% (95% CI 30.3, 37.5%) Weak, 33.1% (95% CI 29.2, 37.3%) Moderate, and 18.3% (95% CI 14.5, 22.5%) Strong. The linking sentence ratings overall do not appear to be substantially different between abstracts with action recommendations compared to those that do not (log odds of having a higher rank is 0.087 (95% CI -0.162, 0.320).
Words and phrases
As shown in Figure 5, ratings among reviewers (n=47) for causal implication of root words were highly heterogeneous, with the only word to reach near consensus on causal implications being “cause” itself. Reviewers rated words such as “correlate” and “associate” lower on the causal implication rankings, but with substantial variation in strength of implication ratings. Words such as “impact”, “effect”, “affect”, and “prevent” were rated as having very strong causal implications overall. Notably, many of these identified words were used in a variety of ways that could shift their meanings. For example, the root word “lower” could be used as “people with X had lower Y” indicating difference in levels, or “X lowered Y” potentially indicating a more causal relationship.
The root word “associate” was rated as having at least some (i.e. Weak, Moderate, or Strong) causal implication in 26/47 cases (55.3%, 95% CI 41.2, 68.6%). For comparison, 78.6% (95% CI 75.7, 81.2%) of linking sentences containing “associate” or variations in the linking phrase were rated as having at least some causal strength.
Modifying words and phrases
Modifying phrases were identified in the abstracts of 72.1% of studies (95% CI 69.0, 75.6%). 11.2% (95% CI 8.6, 14.2%) of studies had a modifying phrase with variations on “statistical” and/or “significant.” Phrases expressing caution (e.g., “may be,” “could,” “potentially”) or strength (e.g. “strongly,” “substantially,”) were both fairly common in the modifying phrases extracted. However, given the wide variety of phrases extracted and the lack of a pre-established framework for doing so, no formal categorization of modifying phrases was performed or quantified. The frequency of modifying words and phrases identified three or more times are shown in Appendix 6.
Differences in strength across key strata
Non-RCTs vs. RCTs
For the RCTs, reviewers rated the causal implication of the abstract linking sentence as being None for 9.5% (95% CI 4.8, 15.2%), Weak for 6.7% (95% CI 2.8, 11.4%), Moderate for 27.6% (95% CI 19.0, 36.4%), and Strong for 56.2% (95% CI 46.4, 65.8%). This is overall much stronger than for the non-RCTs, with a log-odds of RCTs having a higher ordinal linking sentence causal strength rating of 1.63 (95% CI 1.26, 2.04).
Overall, 75.2% (n=79/105; 95% CI 66.7, 82.9%) of RCTs in our sample had no action recommendation. Of the 26 that did, 0 were rated as having a causal implication of None, 6.7% (95% CI 2.8, 11.4%) Weak, 27.6% (95% CI 19.4, 36.6%) Moderate, and 56.0% Strong (95% CI 45.7, 64.8%). The log odds of RCTs having a higher ordinal action recommendation strength was -0.398 (95% CI -0.916, 0.009), noting that this is underpowered due to insufficient RCTs with action recommendations to make reasonable inference about differences.
The most common linking word identified in RCT abstracts was “associate” (n=16/105), followed by “reduce” (n=14/105), and “increase” (n=11/105).
Journals and journal policies
As shown in Appendix 7, journals appeared to have very similar rated strengths of causal linking language and action recommendations. Three journals have publicly posted policies regarding causal language. The Journal of the American Medical Association (JAMA) and JAMA Internal Medicine explicitly restrict the use of “causal” language to RCTs, while the American Journal of Epidemiology (AJE) discourages the use of the word “effect” exclusively, giving guidance as to when it should be used. JAMA had the lowest rank of linking language causal strength. Comparing these three journals to the other 15 journals, the log odds of having a higher rank of linking language causal strength was -0.627 (95% CI -0.771, -0.483) for JAMA, -0.083 (95% CI -0.235, 0.069) for JAMA Internal Medicine, and -0.080 (95% CI -0.229, 0.069) for AJE. The difference in the causal language strength in named journals of epidemiology compared to other journals appears to be small, with the log odds of having a higher linking language strength being -0.140 (95% CI -0.447, 0.166).
The only notable differences in causal strength of action recommendations appears to be regarding the proportion of articles that report any action recommendations at all, as shown in Appendix 8 and Appendix 9. We find that the log odds of having any action recommendation is -0.624 (95% CI -0.885, -0.363) for JAMA, -0.090 (95% CI -0.351, 0.170) for JAMA Internal Medicine, and -0.806 (95% CI -1.067, -0.546) for AJE. Articles from epidemiology journals together have log odds of having an action recommendation in the abstract of -0.516 (95% CI -0.870, -0.163) compared to those from the other 15 journals.
Indications of potential causal interest
Most studies in our sample provided some indication of potential causal interest, as shown in Figure 6. Only 3.8% (95% CI 2.0, 6.0%) of studies presented formal causal models, but most provided some discussion of the theoretical nature of the causal relationship between exposure and outcome (80.0%; CI 75.2, 85.4%). Among those that did discuss theory, 58.7% (95% CI 51.4, 64.8%) moderately or strongly indicated a theoretical causal relationship between the two. 24.6% (95% CI 20.9, 28.0%) of studies had a disclaimer statement explicitly discussing causality (e.g., “observational studies cannot establish causality, but…”). 68.7% (95% CI 63.3, 73.7%) mentioned “confounding” explicitly (i.e. using the word “confound” or variations of it).. Finally, the vast majority of studies in our sample controlled or adjusted for several variables, with 35.1% (95% CI 30.5, 39.9%) having 10 or more control variables.
Inter-rater comparisons
The Krippendorff’s alpha comparing primary independent reviewers’ ratings for linking language strength in the abstract was 0.29. Both primary reviewers agreed in 35.1% of cases, were one category different for 41.2%, two categories different in 19.9%, and three categories different in 3.8% of cases. Agreement increases to 0.41 when including the primary and arbitrating reviewers.
For the action recommendations, noting that in the large majority of cases these were rated as being “N/A” for missing, Krippendorff’s alpha was 0.70, where primary reviewers agreed exactly in 67.6% of cases, differed by one in 14.4% of cases, by two in 8.6%, by three in 5.3%, and by four in 4.1%. Similarly, agreement improved to 0.76 when including the arbitrating reviewers.
Discussion
Our systematic evaluation of the high-profile medical and epidemiological non-RCT literature examining the quantitative relationship between a primary exposure and outcome found that 1) by far the most common word used linking exposures and outcomes was “associate,” 2) reviewers rated over half of linking language in abstracts as having moderately or strongly implied causality, 3) while only about a third of articles issued action recommendations, reviewers rated the vast majority of these moderately or strongly implied that causality had been inferred, and 4) causal language in action recommendations ratings tended to be stronger than the language in linking sentences. We further found indirect evidence of interest in causal inference, even when not stated explicitly, such as discussion of theoretical causal mechanisms, confounding, and causal disclaimers. Overall, we found a substantial disconnect between the causal implications used in technical linking language and research implications.
Our results suggest that much of the high-profile observational health literature we reviewed is practicing a form of Schrödinger’s causal inference,31 where the studies are in a superposition of not using “causal” words but implying causation in many other respects. While the relative paucity of explicit action recommendations might be seen as appropriate caution, it also leaves open or encourages readers to read between the lines. When there are no useful and obvious alternative non-causal interpretations, readers may still infer causality. Notably, the RCTs in our sample used similar linking words as non-RCTs. Our word ratings suggest the degree of causal interpretation for common linking words has been impacted by the unavailability of explicitly causal language, such that the meaning of traditionally non-causal words has broadened to include potentially stronger causal interpretations.32 In effect, the rhetorical “just say association” standard has likely resulted in a scenario where many researchers may not fully believe that even the word “association” just means association.
At this time, we do not know the degree to which journal editors, reviewers, authors, or academic community standards contribute to the implicit and explicit rules of causal language. While there are relatively few explicit and public rules governing language at journals, journals may employ formal internal guidelines and unspoken informal norms.
Our measures of causal implication are based on subjective assessments, which is critical to evaluating human interpreted language. Reviewers substantially differed regarding the causal implications of many linking words, even in the presence of extensive guidance, processes, and training for how to assess causal implication in language. Different interpretations may arise from different backgrounds, experiences, and other factors affecting personal interpretations. Outside of this study, we would expect that a more general set of consumers of health research (clinicians, policy-makers, and others) would interpret these words differently, whether by virtue of differing frameworks for assessing language, personal interpretations, or community standards. Notably, heterogeneity in ratings also appears to come from context, such as modifying phrases or other more subtle clues. This may help explain why, for example, we found differences in ratings between “associate” alone in the root word rating exercise compared to in-context ratings of sentences with “associate” in the linking phrase. Aspects of the rating and interpretation process are also likely to be particularly challenging; for example, in reviewer discussions many reported difficulty evaluating the concept of causal implication strength in cases of null findings. Research consumers and decision-makers may have entirely different interpretations and frameworks, consciously or otherwise.
This study was designed with replicability in mind. The review process was designed to balance independent subjective assessments from skilled researchers and practitioners with explicit guidance and discussion among reviewers. Our assessment process is applicable to any number of areas of systematic evidence review and evaluation, which is often limited to shallow “objective” measures. Beyond pre-registration, nearly all parts of this project were fully open and disseminated to the public to view and comment, including documents, data, and code, resulting in a very large number of contributors, comments, and suggestions throughout the process.
Results may not be directly generalizable to other settings, alternative samples, and reviewers. Because our inclusion criteria excluded studies that were examining several potential factors or exposures and their relationships with outcome(s), our sample was likely to exclude many articles searching for “risk factors,” “correlates,” and similar terms that are commonly found in the health literature. Our journal selection also included only the most prominent general medical, public health, and epidemiology journals, and may not be representative of different fields, subfields, journals and policies. We did not examine the strength of evidence, nor did we examine any information that would indicate the appropriateness of claims.
The practice of avoiding causal language linking exposures and outcomes appears to add little if any clarity. Common standards for which words and language are “causal” or when “causal” words are appropriate do not appear to match interpretation. While being careful about what we claim is critical for medical science, being “careful” is often implemented by stripping out causal language in conclusions, and therefore any hint of what question is being answered. Knowing that the association between X and Y is 42 if we do not know what question that association attempts to answer.33 Further, these practices may weaken methodological accountability, as studies that only indirectly imply causality can be shielded from critique on the grounds of lack of causal inference rigor.4 Rather than policing which words we use to describe relationships between exposures and outcomes, we recommend improved training for researchers, research consumers, and reviewers to better identify and assess causal inference designs and assumptions, and for authors and editors to focus on being clearer about what questions we are asking,34,35 what decisions we are trying to inform, and the degree to which we are and are not able to achieve those goals.
Data Availability
All data and code are publicly available through our OSF repository: https://osf.io/jtdaz, except for files containing personal identifying information and/or personal API keys.
Author roles
Protocol design: NAH, SW, MPF, JMR, OAA, PWGT, EJM, EAS
Study administration: NAH, SW
Data management: NAH
Statistical analysis: NAH
Graphical design: NAH
Study design (piloting): SW, SPi, ER, CL, ALO, RB, SD, MDLRT, TSA, DJD, MS, TM, SPe Data analysis (screening): NAH, SPi, STL, SJH, AES, PS, AB, MSKD, SD, TRE, DRM, TMA, GMK, AA, JAC, MJK, COP
Data analysis (main review): NAH, SW, MPF, JMR, OAA, PWGT, EAS, SPi, STL, ER, SJH, AES, CL, PS, AB, MSKD, ALO, RB, SD, MDLRT, TRE, DRM, TMA, DJD, GMK, AA, JAC, MJK, MS, COP, TM, AC, JS, AS, TSA, SET, JD, EA, RAH, SKS, SS, NJ, SPe, CA, PK, AERA, NUO, IS
Manuscript writing: NAH, SW, JMR
Manuscript editing: NAH, SW, MPF, JMR, OAA, EJM, PWGT, EAS, SPi, STL, ER, SJH, AES, CL, PS, AB, MSKD, ALO, RB, SD, MDLRT, TRE, DRM, TMA, DJD, GMK, AA, JAC, MJK, MS, COP, TM, AC, JS, AS, TSA, SET, JD, EA, RAH, SKS, SS, NJ, SPe, CA, PK, AERA, NUO, IS
NAH serves as the primary guarantor of all aspects of the study and takes full responsibility for the work.
Competing interests
The authors declare no competing interests.
Transparency statement
The lead author, Noah A. Haber, affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned (and, if relevant, registered) have been explained.
Funding
No funding was granted specifically for the support of this study, and no funders had any role in the collection, analysis, and interpretation of data; in the writing of the report; and in the decision to submit the article for publication.
The researchers were independent from funders and that all authors, external and internal, had full access to all of the data (including statistical reports and tables) in the study and can take responsibility for the integrity of the data and the accuracy of the data analysis is also required.
The Meta-Research Innovation Center at Stanford University is supported by Arnold Ventures LLC (Houston, Texas), formerly the Laura and John Arnold Foundation.
Sophie Pilleron was funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 842817.
Saman Khalatbari-Soltani is supported by the Australian Research Council Centre of Excellence in Population Ageing Research (Project number CE170100005).
Ian Schmid is supported by National Institute of Mental Health grant T32MH122357.
Elizabeth Stuart’s time was supported by National Institute of Mental Health grant R01MH115487 and the Bloomberg American Health Initiative.
Ashley O’Donoghue is funded by a philanthropic gift from Google.org outside of the submitted work.
Onyebuchi A. Arah is supported by National Institute of Biomedical Imaging and Bioengineering grant R01EB027650, National Center for Advancing Translational Sciences UCLA Clinical Translational Science Institute grant UL1TR001881, and a philanthropic gift from the Karen Toffler Charity Trust.
Acknowledgements
This work was supported by many people who made contributions to this work. Turki Althunian contributed to the screening process. Jess Rohmann contributed to the piloting process. This work was additionally supported by comments and contributions from Alyssa Bilinksi, Pascal Goldsetzer, Caroline Blaine, Otto Kalliokoski, Eero Raittio, Tanya Colyer, Tim Watkins, Alexander Breskin, Arindam Basu, Jessica L. Rohmann, Luke A McGuinness, Todd Johnson, Mario Malički, Sebastian Skejø, Scott Graham, Michael Chaiton-Murray, John Edlund, Katelyn Smalley, Danielle Newby, Anita Williams, Cord Phelps, Colleen Derkatch, Alexander Wolthon, Pallavi Rohella, Damien Croteau-Chonka, Steven Goodman, and John Ioannidis.
All errors are the sole responsibility of the authors.
Appendices
Appendix 1: Changes from pre-registered protocol
Major changes:
The primary measure of linking language causal implication strength indirectly through the root words to direct reviewer ratings of the sentences themselves.
In the original protocol, the primary method of generating causal implication for the linking language was through the root word rating system, where those ratings would then be applied back to the studies from which they came. No question was asked regarding the causal implications of the linking sentence in context.
During piloting, we added the question to the review tool which had reviewers directly assess the causal implications of the linking sentences as a whole in order to better rate and review the language in context.
During the primary/independent review, but before the arbitrator review phase, we changed our primary linking language measure from the root word exercise to the direct ratings of the sentences themselves.
This decision was made for three reasons
This greatly simplified the estimation of the main results, negating the need to back-apply causal language from the root word ratings
The full sentence context would be a more direct and contextually sensitive assessment of causal language than the root word exercise.
During an interim data quality check of the reviewers’ extracted linking phrases, we found that the extracted data were much more heterogeneous than initially anticipated, lending some doubt whether the original strategy was viable and interpretable.
Journals with very low rates of screening acceptance were retroactively excluded from the list of journals
This decision was made partway through the screening process itself.
Because the protocol specified that we would have the same number of articles accepted from each of the journals, during the screening process we found that screeners would have to work vastly more to meet journal quotas among the journals which had very low rates of screening acceptance.
On June 24, journals which had screening acceptance rates of below 10% or journals in which there were not enough unscreened articles remaining to meet quotas were excluded, and quotas were increased to compensate among the remaining journals.
This decision was made for two primary reasons:
Keeping these journals would have created an infeasible amount of screening required to complete the screening process.
Journals with such low rates of screening acceptance were likely less relevant to meet our stated objectives and journal inclusion criteria.
Minor changes:
We selected 18 journals, rather than the initial expected 20 from the protocol.
The expected number of journals in the protocol (20) was made in error. We chose to follow the process, rather than aim for a specific number of journals. This initially yielded 24 journals, 6 of which were later removed due to low screening acceptance rates (see above)
The sample size target changed to 1,170 non-RCTs (61 per journal) and 90 RCTs (6 per journal)
The protocol was initially stated to be 1,525 articles accepted, with 61 non-RCTs per journal and 6 RCTs.
This was reduced due to lower than expected screening acceptance rates in order to ensure that screening logistics were feasible and that schedules would be met.
The data extraction form received a large number of minor tweaks to the language, phrasing, and guidance.
These changes were made as part of the protocol-specified piloting process.
The root word extraction process was performed on the linking phrases collected from the arbitrator reviews, rather than the primary reviews.
This ensured a cleaner dataset of linking words and phrases from which to extract root linking words
Root words were only included in the root word linking exercise if there were two or more instances of them from the arbitrator reviews, and a light curating process was performed afterwards.
This was performed due to clean up highly heterogeneous extracted linking words and phrases
The root word rating exercise was changed to being performed after the arbitrator reviews.
In the original protocol, the root word exercise occurred during the arbitrator reviews.
This change was made in order to accomodate extracting the root words from the arbitrator-extracted linking phrases
The reviewers were assigned to review all of the words in the root word list
The original protocol specified that the reviewers would only review 20 randomly selected root words
This was performed in order to maximize the power of our sample.
Spearman’s correlation coefficients were added to directly examine the correlation of rankings between ordinal categories
This was not originally specified in the protocol due to oversight, and was added later.
The population weighted tertiary analysis was removed
This was omitted due to lack of clear value of targeting an alternative “population” of studies, to simplify the breadth of analyses, and due to lack of space
Appendix 2: Search terms
Our search was performed and pulled from PubMed to extract title, abstract, MeSH keywords, and citation data, using the following terms:
((<year>[PDAT]) AND (<journal ISSN>[Journal])
AND
(Humans[mesh] AND “Journal Article”[PT] AND English [la] AND hasabstract))
NOT
((“Meta-Analysis”[Publication Type] OR “Review”[Publication Type] OR “Case Reports”[Publication Type] OR “Editorial”[Publication Type] OR “Letter”[Publication Type]))”
Where <year> is the years from 2010 to 2019, and <journal ISSN> is the journal in question. The above search was performed for every year/journal combination and combined.
Appendix 3: Definitions and frameworks
Exposure
For this project, “Exposure” refers to the independent variable of interest (in a regression sense) or the primary or antecedent variable being investigated for a possible (non-)causal link to the study outcome, or resulting or end-point variable. It may be labelled by terms such as treatment, factor, risk factor, protective factor, determinant, intervention, correlate, predictor, agent, cause, causative agent, or other terms.
Outcome
For this project, “Outcome” refers to the dependent or effect variable of interest that is being investigated for a possible link to the exposure (surrogate measures or clinical events). It is typically a post-exposure variable i.e. assumed or known to be preceded by the exposure. It is sometimes called the study endpoint variable, consequence, result, or other terms.
Linking word/phrase
A linking word/phrase describes the nature of the connection between some defined exposure and some defined outcome, generally used in a sentence containing both exposure and outcome. This can describe the type of relationship (e.g. “associated with”) and/or differences in levels (e.g. “had higher”) that may or may not be causal in nature. For our purposes, the phrase may contain 1-3 words, where one of the words is a preposition to link the exposure and outcome. Some examples may include constructions such as “associated with,” “effect of,” “increased,” “was higher than,” “correlated with,” “caused,” “harms,” “predicts,” “risk factor for,” “determined,” “impacts,” “decreased,” “linked to,” etc.
Modifying word/phrase
A modifying word/phrase is a word or phrase that modifies the linking word/phrase describing the nature of the relationship between the exposure and outcome. This includes adding signals of direction, strength, doubt, negation, and statistical properties to the relationship. This may include phrases like “may be,” “positively,” “strongly”, “potentially”, “is likely to,” “does/is not,” “statistically significant,” etc.
Causal language
Causal language implies that one entity influences (or does not influence) another. We define language as being causal if that language implies that movement (or lack thereof) in the outcome was either 1) impelled by the exposure of interest (i.e. a change in the exposure drives or does not drive a change in the outcome, e.g., increase, decrease, improve, change), or 2) implies attribution of the outcome to the exposure (i.e. assigns the responsibility for the change or lack of change in the outcome to the exposure, e.g. “due to,” “since,” “attributable to”).
Action recommendation
This is a description of how a consumer of the research in question might utilize the results and conclusions of the research. This may include recommending that some actor consider changes (or no changes) in some set of procedures and actions. Action recommendations concern what to do with the research. For our purposes, we do not count calls for additional research as action recommendations.
Causal implication of recommendations
Recommendations may often imply a causal interpretation of a finding. For example, authors may suggest that it could be beneficial to change the amount of an exposure, which rests on the assumption that the exposure has a causal effect on the outcome. As a variation, it may also be suggested that an exposure need not be changed, which rests on the assumption that the absence of a causal effect has been established.
Appendix 4: MeSH disease areas
Appendix 5: Causal strength over time
Chart is generated through LOESS smoothing the proportions in each category over time.
Appendix 6: Modifying phrases
Appendix 7: Causal strength of linking sentences in abstract, by journal
Appendix 8: Causal strength action recommendations in abstract, by journal
Appendix 9: Causal strength action recommendations in abstract including NAs, by journal
Footnotes
↵9 (No affiliation data provided)
Minor revisions for grammar, typos, and authors/affiliations. No substantial changes.