Abstract
Background Decisions about health care, such as the effectiveness of new treatments for disease, are regularly made based on evidence from published work. However, poor reporting of statistical methods and results is endemic across health research and risks ineffective or harmful treatments being used in clinical practice. Statistical modelling choices often greatly influence the results. Authors do not always provide enough information to evaluate and repeat their methods, making interpreting results difficult. Our research is designed to understand current reporting practices and inform efforts to educate researchers.
Methods Reporting practices for linear regression were assessed in 95 randomly sampled published papers in the health field from PLOS ONE in 2019, which were randomly allocated to statisticians for post-publication review. The prevalence of reporting practices is described using frequencies, percentages, and Wilson 95% confidence intervals.
Results While 92% of authors reported p-values and 81% reported regression coefficients, only 58% of papers reported a measure of uncertainty, such as confidence intervals or standard errors. Sixty-nine percent of authors did not discuss the scientific importance of estimates, and only 23% directly interpreted the size of coefficients.
Conclusion Our results indicate that statistical methods and results were often poorly reported without sufficient detail to reproduce them. To improve statistical quality and direct health funding to effective treatments, we recommend that statisticians be involved in the research cycle, from study design to post-peer review. The research environment is an ecosystem, and future interventions addressing poor statistical quality should consider the interactions between the individuals, organisations and policy environments. Practical recommendations include journals producing templates with standardised reporting and using interactive checklists to improve reporting practices. Investments in research maintenance and quality control are required to assess and implement these recommendations to improve the quality of health research.
Introduction
Health systems are generally complex and often comprise a network of interrelated variables. Statistical methods can help untangle and understand these relationships, allowing the quantification and estimation of the effects of diseases and treatments. When researchers analyse their data, they face many decisions, including which variables to explore, what statistical test to perform, and whether data should be excluded or transformed [1]. It has been increasingly recognised that the transparency of the decisions made through this process plays an essential role in interpreting results [2].
Evidence suggests that poor statistical quality amongst researchers is endemic, with an estimated 85% of medical research avoidably wasted through poor study design, analysis, reporting quality and the low frequency of publication of non significant results [3]. This shocking figure can be attributed to several sources, including 50% of health research not being published [4]; when reported, studies are often poorly designed, inappropriately analysed, and selectively reported, with benefits often exaggerated [5]. While there is a discussion of these issues, such as the publish or perish culture within universities [6] and questionable research practices [7], it is widely acknowledged that lack of statistical training contributes to all aspects of poor reporting [8].
At the centre of the research waste problem is the quality of statistical reporting and the rising importance of p-values. The widespread misuse and misunderstanding of p-values have been reported for decades [9, 10], with many researchers mindlessly applying significance rules without understanding the size or importance of the studied effect [11]. King et al. [12] suggest that problems with the selection and interpretation of statistical methods are driven by researchers’ reliance on statistical rules of thumb and justification of traditional methods that are popular in the field, even if they are inappropriate. Stark and Saltelli [13] suggest many researchers are guilty of “cargo cult” thinking and go through the process of fitting models, calculating p-values and invoking statistical terms with little understanding of the methods involved.
Reporting guidelines have been created to help address poor reporting and increase transparency and reproducibility in health research. While many research guidelines exist, very few provide detailed advice on reporting and interpreting data analysis [14]. Examples of statistical guidelines for authors include the Statistical Analyses and Methods in the Published Literature (SAMPL) [15], Strengthening Analytical Thinking for Observational Studies (STRATOS) [16] and Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [17]. The SAMPL was created by Lang et al. [15] and includes reporting guidelines for common statistical methods, including linear regression. Many authors have recommended the SAMPL guidelines [14, 2], but only a small number of studies actually cite SAMPL for either individual use [18, 19] or for reporting of quality in reviews [20, 21], suggesting they are not widely used, highlighting the need to promote statistical guidelines to increase awareness and use among health professionals [22].
There have been robust conversations within the statistical community on how to improve the quality of statistical reporting, much of which focuses on p-values and their interpretation with calls to either remove p-values entirely and focus on confidence and prediction intervals or use alternative methods such as likelihood ratios or Bayes Factors [10]. While there are many commentaries on improving statistical quality, only a few studies directly assess authors’ current practice and statistical understanding [23]. This study aims to understand better where statistical reporting can be improved and inform efforts to educate researchers. We selected regression analyses because it is a widely used method that can provide valuable insights when correctly applied. In a related paper, we examined common misconceptions in the assumptions of linear regression [24]. Building on the previous paper, this research highlights the most common issues health researchers face when interpreting regression analyses and makes recommendations for improving practice.
Materials and Methods
This cross-sectional study was designed to understand the prevalence of statistical reporting behaviours for authors using linear regression, including understanding modelling choices and how often statistics were reported, such as coefficients, confidence intervals, and p-values.
Research question
How do authors using linear regression report their model and results?
Sample size
The primary aim of this study was not hypothesis testing but to gain a descriptive understanding of current statistical reporting practices in published manuscripts with a focus on regression assumptions. The study was powered with a 5% margin of error to detect a sample proportion of 0.05 (5%) using a two-sided 95% confidence interval, calculated using exact Clopper–Pearson confidence intervals, using PASS [25]. This sample size was also deemed adequate to understand the prevalence of general statistical reporting behaviours. We estimated 40 statisticians were required to rate the 100 papers, with each paper reviewed twice (40 statisticians × 5 papers = 200 reviews), and five papers per reviewer were thought to be reasonable from our experience and initial feedback.
Study ethics, statistician recruitment, and consent
This study was granted Negligible-Low Risk Ethics from the Queensland University of Technology (QUT) Human Research Ethics Committee with approval number 2000000458. Statisticians were recruited through professional societies in Australia and internationally, such as the Statistical Society of Australia, universities, and other relevant organisations using targeted emails, LinkedIn, and Twitter. The inclusion criterion was previous or current employment as a statistician, data analyst, or data scientist. Study information, including the study protocol, participant information sheet, and the study questions, were available on GitHub and emailed to participants [26], who were also sent online links to the five PLOS ONE papers to be reviewed. Informed written consent was obtained from the statistician by filling out and returning the consent form, which asked if participants would like to be acknowledged in the paper. Study recruitment started in September 2020 and finished in June 2021, with the last participant completed their reviews in September 2021. On average, the median time to completion was four weeks; statisticians were recruited until all 200 reviews were complete. In total, 46 statisticians were recruited, of which five withdrew due to changed circumstances, and one participant had difficulty completing the online form and was replaced.
Randomisation
One hundred research papers (excluding editorials and other non-research papers) were randomly selected from PLOS ONE in 2019. Papers were selected if they had “health” anywhere in the subject area and used the term ‘linear regression’ in the materials and methods section by using the “searchplos” function within the “rplos” package in R [27]. Papers that met the inclusion criteria were randomly ordered, and the first eligible 100 were selected. To capture the broader use of regression from the population of health researchers, we chose to focus on standard linear regression; papers were excluded if the regression included cluster or random effects or used alternative methods such as Bayesian, non-parametric, or where the linear regressions were not part of the paper’s primary analyses, e.g., related to pre-processing such as calibrating a reference sample.
Calculating the prevalence of statistical reporting behaviours
Two volunteer statisticians rated each paper, and the primary author, LJ, also independently provided a third statistical rating. The study was initially designed for the prevalence to be calculated using the two ratings with the primary author adjudicating differences; however, due to the length and complexity of papers, it was decided by the authorship team to use all three ratings to improve the accuracy of results. The reliability of ratings from the two independent statisticians was calculated. Then, each set was compared to the final prevalence to assess the impact of the change to the protocol. Disagreements between the three ratings were documented by reading and commenting on the PDF of papers and recording each disagreement. Finally, the results were also cross-checked for consistency; for example, if a paper was identified as having only univariate models (i.e. single explanatory variable), it did not require checks for collinearity. The paper was then checked, and prevalence was updated accordingly.
Data analysis
The purpose of this study is a descriptive analysis of statistical reporting behaviours, which were described using frequencies, percentages and 95% Wilson confidence intervals to account for percentages close to zero. The reliability of statistical ratings was described using observed agreement and analysed with Gwet’s statistics [28]. The assumptions in Gwet’s analysis do not require testing but instead, relate to the interpretation and generalisability of results. In this case, papers were randomly sampled and randomly allocated to statisticians; no weighting was applied as variables were either binary or nominal. The STROBE guideline for reporting cross-sectional studies was used [29]. R version 4.3.2 [30] was used for all statistical analyses.
Linear Regression
This section provides some technical background on linear regression, a method widely used in research, for readers who may be unfamiliar with it. It also provides context on what researchers need to report when applying this method in their studies.
Simple linear regression is a statistical method that can be used to understand the relationship between two continuous variables, for example, age and blood pressure. A linear relationship is assumed between the dependent (often notated Y) and model parameters associated with the explanatory variable. The explanatory variables are usually denoted by the X variable, as shown in (Fig 1) and described by (Equation 1). This can be readily expanded to “multiple” regression, which allows for multiple independent variables (k explanatory variables and as many parameters) in the model (Equation 2). This enables us to estimate these model parameters for one variable while taking into account the effect that other explanatory variables can have on this relationship. It also allows the exploration of more complex relationships, such as interactions between explanatory variables. Linear regression can also be used to model categorical X variables. In fact, t-tests, ANOVA and linear regression are special cases of the General Linear Model (GLM), where X variables can be either continuous or categorical.
Equation (1) gives the mathematical form of linear regression. The index “i” is for each observation in the data, of which there are N in total. β^1 is the slope; in our example, it represents the average change in blood pressure with a one-unit change in age. The term β^0 is the Y-intercept, which in our example is the blood pressure value when age equals zero. Finally, ∈^i is the “error” or “residual” term, which is the part of Yi that cannot be accounted for by the available information, i.e. by β^0 + β^1X1 for each observation.
Regression coefficients and R2
In a simple linear regression model, the regression coefficient (b) represents the average change in the dependent variable (Y) for every unit increase in the explanatory variable [31]. A common problem interpreting regression coefficients occurs when continuous variables are on a very large or small scale and it becomes difficult to interpret clinically meaningful change; an easy way to improve interpretation is to scale the variable appropriately. For example, weight in grams can be divided by 1000 and interpreted as the average change in Y with unit change in kilograms. Regression coefficients can also be “standardised” and used to compare variables measured on different scales, and coefficients can be interpreted in terms of standard deviations. Suppose the explanatory variable is standardised to fit the standard normal distribution (i.e. X∼N(0,1)) by subtracting the mean and dividing by the standard deviation. In the univariate case (i.e. a single X), the covariance between standardised Y and standardised X equals the correlation coefficient of the two variables. The square of this correlation coefficient measures the proportion of variance in Y explained by X, given by the R-squared. In this univariate situation, R-squared is equivalent to squaring the correlation [31].
While these relationships between correlation and regression exist, researchers may not appreciate that they become complicated when there is more than one variable in the model and are calculated and interpreted differently. For example, the standardised regression coefficients in a multiple linear regression represent the unique contribution of each independent variable for the prediction of the dependent variable after accounting for the effects of all other variables in the model [31]. R2, known as the coefficient of determination, in this case, represents the proportion of the variance in the dependent variable explained by all the explanatory variables. To account for variance explained by chance (i.e. spurious correlation) when multiple explanatory variables are in the model, an “adjusted” R2 is used. While R2 can be used as an effect size as, in general, a higher R2 value indicates a stronger relationship between the dependent and independent variables, it has limitations as it does not provide information on practical significance and considers all variables in the model, rather than specific variables of interest. Therefore, it is recommended that both regression coefficients and R2 are reported.
P-values, confidence intervals, and scientific importance
P-values are the most frequently used measure of statistical evidence across all fields of science and research [32]. Despite the frequency of use, understanding p-values is elusive to most users, with widespread misuse being well-documented since the 1940s [33, 32]. When conducting a hypothesis test, a test’s significance level (alpha) is chosen to determine the acceptable type I error (falsely rejecting the null hypothesis). The p-value is the probability of obtaining a result at least as extreme as the observed result, assuming the null hypothesis is true. P-values were introduced in the 1920s by Ronald Fisher and were not meant to be a conclusive test but, instead, a way of determining whether a result deserved further investigation [34]. Unfortunately, in practice, researchers often use this continuous measure as a threshold, creating a dichotomy of results declared either statistically significant (p < 0.05) or not [35], regardless of practical importance. Despite much work in this area, errors in the logic of p-values remain prolific in the literature [36, 37]. To avoid over-interpretation of p-values, it is recommended that the smallest clinical improvement considered consequential to the patient [38] be identified before undertaking the study. In practice, this value may not be known for many exploratory studies, but such studies should still consider practical significance.
When translating research from the lab into clinical practice, researchers should be cautious about making important clinical changes based on the results of one study, as a sample-to-sample variation will likely change the estimated effect [39]. When other scientists replicate the research, the range of coefficients, confidence intervals, and p-values is gained. Usually, researchers don’t have access to these replications and must make the best decisions based on the available information [39]. While the width of a confidence interval indicates how much these confidence intervals may bounce around when an experiment is replicated, the p-values fluctuate widely, and they are less useful in understanding whether results will be replicated in future experiments [39].
Therefore, it is recommended that confidence intervals are reported with p-values.
Data transformation
Data transformation is used across the health area when data are skewed or do not fit a normal distribution, which is the distribution assumed for the residuals of linear regression. Data transformation is one tool in the statistical toolbox, and while it is helpful in certain situations, it should be used cautiously. Logarithmic transformations have been used as a cure-all for assumption violations; for a detailed explanation of regression assumptions and outliers, see Jones et al. [24]. When one or both variables have been log-transformed, the interpretation of regression coefficients changes from a unit change to a percent change. Means and 95% confidence intervals of groups can also be back transformed (geometric mean) [40]. When the data fits the underlying transformed distribution, and the residuals of the linear model are normally distributed, the interpretation of results may be improved. However, when there is a lack of fit of the transformed variable, transformations can cause more problems than they fix, as they tend to reduce the variance [41]. Once transformed, interpretation becomes more complex and may distort relationships between variables, and researchers should consider using alternative statistical methods that are more appropriate for their data [41]; for example, a gamma distribution can be used for heavily skewed continuous data.
Modelling
Broadly speaking, there are two approaches to model building: statistical and epidemiological. Statistical approaches include algorithmic methods such as stepwise modelling or regularisation methods that reduce the risk of overfitting [42]. Epidemiological approaches include choosing the final model based on previous literature or known disease pathways, regardless of statistical significance. These approaches may result in different final models, p-values, and parameter estimates [2].
The choice of model for data should be based on the study design and the research question. For example, in a Randomised Controlled Trial (RCT), where participants are successfully randomised into two groups, the only systematic difference between the groups is the study intervention. In practice, while RCTs can be analysed with a simple t-test, there are often adjustments for stratification and other pre-specified variables, all of which should be detailed in the study protocol. In comparison, observational studies are often complex and may have differing purposes depending on the research question. Relationships in observational studies are more difficult to directly measure due to confounding variables, which may distort relationships [43]. If not adequately accounted for, confounding variables may hide the true association between dependent and independent variables, leading to biased estimates and inflation of the variance [44], which will affect subsequent interpretation.
Model selection methods use different approaches to identify the best subset of variables that predict the dependent variable [45]. The most common statistical modelling approach is stepwise selection, which includes backward selection, forward selection, and a combination of both, known as stepwise selection [31]. These approaches iteratively fit models by adding or removing variables based on predefined criteria [45]. However, these methods have been criticised for producing overfitted models that describe the sample well but are less generalisable to the target population [45]. Regularisation methods such as Lasso (Least Absolute Shrinkage and Selection Operator) and Ridge regression are another approach. These methods penalise the model based on complexity by introducing a parameter that allows variable coefficients with minor contributions to be shrunk towards zero [46]. These methods can deal with highly correlated independent variables, with Lasso allowing model selection by shrinking model parameters to absolute zero [46]. All modelling choices should match the study’s objective and be pre-planned in a study protocol to allow transparency and avoid p-hacking [47, 16].
Multicollinearity
Regression models should be assessed for multicollinearity when multiple independent variables are included. Multicollinearity occurs when high correlations exist between two or more independent variables [31] that explain the same variance in the dependent variable and make it challenging to separate the importance of individual variables. It can lead to unstable coefficients and increased type II errors (incorrectly concluding that the variable is not statistically significant), with the standard errors and confidence intervals becoming inflated [48]. Diagnosis of multicollinearity includes consideration of the Variance Inflation Factor (VIF) and pairwise correlations and examining changes in the standard error of models [49]. VIF measures how much the variance of the estimated parameter is increased due to collinearity, with a rule of thumb of values of 10 indicating a problem [31]. However, Zurr et al. [49] suggest that lower values of three can also indicate problematic collinearity. Treatment for multicollinearity may include using alternate methods such as regularisation methods or dropping one of the variables found to be highly correlated [48]. Deciding which variables should be removed from the model can be done in several ways, including dropping variables with the highest VIF or preferably using clinical understanding to keep the most important predictors in the model.
Results and discussion
In 2019, 16,318 papers were published in PLOS ONE; of these, 1005 (6%) mentioned linear regression in the methods section with health in the subject area. Papers were randomly selected and reviewed for the inclusion criterion until we had 100 papers that used linear regression. Whilst reviewing the paper, the statisticians could exclude papers by indicating no linear regression results reported; this option was provided to reduce the risk of excluding papers with poor reporting. Ten papers were identified as potentially having no linear results; the author team reviewed these papers and excluded five. Therefore, 95 papers were included in the final analysis (Figure 2). For papers with at least one result presented but rated by one statistician not to contain linear regression results, the missing review was replaced by the primary authors’ results for the reliability analysis.
The majority of the included studies were observational (n=80, 84%), with 15 experimental studies. Human participants were primarily used in 77% (73) of studies, 13% (12) used animals, 3% (3) used other studies, and the remaining seven studies had either a combination of human, animal, or plants, or were conducted in a lab.
Over half (55%) of statisticians indicated their highest statistical or mathematical education was a PhD, with 28% having a Master’s level qualification. Ten percent had honours or bachelor’s degrees. One statistician had a diploma, while two others had no formal statistical education. Statisticians were experienced, with 25% having 5-9 years of experience, 30% with 10-19 years, and 23% having 20+ years of experience. For further information on statistician demographics, see Jones et al. [24].
Prevalence of statistical reporting behaviours
Most authors (92%) used p-values to describe their results, with 17 authors mostly reporting p-values categorically. In comparison, 81% of authors reported regression coefficients, with less than half reporting either confidence intervals (46%) or standard error (20%). The total number of observations was only clear in 47% of papers (Table 1). Thirteen of the 30 studies that transformed their data did not provide any reasoning. Several authors demonstrated poor reporting practices, interchangeably using correlation coefficients (r), coefficient of determination (R2), and regression coefficients often without interpretation. Nineteen percent of researchers were unclear about their modelling choice, with 14% using a recognised modelling approach, with less than half of these papers providing sufficient detail to be reproduced (Table 2).
Agreement of statistical raters
There was a high agreement between the two statistical ratings for reporting behaviours, including coefficients, confidence intervals, test statistics, degrees of freedom and measures of uncertainty (Table 3, for full reporting, see S1 Table). However, lower but moderate agreement (Gwet ≥ 0.4 to 0.59) was observed for questions requiring interpretation by raters, including the size and direction of parameter estimates, p-values, collinearity, and transformation. Relatively poor agreement was observed for some outcomes, including the importance of parameters, variables scaled appropriately, and any process for selecting the variables included in the final model. Reasons for disagreement included differences in considering what was described in both the text and tables. Raters were sometimes split between unclear and not required. Some disagreement between ratings were also due to the authors’ methods sections, which were often unclear. In general, the two statistical raters had a higher agreement with the final prevalence score (which took into account the third rating conducted by the primary author) than they did with each other, indicating that while there was variability, the overall prevalence was reflective of raters.
Comparison of results to other research
Most authors of the papers included in the current study (81%) reported regression coefficients, however, less than half (46%) of authors used confidence intervals or standard errors (20%) with 58% of papers reporting some measure of uncertainty; 18% of authors mostly reported p-values categorically, ignoring the widely discussed guidelines on p-values from the American Statistical Association [10]. Poor quality of reporting in published work can be seen across all health research fields [32, 50]. A Study by Strasak et al. [51] compared two prominent medical journals, Nature Medicine and The New England Journal of Medicine, in 2004 for common statistical errors. In a subset of 53 papers across the two journals, The New England Journal of Medicine, 45% did not report confidence intervals for the main effect, and 19% of articles did not report exact p-values. For Nature Medicine papers 95% did not report confidence intervals, and 86% did not report exact p-values.
The results from our PLOS ONE sample were that 23% of studies were univariate, which is a higher use of multivariable modelling than in previously reviewed health literature but in line with the increasing complexity of modelling over time [51]. Real et al. [52] examined the use of multiple regression models in observational studies in Spanish scientific journals between 1970 and 2013. They found only 6.1% of the articles included the term multivariable analysis, with increasing frequency reported from 0.1% in 1980 to 12.3% in 2013. Although many model selection methods exist, some are more robust than others. Sun et al. [53] outline the danger of using univariate analysis with a statistical significance threshold (p < 0.05) as a screening tool for inclusion in multivariable models. The authors provide examples of simulated and real data where this method ignores confounding and inappropriately excludes important variables, leading to incorrect conclusions about associations. In a review of oncology studies, Mallet et al. [54] found that out of 43 studies, 21 (49%) used univariate analysis as a pre-screening test to select variables for multivariable models. The current research results show a lower proportion of use of the significance threshold as a pre-screen, with 3 out of 64 papers that used multivariable modelling identified that they took this approach. While this sounds like a positive result, it is difficult to interpret as 20% of studies were unclear when describing their modelling process, it was often unclear if models were univariate or multivariable. Statistical sections were often generic and difficult to follow, with poor reasoning, with one paper authors describing their model selection as the overfitted model, not understanding that fitting a model with a small sample size with many variables is poor practice. Variable inclusion was based on literature or author knowledge in 36 (38%) papers, generally with very little or no explanation. Thirteen (14%) studies used a recognised modelling strategy; of these, only stepwise methods were used, with 6 describing any statistical significance criteria.
Poor reporting of statistical sections is common in health, with White et al. [55] finding that many papers’ content resembled “boilerplate text” cut and pasted from already published work, with often little resemblance to the analyses conducted. Collyer et al. [56] conducted a qualitative study to understand researchers’ understanding of linear regression, finding that the interpretation of regression coefficients was described by researchers as iterative and nuanced rather than complete or authoritative statements, which sometimes depended on prior understandings. However, in our study, it was challenging to judge authors understanding as most results were not interpreted, with only 11 (12%) author teams properly linking the size of the effect back to the dependent variable, and another 18 (19%) doing so generically.
Authors rarely check multicollinearity with Vatcheva et al. [57] searching the epidemiological literature in PubMed from January 2004 to December 2013 and found that only 1 in 100 regression papers mentioned collinearity or multicollinearity. The authors report that when variables are strongly collinear the normal interpretation of a regression coefficient of a change in Y with a one-unit increase in X while holding the other predictors constant becomes practically impossible. They concluded that although the multicollinearity diagnosis does not solve the problem, it is important to understand the impact on findings and allow greater care to be taken when interpreting the regression coefficients [57]. Norstrom et al. [58] reviewed 41 papers from public health and found that only one article tested for collinearity. Fernandez–Nino and Hernandaz–Montes [59] found that 15% (17/111) of articles in Biomedica reported assessing collinearity. Our results confirm that proper checking and reporting of collinearity remains poor, with 13 out of 64 (20%) multivariable papers checked for collinearity.
System-wide reform of statistical practices
While blaming individual researchers for poor statistical quality is tempting, our results, which align with previous research [60], indicate systemic issues in understanding and reporting the broader statistical theory and tolerance of bad behaviour [61]. This research aims not to name and shame individual researchers for their reporting practices but to understand the magnitude of the problem and help guide the culture change required to improve reporting. The broader purpose of this research is not about the individual researchers but rather the practical implication of arriving at the wrong conclusions when bad statistical practices are used. It is about patients and the impact that potentially ineffective treatments might have. Leek et al. [62] suggest that if we think of poor research practices as a disease, we should see the review process as medication with the research quality crisis seen from a primary prevention perspective. As in health prevention, editorial review (medication) is the last step and should not be relied on to fix the problems [62]. Instead, greater investment in prevention strategies such as increasing awareness of issues, increased training and access to qualified methodologists is required to encourage healthy research practices.
Parallels can be observed between poor statistical practices and addictive behaviours, including obesity, where there is a clear relationship between junk food availability in communities [63] (seen as obesogenic environments) with the parallel that some journals can be seen as selling junk (quick publication without adequate review), with institutions rewarding quantity over quality (calorie-rich food deficient in nutritional value). In the same way, telling someone to lose weight won’t solve the obesity crisis; just telling people to do better research or to stop abusing p-values will not solve the statistical quality crisis. Like an addictive drug, the reward for poor quality research can create a feedback loop; with more publications required to achieve promotion or funding success, the more shortcuts are taken.
There are complex causes for poor reporting quality observed historically [3] and in the current study. Many opinion pieces have been written on the topic, as well as primary research targeting particular items of statistical reporting, such as p-values and confidence intervals, which has been limited in improving the interpretation of results [64, 32]. There are no easy solutions, and we recommend a system-wide approach to reform statistical practices. When designing future interventions to tackle poor statistical quality, meta-researchers should incorporate the knowledge from health about behaviour change [65] and complex relationships between individuals and the interplay between interpersonal support systems, the community, the organisation and the policy environment [66]. The social-ecological model proposed by Bronfenbrenner [66] can be adapted to approach system-wide reform of research practices (Table 4).
Without the connections and cooperation of the different levels identified in the social-ecological model [66], real reform improving research quality is unlikely to succeed. Barnett and Byrne [67] explain a bystander effect currently occurring in research quality, with everyone watching and waiting for someone else to act while the research systems further decline, with publishers expecting institutions to prevent and educate against poor practice, institutions expecting their staff to protect their reputations and for journals to improve the peer review, and funders willing to fund new research but not quality control. They recommend diverting 1% of publishers’ profits and scientific funding to quality control [67]. Currently, no time or money is built into the system for research maintenance or quality control. As seen in preventative health [68], broad structural change is likely to occur only with investment and policy implementation.
Checklists and automated reviews
Our findings suggest that while most authors report regression coefficients, they often do not provide any measure of uncertainty around their result, and it can be challenging to identify the specific statistical method used. Journals can improve the quality of statistical reporting by implementing policies that standardise the presentation of statistical results. This could involve including all statistics in tables, whether in the main body of the paper or the supplementary materials, with clear identification of the statistical tests used.
Many journals require reporting guidelines, including statistical guidelines such as SAMPL [15]. This could be an opportunity for researchers to seek advice and improve statistical methodology. However, the current checklist approach of just providing page numbers instead of details has been criticised, with Blanco et al. [69] questioning whether checklists submitted by authors reflect the information presented in articles. They randomly selected 12 randomised controlled trials from three journals and found that only one article fully adhered to CONSORT guidelines. They concluded that journals needed action to ensure transparent reporting, including checking the items examined by editors or trained editorial assistants. PLOS ONE recommends that authors use SAMPL to provide guiding principles for reporting statistical methods and results and specific instructions for reporting linear regression; our results show that the guidelines are not widely followed. PLOS ONE also recommends the use of STROBE [29] for observational studies, but our results showed poor reporting of results with papers often lacking detail on whether the study was descriptive, associational, or predictive and a clear statement of what variables were selected and why. These results support Pouwels et al. [70], who concluded that authors should be required to submit the checklist with text excerpted from the manuscript instead of just referring to page numbers.
When journals introduce new policies, it’s important to monitor their value. These policies should not just increase the author and reviewer burden without improving quality. There’s a risk that researchers might provide normative responses to checklists rather than focussing on improving overall research quality [71]. This was evident in interventions promoting the better use of confidence intervals, where the impact on interpretation quality was minimal [64, 72]. To reduce this burden on reviewers, it’s recommended that journals provide templates of papers with expected results and standard reporting. Reviewers can be provided with interactive checklists, similar to the one used in this research. For example, if reviewers indicate that confidence intervals were not reported, an automated feedback system can educate authors on table formatting and interpretation of results.
Some readers may ask, can statistical reviews be automated? While this is still a developing field, there have been previous attempts [73], including text mining and statcheck, an algorithm designed to scan papers to detect inconsistencies in calculated test statistics, degrees of freedom and their associated p-values [50]. Roughly half of the papers reviewed had at least one p-value that did not correspond with their associated test statistic and degrees of freedom. Statcheck can only process data with specific APA formatting (e.g. t(36) = 0.099, P = 0.921) and was found to only process 61% of all statistical tests [74]. The current study found results were often incomplete and inconsistently formatted, reflecting differing reporting practices across health fields, a current barrier to automation. Therefore, while automated tools are helpful, they can only aid the reviewer, such as helping screen the paper for checklists. Reviewers can then use this information to improve the interpretation of results [73].
Involvement of statisticians
The volume and statistical complexity of most medical research have increased drastically in the last couple of decades, with the use of advanced methods such as survival analysis and multivariable linear regression now commonplace [23]. Our study confirmed this, with most studies using multiple statistical methods and multivariable analysis. Unfortunately, serious flaws in published work are also commonplace, with Altman et al. [23] reporting that many of these problems are caused by statistical analyses performed by health professionals with an inadequate understanding of statistical methods. Studies in quality improvement consistently recommend that authors should involve biostatisticians in projects early, enabling well-designed studies with robust interpretation of results [60]. However, many projects either completely lack involvement from a biostatistician, or they are involved too late to improve study findings effectively. This has been a long-recognised phenomenon across all research fields, with Fisher [75] famously saying, “To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of”. This essentially highlights that no analytical methods can rescue the result once a study has been undertaken with poor design.
Our study highlighted that statistical sections were often generic, emphasising values rather than practical importance, with only (23%) of authors directly interpreting the size of regression coefficients, indicating statistical input may have improved reporting. The use of statistical expertise was examined by Altman et al. [60], who surveyed the authors of all original research articles submitted to the BMJ and Annals of Internal Medicine over five months in 2001. Authors were asked if studies had statistical or epidemiology input, the stage this occurred and reasons if none was used. Of the 704 authors who responded, 39% (273) of papers had input from a statistician. For the papers with statistical input, 30% of authors identified their first major contribution was at the analysis stage. Of these papers, a third of biostatisticians were not acknowledged for their work. Altman et al. [60] found that articles without methodological support were more likely to be desk-rejected (71%) than articles with statistical input (57%). A more recent survey by Sebo et al. [76] randomly selected 781 articles published in 2016 journals from high-impact medicine and primary care journals and found when a statistician is involved as a co-author, the time to the publication of research is reduced. Mullner et al. [77] reviewed 537 papers from medicine and found when statisticians are involved in studies, inadequate reporting of adjustment for confounders drops from 56% to 27%. While we did not measure the involvement of statisticians in our research, poor reporting was commonplace; statistical reviewers had difficulty identifying if there was a process for selecting variables or even whether there were linear regression results in the paper. Therefore, it is recommended that statisticians be involved early in medical research and be appropriately recognised for their contribution. While much of the burden of implementing checklists to improve statistical quality falls on journals, we recommend institutions such as universities take responsibility for their paper submissions, stopping the poor research before it gets to the journal. To achieve this goal, more funding from institutions for central statistical support is required. A report by the NHMRC (Australia’s major funder of health and medical research) on research quality identified study statistics and analysis as a critical core competency for high quality research, and noted statisticians as advisors to/members of ethics committees as an example of institutional support that will foster high quality research [78]. There have also been broader calls for statisticians to be involved in all medical research, to improve study design and interpretation of results [79].
Post-publication peer review, as conducted by statisticians in the current study, allows for transparent and continuous research evaluation, identifying flaws or errors [80]. In an environment where digital technology is the norm, researchers can be given real-time feedback about statistical methods through journal websites, pre-prints and changes made through version control. There is an opportunity to change practice by encouraging researchers to take ownership of their errors, where publications are regarded as the beginning of the journey, and published work is viewed as dynamic ‘living documents’ that can be changed and updated as errors are identified [81]. For this to occur, both researchers and institutions need to invest in quality over volume, with negative perceptions about paper corrections overcome.
Limitations
PLOS ONE is a large cross-discipline journal, but may not be representative of all health and biomedical journals. The focus of this study was linear regression, this was purposeful as there will be different misconceptions driving understanding in comparison to ANOVA. We focused on the interpretation of continuous variables, however, we recommend that future questionnaires be seen in a general linear model framework. Initially, the questionnaire contained 55 items and included an interpretation of categorical independent variables but it was removed due to length concerns. However, breaking up the interpretation of coefficients into continuous and categorical, as well as adding if post hoc tests were used, would not compromise length but improve interpretability.
Conclusions
Linear regression is one of the most frequently used statistical methods, so researchers should be able to interpret its output. Unfortunately, our research shows that the average researcher tends to over-rely on p-values and significance rather than the contextual importance and robustness of conclusions drawn. This systematic failure in statistical reporting highlights the need for investment in research training and quality control; this should be across the board, from ethics to submission of research to post-peer review, allowing statisticians and other methodological experts to be involved through the entire project cycle. The research environment is an ecosystem, and future meta-research should consider how the different levels of the system interact, understand the behaviour of individuals, their support systems, the community, organisations, and policy environment, as well as adapt and use established knowledge about behaviour change. Journal policies can achieve improvements in basic reporting, with the recommendation from this study to introduce interactive checklists for authors and reviewers, so when poor reporting occurs, automated feedback with education on how the tables should be formatted and results be interpreted. Journals could also produce template papers and standardise reporting for commonly used statistical tests. To increase the transparency of reporting, all statistical tests should be put into tables, whether in the main body of the paper or the supplementary materials; it should be clear from tables what test was used and if models are univariate or multivariable. Finally, post-peer review needs to be encouraged, where correcting errors and clarifying research is rewarded rather than punished, and research papers are regarded as living documents with version control; for this to occur, there needs to be a cultural change and investment in how academics and institutions think about academic output, with research maintenance built into roles and shift away from volume to quality of research.
Supporting Information
S1 Table: Full reporting of agreement and reliability for statistical raters.
Data Availability
The raw data and a reproducible R Quarto file used to produce this paper, including all tables and figures have been stored in a GitHub repository and can be accessed at https://github.com/Lee-V-Jones/Linear_Regression_Reporting_Practices
https://github.com/Lee-V-Jones/Linear_Regression_Reporting_Practices
Funding
There was no cost associated with this research except for attending conferences. These costs were covered by the primary author’s PhD allocation from the health faculty, Queensland University of Technology, and scholarships. The Statistical Society of Australia (SSA) and the Association for Interdisciplinary Meta-research & Open Science (AIMOS) supported the primary author with travel grants to attend their respective conferences. These scholarships did not influence the results of the study.
Competing Interests
The authors declare there are no competing interests.
Data Availability
The raw data and a reproducible R Quarto file used to produce this paper, including all tables and figures have been stored in a GitHub repository and can be accessed at [82]
Author Contributions
Conceptualization: Lee Jones, Adrian Barnett, Dimitrios Vagenas.
Data Curation: Lee Jones.
Formal Analysis: Lee Jones.
Funding Acquisition: Lee Jones.
Investigation: Lee Jones.
Methodology: Lee Jones, Adrian Barnett, Dimitrios Vagenas.
Project Administration: Lee Jones, Dimitrios Vagenas.
Resources: Dimitrios Vagenas.
Software: Lee Jones.
Supervision: Adrian Barnett, Dimitrios Vagenas.
Validation: Adrian Barnett, Dimitrios Vagenas.
Visualization: Lee Jones.
Writing – Original Draft Preparation: Lee Jones.
Writing – review & editing: Lee Jones, Adrian Barnett, Dimitrios Vagenas.
Acknowledgements
We acknowledge all the statisticians (named and not named) who kindly gave up their time to contribute to this publication by reviewing papers, including: Ingrid Aulike, Peter Baker, Brigid Betz-Stablein, Enrique Bustamante, Taya Collyer, Susanna Cramb, Alanah Cronin, Laura Delaney, Zoe Dettrick, Eralda Gjika Dhamo, Des FitzGerald, Peter Geelan-Small, Edward Gosden, Alison Griffin, Jenine Harris, Cameron Hurst, Kyle James, Helen Johnson, Jessica Kasza, Karen Lamb, Stacey Llewellyn, James Martin, Miranda Mortlock, Satomi Okano, Alan Rigby, Michael Steele, Megan Steele, Jacqueline Thompson, Simon Turner, Michael Waller, Kevin Wang, Jace Warren, Natasha Weaver, Lachlan Webb, and Janet Williams.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵