ABSTRACT
Background Pressure injuries (PIs) pose a substantial healthcare burden and incur significant costs worldwide. Several risk prediction models to allow timely implementation of preventive measures and potentially reduce healthcare system burden are available and in use. The ability of risk prediction tools to correctly identify those at high risk of PI (prognostic accuracy) and to have a clinically significant impact on patient management and outcomes (effectiveness) is not clear. We aimed to evaluate the prognostic accuracy and clinical effectiveness of risk prediction tools for PI, and to identify gaps in the literature.
Methods and Findings The umbrella review was conducted according to Cochrane guidance. MEDLINE, Embase, CINAHL, EPISTEMONIKOS, Google Scholar and reference lists were searched to identify relevant systematic reviews. Risk of bias was assessed using adapted AMSTAR-2 criteria. Results were described narratively.
We identified 16 reviews that assessed prognostic accuracy and 10 that assessed clinical effectiveness of risk prediction tools for PI. The 16 reviews of prognostic accuracy evaluated 63 tools (39 scales and 24 machine learning models), with the Braden, Norton, Waterlow, Cubbin-Jackson scales (and modifications thereof) the most evaluated tools. Meta-analyses from a focused set of included reviews showed that the scales had sensitivities and specificities ranging from 53%-97% and 46%-84%, respectively. Only 2/16 reviews performed appropriate statistical synthesis and quality assessment. One review assessing machine learning based algorithms reported high prognostic accuracy estimates, but some of which were sourced from the same data within which the models were developed, leading to potentially overoptimistic results.
Two randomised trials assessing the effect of PI risk assessment tools on incidence of PIs were identified from the 10 systematic reviews of clinical effectiveness; both were included in a Cochrane review and assessed as high risk of bias. Both trials found no evidence of an effect on PI incidence.
Conclusions Our findings underscore the lack of high-quality evidence for the accuracy of risk prediction tools for PI. There is no reliable evidence to suggest that using existing risk prediction tools effectively reduces the incidence of PIs. Further research is needed on their clinical effectiveness, but only once promising prediction tools have been developed and appropriately validated.
INTRODUCTION
Pressure injuries (PI), also known as pressure ulcers or decubitus ulcers, have an estimated global prevalence of 12.8% among hospitalised adults,1 and place a significant burden on healthcare systems (estimated at $26.8 billion per year in the US alone2). PIs are most common in individuals with reduced mobility, limited sensation, poor circulation, or compromised skin integrity, and can affect those in community settings and long-term care as well as hospital settings. Effective prevention of PI requires multicomponent preventive strategies such as mattresses, overlays, and other support systems, nutritional supplementation, repositioning, dressings, creams, lotions, and cleansers.3 It is therefore important to correctly identify those most at risk of PI to allow timely and targeted implementation of preventive measures, to reduce harm and consequently burden to healthcare systems.4
Numerous clinical assessment scales (e.g. Braden5 6, Norton7 and Waterlow8) and statistical risk prediction models for assessing the risk of PI are available however, many are limited by reliance on subjective clinical judgment and do not appear to meet basic standards for the development or reporting of risk prediction models.9 Nevertheless, many such tools are in routine clinical usage. For example, in certain hospitals and long-term care settings in the US, healthcare professionals must conduct mandatory risk assessments for PI for all patients for the purposes of risk stratification and clinical triage.
Despite the apparent lack of sound methods for development and validation (including external validation) of available risk prediction tools, there is a considerable body of evidence evaluating their clinical utility, much of which has been synthesised in systematic reviews and meta-analyses.9 Clinical utility includes both prognostic accuracy and clinical effectiveness. Prognostic accuracy is estimated by applying a numeric threshold above (or below) which there is a greater risk of PI, with study results presented using accuracy metrics such as sensitivity, specificity or the area under the receiver operating characteristic (ROC) curve.10 Resulting accuracy is driven not only by the nominated threshold for defining participants as at low or high risk for PI but by other study factors including population and setting.11 Clinical effectiveness, or the ability of a tool to impact on health outcomes such as the incidence or severity of PI, is related both to the accuracy of the tool (or its ability to correctly identify those most likely to develop PI) and to the uptake and implementation of the tool in practice. Demonstrating a change in health outcomes as a result of use of a risk prediction tool is vital to encourage implementation.12
Using an umbrella review approach, we aimed to provide a comprehensive overview of available systematic reviews that consider the prognostic accuracy and clinical effectiveness of PI risk prediction tools.
METHODS
Protocol registration and reporting of findings
We followed Cochrane guidance for conducting umbrella reviews13, and ‘Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Studies’ (PRISMA-DTA) reporting guidelines14 (see Appendix 1). The protocol was registered on Open Science Framework (https://osf.io/tepyk).
Literature search
Electronic searches of MEDLINE, Embase via Ovid and CINAHL Plus EBSCO from inception to January 2023 were developed and conducted by an experienced information specialist (AC), employing well-established systematic review and prognostic search filters,15–17 combined with appropriate keywords related to PIs. Simplified supplementary searches in EPISTEMONIKOS and Google Scholar were also undertaken (see Appendix 2 for further details). Screening of search results and full texts were conducted independently and in duplicate by two reviewers (BH, JD, YT, KS), with disagreements resolved by a third reviewer.
Eligibility criteria for this umbrella review
Published English-language systematic reviews of risk prediction tools developed for adult patients at risk of PI in any setting were included. Clinical risk assessment scales and models developed using statistical or machine learning (ML) methods were eligible (models exclusively using pressure sensor data were not considered). Risk prediction tools could be applied by any healthcare professional using any threshold for classifying patients as high or low risk and using any PI classification system18–21 as a reference standard. For prognostic accuracy, we required accuracy metrics, such as sensitivity and specificity, to be presented but did not require full 2×2 classification tables to be reported. Reviews on diagnosing or staging suspected or existing PIs were excluded.
To be considered ‘systematic’, reviews were required to report a thorough search of at least two electronic databases and at least one other indication of systematic methods (e.g. explicit eligibility criteria, formal quality assessment of included studies, adequate data presentation for reproducibility of results, or review stages (e.g. search screening) conducted independently in duplicate).
Data extraction and quality assessment
Data extraction forms (Appendix 3) were informed by the CHARMS checklist (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) and Cochrane Prognosis group template.22 23 Data extraction items included review characteristics, number of studies and participants, study quality and results.
The methodological quality of included systematic reviews was assessed using AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews)24, adapted for systematic reviews of risk prediction models (Appendix 4). Our adapted AMSTAR-2 contains six critical items, and limitations in any of these items reduces the overall validity of a review.24 Quality assessment and data extraction were conducted by one reviewer and checked by a second (BH, JD, KS), with disagreements resolved by consensus.
Synthesis methods
Reviews about prognostic accuracy and clinical effectiveness of risk prediction tools were considered separately. Review methods and results were tabulated and a narrative synthesis provided. Prognostic accuracy results from reviews including a statistical synthesis were tabulated according to risk prediction tool.
Considerable overlap in risk prediction tools and included primary studies was noted between reviews. For risk prediction tools that were included in multiple meta-analyses, we focused our synthesis on the review(s) with the most recent search date or most comprehensive (based on number of included studies) and most robust estimate of prognostic accuracy (judged according to the appropriateness of the meta-analytic method used, e.g. use of recommended hierarchical approaches for test accuracy data25). The prognostic accuracy of risk prediction tools that were included in three or fewer reviews, was reported only if an appropriate method of statistical synthesis13 was used.
For clinical effectiveness results, reviews with the most recent search date or most comprehensive overview of available studies and that at least partially met more of the AMSTAR-2 criteria24 were prioritised for narrative synthesis.
RESULTS
Characteristics of included reviews
A total of 110 records were selected for full-text assessment from 6302 unique records. We could obtain the full text of 104 publications, of which 23 reviews met all eligibility criteria (Figure 1), 16 reported accuracy data24–39 and 10 reported clinical effectiveness data25 29 34 40–46 (three reported both accuracy and effectiveness data25 29 34). Table 1 and Figure 2 provide an overview of the characteristics, methods and methodological quality of all 23 reviews (see Appendix 5 for full details).
Reviews were published between 2006 and 2022. Approximately half (12/23, 52%) restricted inclusion to adult populations (Table 1), three included any age group, and nine (39%) did not report any age restrictions. Six reviews (6/23, 26%) specified only populations without PIs at baseline for inclusion. Acute care was the most common setting across both review questions, 5/16 (31%) and 4/10 (40%) for accuracy and effectiveness reviews, respectively. Quality assessment tools varied, with QUADAS-2 (n=7) or QUADAS (n=2) being most common for reviews of accuracy (9/16, 56%). One accuracy review30 reported use of both QUADAS-2 and PROBAST tools in their methods, but only reported QUADAS-2 results.
Reviews of accuracy predominantly focused on studies using any (5/16, 31%) or pre-specified (8/16, 50%) risk assessment tools or scales, one included only ML-based prediction models.30 A total of 63 risk prediction tools were reported across the reviews, including 24 ML models. The number of included risk prediction tools in a single review ranged from one34–39 to 2832. Only two reviews reported eligibility criteria related to the development or validation of the risk prediction tools. One33 (6%) excluded evaluation studies that used the same data that was used to develop the tool and the other29 included only “validated risk assessment instruments”, however this was not further defined and the review included studies reporting the original development of risk prediction tools.
The majority (13/16, 81%) of accuracy reviews conducted a statistical synthesis of data, however only two utilised currently recommended hierarchical approaches for the meta-analysis of test accuracy data,36 40 seven conducted univariate meta-analysis of individual accuracy measures (e.g. sensitivity and specificity separately, or AUC31, RR32 or odds ratio33) and four did not clearly report the type of analysis approach used.
Of the 10 systematic reviews evaluating the clinical effectiveness of risk prediction tools, two only considered the reliability of risk assessment scales41 42 and eight considered effects on patient outcomes (one of which also considered tool reliability43). More than half of reviews (6, 60%) compared use of PI risk assessment scales to clinical judgement alone or ‘standard care’. The number of included studies ranged from one44 to 2045 and the sample sizes of primary studies ranged from one (one subject and 110 raters, in an inter-rater reliability study46) to 3,027 patients. Reported outcomes included the incidence of PIs (7/10), preventative interventions prescribed (5/10) and interrater reliability (3/10) (reported in Appendix 5). One (Cochrane) review used the Cochrane RoB tool for quality assessment of included studies and three used JBI (n=2) or CASP (n=1) tools. Due to heterogeneity in study design, risk prediction tools and outcomes evaluated, none of the included reviews provided any form of statistical synthesis of study results.
Methodological quality of included reviews
The quality of included reviews was generally poor (Figure 2; Appendix 5). The AMSTAR-2 items that were most consistently met (yes or partial yes) were: comprehensiveness of the search (19/23, 83%), study selection independently in duplicate (15/23, 65%), and conflicts of interest reported (17/23, 74%).
Of the 16 accuracy reviews, four (25%)30 35 36 40 used an appropriate method of quality assessment of included studies (i.e. QUADAS or QUADAS-2 dependent on publication year) and presented judgements per study. Of the 10 effectiveness reviews, two (20%)29 47 used an appropriate method of quality assessment (the Cochrane tool for assessing risk of bias48 and a criteria consistent with AHRQ Methods Guide for Effectiveness and Comparative Effectiveness Reviews49, respectively) and provided judgements per study. Four reviews either reported quality assessment results per study (n=341 45 50) or were considered to use an appropriate quality assessment tool (n=133) (AMSTAR-2 criterion partially met).
Of the accuracy reviews that included a statistical synthesis, 31% (4/13)31 32 36 40 used an appropriate method of meta-analysis and investigated sources of heterogeneity. Two reviews36 40 used recommended hierarchical approaches to meta-analysis of test accuracy data (the bivariate model36 and hierarchical summary ROC (HSROC) model40) and one31 calculated summary AUC using random effects meta-analysis.32
Compared to the reviews of accuracy, reviews of effectiveness more commonly provided adequate descriptions of primary studies (8/10, 80% vs 4/16, 25%) (Figure 2). No other major differences across review questions were noted.
Results from reviews evaluating the prognostic accuracy of risk prediction models
Five of 16 accuracy reviews were prioritised for narrative synthesis (Tables 2-3) and are reported below according to risk prediction tool. Four of the five reviews did not include development study estimates within their meta-analyses, but this information could not be ascertained for the review30 of ML-based models. None of these reviews assessed the quality of development methods for the prediction tools considered in their statistical syntheses.
Braden, and modified Braden scales
The most recent and largest review36 of the Braden scale (60 studies, including 49,326 patients), which used hierarchical bivariate meta-analysis, reported an overall summary sensitivity of 0.78 (95% CI 0.74, 0.82; 15,241 patients) and specificity of 0.72 (95% CI 0.66, 0.78; 34,085 patients) across all reported thresholds (range ≤10 to ≤20). Summary sensitivities and specificities ranged from 0.79 (95% CI 0.76, 0.82) and 0.66 (95% CI 0.55, 0.75) at the lowest cut-offs for identification of high-risk patients (≤15 in 15 studies) to 0.82 (95% CI 0.73, 0.89) and 0.70 (95% CI 0.62, 0.77) using a cut-off of 18 (15 studies), respectively. Heterogeneity investigations suggested higher accuracy for predicting pressure injury risk in patients with a mean age of 60 years or less, in hospitalised patients (compared to long-term care facility residents) and in Caucasian populations (compared to Asian populations).36 The review noted a high risk of bias for the ‘index test’ section of the QUADAS-2 assessment in approximately a third of included studies, but failed to provide details or reasons for this assessment.
Two modified versions of the Braden scale51 52 were included by Park and colleagues.53 Summary sensitivities were 0.97 (95% CI 0.92, 0.99; 125 patients from four studies)51 and 0.89 (95% CI 0.71, 0.98; 27 patients from two studies)52, and summary specificities were 0.70 (95% CI 0.66, 0.73; 563 patients)51 and 0.71 (95% CI 0.67, 0.75; 599 patients).52 The review was rated critically low on the AMSTAR-2 assessment, with only 3/15 (20%) criteria fulfilled. Despite reporting the use of QUADAS-2 for their risk of bias assessment, QUADAS-2 results were not reported, except that none of the included studies were estimated to be at high risk.
Cubbin & Jackson scale
Zhang and colleagues40 included six studies evaluating the original Cubbin & Jackson scale54 (800 patients). Summary sensitivity and specificity were both reported as 0.84 (95% CIs 0.59, 0.95 and 0.66, 0.93, respectively) 40 suggesting that this represents the point on the HSROC curve where sensitivity equals specificity, particularly as reported thresholds ranged from 24 to 34. The review authors concluded that although the accuracy of the Cubbin & Jackson scale was higher than the EVARUCI scale and the Braden scale, low quality of evidence and significant heterogeneity limit the strength of conclusions that can be drawn.
Norton scale
Park and colleagues53 pooled data from seven studies (2,899 participants) evaluating the Norton scale, across thresholds ranging from <14 to <16. They reported summary sensitivity of 0.75 (95% CI 0.70, 0.79) and specificity 0.57 (95% CI 0.55, 0.59). A further four reviews presented statistically synthesised results for the Norton scale (Appendix 5), including one review by Chou and colleagues29 which included nine studies (5,444 participants) but only reported median values for accuracy parameters.
Waterlow scale
Although Zhang and colleagues40 included the fewest participants (4 studies; 1,000 participants) of all six reviews that conducted a statistical synthesis of the accuracy of the Waterlow scale8, they provided the most recent review. It was rated highest on AMSTAR-2 criteria and appropriately used the HSROC model for meta-analysis across thresholds ranging from 12 to 25. Summary sensitivity was 0.63 (95% CI 0.48, 0.76) and summary specificity 0.46 (95% CI 0.22, 0.71) (Table 3). A second review53 reported summary sensitivity of 0.55 (95% CI 0.49, 0.62) and specificity 0.82 (95% CI 0.80, 0.85) (6 studies; 1268 participants).
Machine learning algorithms
Qu and colleagues30 conducted separate meta-analyses of 25 studies according to ML algorithm type (Table 2). The review rated critically low on AMSTAR-2 items, with only 6/15 (40%) criteria fulfilled, and reported using Bayesian DTA meta-analysis. The review did not restrict inclusion to external evaluations of the models, and the authors did not report which estimates were sourced from development data or external data. The summary AUC for the five algorithms ranged from 0.82 (95% CI 0.79, 0.85; 9 studies with 97,815 participants) for neural network-based models to 0.95 (95% CI 0.93, 0.97; 7 studies with 161,334 participants) for random forest models (Table 3). The latter approach also had the highest summary specificity 0.96 (95% CI 0.80, 0.99), with sensitivity 0.72 (95% CI 0.26, 0.95). The highest summary sensitivity was observed for support vector machine models (0.81, 95% CI 0.69, 0.90) with summary specificity 0.81 (95% CI 0.59, 0.93) (9 studies, 152,068 participants). The remaining algorithms had summary sensitivities ranging from 0.66 (decision tree models) to 0.73 (neural network models) (Table 3). Two additional ML algorithms evaluated in the included studies (Bayesian networks and LOS (abbreviation not explained)) had too few studies to allow meta-analysis (Appendix 5).
Other scales
In addition to the risk prediction tools reported above, Zhang and colleagues40 reported on the EVARUCI scale55, presenting summary sensitivity and specificity of 0.84 (95% CI 0.79, 0.89) and 0.68 (95% CI 0.66, 0.70), respectively (3 studies; 3,063 participants). These results were pooled across thresholds, 11 and 11.5 (one not reported).
Beyond the results covered by our five prioritised reviews, three further modifications of the Braden scale were evaluated in statistical syntheses: Braden modified by Kwong60, the 4-factor model61 and ‘extended Braden’61, revealing variable performance with high uncertainty.32 29 42 Another two modified versions of the Norton scale (by Ek62, and by Bienstein63) were also included in one review’s meta-analyses32, but only risk ratios were reported. Three additional scales (revised “Jackson & Cubbin”64, EMINA65 and PSPS66) were evaluated in one statistical synthesis each.32 29 Full details can be found in Appendix 5 Table S4.
Appendix 5 Table S5 reports data for another 17 risk prediction tools, each associated with a single primary study (therefore not covered in detail in the text above), and another two tools, Sunderland67 and RAPS68, which are assessed in two primary studies each.
Results from reviews evaluating the clinical effectiveness of risk prediction models
Table 4 provides an overview of results from four29 45 47 50 69 of the 10 reviews reporting clinical effectiveness, including one Cochrane review47 which identified two randomised controlled trials (RCTs) of risk prediction tools and assessed risk of bias using the Cochrane RoB tool48. The remaining reviews used broader eligibility criteria for study inclusion and a range of different quality assessment tools, with some reviews reaching varying conclusions about the methodological quality of the same studies. Given the overlap in study inclusion between reviews, a summary of the included comparative studies is provided below.
One individually randomised trial (Webster and colleagues70) and one cluster randomised trial (Saleh and colleagues71) were considered to be at high risk of bias by the Cochrane review authors. The individually randomised trial70 was included in three additional reviews29 44 47 50, each of which considered the trial to be ‘good quality’29, ‘valid’44, or ‘high quality’50. The trial was conducted in 1,231 hospital inpatients and found no evidence of a difference in PI incidence between patients assessed with either the Waterlow scale or Ramstadius tool compared with clinical judgment alone (RR 1.10, 95% CI 0.68, 1.81 for Waterlow and RR 0.79, 95% CI 0.46, 1.35 for Ramstadius). The trial further showed no evidence of a difference in patient management or in PI severity when using a risk assessment tool compared to clinical judgement.
The cluster randomised trial71 was considered to be of poor methodological quality in two reviews.29 47 The trial included 521 patients at a military hospital and compared nurse training with mandatory use of the Braden scale, to nurse training and optional use of the Braden scale, to no training. No evidence of a difference in PI incidence was observed between groups: incidence rates were 22%, 22% and 15% (p=0.38), for the three groups respectively.
In both reviews by Lovegrove and colleagues,45 50 an uncontrolled comparison study72 was included. The study assessed the clinical effectiveness of the Maelor scale,73 and was rated as high quality within the most recent review.50 Preventive strategies and PI prevalence were compared across two sites, an Irish hospital that used the Maelor scale (121 patients) and a Norwegian hospital that used nurses’ clinical judgement (59 patients). A higher rate of preventive strategies, as well as a lower PI prevalence (12% vs. 54%), was reported for the Irish hospital. However, these results are likely to be highly confounded by inherent differences in population and setting.
A non-randomised study by Gunningberg and colleagues74 was included in two reviews, one of which is reported in Table 433 69 and was considered to be of relatively high quality. The study was conducted in 124 patients in emergency and orthopaedic units and compared the use of a pressure ulcer risk alarm sticker for patients with a modified Norton Score of <21 (indicating high-risk patients) to standard care. No significant difference in the incidence of pressure ulcers between the Norton scale and standard care groups was observed.
A non-randomised study75 conducted in 233 hospice inpatients was included in three reviews,29 33 69 one of which is reported in Table 4.69 The study met six of eight quality criteria used by Health Quality Ontario.69 Use of a modified version of the Norton scale (Norton modified by Bale), in conjunction with standardised use of preventive interventions based on risk score, was found to be associated with lower risk of pressure ulcers when compared with nurses’ clinical judgment alone (RR 0.11, 95% CI 0.03, 0.46). The lack of randomisation limits the reliability of this result, and review authors report that the modified Norton scale had not been validated.
Finally, a ‘before-and-after’ study76 of 181 patients in various hospital settings was included in two reviews. 33 69 The Health Quality Ontario considered the study to meet all quality criteria.69 Use of the Norton scale with additional training for staff was associated with significant differences in the number of preventative interventions prescribed compared to standard care (18.96 vs. 10.75, respectively). Preventative interventions were also introduced earlier in the intervention group (on day 1, 61% vs. 50%, P<0.002 for Norton and usual care, respectively). However, no significant difference in the incidence of PIs was detected between the groups.
DISCUSSION
This umbrella review summarises data from a total of 23 systematic reviews of studies evaluating the clinical utility of a total of 63 PI risk prediction tools. Despite the large number of available reviews, quality assessment using an adaptation of AMSTAR-2 suggested that the majority were conducted to a relatively poor standard or did not meet reporting standards for systematic reviews.14 26 Of the 15 items included in AMSTAR-2, only two (for accuracy reviews) and four (for effectiveness reviews) criteria were more consistently met (more than 60% of reviews scoring ‘Yes’). All other criteria were fully met by less than half of reviews. The primary studies included in the reviews were particularly poorly described in the accuracy reviews, making it difficult to determine exactly what was evaluated and in whom. In particular, the source of the data was poorly reported. Only one review33 explicitly restricted to accuracy estimates from external validations, and only one review29 described whether estimates were sourced from training or external data. The extent to which we could reliably describe and comment on the content of the reviews is limited and high-quality evidence for the accuracy and clinical effectiveness of PI risk prediction models may be lacking.
Prognostic accuracy of risk prediction models
Of the 16 reviews focused on the predictive accuracy of included models, only two used appropriate methods for both quality assessment and statistical synthesis of accuracy data36 40, one of which36 evaluated only the Braden scale. Only one review33 pre-specified the exclusion of studies reporting tool development only, one review restricted to “validated risk assessment instruments” only29, and none of the reviews discussed the importance of appropriate validation of prediction models. Only two reviews conducted meta-analyses at different cut-offs for determination of high risk29 36; the remaining reviews combined data regardless of the threshold used. Combining data across different thresholds to estimate summary sensitivity and specificity is discouraged as it yields clinically uninterpretable and non-generalisable estimates, because the estimates do not relate to a particular threshold.25
Results of meta-analyses suggested that risk prediction scales have moderate sensitivities and somewhat lower specificities, typically in the range of around 70% to 85% for sensitivity and as low as 30% to 40% for specificity for some tools. Without a detailed review of the primary study publications for these models, it is not possible to assess which, if any, of these risk assessment scales might outperform the others. It seems that limited comparative studies comparing the accuracy of different tools are available.
For the ML-based models, one review30 meta-analysed accuracy data by algorithm type. The results of the meta-analyses are not informative for clinical practice but may be a useful way of identifying which ML algorithms may be more suited to the data. Results suggested that specificities for random forest or decision tree models could reach 90% or above with associated sensitivities in the range of 66% to 72%, however relatively wide confidence intervals around these summary estimates reflect considerable variation in model performance. Moreover, some of these estimates came from internal validations within model development studies, and may not be transferable to other settings.77 Authors should make it clear where accuracy estimates are derived from to avoid overinterpretation of results.
Clinical effectiveness of risk prediction scales
Prediction models, like any test used for diagnostic or prognostic purposes, require evaluation in the care pathway to identify the extent to which their use can impact on health outcomes.78 Of the 10 reviews assessing clinical effectiveness of PI risk prediction tools, the only primary studies suggesting potential patient benefits from the use of risk prediction tools,72 75 76 were non-randomised and are likely to be at high risk of bias. In contrast, two randomised trials,70 71 (both considered at high risk of bias by the Cochrane review47) suggest that use of structured risk assessment tools does not reduce the incidence of PIs. We should recognise that effectiveness outcomes largely depend on the availability and efficacy of preventative measures, and conclusions regarding the clinical effectiveness of these risk assessment tools cannot be confidently drawn from the limited evidence available. All reviews included studies that assessed the use of risk assessment scales developed by experts, and no evidence is available evaluating the clinical effectiveness of empirically derived prediction models or ML algorithms.
Other existing evidence
Moore and colleagues47 recently updated their review (published after our search was conducted79) and reported no new randomised trials that assessed the effect of risk assessment tools on PI incidence.
We have separately reviewed9 available evidence for the development and validation of risk prediction tools for PI occurrence. Almost half (52/116, 45%) of available tools were developed using ML methods (as defined by review authors), 40% (46/116) were based on clinical expertise or unclear methods, and only 18 (16%) were identified as having used statistical modelling methods. The reviews varied in methodological quality and reporting; however, the reporting of prediction model development in the original primary studies appears to be poor. For example, across all prediction tools identified, the internal validation approach was unclear and unidentifiable for 70% (81/116) of tools, and only one review identified and included external validation studies (n=7 studies).
ML-based models may have potential for identifying those at risk of PI, as suggested by one review30 included in this umbrella review. However, it is important to consider the lack of transparency in reporting of model development methods and model performance, and the concerning lack of model validation in populations outside of the original model development sample.9
Strengths and limitations
We have conducted the first umbrella review that summarise the prognostic accuracy and clinical effectiveness of prediction models for risk of PI. We followed Cochrane guidance13, with a highly sensitive search strategy designed by an experienced information specialist. Although we excluded non-English publications due to time and resource constraints, where possible these publications were used to identify additional eligible risk prediction models. To some extent, our review is limited by the use of AMSTAR-2 for quality assessment of included reviews. AMSTAR-2 was not designed for assessing systematic reviews of diagnostic or prognostic studies. Although we made some adaptations, many of the existing and amended criteria relate to the quality of reporting of the reviews as opposed to methodological quality. There is scope for further work to establish criteria for assessing systematic reviews of prediction models.
The primary limitation of our study lies in the limited detail available on risk prediction tools and their performance within the included systematic reviews. To ensure comprehensive model identification, we adopted a broad definition of ‘systematic,’ potentially influencing the depth of information provided in the reviews, and the reporting quality in many primary studies contributing to these reviews may be suboptimal. Notably, excluding ML-based models, over half of the existing risk prediction tools were published prior to 2000, before the publication of original versions of reporting guidelines for diagnostic accuracy studies80 and risk prediction models.81
CONCLUSIONS
In conclusion, this umbrella review comprehensively summarises the prognostic accuracy and clinical effectiveness of risk prediction tools for developing PIs. The included systematic reviews used poor methodology and reporting, limiting our ability to reliably describe and evaluate their content. ML-based models demonstrated potential, with high specificity reported for some models. Wide confidence intervals highlight the variability in current evaluations, and external validation of ML tools may be lacking. The prognostic accuracy of clinical scales and statistically derived prediction models has a substantial range of specificities and sensitivities, motivating further model development with high quality data and appropriate statistical methods.
Regarding clinical effectiveness, a reduction of PI incidence is unclear due the overall uncertainty and potential biases in available studies. This underscores the need for further research in this critical area, once promising prediction tools have been developed and appropriately validated. In particular, the clinical impact of newer ML-based models currently remains largely unexplored. Despite these limitations, our umbrella review provides valuable insights into the current state of PI risk prediction tools, emphasising the need for robust research methods to be used in future evaluations.
Data Availability
All data produced in the present work are contained in the manuscript and supplementary file
Author Contributions
Conceptualisation: Bethany Hillier, Katie Scandrett, April Coombe, Tina Hernandez-Boussard, Ewout Steyerberg, Yemisi Takwoingi, Vladica Velickovic, Jacqueline Dinnes
Data curation: Bethany Hillier, Katie Scandrett, April Coombe, Jacqueline Dinnes
Formal analysis: Bethany Hillier, Katie Scandrett, Jacqueline Dinnes
Funding acquisition: Yemisi Takwoingi, Vladica Velickovic, Jacqueline Dinnes
Investigation: Bethany Hillier, Katie Scandrett, April Coombe, Yemisi Takwoingi, Jacqueline Dinnes
Methodology: Bethany Hillier, Katie Scandrett, April Coombe, Tina Hernandez-Boussard, Ewout Steyerberg, Yemisi Takwoingi, Vladica Velickovic, Jacqueline Dinnes
Project administration: Bethany Hillier, Yemisi Takwoingi, Jacqueline Dinnes
Resources: Bethany Hillier, Katie Scandrett
Supervision: Yemisi Takwoingi, Jacqueline Dinnes
Writing – original draft: Bethany Hillier, Katie Scandrett, April Coombe, Jacqueline Dinnes
Writing – review & editing: Bethany Hillier, Katie Scandrett, April Coombe, Tina Hernandez-Boussard, Ewout Steyerberg, Yemisi Takwoingi, Vladica Velickovic, Jacqueline Dinnes
Funding
This work was commissioned and supported by Paul Hartmann AG (Heidenheim, Germany). The contract with the University of Birmingham was agreed on the legal understanding that the authors had the freedom to publish results regardless of the findings.
YT, JD, BH, KS and AC are funded by the National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre (BRC). This paper presents independent research supported by the NIHR Birmingham BRC at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.
Conflicting Interests
I have read the journal’s policy and the authors of this manuscript have the following competing interests: VV is an employee of Paul Hartmann AG; ES and THB received consultancy fees from Paul Hartmann AG. All other authors received no personal funding or personal compensation from Paul Hartmann AG and have declared that no competing interests exist.
Acknowledgements
We would like to thank Mrs. Rosie Boodell (University of Birmingham, UK) for her help in acquiring the publications necessary to complete this piece of work.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.
- 17.↵
- 18.↵
- 19.
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.
- 28.
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.
- 38.
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.
- 57.
- 58.
- 59.
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵