Development and assessment of a machine learning tool for predicting emergency admission in Scotland
====================================================================================================

* James Liley
* Gergo Bohner
* Samuel R. Emerson
* Bilal A. Mateen
* Katie Borland
* David Carr
* Scott Heald
* Samuel D. Oduro
* Jill Ireland
* Keith Moffat
* Rachel Porteous
* Stephen Riddell
* Simon Rogers
* Ioanna Thoma
* Nathan Cunningham
* Chris Holmes
* Katrina Payne
* Sebastian J. Vollmer
* Catalina A. Vallejos
* Louis J. M. Aslett

## Abstract

Emergency admissions (EA), where a patient requires urgent in-hospital care, are a major challenge for healthcare systems. The development of risk prediction models can partly alleviate this problem by supporting primary care interventions and public health planning. Here, we introduce SPARRAv4, a predictive score for EA risk that will be deployed nationwide in Scotland. SPARRAv4 was derived using supervised and unsupervised machine-learning methods applied to routinely collected electronic health records from approximately 4.8M Scottish residents (2013-18). We demonstrate improvements in discrimination and calibration with respect previous scores deployed in Scotland, as well as stability over a 3-year timeframe. Our analysis also provides insights about the epidemiology of EA risk in Scotland, by studying predictive performance across different population sub-groups and reasons for admission, as well as by quantifying the effect of individual input features. Finally, we discuss broader challenges including reproducibility and how to safely update risk prediction models that are already deployed at population level.

Keywords
*   Emergency admission
*   Primary care
*   Machine learning

## Introduction

Emergency admissions (EA), where a patient requires urgent in-hospital care, represent deteriorations in individual health and are a major challenge for healthcare systems. For example, approximately 395,000 Scottish residents (≈1 in 14) had at least one EA between 1 April 2021 and 31 March 2022 [Public Health Scotland, 2022]. In total, around 600,000 EAs were recorded for these individuals, nearly 54% of all hospital admissions in that period, and they resulted in longer hospital stays (6.8 days average) compared to planned elective admissions (3.6 days average). Modern health and social care policies aim to implement proactive strategies [Rural Access Action Team, 2005], often by appropriate primary care intervention [McDonagh et al., 2000, Sanderson and Dixon, 2000, Coast et al., 1996]. Machine learning (ML) can support such interventions by identifying individuals at risk of EA who may benefit from anticipatory care. If successful, such interventions can lead to better patient outcomes and reduced pressures on secondary care (Figure 1A).

![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F1)

Figure 1: Data and model fitting overview.
(A) Illustration of how SPARRA can support primary care intervention with the goal of improving patient outcomes. (B) Distribution of the number of input EHR entries (prior to exclusions) according to age, sex and SIMD deciles (1: most deprived; 10: least deprived). (C) Flow chart summarising data and model fitting pipelines.

A range of risk prediction models have been developed in this context [Rahimian et al., 2018, Lyon et al., 2007, Wallace et al., 2014, Bottle et al., 2006, Billings et al., 2006, Hippisley-Cox and Coupland, 2013]. However, transferability across temporal and geographical settings is limited due to differing demographics and data availability [Wallace et al., 2014]. Development of models in the setting in which they will be used is thus preferable to reapplication of models trained in other settings. In Scotland, the Information Services Division of the National Services Scotland (now incorporated into Public Health Scotland; PHS) developed SPARRA (Scottish Patients At Risk of Re-admission and Admission) — an algorithm to predict the risk of EA in the next 12 months. SPARRA was derived using national electronic health records (EHR) databases and has been in use since 2006. The current version of the algorithm (SPARRAv3) [Health and Social Care Information Programme, 2011] was deployed in 2012/13 and is calculated monthly by PHS for almost the entire Scottish population. Individual-level SPARRA scores can be accessed by general practitioners (GPs), helping them to plan mitigation strategies for individuals with complex care needs. Collectively, SPARRA scores may be used to estimate future demand, supporting planning and resource allocation. SPARRA has also been used extensively in public health research [Leckcivilize et al., 2021, Highet et al., 2014, Bajaj et al., 2016, Canny et al., 2016, Manoukian et al., 2021, Wallace et al., 2016].

In this paper we update the SPARRA algorithm to version 4 (SPARRAv4) using contemporary supervised and unsupervised ML methods. This represents a large scale ML risk score, fitted and deployed at national level, and widely available in clinical settings. We develop SPARRAv4 using EHRs collected for around 4.8 million (after exclusions) Scottish residents between 2013 and 2018. Among other variables, this includes data about past hospital admissions, long term conditions (e.g. asthma) and prescriptions. We use crossvalidation to evaluate the validity of SPARRAv4 and its stability over time. This shows an improvement of performance with respect to SPARRAv3 in terms of discrimination and calibration, including a stratified analysis across different subpopulations. We also perform extensive analyses to determine what reasons for emergency admission are predictable, and use Shapley values [Lundberg and Lee, 2017] to quantify the effect of individual input factors. Finally, we discuss some of the practical challenges that arise when developing and deploying models of this kind, including issues associated to updating risk prediction scores that are already deployed at population level.

Reproducibility is critical to ensure reliable application of ML in clinical settings [Mc-Dermott et al., 2021]. To provide a transparent description of our pipeline, this manuscript conforms to the TRIPOD guidelines [Collins et al., 2015] (Supplementary Table S1). Moreover, all code is publicly available at [github.com/jamesliley/SPARRAv4](http://github.com/jamesliley/SPARRAv4). This includes non-disclosive outputs used to generate all the figures and tables presented in this article.

## Results

### Data overview

The input data prior to any exclusions combines multiple national EHR databases held by Public Health Scotland for 5.8 million Scottish residents between 1 May 2013 and 30 April 2018 (Figure 1B; Supplementary Table S2; Supplementary Figure S1A), some of whom died during the observation period. These comprised 468 million records, comprising interactions with the Scottish healthcare system and deaths. The distribution of the number of available records exhibits sex- and age-specific patterns, including a female-specific mode around childbearing age. Overall, marginally more records are available for individuals in the most deprived areas (as measured by deciles of the 2016 Scottish Index of Multiple Deprivation (SIMD); [Scottish Government, 2016]), particularly within accidents & emergency and mental health hospital records. Two additional tables (Supplementary Table S2) containing historic data about long term conditions (LTC, back to 1981) and mortality records were also used as input. Prediction features derived from all this raw input data are listed in Supplementary Table S3 and Supplementary Table S4.

We selected three time cutoffs for model fitting (1 May and 1 December 2016, and 1 May 2017) leading to 17.4 million individual-time pairs, hereafter referred to as samples (Figure 1C). After exclusions (see Methods), the data comprise 12.8 million samples corresponding to 4.8 million individuals. Overall, the study cohort is slightly older, has more females, and is moderately more deprived than the general population (Table 1). The prediction target was defined as a recorded EA to a Scottish hospital or death in the year following the time cutoff (Supplementary Note S1). In total, 1,142,169 EA or death events (9%) were observed across all samples. This includes 57,183 samples for which a death was recorded (without a prior EA within that year) and 1,084,986 samples for which an EA was recorded (amongst those, 107,827 deaths were observed after the EA). As expected, the proportion of deaths amongst the observed events increases with age (Supplementary Figure S1B). Moreover, patients with an EA or death event (in at least one time cutoff) are, on average, older and more deprived than those without an event (Table 1).

View this table:
[Table 1:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T1)

Table 1: Demographic summary for the different cohorts:
the whole Scottish population (approximately 5.8 million), those present in the input databases at least one time point (17,488,596 samples comprising 5,829,532 unique individuals), our study cohort after exclusions (12,866,084 samples comprising 4,835,428 unique individuals) and our study cohort after stratifying by event status (EA or death: 1,142,169 samples comprising 667,566 unique individuals; no EA or death: 11,723,915 samples comprising 4,670,756 unique individuals). Summary statistics were calculated using sample-level data. The EA or death cohort includes individual-time pairs for which the individual at least one EA or died during the year after the time. LTC denotes long-term conditions (e.g. epilepsy). Data for the Scottish population is from the 2011 Census [Office for National Statistics et al., 2011].

### Overall predictive performance

In held out test data, SPARRAv4 was effective at predicting EA, and outperformed SPARRAv3 on the basis of AUROC and AUPRC (Figure 2A-B). SPARRAv4 was also better calibrated, particularly for individuals with observed risk ≈0.5, who are less likely to be a priori known by GPs (Figure 2C). Whilst SPARRAv3 and SPARRAv4 scores were highly correlated, large discrepancies were observed for some individuals (Supplementary Figure S2). In individuals for whom *v*3 and *v*4 disagreed (defined as |*v*3 −*v*4 |*>* 0.1), we found that *v*4 was better-calibrated than *v*3 (Figure 2D).

![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F2)

Figure 2: Comparison of overall predictive performance between SPARRAv3 and SPARRAv4.
(A) ROC. (B) PRC. Lower sub-panels show differences in sensitivity and precision, respectively. Confidence intervals are negligible. (C) Calibration curves. (D) Calibration curves for samples in which |*v*4 *-v*3| *>* 0.1. Lower sub-panels show the difference between curves and the *y* = *x* line (perfect calibration). Confidence envelopes are pointwise (that is, for each *x*-value, not the whole curve). Predicted/true value pairs are combined across cross-validation folds in all panels for simplicity. (E) Difference in the number of individuals who had an event amongst individuals designated highest-risk by v3 and v4. The repeating pattern is a rounding effect of v3. (F) Difference in the number of highest-risk individuals to target to avoid a given number of admissions.

While the increases in AUROC and AUPRC may be small, the improvement provided by SPARRAv4 in terms of absolute benefit to population is substantial. Therefore its deployment at national scale is highly consequential. In particular, amongst the 50,000 individuals judged to be at highest risk by SPARRAv3, around 4,000 fewer individuals were eventually admitted that were amongst the 50,000 individuals judged to be at highest risk by SPARRAv4 (Figure 2E). For another perspective, if we simply assume that 20% of admissions are avoidable [value taken from Blunt, 2013], that avoidable admissions are as predictable as non-avoidable admissions, and that we wish to pre-empt 3,000 avoidable admissions by targeted intervention on the highest risk patients (the second assumption is conservative, since avoidable admissions are often predictable due to other medical problems). Then, by using SPARRAv4, we would need to intervene on approximately 1,500 fewer patients than if we were to use SPARRAv3 in the same way, in order to achieve the target of avoiding 3,000 admissions (Figure 2F).

Since SPARRAv4 comprises an ensemble of models (Machine Learning prediction methods), we also explored a breakdown of performance across the constituent models (Supplementary Table S5; Supplementary Figure S3). The ensemble had slightly better performance than the best constituent models (XGB and RF). Note that some constituent models (ANN, GLM, NB) had ensemble coefficients which were regularised to be vanishingly small, so in practice predictions for those models need not be computed when calculating SPARRAv4.

### Stratified performance of SPARRAv3 and SPARRAv4

To examine differences in performance more closely, we explored the performance of SPARRAv3 and SPARRAv4 across different patient subcohorts defined by age, SIMD deciles and the four subcohorts defined as part of SPARRAv3 development. Generally, we observed that SPARRAv4 had better predictive performance across all subcohorts (Figure 3A).

![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F3)

Figure 3: Stratified performance of SPARRAv3 and SPARRAv4.
(A) Performance of SPARRAv3 and SPARRAv4 in subcohorts defined by age, SIMD and the original subcohorts defined during SPARRAv3 development (Methods). Top: AUROC (red: SPARRAv3; black: SPARRAv4). Vertical bars denote plus/minus 3 standard deviations. Middle: AU-ROC increase for SPARRAv4 with respect to SPARRAv3. For context, bottom sub-panels show the proportion of samples with an event within each group. (B) Distribution of SPARRAv4 scores (in log-scale) for different group of samples. Black points indicate the associated medians. Groups were defined according to whether an event was observed (red violin plots) or, for those with an EA, based on the diagnosis recorded during the admission. (C) Distribution of time-to-event after the first time cutoff amongst cohorts of individuals who had an EA in the year following the time cutoff and had a SPARRAv4 score above a given cutoff.

### Conditional performance of SPARRAv4 by admission type and imminence

When exploring the distribution of SPARRAv4 scores according to the diagnosis that was assigned to the patient during admission (Supplementary Table S6) — that is, after prediction — we found that certain medical classes of admission were predicted disproportionately well (Figure 3B), namely predicting mental/behavioural, respiratory and endocrine/metabolic related admissions. Similarly, all cause mortality was also associated with high SPARRAv4 scores. As one would expect, we were less able to predict external causes of admissions (e.g., S21: open wound of thorax [World Health Organization, 2004]). Obstetric and puerperium-related admissions were particularly challenging to predict by SPARRAv4. When further analysing SPARRAv4 scores, we also found that it tends to better predict imminent admissions: individuals with high risk scores were more likely to have an EA near the start of the 1 year outcome period (Figure 3C). We did not use an absolute threshold to determine who is at high risk. Instead, we ranked individuals according to their scores and looked at those in the top part of the ranking (i.e. with the highest risk scores).

### Deployment scenario stability and performance attenuation

We next addressed two crucial aspects pertaining to practical usage of SPARRAv4. Firstly, we assess the durability of performance for a model trained once and employed to generate predictions at future times, confirming it does not deteriorate. This is the way in which SPARRAv4 will be deployed by PHS, generating new predictions each month but without repeated model updating, akin to SPARRAv3’s monthly use without update from 2013–2023. Secondly, we demonstrate that it is none-the-less necessary to update predictions despite the absence of model updates, since evolving patient covariates lead to the performance attenuation of any point-in-time prediction.

We firstly used a *static model M* (Methods) to predict risk at future time-points (i.e. new predictions are generated as the features are updated). *M* performed essentially equally well over time, with no statistically significant decrease in performance (adjusted p-values *>* 0.05, or improved performance with time for all comparisons of AUROCs; Supplementary Figure S4A-C). With stability under the deployment scenario confirmed, we also explored the distribution of scores over time. In line with expectations, the quantiles of scores generated by the static model increased as the cohort grew older (Supplementary Figure S4D). The mean risk scores of individuals in the highest centiles of risk at *t* decreased over time (Supplementary Figure S4F), suggesting that very high risk scores tend to be transient. The bivariate densities of time-specific scores (Figure 4A; Supplementary Figure S4G) also show lower scores to be more stable than higher scores, and that subjects ‘jump’ to higher scores (upper left in Figure 4A) more than they drop to lower scores (bottom right).

![Figure 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F4.medium.gif)

[Figure 4:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F4)

Figure 4: Performance of a static model and static predictions.
(A) Density (low to high: white-grey-red-yellow) of scores generated using the static model *M* to predict EA risk at *t*1 (2 May 2015) and *t*2 (1 Dec 2015). The density is normalised to uniform marginal on the Y axis, then the X axis; true marginal distributions of risk scores are shown alongside in grey. (B-D) Performance of static scores calculated at *t* using *M* to predict risk for later time points. (B) ROC curves. Lower panel show differences in sensitivity with respect to *t*1. (C) PRC curves. Lower panel show differences in precision with respect to *t*1. (D) Calibration curves. Lower panel show the difference between observed and expected EA frequency.

Finally, we examined the behaviour of *static scores* (computed at *t* using *M*) to predict future event risk. We observed that the static scores performed reasonably well even 2-3 years after *t*, although discrimination and calibration was gradually lost (Figure 4B-D). More generally, we observe that scores fitted and calculated at a fixed timepoint had successively lower AUROCs for predicting EA over future periods (Supplementary Figure S4G). This affirms the need for updated predictions in deployment, despite the static model.

### Feature importance

The features with the largest mean absolute Shapley value (excluding SPARRAv3 and the features derived from the topic model) were age, the number of days since the last EA, the number of previous A&E attendances, and the number of antibacterial prescriptions (Table 2). Most features had non-linear effects (see e.g. Supplementary Figure S5A-B). For example, the risk contribution from age was high in infancy, dropping rapidly from infancy through childhood, then remaining stable until around age 65, and rising rapidly thereafter (Figure 5A). We also found a non-linear importance of SIMD (Figure 5B) and number of previous emergency hospital admissions (Supplementary Figure S5C).

View this table:
[Table 2:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T2)

Table 2: Top 20 most important variables by mean absolute Shapley value (percentage scale).
Importance can be interpreted as the average percent added or subtracted to risk score due to this factor.

![Figure 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F5.medium.gif)

[Figure 5:](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F5)

Figure 5: Analysis of Shapley values.
Distribution of Shapley values by (A) age and (B) SIMD deciles (1: most deprived; 10: least deprived). (C) Number of additional years of age needed to match the difference in Shapley values between SIMD deciles 1 and 10. (D) ‘Effective ages’ calculated to match EA rates: for an (age, SIMD decile) pair, the age at mean SIMD with the equivalent EA rate.

We further investigated the contribution of SIMD by comparing Shapley values between features. We computed the mean difference in contribution of SIMD to risk score between individuals in the most deprived and least deprived SIMD decile areas, and the additional years of age which would contribute an equivalent amount. This was generally around 10-40 additional years (Figure 5C). In terms of raw admission rates, disparity was further apparent: individuals aged 20 in lowest SIMD decile areas had similar admission rates to individuals aged 70 in the 3 highest SIMD decile areas (Figure 5D).

When exploring the added value (in terms of AUROC) of including the features derived using the topic model (Supplementary Table S4), we observed slightly better performance than the model without such features (*p*-value = 3 *×* 10−29; Supplementary Figure S5E-F). Analogously to Figure 2E, we also computed the additional number of samples correctly identified as having an event amongst the top scores by the two models. Although the absolute difference in AUROC was small, we found that the use of topic features increased the number of EAs detected in the top 500,000 scores by around 200.

### Deployment

SPARRAv4 was developed in a remote data safe haven (DSH) environment [Public Health Scotland, 2020] without access to internet or modern collaboration tools (e.g. git version control). Whilst our analysis code and a summary of model outputs (e.g. AUROC values) could be securely extracted from the DSH, this was not possible for the actual trained model due to potential leaks of sensitive patient information [Jefferson et al., 2022]. This introduced reproducibility challenges, since the model had to be retrained in a different secure environment before it is deployed by PHS. In particular, this re-development outside the DSH had two distinct phases. Firstly, the raw data transformations (to convert the original databases into a format that is suitable for ML algorithms) were reproduced from scratch from the same source data. This allowed for a very detailed level of testing on the (sometimes complex) transformations. Once the output of the transformations matched perfectly between the DSH and the external environment for all features, the topic and predictive models were re-trained. The training process could not be exactly matched due to differing compute environments, package versions and training/validation split. However, after training the external models were validated by comparing the performance (via AUC) and the calibration with the results obtained within the DSH.

Another practical issue that arises when developing and deploying a new version of SPARRA is due to potential *performative prediction* effects [Perdomo et al., 2020]. Since SPARRAv3 is already visible to GPs (who may intervene to reduce the risk of high-risk patients), v3 can alter observed risk in training data used for v4, with v3 becoming a *‘victim of its own success’* [Lenert et al., 2019, *Sperrin et al., 2019]. This is potentially hazardous: if some risk factor R* confers high v3 scores prompting GP intervention (e.g., enhanced follow-up), then in the training data for v4, *R* may no longer apparently confer increased risk. Should v4 replace v3, some individuals would therefore have their EA risk underestimated, potentially diverting important anticipatory care away from them. This highlights a critical problem in the theory of model updating [Liley et al., 2021](Supplementary Note S2; Supplementary Figure S6A-D). As a practical solution, during deployment, GPs could receive the maximum between v3 and v4 scores. This would avoid the potential hazard of risk underestimation, at the cost of mild loss of score calibration Supplementary Figure S6E.

## Discussion

We used routinely collected EHR from around 5.8 million Scottish residents to develop and evaluate SPARRAv4, a risk prediction score that quantifies 1-year EA risk based on age, deprivation (using SIMD as a geographic-based proxy) and a wide range of features derived from a patient’s past medical history. SPARRAv4 constitutes a real-world use of ML, derived from population-level data and embedded in clinical settings across Scotland.

We demonstrated meaningful gains in predictive performance over the previous version of SPARRA. This arises from the use of more flexible ML methods (e.g. to capture non-linear patterns between features and EA risk) and the incorporation of features derived by a topic model which extracts more granular information (with respect to the manually curated features used by SPARRAv3) from past diagnoses and prescriptions data. The latter can be thought of as a proxy for multi-morbidity patterns. Our analysis also provides insights into the epidemiology of EA risk, highlighting predictable patterns in terms of EA type (as defined by the recorded primary diagnosis) and the imminence of EA. Moreover, we studied the contribution of each feature, revealing a complex relationship between age, deprivation and EA risk. Note, however, that we cannot assign a causal interpretation for any reported associations. In particular, the link between SIMD and EA risk is complex; SIMD includes a ‘health’ constituent [Scottish Government, 2016], and individuals in more-deprived SIMD decile areas (1: most deprived; 10: least deprived) miss more primary care appointments [Ellis et al., 2017].

One important strength of SPARRAv4 is its nationwide coverage, using existing healthcare databases without the need for additional bespoke data collection. This, however, prevents the use of primary care data (beyond community prescribing) as it is not currently centrally collected in Scotland. Due to privacy considerations, we were also unable to access geographic location data, precluding the study of potential differences between e.g. rural and urban areas and the use of a geographically separated test set [Wallace et al., 2014]. Limited data availability also limits a straightforward comparison of predictive performance (e.g. in terms of AUROC) with respect to similar models developed in England [Billings et al., 2006, Rahimian et al., 2018] (this is also complicated because of different model choices, e.g. [Rahimian et al., 2018] modelled time-to-event data but we used a binary 1-year EA indicator). For example, we do not have information about marital and smoking status, blood test results and family histories; all of which were found to be predictive of EA risk by [Rahimian et al., 2018].

Beyond model development and evaluation, our work also highlights broader challenges that arise in this type of translational project using EHR. In particular, as SPARRAv4 has the potential to influence patient care, we have placed high emphasis on transparency and reproducibility while ensuring compliance with data governance constraints. Providing our code in a publicly available repository will also allow us to transparently document future changes to the model (e.g. if any unwanted behaviour is identified during the early stages of deployment). SPARRAv4 also constitutes a real-world example in which potential performative effects need to be taken into account when updating an already deployed risk prediction model.

It is critical to emphasise that SPARRAv4 will not replace clinical judgement, nor does it direct changes to patient management made solely based on the score. Indeed, any potential interventions must be decided jointly by medical professionals and patients, balancing the underlying risks and benefits. Moreover, lowering EA risk does not necessarily entail overall patient benefit as e.g. long-term oral corticosteroid use in mild asthmatics would reduce EA risk, but the corticosteroids themselves cause an unacceptable cost of long-term morbidity [NICE guidelines, 2017].

Optimal translation into clinical action is a vital research area and is essential for quantifying the benefit of such scores in clinical practice. Indeed, any benefit is dependent on widespread uptake and the existence of timely integrated health and social care interventions, and identification of EA risk is only the first step in this pathway. Therefore, we will continue to collaborate to achieve successful deployment of SPARRAv4 and will carefully consider the feedback from GPs to improve the model and the communication of its results further (e.g. via informative dashboards). As the COVID-19 pandemic resolves, it will also be important to assess potential effects of dataset shift [Subbaswamy and Saria, 2020] due to disproportionate mortality burden in older individuals and long-term consequences of COVID-19 infections. In an era where healthcare systems are under high stress, we hope that the availability of robust and reproducible risk prediction scores such as SPARRAv4 (and its future developments) will contribute to the design of proactive interventions that reduce pressures on healthcare systems and improve healthy life expectancy.

## Methods

### SPARRAv3

SPARRAv3 [Health and Social Care Information Programme, 2011], deployed in 2012, uses separate logistic regressions on four subcohorts of individuals: frail elderly conditions (FEC; individuals aged *>* 75); long-term conditions (LTC; individuals aged 16-75 with prior healthcare system contact), young emergency department (YED; individuals aged 16-55 who have had at least one A&E attendance in the previous year) and under-16 (U16; individuals aged *<* 16). If an individual belongs to more than one of these groups, the maximum of the associated scores is reported. SPARRAv3 was fitted once (at its inception in 2012) with regression coefficients remaining fixed thereafter. Most input features were manually dichotomised into two or more ranges for fitting and prediction. The prediction target for SPARRAv3 is EA within 12 months. People who died in the pre-prediction period, and who therefore do not have an outcome for use in the analysis, are excluded. PHS calculated SPARRAv3 scores and provided them as input for the analysis described herein. Any GP in Scotland can access SPARRA scores after attaining information governance approval.

### Exclusion criteria

The exclusion criteria were applied per sample (defined as individual-time pairs; Figure 1C). Samples were excluded if: (i) they were excluded from SPARRAv3 [these largely correspond to individuals with no healthcare interactions or that were not covered by the four SPARRAv3 subcohorts; Health and Social Care Information Programme, 2011], (ii) when the individual died prior to the prediction time cutoff, (iii) when the SIMD for the individual was unknown, or (iv) those associated to individuals whose Community Health Index [CHI; ISD Scotland Data Dictionary, 2023] changed during the study period. The CHI number is a unique identifier which is used in Scotland for health care purposes.

### Feature engineering

A typical entry in the source EHR tables (Supplementary Table S2) recorded a single interaction between a patient and NHS Scotland (e.g. hospitalisation), comprising a unique individual identifier (an anonymised version of the CHI number), the date on which the interaction began (admission), the date it ended (discharge), and further details (diagnoses made, procedures performed). For each sample, entries from up to three years before the time cutoff were considered when building input features, except long-term condition (LTC) records, which considered all data since recording began in January 1981. A full feature list is described in Supplementary Table S3. This includes SPARRAv3 [Health and Social Care Information Programme, 2011] features, e.g. age, sex, SIMD deciles and counts of previous admissions (e.g. A&E admissions, drug-and-alcohol-related admissions). Additional features encoding time-since-last-event (e.g. days since last outpatient attendance) were included following findings in [Rahimian et al., 2018]. From community prescribing data, we derived predictors encoding the number of prescriptions of various categories (e.g. respiratory), extending the set of predictors beyond a similar set used in SPARRAv3. Similarly to SPARRAv3, we also derived the total number of different prescription categories, the total number of filled prescription items, and the number of British National Formulary (BNF) sections from which a prescription was filled [Prasad, 1994]. From LTC records, we extracted the number of years since diagnosis of each LTC (e.g. asthma), the total number of LTCs recorded, and the number of LTCs resulting in hospital admissions.

We additionally used a Latent Dirichlet Allocation (LDA)-based topic model [Blei et al., 2003] to derive further information from prescriptions and LTC data. We jointly modelled prescriptions and LTCs using 30 topics, considering samples as ‘documents’ and LTCs/prescriptions as ‘words’. This enabled a substantial reduction in feature dimensionality, given the number of LTCs/prescription factor levels. Using the map from documents to topic probabilities, we used derived topic probabilities as additional features in SPARRAv4.

### Machine Learning prediction methods

For SPARRAv4, we had no prior belief that any ML model class would be best, so considered a range of binary prediction approaches (hereafter referred to as constituent models). The following models were fitted using the *h2o* [LeDell et al., 2019] R package (version 3.24.0.2): an artificial neural network (ANN), two random forests (RF) (depth 20 and 40), an elastic net generalised linear model (GLM) and a naive Bayes (NB) classifier. The *xgboost* [Chen et al., 2019] R package (version 1.6.0.1) was used to train three gradient-boosted trees (XGB) models (maximum tree depth 3, 4, and 8). Hyper-parameter choices are described in Supplementary Note S3. SPARRAv3 was used as an extra constituent model.

Rather than selecting a single constituent model, we used an ensemble approach. Similar to Van der Laan et al. [2007], we calculated an optimal linear combination (*L*1-penalised regression, using the R package glmnet, version 4.1.4) of the predictions generated by each constituent model. Ensemble weights were chosen to optimise the area under an ROC curve (AUROC). Finally, we monotonically transformed the derived predictor to improve calibration by inverting the empirical calibration function (Supplementary Note S4).

### Data imputation

As all non-primary care interactions with NHS Scotland are recorded in the input databases, there was no missingness for most features. For ‘time-since-interaction’ type features, samples for which there was no recorded interaction were coded as twice the maximum lookback time. There was minor non-random missingness in topic features (∼ 0.8%) due to individuals in the dataset with no diagnoses or filled prescriptions. We used mean-value imputation in the ANN and GLM models (deriving mean values from training data only), used missingness to inform tree splits (defaults in [LeDell et al., 2019]) in RF, used sample-wise imputation in XGB (as per [Chen et al., 2019]) and dropped during fitting (default in [LeDell et al., 2019]) in NB (omitted missing values for prediction).

Particular care was required for features encoding total lengths of hospital stays. In some cases, a discharge date was not recorded, which could lead to an erroneous assumption of a very long hospital stay (from admission until the time cutoff). To address this, we truncated apparently spuriously long stays at data-informed values (Supplementary Note S5).

### Cross-validation

We fitted and evaluated SPARRAv4 using three-fold cross-validation (CV). This was designed such that all elements of the model evaluated on a test set were agnostic to samples in that test set. Individuals were randomly partitioned into three data folds (F1, F2 and F3). At each CV iteration, F1 and F2 were combined and used as a training dataset, F3 was used as a test dataset. The training dataset (F1+F2) was used to fit the topic model and to train all constituent models (except SPARRAv3, whose training anyhow pre-dates the data used here). The ensemble weights and re-calibration transformation were learned using F1 + F2, i.e. without using the test set from the test set (Supplementary Note S4).

### Predictive performance

Our primary endpoint for model performance was AUROC. We also considered area-under-precision-recall curves (PRC) and calibration curves. We plotted calibration curves using a kernelised calibration estimator (Supplementary Note S6).

For simplicity, figures show ROC/PRC that were calculated by combining all samples from the three *test* CV folds (that is, all predictions and observed outcomes were merged to draw a single curve). Quoted AUROC/AUPRC values were calculated as an average across the three *test* CV folds to avert problems from between-fold differences in models [Forman and Scholz, 2010]. For ease of comparison, we also used mean-over-folds to compute quoted AUROCs and AUPRCs for SPARRAv3, although the latter was not fitted to our data.

### Deployment scenario stability and performance attenuation

Using the same analysis pipeline as for the development of SPARRAv4, we trained a static model *M* to an early time cutoff (*t*=1 May 2014), and using one year of data prior to *t* to derive predictors (the restricted lookback is the only deviation from the actual model pipeline, due to limited temporal span of the training data).

We studied the performance of *M* as a *static model* to repeatedly predict risk at future time points, which mirrors the way in which PHS will deploy the model. To do this, we assembled test features from data 1 year prior to *t*1=1 May 2015, *t*2=1 Dec 2015, *t*3=1 May 2016, *t*4=1 Dec 2016, and *t*5=1 May 2017, applying *M* to predict EA risk in the year following each time-point. In this analysis, the comparison of the distribution of scores over time only considered the cohort of patients who were alive and had valid scores at *t*1, …, *t*5.

To ensure a fair comparison when evaluating the performance of *static predictions* (computed at *t* using *M*) to predict future event risk (at *t*1, …, *t*5), we only considered a subsample of 1 million individuals with full data across all time-points, selected such that global admission rates matched those at *t*.

### Feature importance

We examined the contribution of feature to risk scores at an individual level by estimating Shapley values [Lundberg and Lee, 2017] for each feature. For simplicity, this calculation was done using 20,000 randomly-chosen samples in the first cross-validation fold (F1). We treated SPARRAv3 scores as fixed predictors rather than as functions of other predictors.

We also assessed the added value of inclusion of topic-model derived features, which summarise more granular information about the previous medical history of a patient with respect to those included in SPARRAv3. For this purpose, we refitted the model to F2+F3 with topic-derived features excluded from the predictor matrix. We compared the performance of these models using F1 as test data. We compared the performance of predictive models with and without the features derived from the topic model by comparing AUROC values using DeLong’s test [DeLong et al., 1988].

## Data Availability

Raw data for this project are patient-level NHS Scotland health records, and are confidential. Due to the confidential nature of the data used, all analysis took place on remote 'safe havens', without access to internet, software updates or unpublished software. Information Governance training was required for all researchers accessing the analysis environment. Moreover, to avoid the risk of accidental disclosure of sensitive information, an independent team carried out statistical disclosure control checks to all data exports, including the outputs presented in this manuscript. All analysis code and co-ordinates required to reproduce our Figures are available in github.com/jamesliley/SPARRAv4

[https://github.com/jamesliley/SPARRAv4](https://github.com/jamesliley/SPARRAv4) 

## Supporting information

**Supplementary Table S1** Checklist for TRIPOD guidelines [Collins et al., 2015].

**Supplementary Table S2** Input data sources.

**Supplementary Table S3** Definition of input features for SPARRAv4.

**Supplementary Table S4** Exploration of contributors to each topic.

**Supplementary Table S5** Overall discrimination performance for each model.

**Supplementary Table S6** Definition of different admission types.

**Supplementary Figure S1** Extended data overview.

**Supplementary Figure S2** Density plot comparing SPARRAv3 and SPARRAv4 scores.

**Supplementary Figure S3** Calibration curves for SPARRAv4 model constituents.

**Supplementary Figure S4** Performance of a static model and static predictions used to predict risk at future time points.

**Supplementary Figure S5** Feature importance

**Supplementary Figure S6** Model updating in the presence of performative effects.

**Supplementary Note S1** Choice of prediction target for SPARRAv4.

**Supplementary Note S2** Model updating in the presence of performative effects.

**Supplementary Note S3** Hyperparameter choice for ML prediction methods.

**Supplementary Note S4** Details of the cross-validation procedure.

**Supplementary Note S5** Imputation of lengths of stay when discharge date was missing.

**Supplementary Note S6** Assessment of calibration.

## Code and data sharing

Raw data for this project are patient-level EHR, and are confidential. Due to the confidential nature of the data, all analysis took place on a remote “data safe haven”, without access to internet, software updates or unpublished software. Information Governance training was required for all researchers accessing the analysis environment. Moreover, to avoid the risk of accidental disclosure of sensitive information, an independent team carried out statistical disclosure control checks on all data exports, including the outputs presented in this manuscript. All analysis code and co-ordinates required to reproduce our Figures are available in [github.com/jamesliley/SPARRAv4](http://github.com/jamesliley/SPARRAv4). This manuscript conforms to the TRIPOD guidelines [Collins et al., 2015] (Supplement Supplementary Table S1).

## Ethics statement

The project was covered under National Safe Haven Generic Ethical Approval (favourable ethical opinion from the East of Scotland NHS Research Ethics Service).

## Conflicts of interest

The authors declare no conflicts of interest.

## 1 SUPPLEMENTARY TABLES

View this table:
[Table S1.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T3)

Table S1. TRIPOD guidelines and pages where discussed [Moons et al., 2015]

View this table:
[Table S2.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T4)

Table S2. Input data sources.
The data comprises national EHR Scottish databases between 1 May 2013 and 30 April 2018. Each record corresponds to a single interaction with the health system. SMR - Scottish Morbidity Records. Information about specific EHR tables is available at Public Health Scotland [2023] (SMR datasets), Public Health Scotland [2020a] (A&E2), Public Health Scotland [2020c] (System Watch) and Public Health Scotland [2020b] (PIS). The LTC table was derived by PHS from historic SMR01 tables with an admission between 01 January 1981 and 30 April 2018. The Deaths table was provided to PHS by National Records of Scotland (NRS) and is up to date until TBC. The timescale for the pre-prediction period is 3 years for PIS, SMR00, SMR01, System Watch, A&E2, SMR01E and SMR04. All LTC records up to 01 January 1981 are used in the pre-prediction period. The earlier end of the pre-prediction period is because the most recent month will not be generally available when running the predictions.

View this table:
[Table S3.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T5)

Table S3. Definition of input features for SPARRAv4.
Variable names match the names used in our analysis code.

View this table:
[Table S4.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T6)

Table S4. Exploration of the inferred topics.
Details of derived topics for topic model used for prediction in F1 (fitted to F2+F3). A topic model assumes that each ‘document’ (individual) in a ‘corpus’ (population) is associated with various ‘topics’ (roughly, illness categories) where each topic corresponds to a distribution over ‘words’ (ICD10 codes and medication types). We would expect that the 30 topics fitted to each fold roughly represent the major clusters of disease types which occur amongst those individuals. This tables shows the ‘words’ with the highest probability of membership in each topic (*>* 1%, where probabilities over all words sum to 100%). In each topic, words are ordered by decreasing probability of topic membership. ICD10 codes are italicised; medication types are not. Topics are ordered by decreasing importance (mean absolute Shapley value). We manually assigned labels to some topics which appear to code for particular disease types.

View this table:
[Table S5.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T7)

Table S5. Overall discrimination performance for SPARRAv4 and its constituent models.
Areas under ROC curves and PR curves by fold for each constituent predictor and ensemble. Columns ‘Coef.’ indicate estimated coefficients (weights) in the final ensemble. All standard errors for AUROCs are *<* 5 *×* 10*−*4 and for AUPRCs are *<* 8 *×* 10*−*4

View this table:
[Table S6.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/T8)

Table S6. Definition of different admission types.

## 2 SUPPLEMENTARY FIGURES

![Figure S1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F6.medium.gif)

[Figure S1.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F6)

Figure S1. Extended data overview.
Distribution of the number of input EHR entries (prior to exclusions) according to age, sex and SIMD deciles (1: most deprived; 10: least deprived) stratified by the input database. All sub-panels are drawn to the same scale. “Other” includes geriatric long stay (SMR01E) and urgent care monitoring (System Watch). (B) Distribution of target events by age stratified by event type.

![Figure S2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F7.medium.gif)

[Figure S2.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F7)

Figure S2. Density plot comparing SPARRAv3 and SPARRAv4 scores.
The test datasets used within each CV iteration were combined in order to generate this plot (i.e. all samples are included once). Joint density (low to high: white-grey-red-yellow) of individual SPARRAv3 and SPARRAv4 scores. The density is normalised to uniform marginal on the Y axis, then the X axis; true marginal distributions of risk scores are shown alongside in grey.

![Figure S3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F8.medium.gif)

[Figure S3.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F8)

Figure S3. Calibration curves for SPARRAv4 model constituents.
Estimates obtained for the test set within each CV iteration are shown in different colours (legend shown in top left panel). Bottom sub-panels show departure from perfect calibration.

![Figure S4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F9.medium.gif)

[Figure S4.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F9)

Figure S4. Performance of a static model and static predictions used to predict risk at future time points.
(A-C) A model *M* fitted to an early time point (*t*: 2 May 2014) was evaluated at later time points. (A) ROC curves. (B) PR curves. (C) Calibration curves. (D) Centiles (light blue) and deciles (dark blue) of risk scores (calculated using *M*) over time, across all individuals with data available at all time points. (E) Average score over time for groups of individuals defined by risk centiles (light blue) and deciles (dark blue) at the first *t* (2 May 2015). (F) Density (low to high: white-grey-red-yellow) of scores generated using the static model *M* to predict EA risk at *t*1 (2 May 2015) and *t*3 (1 May 2016). The density is normalised to uniform marginal on the Y axis, then the X axis; true marginal distributions of risk scores are shown alongside in grey. Plots for different time points combinations are available at [https://github.com/jamesliley/SPARRAv4/tree/master/Analysis/time_attenuation/cohort/crosstime](https://github.com/jamesliley/SPARRAv4/tree/master/Analysis/time_attenuation/cohort/crosstime). (G) AUROCs for predictions calculated at each time point (based on *M*) for prediction in subsequent time points.

![Figure S5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F10.medium.gif)

[Figure S5.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F10)

Figure S5. Feature importance.
(A-B) Two examples of non-linear feature importance as measured by mean Shapley values (vertical lines show plus/minus one standard deviation). (C) Distribution of Shapley values for the number of previous elective and emergency admissions. (D-E) Comparison of predictive performance with and without topic-model derived features. (D) ROC. (E) PRC. (F) Calibration curve. For (D-E), bottom sub-panels show differences in sensitivity and specificity, respectively. In (F) bottom sub-panel shows difference with respect to perfect calibration.

![Figure S6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/28/2021.08.06.21261593/F11.medium.gif)

[Figure S6.](http://medrxiv.org/content/early/2023/09/28/2021.08.06.21261593/F11)

Figure S6. Model updating in the presence of performative effects.
(A-D) Causal structure for the training and deployment of SPARRAv3 and SPARRAv4. *X**i* represents covariates for a patient-time pair; *v*3(*f it*)*/v*4(*f it*) and *v*3(*X**i*)*/v*3(*X**i*) represent the fitting and deployment of v3 and v4 respectively. (A) Training setting for SPARRAv3. (B) Training setting for SPARRAv4. (C) Deployment setting if SPARRAv4 were to naively replace SPARRAv3. (D) Deployment setting in which SPARRAv4 is used as an adjuvant to SPARRAv3. (E) Comparison of predictive performance between SPARRAv3, SPARRAv4 and the maximum between both scores.

## 3 SUPPLEMENTARY NOTES

### 3.1 Choice of prediction target for SPARRAv4

The primary target for SPARRA is to predict whether an individual will experience an EA within 12 months from the prediction cutoff. A problem arises due to the deaths during the follow-up year for which the target may be unknown (e.g. if someone died within 6 months, without a prior EA). Broadly, there are four options for how to treat such individuals during model training and testing:

1.  Exclude them from the dataset

2.  Treat them according to whether they had an emergency admission before they died

3.  Treat them as no admission, or

4.  Treat them as an admission

It would also be possible to code death in follow-up differentially; for instance, coding in-hospital death as EA and in-community death as exclusions or non-EA. Our choice not to code all deaths identically is in the interests of non-maleficence. If an individual is at risk of imminent death in the community they will typically be admitted to hospital if it is possible to react in time, with a possible exemption if this is not in their best interests.

Option 1 would exclude the most critically ill individuals from the dataset and hence was discarded. Option 2 would effectively mean such individuals have a follow-up time less than a year, and would classify individuals who died without a hospital admission as having had a ‘desirable’ outcome. Option 3 would effectively classify death as a ‘desirable’ outcome, so we avoided it. The consequences from coding community deaths as non-EA would be severe, as it could mean that healthier individuals at risk of sudden death are either coded as non-EA or excluded from the dataset, potentially leading to inappropriately low scores being assigned to these individuals. This could draw treatment away from individuals in high need. Instead, option 4 allows the general description of the target as ‘a catastrophic breakdown in health’. In this case, our model would not be able to distinguish community deaths from emergency admissions: we may assign high ‘EA’ scores to the very old and terminally ill, when in fact these individuals may be treated in the community rather than admitted. The potential harm from this option is small. It could mean that such individuals are excessively treated rather than palliated, but since palliation over treatment is an active decision [Romo et al., 2017] and such individuals are generally known to be high-risk it is unlikely that the SPARRA score will adversely affect any decisions in this case. As the philosophy of the SPARRA score is to avert breakdowns in health, of which death can be considered an example, we decided to use a composite prediction target (EA or death within 12 months) which is consistent with option 4.

### 3.2 Model updating in the presence of performative effects

We aim to produce the SPARRA score to accurately estimate EA risk over a year under normal medical care. In other words, the score should represent the EA risk if GPs do not already have access to such a risk score. Because GPs see a SPARRA score (SPARRAv3) and may act on it, the observed risk may be lower than predicted - the score may become a ‘victim of its own success’ [Lenert et al., 2019, Sperrin et al., 2019] due to performative effects Perdomo et al. [2020]. Unfortunately, since the SPARRAv3 score is widely available to Scottish GPs, and may be freely acted on, we cannot assess the behaviour of the medical system in its absence. This is potentially hazardous [Liley et al., 2021].

Formally, at a given fixed time, for each individual, the value of ‘EA in the next 12 months’ is a Bernoulli random variable. The probability of the event for individual *i* is conditional on a set of covariates *X**i* derived from their EHR. We denote *v*3(*X**i*), *v*4(*X**i*) the derived SPARRAv3 and SPARRAv4 scores as functions of covariates, and assume a causal structure shown in Supplementary Figure S6 (for simplicity, we assume there are no unobserved confounders but the same argument applies in their presence). With no SPARRA-like predictive score in place, there is only one causal pathway *X**i* → *EA*. It is to this system (coloured red) that *v*3 was fitted. Here, *v*3(*X**i*) estimates the ‘native’ risk *Pr*(*EA*| *X**i*) (ignoring previous versions of the SPARRA score, which covered *<* 30% of the population). Although *v*3(*X**i*) is determined entirely by *X**i*, the act of distributing values of *v*3(*X**i*) to GPs opens a second causal pathway from *X**i* to *EA* (Supplementary Figure S6) driven by GP interventions made in response to *v*3(*X**i*) scores. It is to this system (coloured red) that SPARRAv4 is fitted. Hence, *v*4(*X**i*) is an estimator of *Pr*(*EA* | *X**i*, *v*3(*X**i*)), a ‘conditional’ risk after interventions driven by *v*3(*X**i*) have been implemented.

If SPARRAv4 naively replaced SPARRAv3 (Supplementary Figure S6), we would be using *v*4(*X**i*) to predict behaviour of a system different to that on which it was trained (Supplementary Figure S6). To amend this problem, we propose to use SPARRAv4 in *conjunction* with SPARRAv3 rather than to completely replace it (Supplementary Figure S6). Ideally, GPs would be given *v*3(*X**i*) and *v*4(*X**i*) simultaneously and asked to *firstly* observe and act on *v*3(*X**i*), *then* observe and act on *v*4(*X**i*), thereby only using *v*4(*X**i*) as per Supplementary Figure S6. This is impractical, so instead, we propose to distribute a single value (given by the maximum between *v*3(*X**i*) and *v*4(*X**i*)), avoiding the potential hazard of risk underestimation, at the cost of mild loss of score calibration (Supplementary Figure S6).

### 3.3 Hyperparameter choice for ML prediction methods

We used a range of constituent models. The h2o [LeDell et al., 2019] R package (version 3.24.0.2) was used to train ANN, RF, GLM and NB models. The xgboost [Chen et al., 2019] R package (version 1.6.0.1) was used to train the XGB models. Unless otherwise specified, hyperparameters were set as the software defaults. When tuned, hyperparameter values were chosen to optimise the default objective functions implemented for each method: log-loss or the ANN, RFs and GLM, likelihood for the NB model; and a logistic objective for the XGB trees. In all cases, hyperparameters were determined by randomly splitting the relevant dataset into a training and test set of 80% and 20% of the data respectively. Details for each method are provided below.

#### 3.3.1 SPARRAv3

SPARRAv3 scores were calculated by PHS using their existing algorithm Health and Social Care Information Programme [2011].

#### 3.3.2 Artificial neural network (ANN)

We used a training dropout rate of 20% to reduce generalisation error. We optimised over the number of layers (1 or 2) and the number of nodes in each layer (128 or 256).

#### 3.3.3 Random forest (RF)

We fitted two RF: one had maximum depth 20 and 500 trees, and the other had maximum depth 40 and 50 trees (both taking a similar time to fit).

#### 3.3.4 Gradient-boosted trees (XGB)

We fitted three boosted tree models with three maximum depths: 3, 4, and 8. For the deeper-tree model, we set a low step size shrinkage *η* = 0.075 and a positive minimum loss reduction *γ* = 5 in order to avoid overfitting. In the other two models, we used default values of *η* = 0.3, *γ* = 0.

#### 3.3.5 Naive Bayes (NB)

The only hyperparameter we tuned was a Laplace smoothing parameter, varying between 0 and 4.

#### 3.3.6 Penalised Generalised linear model (GLM)

We optimised *L*1 and *L*2 penalties (an elastic net), considering total penalty (*L*1 + *L*2) in 10−{1,2,3,4,5*}*, and a ratio *L*1*/L*2 in {0, 0.5, 1*}*.

### 3.4 Model re-calibration

We applied a monotonic transformation to optimise the calibration of the predictions generated by the ensemble. Given a predicted value ![Graphic][1]</img> (for ease of notation we do not explicitly include its dependency on the input features *X*) we defined a transformation *m*(·) to optimise calibration, essentially using isotonic regression. The latter was derived using the following procedure.

Fitst, we defined an empirical calibration function for an estimator ![Graphic][2]</img> of *Y*|*X* : ![Formula][3]</img> 

We then found *a, b* such that the mean and mode of ![Graphic][4]</img> were approximately correctly calibrated; that is, ![Graphic][5]</img> for *y ∈*{mean![Graphic][6]</img>, mode![Graphic][7]</img>}, and scaled *a, b* such that 0 ![Graphic][8]</img>. Across an evenly spaced grid *G* of 100 *y*-values we computed the function: ![Formula][9]</img>  using the cumulative maximum of *CAL*(·) to ensure *c*(·) is non-decreasing, and adding a linear term to ensure *c*(·) is increasing. We extended the domain of *c*(·) to [0, 1] using piecewise linear interpolation, and defined our calibrating transform *m*(·) as the inverse of *c*(·): ![Formula][10]</img> 

The transformation above was optimised using by further splitting the training set (F1+F2) within our 3-fold cross-validation (CV) procedure (we use F1, F2 and F3 to denote each fold). For each CV-fold, the following steps were performed:

1.  Train all constituent models using F1 (except SPARRAv3, for which PHS provided the scores).

2.  Each constituent model was then used to generate predictions for samples in F2.

3.  Given those predictions, ensemble weights were inferred via 10-fold CV within F2.

4.  Using the previously calculated predictions and ensemble weights, the parameters *a* and *b* were chosen to optimise calibration in F2.

5.  The optimal ensemble weights and calibration transformation parameters (*a* and *b*) were then used as fixed constants when training the model in the combined F1+F2 dataset.

Note that, due to computational constrains, the topic model was not retrained within the above procedure. Instead, a pre-trained topic model (using F1+F2 as a combined dataset) was used to generate features to be used in step 1.

### 3.5 Imputation of lengths of stay when discharge date was missing

Some of our predictors concerned lengths-of-stay; that is, total days spent in hospital in the pre-prediction period (elective\_bed\_days, emergency\_bed_days, and other_bed_days; see Table S3). In general, these were calculated by finding all stays listed for a given individual, subtracting the admission date from the discharge date for each stay, and summing the results across all stays. However, for some hospital stays, no discharge date was present in the source tables. In some cases, this was due to the individual still being in hospital at the time cutoff, but in others was evidently due to the discharge date simply not being recorded; we identified several individuals who were admitted with no discharge date who had evidence of community activity during the time they were supposedly in the hospital. To manage this, we used an imputation procedure for hospital stays in which the discharge date was not recorded. When we see an individual at a time cutoff *t* with admission date *d* and no discharge date, we have options of:

1.  Do not count this admission towards the total length of stays; that is, count the stay length as 0 days for that admission. This will under-estimate the total length of stay.

2.  Count time *t* −*d* towards the total length of stay. Effectively this imputes the discharge date using the time cutoff. This could lead to incorrect assumptions of very long hospital stays for individuals; indeed, since the pre-prediction period is three years, the mean assumed hospital stay length for such patients would be in excess of eighteen months. This is likely to over-estimate the total length of stay.

3.  Count some arbitrary time *t* towards the total length of stay. Depending on the value of *t*, the total length of stay may be under- or over-estimated.

All of these options coult potentially decrease the usefulness of these variables by artificially inflating (or deflating) the predicted EA risk. As a compromise, we decided to use ![Formula][11]</img>  as the length of stay for admissions with a missing discharge date. Effectively, this strategy uses *t* as a default *minimum length* for stays with missing discharge date.

To choose *t*, we use an empirical Bayes-optimal decision rule. Let *E* be the event that the discharge time for a given admission is not recorded. We model the time *t* −*d* as a (discrete) random variable *X* with a mixture distribution depending on *E*. We want to choose *t* so that *P*(*E*|*X* = *x*) *≥* 1*/*2 if and only if *x ≥ t*. We set ![Formula][12]</img>  that is; if the discharge time is recorded (in which case the individual is genuinely still in hospital at time *t*), we have some distribution of true lengths of stay, whereas if the discharge time is not recorded, the time *t* −*d* has an equal probability of being anywhere between one day and three years. ![Formula][13]</img> 

Given estimates of *q* and *f* (·), to find *t* we may set this expression to 1*/*2 and solve for *x*.

In order to estimate *q* and *f* with ![Graphic][14]</img> and ![Graphic][15]</img>, we consider the population P of admissions (not individuals) where the admission date is between May 2013 and May 2014. We then estimate

![Graphic][16]</img> = proportion of *P* with no recorded discharge date or discharged after May 2016

![Graphic][17]</img> = proportion of *P* with recorded discharge date before May 2016 with length of stay *x*

We use this population of admissions so as to avoid data leakage, since these are prior to the earliest time cutoff (May 2016) used in fitting the model. This is also our rationale for treating individuals who were discharged post-May 2016 the same as having no recorded discharge date: we cannot use this information without data leakage. However, we note that the number of individuals with genuine *>* 2 year hospital stays is very small.

Following this procedure, estimated values of *t* are 26, 19 and 6 for emergency\_bed\_days, elective\_bed\_days and other\_bed\_days, respectively.

### 3.6 Assessment of calibration

We use an estimator for calibration broadly based on the Nadaraya-Watson kernel estimator [Nadaraya, 1964, Watson, 1964]. We re-derive several properties (consistency, bias) to highlight their interpretation in our context.

We assume in general that, for IID predictor/outcome pairs (*X**i*,*Y**i*) ∼ (*X,Y*), *i ∈* 1..*n*, and an optimal predictor function *p**opt*, we have ![Formula][18]</img>  noting that this implies ![Formula][19]</img> 

We want to estimate *p**opt* (*X*).

Since we only observe *Y* = 1 or *Y* = 0, we must estimate *E*[*Y p*(*X*) = *z*] as some kind of average of *Y* about observed values *p*(*X*) close to *z*. A routine way to do this is to use ‘reliability diagrams’ [Bröcker and Smith, 2007] in which we bin values of *p*(*X*) and estimate *E*(*Y*|*p*(*X*)) in each bin.

Since for small bin sizes there may be few or no values of *p*(*X*) in some bins, we use a kernel estimate *ĉ**p*(*z*) of *c**p*(*z*) = *E*[*Y*|*p*(*X*) = *z*]: ![Formula][20]</img>  where *K**δ* : (0, 1)2 → ℝ + is some distance-measuring kernel with width *δ*. We avoid the simpler estimate of the *K**δ* -weighted mean of *Y**i*s for reasons shown below. We note the following:

Proposition 1.
*If p*(*X*) *has Lesbegue-integrable positive density on* (0, 1), *K*(*z, x*) *and c**p*(*x*) *are Lesbegue-integrable functions of x for fixed z >* 0, *and the kernel ‘narrows with δ’ so* ![Formula][21]</img> 

*then* ![Graphic][22]</img> *becomes a consistent estimator of c*(*z*) *as δ* → 0

*Proof*. From Slutsky’s lemma, the law of total expectation and the strong law of large numbers ![Formula][23]</img>  □

We note that ![Graphic][24]</img>is not generally consistent if *δ >* 0. However, the inconsistency is not severe: we note

Proposition 2.
*If, in addition to the above, K**δ* (*x, z*) = *K**δ* (*x*−*z*) *is a symmetric density with second moment δ and negligible moments of higher order, and the densities of p*(*X*) *and c**p*(*X*) *are twice differentiable at z, then ĉ**p*(*z*) → *c**p*(*z*) + *O*(*δ* 2)

*Proof*. We have ![Formula][25]</img>  noting the symmetry of *K**δ*. If we replace *c**p*[*p*(*X*)] with *p*(*X*), the expectation is *z f**p*(*X*)(*z*) + *O*(*δ* 2), and the result follows from the first part of 9. □

Remark 1.
*In the ideal case where c**p*(*z*) = *z (that is, our model is perfectly calibrated) estimator 8 is consistent even when δ >* 0, *whereas the apparently simpler asymptotically consistent (as δ* → 0*) estimator of a weighted sum of Y**i**’s:* ![Formula][26]</img> 

*is not*.

Finally, we note the following:

Proposition 3.
*Under the assumptions above, with fixed X**i*, *the bias of ĉ**p*(*z*) *is* ![Formula][27]</img> 

*where B*(*X**i*, *z*) = *p*(*X**i*)*c**p*(*z*) −*zc**p*(*p*(*X**i*)).

*Proof*. With fixed *X**i* ![Formula][28]</img>  □

Remark 2.
*This enables straightforward evaluation of bounds on bias given bounds on the form of c**p*. *The estimator ĉ**p* *is unbiased if c**p*(*x*) = *kx for some k, since B*(*X**i*, *z*) *≡* 0.

Remark 3.
*An alternative way to draw a kernelised calibration curve is to simply plot a parametric curve* ![Formula][29]</img> 

*which, for each t, is an only-slightly biased estimate of some point z, c**p*(*z*). *If a rectangular kernel is used, this is equivalent to binning values of p*(*X**i*) *[Bröcker and Smith, 2007]. However, this method does not generally give a curve across the entire range of p*(*X**i*).

It is straightforward to estimate ![Formula][30]</img>  where the approximation is exact if *c**p*(*z*) = *z*. Together with an estimate of maximum absolute bias *b**z* at *z*, this enables estimates of conservative confidence intervals on *ĉ**p*(*z*) at level 1 −*α* : ![Formula][31]</img> 

In all plots in this paper, we bounded bias under the assumption that there existed *k* such that | *c**p*(*z*) *kz*| *< z*2*/*10.

The calibration estimator derived here is demonstrated in an R script sparra_calibration.R available with the attached R code for this manuscript.

## Acknowledgements

The authors note that this project’s success was entirely contingent on close co-operation between the Alan Turing Institute and PHS. We thank all individuals involved in primary care in Scotland for the continued support of the SPARRA project and the Public Benefit and Privacy Panel for Health and Social Care (study number 1718-0370) for Information Governance approval on behalf of the Health Boards in NHS Scotland.

All author contributions were significant and essential to the completion of this work. Author contributions were as follows: Manuscript preparation: JL, SRE, BAM, SJV, CAV, LJMA, IT; Project initiation: SJV, CAV, LJMA, CH; Model design: JL, GB, SJV, CAV, LJMA; Code and scripts: JL, GB, LJMA; NC; IT; SDR; Code review and checking: SRE, IT; SDR; Setup of computational system: GB, LJMA; Data access management: DC, RP; EHR access: KB, DC, JI, RP, SO, SR; Public health input: KB, DC, SO, JI, RP, SR; Medical input: JL, BAM, KM; Core planning group: JL, GB, SRE, BAM, KB, DC, JI, KM, RP, SJV, CAV, LJMA; Logistical and legal oversight of project: SH, KP.

Computing for this project was performed in the Scottish National Safe Haven (NSH),which is commissioned by eDRIS, Public Health Scotland from EPCC, based at The University of Edinburgh. The authors would like to acknowledge the support of the eDRIS Team for their involvement in obtaining approvals, provisioning and linking data and the use of the secure analytical platform within the National Safe Haven.

We thank the Alan Turing Institute, PHS, the MRC Human Genetics Unit at the University of Edinburgh, Durham University, University of Warwick, Wellcome Trust, Health Data Research UK, and King’s College Hospital, London for their continuous support of the authors. JL, IT, CAV and LJMA were partially supported by Wave 1 of The UKRI Strategic Priorities Fund under the EPSRC Grant EP/T001569/1, particularly the “Health” theme within that grant and The Alan Turing Institute; JL, IT, BAM, CAV, LJMA and SJV were partially supported by Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England), the devolved administrations, and leading medical research charities; SJV, NC and GB were partially supported by the University of Warwick Impact Fund. SRE is funded by the EPSRC doctoral training partnership (DTP) at Durham University, grant reference EP/R513039/1; LJMA was partially supported by a Health Programme Fellowship at The Alan Turing Institute; CAV was supported by a Chancellor’s Fellowship provided by the University of Edinburgh.

For the purpose of open access, the author has applied a Creative Commons Attribution (CC BY) licence to any Author Accepted Manuscript version arising from this submission.

## Footnotes

*   This version of the manuscript has been updated in response to the discovery of minor errors in the raw data used to fit the models. Conclusions have not changed. The paper has also been edited for aesthetics and readability.

*   Received September 23, 2023.
*   Revision received September 28, 2023.
*   Accepted September 28, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.   N Bajaj,  S Jauhar, and  J Taylor. Scottish patients at risk of readmission and admissionmental health (SPARRA MH) case study of users and non-users of a national information source. Health Syst Policy Res, 3:3, 2016.
    
    
2.   John Billings,  Jennifer Dixon,  Tod Mijanovich, and  David Wennberg. Case finding for patients at risk of readmission to hospital: development of algorithm to identify high risk patients. BMJ, 333(7563):327, 2006.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjEyOiIzMzMvNzU2My8zMjciO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yOC8yMDIxLjA4LjA2LjIxMjYxNTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

3.   David M Blei,  Andrew Y Ng, and  Michael I Jordan. Latent Dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.5555/944919.944937&link_type=DOI) 

4.   Ian Blunt. Focus on preventable admissions. London: Nuffield Trust, 2013.
    
    
5.   Alex Bottle,  Paul Aylin, and  Azeem Majeed. Identifying patients at high risk of emergency hospital admissions: a logistic regression analysis. Journal of the Royal Society of Medicine, 99(8):406–414, 2006.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1258/jrsm.99.8.406&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=16893941&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

6.   Anne Canny,  Frances Robertson,  Peter Knight,  Adam Redpath, and  Miles D Witham. An evaluation of the psychometric properties of the indicator of relative need (IoRN) instrument. BMC geriatrics, 16(1):1–10, 2016.
    
    
7.   Tianqi Chen,  Tong He,  Michael Benesty,  Vadim Khotilovich,  Yuan Tang,  Hyunsu Cho,  Kailong Chen,  Rory Mitchell,  Ignacio Cano,  Tianyi Zhou,  Mu Li,  Junyuan Xie,  Min Lin,  Yifeng Geng, and  Yutian Li. xgboost: Extreme Gradient Boosting, 2019. URL [https://CRAN.R-project.org/package=xgboost](https://CRAN.R-project.org/package=xgboost). xR package version 0.90.0.2.
    
    
8.   Joanna Coast,  Abby Inglis, and  Stephen Frankel. Alternatives to hospital care: what are they and who should decide? BMJ, 312(7024):162–166, 1996.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MzoiYm1qIjtzOjU6InJlc2lkIjtzOjEyOiIzMTIvNzAyNC8xNjIiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yOC8yMDIxLjA4LjA2LjIxMjYxNTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

9.   Gary S Collins,  Johannes B Reitsma,  Douglas G Altman, and  Karel GM Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): the tripod statement. Journal of British Surgery, 102(3):148–158, 2015.
    
    
10.  Elizabeth R DeLong,  David M DeLong, and  Daniel L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837–845, 1988.
    
    
11.  David A Ellis,  Ross McQueenie,  Alex McConnachie,  Philip Wilson, and  Andrea E Williamson. Demographic and practice factors predicting repeated non-attendance in primary care: a national retrospective cohort analysis. The Lancet Public Health, 2(12): e551–e559, 2017.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2468-2667(17)30217-7&link_type=DOI) 

12.  George Forman and  Martin Scholz. Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement. Acm Sigkdd Explorations Newsletter, 12(1):49–57, 2010.
    
    
13. Health and Social Care Information Programme. A report on the development of SPARRA version 3 (developing risk prediction to support preventative and anticipatory care in Scotland), 2011. [https://www.isdscotland.org/Health-Topics/Health-and-Social-Community-Care/SPARRA/2012-02-09-SPARRA-Version-3.pdf](https://www.isdscotland.org/Health-Topics/Health-and-Social-Community-Care/SPARRA/2012-02-09-SPARRA-Version-3.pdf), Accessed: 6-3-2020.
    
    
14.  Gill Highet,  Debbie Crawford,  Scott A Murray, and  Kirsty Boyd. Development and evaluation of the supportive and palliative care indicators tool (SPICT): a mixed-methods study. BMJ supportive & palliative care, 4(3):285–290, 2014.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6OToiYm1qc3BjYXJlIjtzOjU6InJlc2lkIjtzOjc6IjQvMy8yODUiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yOC8yMDIxLjA4LjA2LjIxMjYxNTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

15. Julia Hippisley-Cox and Carol Coupland. Predicting risk of emergency admission to hospital using primary care data: derivation and validation of QAdmissions score. BMJ open, 3 (8):e003482, 2013.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYm1qb3BlbiI7czo1OiJyZXNpZCI7czoxMToiMy84L2UwMDM0ODIiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yOC8yMDIxLjA4LjA2LjIxMjYxNTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

16. ISD Scotland Data Dictionary. CHI - Community Health Index, 2023. [https://www.ndc.scot.nhs.uk/Dictionary-A-Z/Definitions/index.asp?ID=128](https://www.ndc.scot.nhs.uk/Dictionary-A-Z/Definitions/index.asp?ID=128), Accessed: 17-3-2023.
    
    
17.  Emily Jefferson,  James Liley,  Maeve Malone,  Smarti Reel,  Alba Crespi-Boixader,  Xaroula Kerasidou,  Francesco Tava,  Andrew McCarthy,  Richard Preen,  Alberto Blanco-Justicia, et al. GRAIMATTER green paper: Recommendations for disclosure control of trained machine learning (ML) models from trusted research environments (TREs). arXiv preprint arXiv:2211.01656, 2022.
    
    
18.  Attakrit Leckcivilize,  Paul McNamee,  Christopher Cooper, and  Robby Steel. Impact of an anticipatory care planning intervention on unscheduled acute hospital care using difference-in-difference analysis. BMJ health & care informatics, 28(1), 2021.
    
    
19.  Erin LeDell,  Navdeep Gill,  Spencer Aiello,  Anqi Fu,  Arno Candel,  Cliff Click,  Tom Kraljevic,  Tomas Nykodym,  Patrick Aboyoun,  Michal Kurka, and  Michal Malohlava. h2o: R Interface for ‘H2O’, 2019. URL [https://CRAN.R-project.org/package=h2o](https://CRAN.R-project.org/package=h2o). R package version 3.26.0.2.
    
    
20.  Matthew C Lenert,  Michael E Matheny, and  Colin G Walsh. Prognostic models will be victims of their own success, unless…. Journal of the American Medical Informatics Association, 26(12):1645–1650, 2019.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/jamia/ocz145&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

21.  James Liley,  Samuel R Emerson,  Bilal A Mateen,  Catalina A Vallejos,  Louis JM Aslett, and Sebastian J Vollmer. Model updating after interventions paradoxically introduces bias. AISTATS proceedings, 2021.
    
    
22.  Scott M Lundberg and  Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
    
    
23.  David Lyon,  Gillian A Lancaster,  Steve Taylor,  Chris Dowrick, and  Hannah Chellaswamy. Predicting the likelihood of emergency admission to hospital of older people: development and validation of the emergency admission risk likelihood index (EARLI). Family practice, 24(2):158–167, 2007.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/fampra/cml069&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17210987&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000246122000010&link_type=ISI) 

24.  S Manoukian,  S Stewart,  N Graves,  H Mason,  C Robertson,  S Kennedy,  J Pan,  L Haahr,  SJ Dancer,  B Cook, et al. Evaluating the post-discharge cost of healthcare-associated infection in NHS Scotland. Journal of Hospital Infection, 114:51–58, 2021.
    
    
25.  Matthew BA McDermott,  Shirly Wang,  Nikki Marinsek,  Rajesh Ranganath,  Luca Foschini, and  Marzyeh Ghassemi. Reproducibility in machine learning for health research: Still a ways to go. Science Translational Medicine, 13(586):eabb1655, 2021.
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiRlVMTCI7czoxMToiam91cm5hbENvZGUiO3M6MTE6InNjaXRyYW5zbWVkIjtzOjU6InJlc2lkIjtzOjE1OiIxMy81ODYvZWFiYjE2NTUiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yOC8yMDIxLjA4LjA2LjIxMjYxNTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

26.  Marian S McDonagh,  David H Smith, and  Maria Goddard. Measuring appropriate use of acute beds: a systematic review of methods and results. Health policy, 53(3):157–184, 2000.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0168-8510(00)00092-0&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=10996065&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000089316400002&link_type=ISI) 

27. NICE guidelines. Asthma: diagnosis, monitoring and chronic asthma management. National Institute of Health and Care Excellence, November 2017.
    
    
28. Office for National Statistics, National Records of Scotland, and Northern Ireland Statistics and Research Agency. 2011 census aggregate data. UK data service (edition: June 2011), 2011.
    
    
29.  Juan Perdomo,  Tijana Zrnic Celestine Mendler-Dünner, and  Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
    
    
30.  Anne B Prasad. British National Formulary. Psychiatric Bulletin, 18(5):304–304, 1994.
    
    [FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6MzoiUERGIjtzOjExOiJqb3VybmFsQ29kZSI7czo5OiJwYnJjcHN5Y2giO3M6NToicmVzaWQiO3M6ODoiMTgvNS8zMDQiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yOC8yMDIxLjA4LjA2LjIxMjYxNTkzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

31. Public Health Scotland. eDRIS Products and Services, Public Health Scotland, 2020. URL [https://www.isdscotland.org/Products-and-Services/eDRIS/](https://www.isdscotland.org/Products-and-Services/eDRIS/).
    
    
32. Public Health Scotland. Acute hospital activity and NHS beds information for Scotland, 2022. URL [https://publichealthscotland.scot/media/15288/2022-09-27-annual-acuteactivity-report.pdf](https://publichealthscotland.scot/media/15288/2022-09-27-annual-acuteactivity-report.pdf).
    
    
33.  Fatemeh Rahimian,  Gholamreza Salimi-Khorshidi,  Amir H Payberah,  Jenny Tran,  Roberto Ayala Solares,  Francesca Raimondi,  Milad Nazarzadeh,  Dexter Canoy, and  Kazem Rahimi. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS medicine, 15(11):e1002695, 2018.
    
    
34. Rural Access Action Team. The national framework for service change in NHS Scotland. Scottish Executive, Edinburgh, 2005.
    
    
35.  Colin Sanderson and  Jennifer Dixon. Conditions for which onset or hospital admission is potentially preventable by timely and effective ambulatory care. Journal of health services research & policy, 5(4):222–230, 2000.
    
    
36. Scottish Government. Scottish index of multiple deprivation, 2016.
    
    
37.  Matthew Sperrin,  David Jenkins,  Glen P Martin, and  Niels Peek. Explicit causal reasoning is needed to prevent prognostic models being victims of their own success. Journal of the American Medical Informatics Association, 26(12):1675–1676, 2019.
    
    
38.  Adarsh Subbaswamy and  Suchi Saria. From development to deployment: dataset shift, causality, and shift-stable models in health ai. Biostatistics, 21(2):345–352, 2020.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biostatistics/kxz041&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=31742354&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

39.  Mark J Van der Laan,  Eric C Polley, and  Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
    
    
40.  Emma Wallace,  Ellen Stuart,  Niall Vaughan,  Kathleen Bennett,  Tom Fahey, and  Susan M Smith. Risk prediction models to predict emergency hospital admission in communitydwelling adults: a systematic review. Medical care, 52(8):751, 2014.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/MLR.0000000000000171&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25023919&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

41.  Emma Wallace,  Susan M Smith,  Tom Fahey, and  Martin Roland. Reducing emergency admissions through community based interventions. BMJ, 352, 2016.
    
    
42. World Health Organization. International statistical classification of diseases and related health problems, volume 1. World Health Organization, 2004.
    
    
## REFERENCES

1.   Jochen Bröcker and  Leonard A Smith. Increasing the reliability of reliability diagrams. Weather and forecasting, 22(3):651–661, 2007.
    
    
2.   Tianqi Chen,  Tong He,  Michael Benesty,  Vadim Khotilovich,  Yuan Tang,  Hyunsu Cho,  Kailong Chen,  Rory Mitchell,  Ignacio Cano,  Tianyi Zhou,  Mu Li,  Junyuan Xie,  Min Lin,  Yifeng Geng, and  Yutian Li. xgboost: Extreme Gradient Boosting, 2019. URL [https://CRAN.R-project.org/package=xgboost](https://CRAN.R-project.org/package=xgboost). R package version 0.90.0.2.
    
    
3.  Health and Social Care Information Programme. A report on the development of SPARRA version 3 (developing risk prediction to support preventative and anticipatory care in Scotland), 2011. [https://www.isdscotland.org/Health-Topics/Health-and-Social-Community-Care/SPARRA/2012-02-09-SPARRA-Version-3.pdf](https://www.isdscotland.org/Health-Topics/Health-and-Social-Community-Care/SPARRA/2012-02-09-SPARRA-Version-3.pdf), Accessed: 6-3-2020.
    
    
4.   Erin LeDell,  Navdeep Gill,  Spencer Aiello,  Anqi Fu,  Arno Candel,  Cliff Click,  Tom Kraljevic,  Tomas Nykodym,  Patrick Aboyoun,  Michal Kurka, and  Michal Malohlava. h2o: R Interface for ‘H2O’, 2019. URL [https://CRAN.R-project.org/package=h2o](https://CRAN.R-project.org/package=h2o). R package version 3.26.0.2.
    
    
5.   Matthew C Lenert,  Michael E Matheny, and  Colin G Walsh. Prognostic models will be victims of their own success, unless. Journal of the American Medical Informatics Association, 26(12):1645–1650, 2019.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/jamia/ocz145&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

6.   James Liley,  Samuel R Emerson,  Bilal A Mateen,  Catalina A Vallejos,  Louis JM Aslett, and  Sebastian J Vollmer. Model updating after interventions paradoxically introduces bias. AISTATS proceedings, 2021.
    
    
7.   Karel GM Moons,  Douglas G Altman,  Johannes B Reitsma,  John PA Ioannidis,  Petra Macaskill,  Ewout W Steyerberg,  Andrew J Vickers,  David F Ransohoff, and  Gary S Collins. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Annals of internal medicine, 162(1):W1–W73, 2015.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7326/M14-0698&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25560730&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

8.   Elizbar A Nadaraya. On estimating regression. Theory of Probability & Its Applications, 9(1):141–142, 1964.
    
    
9.   Juan Perdomo,  Tijana Zrnic, Celestine Mendler- Dü nner, and  Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
    
    
10. Public Health Scotland. AE2 - Accident and emergency records, 2020a. [https://www.ndc.scot.nhs.uk/National-Datasets/data.asp?SubID=3](https://www.ndc.scot.nhs.uk/National-Datasets/data.asp?SubID=3), Accessed: 6-3-2020.
    
    
11. Public Health Scotland. PIS - Prescribing information systems, 2020b. [https://www.ndc.scot.nhs.uk/National-Datasets/data.asp?SubID=9](https://www.ndc.scot.nhs.uk/National-Datasets/data.asp?SubID=9), Accessed: 6-3-2020.
    
    
12. Public Health Scotland. System Watch: urgent care usage, 2020c. [https://publichealthscotland.scot/services/system-watch/#section](https://publichealthscotland.scot/services/system-watch/#section) 1-1, Accessed: 6-3-2023.
    
    
13. Public Health Scotland. SMR datasets - ISD Scotland Data Dictionary, 2023. [https://www.ndc.scot.nhs.uk/Data-Dictionary/SMR-Datasets/](https://www.ndc.scot.nhs.uk/Data-Dictionary/SMR-Datasets/), Accessed: 6-3-2023.
    
    
14.  Rafael D Romo,  Theresa A Allison,  Alexander K Smith, and  Margaret I Wallhagen. Sense of control in end-of-life decision-making. Journal of the American Geriatrics Society, 65(3):e70–e75, 2017.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F28%2F2021.08.06.21261593.atom) 

15.  Matthew Sperrin,  David Jenkins,  Glen P Martin, and  Niels Peek. Explicit causal reasoning is needed to prevent prognostic models being victims of their own success. Journal of the American Medical Informatics Association, 26(12):1675–1676, 2019.
    
    
16.  Geoffrey S Watson. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pages 359–372, 1964.

 [1]: /embed/inline-graphic-1.gif
 [2]: /embed/inline-graphic-2.gif
 [3]: /embed/graphic-29.gif
 [4]: /embed/inline-graphic-3.gif
 [5]: /embed/inline-graphic-4.gif
 [6]: /embed/inline-graphic-5.gif
 [7]: /embed/inline-graphic-6.gif
 [8]: /embed/inline-graphic-7.gif
 [9]: /embed/graphic-30.gif
 [10]: /embed/graphic-31.gif
 [11]: /embed/graphic-32.gif
 [12]: /embed/graphic-33.gif
 [13]: /embed/graphic-34.gif
 [14]: /embed/inline-graphic-8.gif
 [15]: /embed/inline-graphic-9.gif
 [16]: /embed/inline-graphic-10.gif
 [17]: /embed/inline-graphic-11.gif
 [18]: /embed/graphic-35.gif
 [19]: /embed/graphic-36.gif
 [20]: /embed/graphic-37.gif
 [21]: /embed/graphic-38.gif
 [22]: /embed/inline-graphic-12.gif
 [23]: /embed/graphic-39.gif
 [24]: /embed/inline-graphic-13.gif
 [25]: /embed/graphic-40.gif
 [26]: /embed/graphic-41.gif
 [27]: /embed/graphic-42.gif
 [28]: /embed/graphic-43.gif
 [29]: /embed/graphic-44.gif
 [30]: /embed/graphic-45.gif
 [31]: /embed/graphic-46.gif