ABSTRACT
Short-term reattendances to emergency departments are a key quality of care indicator. Identifying patients at increased risk of early reattendance could help reduce the number of missed critical illnesses and could reduce avoidable utilization of emergency departments by enabling targeted post-discharge intervention. In this manuscript we present a retrospective, single-centre study where we create and evaluate a machine-learnt classifier trained to identify patients at risk of reattendance within 72 hours of discharge from an emergency department. On a patient hold-out test set, our highest performing classifier obtained an AUROC of 0.749 and an average precision of 0.232; demonstrating that machine-learning algorithms can be used to classify patients, with moderate performance, into low and high-risk groups for reattendance. In parallel to our predictive model we train an explanation model, capable of explaining the predictions of the machine-learnt classifier at an attendance level. These explanations can be used to help understand the decisions a model is making, evaluate biases present in it’s decisions and help inform the design of bespoke interventional strategies.
Introduction
The use of emergency departments (EDs) has been growing steadily over the last decade1, 2, which in turn has contributed to increased overcrowding and extended waiting times. Since delays in care and overcrowding have been linked to increased rates of adverse outcomes3, 4, it is important to investigate the most efficient ways of using the available resources and, importantly, minimise and mitigate their unnecessary use. One way this can be achieved is by the minimisation of short-term reattendances, which describe a situation where a patient presents to an emergency department within 72 hours of having been discharged. The number of short-term reattendances can be minimised by both delivering the highest levels of patient care, thereby reducing the chance of missed critical illness and injury at the initial attendance, and by mitigating reattendances for reasons at least partially unrelated to the initial ED attendance.
Research has shown there are several factors indicative of short-term reattendance risk including social factors (e.g., living alone)5, depression6, initial diagnosis7, and historical emergency department usage8. Knowledge of these risk factors is important to clinical staff when planning discharge, but this is unlikely the most optimal way of determining those at risk of suffering from a significant illness following erroneous discharge or those in need of additional support in the community following discharge from an ED. Predictive models, available as a decision support tool at the point of discharge, able to reliably identify those at increased risk of short-term reattendance using known risk factors and attendance level information, may be able to significantly reduce the number of reattendances by appropriately quantifying and explaining a patients risk of reattendance to clinical staff. Ultimately this would allow appropriate intervention (e.g., further diagnostic tests), more informed discussions about a patients discharge plan, or support in the community for those recently discharged.
Machine learnt models are a class of predictive models which are particularly well positioned to add value to emergency department processes. By making use of large amounts of clinical and administrative data, these models can provide estimates of a patient’s short-term reattendance risk9, 10 and explain the reason for the patient’s predicted risk. Explanation is particularly important, as this could help either inform the patients care trajectory or guide the post-discharge intervention plan. In this manuscript we discuss a machine-learnt model, utilizing historical (coded, inpatient) discharge summaries, alongside contemporary clinical data recorded during emergency department attendances such as observations and the results of standard triage processes, to identify patients at increased risk of short-term reattendance following an emergency department attendance. In addition to our predictive model, we construct an explanation model which enables us to evaluate the trends our model has learned and explain our model’s prediction at an attendance level.
Methods
Dataset curation
The dataset features a pseudonymized version of all attendances by adults to Southampton’s Emergency Department (University Hospitals Southampton Foundation Trust) occurring between the 1st April 2019 and the 30th of April 2020. For our study cohort, we take only attendances which resulted in discharge directly from the ED, of which there were 54,015. The core dataset includes patients’ year of birth, results of any near-patient observations, and high-level information about the attendance included in the standard UK Emergency Care Data Set (e.g., outcome, arrival mode, duration of visit, and chief complaint). To provide the machine learning classifier with a view of patients’ medical history we make use of historical discharge summaries associated with the patient, both from the emergency department and from the patients electronic health record maintained by the University Hospitals Southampton Foundation Trust. For a given patient, from any discharge summary occurring prior to a given emergency department attendance, we make use of (ICD10) coded conditions (e.g., type 2 diabetes, current smoker) and create a binary indicator which indicates whether a patient has a given condition coded in their electronic health record prior to a given ED attendance. The electronic health records used by our models are available to review by clinicians and are used in regular practice. Our model does not have access to any free text fields in the electronic health record, where as a clinician would. Previous studies have shown that (free text) clinical notes can be predictive of patient outcomes across the broader hospital network11, 12, but including these notes was beyond the scope of our study as they would limit the explainability of our algorithms. An example of the most frequently observed conditions are presented in Table 1.
Reattendance identification
Patient reattendances are identified by using the patient pseudo identifier to calculate the time to their next ED attendance. All reattendances (with the exception of planned reattendances, Figure 1) are considered, even if the second attendance is for a different condition to the original attendance. This is then dichotomized (return within 72 hours from the point of discharge) to annotate the reattendance state of each attendance. This formulation allowed us to frame the predictive task as a binary classification problem.
Predictive modelling
We separated our data into a training set and two independent test sets (Figure 1). The last 3 months of attendances (01/02/2020 to 30/04/2020, inclusive of the COVID-19 pandemic) were segregated as a temporal test set, excluding any visit which took part in either the first 30 days or the final 72 hours. These exclusions remove information leakage between the training and temporal test set and remove attendances where reattendance to the emergency department could not be calculated reliably (i.e., attendances which occur in the last 72 hours of the data extract). The remaining attendances (N=44,294) were randomly split at the patient level to create a patient-level hold test set containing attendances from 20 % of the remaining patients. Remaining attendances (N=35,447) were used as the training and validation set. The number of patients in each respective dataset was 4,464, 7,237, and 28,945. The relation between patients in each dataset is displayed in Supplementary Figure 1, demonstrating patient exclusivity between the training set and the patient hold-out test set. Results of our predictive model, evaluated on the temporal test set is discussed in the Supplementary Information.
As our machine-learnt classifier, we used a gradient boosted decision tree as implemented in the XGBoost framework16. Features used in modelling include : patient age (estimated from year of birth), number of emergency department attendances in the 30 days prior to the attendance, the chief complaint of the attendance (e.g., ‘abdominal pain’), the patients mode of arrival, previously described medical condition indicators, the count of the number of medical conditions a patient has, vital signs (temperature, pulse and respiration rate, systolic blood pressure, and blood oxygen saturation levels), the Manchester Triage System score, triage pain score, (coded) discharge diagnosis, and the hour of day and day of the week the attendance occurred. A full data schema is presented in Supplementary Table 1. Medical conditions associated with the patient at a given attendance were included as a one-hot-encoded feature vector, the day of the week encoded using ordinal encoding, and all other categorical variables were encoded using target encoding13. Model hyperparameters were optimized using five-fold cross-validation (CV) of the training set at the patient level (the set of attendances from a unique patient appear exclusively in the validation or training set for each fold) using Bayesian optimization utilizing the Tree Parzen Estimator algorithm as implemented in the hyperopt Python library14, 15.
We evaluate our final model performance (the mean output of the five models trained during cross-validation) on the two hold-out test sets. Models performance is evaluated using the Area Under the Receiving Operating Curve (AUROC) and the average precision under the precision-recall curve.
Model explainability
To explain the predictions of our model we make use of the TreeExplainer algorithm implemented in the SHAP Python library17–19. TreeExplainer calculates SHAP values (i.e, Shapley values), a concept from coalitional game theory which treats predictive variables as players in a game and distributes their contribution to the predicted probability. To calculate the SHAP value for a given feature, one trains a model for each possible feature set (with and without the given feature) and calculates the mean change in the predicted probability when the feature was added to a feature set for all possible sets of features. This mean change is the SHAP value and can be negative (adding the feature predicted reduces reattendance risk) or positive (adding the feature increases the predicted reattendance risk). SHAP values are particularly powerful as they meet the four desirable theoretical conditions of an explanation algorithm and can provide instance (i.e., attendance) level explanations19. Practically, for each attendance we will have a scalar value for each variable used by the model which quantifies the contribution that variable had on the predicted reattendance risk for the given attendance, with SHAP values of larger magnitudes indicating that the relevant variable was of greater importance in determining the predicted reattendance risk.
To investigate the different explanations across the whole dataset, we project the SHAP values for all attendances into a two-dimensional (‘explanation’) space using Uniform Manifold Approximation and Projection (UMAP)20. UMAP is a dimensionality reduction technique regularly used to visualise high-dimensional spaces in a low-dimensional embedding, such that global and local structure of the space can be explored21, 22. Attendances which are closer in proximity in this two-dimensional space share a more similar explanation for their predicted reattendance risk.
Ethics and data governance
This study was approved by the University of Southampton’s Ethics and Research governance committee (ERGO/FEPS/53164) and approval was obtained from the Health Research Authority (20/HRA/1102). Data was pseudonymized (and where appropriate linked) before being passed to the research team. The research team did not have access to the pseudonymisation key.
Results
To investigate the potential of individual and sets of variables at predicting 72-hour reattendance we constructed a series of XGBoost models, evaluating their performance on the training set using five fold cross-validation as described in the Methods section, the results of this experiment are displayed in Table 2.
Six variables (age, Manchester Triage System score, hour of day, vital signs, pain score, and arrival model) were found to be weakly predictive of a patient’s 72-hour reattendance risk in isolation (AUROCs between 0.5 and 0.6, Table 2 models b-g). All other variables (Table 2 models h-n) were found to be moderately predictive (AUROC between 0.6 and 0.70) of 72-hour reattendance risk in isolation, with the exception of the day of the week the attendance occurred which was not predictive of a patients reattendance risk (model a, Table 2).
Patients conditions were included in two representations. The count of the number of historical conditions (model m, Table 2) obtained a validation AUROC of 0.692, reflecting that those with a recorded medical history with the emergency department or the associated hospital are more likely to reattend (8.3 % (95 % CI: 7.8-8.7 %) reattendance rate) than those who do not (2.3% (95 % CI: 2.1-2.5%) reattendance rate). When we included the full one-hot encoded matrix denoting whether the patient had a history of a specific condition, our model (model n, Table 2) obtained a validation AUROC of 0.706 – higher than when our model used just the number of historical conditions. This suggests that different (medical) conditions are associated with a differing degrees of reattendance risk.
The model that used the number of times a patient attended the emergency department in the 30-day prior to their current attendance (Table 2, model l) exhibited a validation AUROC of 0.669, agreeing with other studies that a patients previous emergency department usage is an important consideration when considering their reattendance risk8. Three models (models i, j, and k, Table 2) make use of coded information describing the primary reason for the emergency department attendance, recorded at three distinct timepoints and by potentially different members of clinical and non-clinical staff. Making use of the chief complaint collected at either the point of registration or Triage, respective validation AUROCs of 0.640 and 0.647 could be achieved. At the point of discharge, the recorded coded diagnosis obtained a validation AUROC of 0.642. This demonstrates that different diagnoses are associated with differing degrees of reattendance risk and indicates that a high-level, coded description of the patients chief complaint is moderately predictive of reattendance risk, regardless of when it is recorded during the attendance.
Finally, we investigated models which utilize multiple variables (models o and p in Table 2). Firstly, we trained a model using only the three variables which were most predictive as determined by our univariate feature importance investigation (Table 2). This model (model o in Table 2) used the condition indicators, the chief complaint recorded at triage, and the number of times the patient visited the ED in the previous 30 days. Ultimately, it obtained a validation AUROC of 0.741, demonstrating that using multiple variables is more predictive of reattendance than a single variable. We also trained and evaluted a model using all features available at discharge, which used all variables available at the time of discharge and increased the validation AUROC to 0.761, but with a small decrease in the validation average precision.
Next, we applied our final model (model p, Table 2) on the patient wise hold-out test set, the evaluation of which is presented in Figure 2. The AUROC and average precision was 0.749 and 0.232; while the model obtained its moderate performance there was a reduction in performance between the validation and patient wise hold-out test set. This could arise because of a degree of overfitting of the model to the validation data (via the selection of hyperparametes which optimize model performance on the validation set) or because of a different set of patients in the hold-out test set. To demonstrate the evaluation of our model as a binary decision support tool, we display a confusion matrix for our classifier at a single configuration in Figure 2 c. The threshold for dichotomization of the predictions was chosen such that a recall of 0.15 was obtained.
To investigate the trends our model has learned we made use of the TreeExplainer algorithm18; a demonstration of the global explanation of our reattendance model is presented in Figure 3. In Figure 3 a the SHAP values (quantifying, at an instance level, the impact a given variable has on the model’s prediction) for 10 variables are shown for each attendance. Looking at this plot for a large number of variables allows a high level understanding of the model to be obtained: the model associates anyone with a recorded medical condition as being at increased risk of reattendance and learns that some medical conditions represent a greater reattendance risk than others, for example living alone is generally associated with a higher reattendance risk than having a history of depression (the mean SHAP value is greater for those who live alone). In panels b and c of Figure 3, we plot the same information for two features (the hour of day attendance occurred and 30 day visit count respectively) but in a 2D plane which allows greater insight into the associations our model has learned between these variables and a patients reattendance risk. The model has learned that the patients risk of reattendance displays a periodic dependence (Figure 3 b) with the hour of day the attendance occurred (vertical dispersion is the result of interactions with other variables in the dataset) and it has also learned an approximately linear dependence between a patients reattendance risk and the number of times they have attended the emergency department in the last 30 days (Figure 3 c). It is important to note that these insights do not necessarily reflect the actual risk factors for reattendance (since the model is an imperfect classifier) but only reflect the trends the model has learned to make its decisions.
We then project the explanations for all attendances into a lower-dimensional (2D) ‘explanation’ space using the UMAP algorithm (see Methods for details), this projection provides insight into the different high-level groups of explanations provided by our model. The two-dimensional embedding of the attendances in the patient hold-out test set into the explanation space is visualised in Figure 4, attendances close in this space share more similar explanations for their predicted reattendance risk. There are clear regions (colour) in the explanation space associated with increased reattendance risk. In Figures 4 b we colour the attendances by a high-level cluster assignment obtained using the DBScan algorithm23), where hyperparameters were selected by visual inspection. Key descriptors of each cluster are displayed in Table 3 and visual inspection of this table provides high-level insight into the reasoning of the machine-learnt model. For example, for attendances assigned to cluster five, on average, the most important variable for determining a patients reattendance risk is their medical history.
Discussion
Our final 72-hour reattendance risk model achieved an AUROC of 0.749 and an average precision of 0.232 on a set of attendances independent to the training set. Qualitatively, our model can use a patient’s (local) medical history and attendance level information to predict their reattendance risk with moderate performance. In parallel, we trained an explanation model, which can explain the model’s predictions at an attendance level (Figure 3 and Supplementary Figure 3) level. These explanations were projected into a two-dimensional space (Figure 4). Such a visualisation can facilitate improved understanding of the model’s high-level reasoning and can be used as a tool to understand the different sub-groups at risk of reattendance, which could be used by the clinical care team to design interventions based on where a given attendance resides within the explanation space. Ultimately, this facilitates the deployment of the machine-learnt models in a more informed manner.
Our final model (model p in Table 2) makes use of all variables found to be predictive of short-term reattendance. This is not necessarily the optimal set of variables for a reattendance predictor and the variable likely contain obselete information because of correlations between variables. High correlation between variables is expected for clinical data, for example, it is likely that patient age, arrival mode, and vital signs all to latently encode the patient’s frailty, which is known to be related to a patients reattendance risk24.
In our exploratory analysis, we found that the hour of day the attendance began correlates to the reattendance rate, with higher reattendance rates observed during the night (Supplementary Figure 2). By evaluating the observed SHAP values for the hour of day (Figure 3 b) we can observe that our model has learned a similar trend, associating attendance registration during the night with an increased (between zero and two percent increase) reattendance risk. This trend could have several different origins. Firstly, we have found (not shown) that the hour of day displays correlation with the reason for attendance with either complaints associated with a higher risk of 72-hour reattendance more likely to present during the night or complaints associated with a lower risk of 72-hour reattendance less likely to present during the night. Secondly, it is possible that the staff fatigue and lower staffing levels may contribute to the increased reattendance rate for attendances occurring during the night, although we have no way of testing this hypothesis in our dataset.
Our analysis (e.g., Figure 2 and model e in Table 2) shows that certain complaints are associated with a higher risk of 72-hour reattendance. For example, attendances whose chief complaint at registration is ‘Abdominal pain’ had a mean 72-hour reattendance rate of 6.8 % (95 % CI: 6.0-7.7 %), compared to the mean reattendance rate of 4.8 % (95 % CI: 4.6-5.0 %). Coding compliance was not evaluated in our dataset, which may effect this observation. For example, the most common chief complaint at registration was ‘Unwell Adult’ (with 20.5 % of attendances listing this as the chief complaint in the training set) which is utilized when either the chief complaint is not clear at registration, when the patient presents with multiple complaints or as a result of inappropriate coding.
In addition to identifying complaints associated with a heightened short-term reattendance risk, our model also makes use of ICD10 coded conditions (e.g., type 1 diabetes, lives alone) extracted from a patients electronic health record. These variables allow the model to identify medical conditions, comorbidities, and risks which are associated with increased reattendance rates and enables models to achieve moderate predictive performance (Table 2). Excluding the medical condition indicators, the most important feature is the 30-day visit count which, in part, reflects the disproportionate use of EDs by frequent users25. In the visualisation of the attendances in the patient hold-out test set in the two-dimensional explanation space (Figure 4), the most frequent attenders (30 day visit count of two or more) are clearly segregated (cluster 8 in Figure 4 b and Table 3). Visual inspection of the properties of each cluster in Figure 4 could be used to help guide the design of interventional stratgies. For example, while attendances within cluster 8 (i.e., frequent attenders) may benefit from support in the community to mitigate their reattendance risk, this will not necessarily be appropriate patients with a heighted reattendance risk but are suffering from an acute injury associated with increased reattendance risk, such as a severe burn.
From a clinical perspective it is important to investigate the subset of reattendances which are also readmissions (i.e., reattendances to the emergency department which result in the patient being admitted to an inpatient ward). In these cases, there is increased risk that there was missed critical illness or injury at the initial attendance, and they are important to evaluate for clinical assurance purposes. Overall, 37.1 % of reattendances end in readmission, resulting in a 72-hour readmission rate of 2.0 %. Evaluating our models predictions with a target equal to whether the patient readmitted in 72 hours, we find it has an AUROC of 0.805 and an average precision of 0.040 on the hold-out patient test set. The high AUROC means the model displays high discernibility between attendances which result in readmission and those that do not. The low average precision reflects that readmissions only make up a minority fraction of reattendances and the false positive rate increases as a result of the large class imbalance. Overall, these results demonstrate our classifier can identify the subset of reattendances which are also readmissions with a similar predictive performance as reattendances which do not result in admission – a particularly important result since these two different outcomes will likely merit different interventional strategies to reduce the risk of reattendance/readmission.
A limitation of our study, shared with other investigations of machine-learning use in EDs26, is that its primary data source is structured past medical histories, which is unavailable for many patients. This could lead to our model discriminating against people without a clinical history at the emergency department and associated hospital. An example of this bias can be observed for cluster two (Table 3), where the two most importants variables for determining a patients reattendance rate is the absence of any visit to the emergency department in the last 30 days and the absence of any recorded medical conditions; which may not be desirable behaviour. We mitigate this through the use of visit-level information and this bias could be further reduced by linking to community datasets (e.g., GP records) to get a view of a patients medical history. However, in a deployment scenario this bias could be minimized further by using the model as an alert tool – with results only being displayed for patients it predicts to be at high risk of reattendance and otherwise being entirely invisible to clinical staff who would be free to carry out standard clinical practice in cases where the alarm is not raised.
Practically, since the model uses only information available to clinicians at the time of the emergency department visit the model has a relatively low barrier to implementation. Despite this, it will be essential to perform prospective, randomized clinical trials of any implementation, investigating the efficacy of these predictive risk models, the associated interventions and, importantly, analysing how they impact decision making. Ultimately, deployment of a machine-learning model could eventually invalidate the model by changing the behaviours and descriptors of reattendances by altering the clinical decisions made. In the short term, a relatively low-risk implementation of a machine-learnt model trained to identify patients at risk of reattendance would be in the implementation of a low-recall and high-precision alert system (for example, the configuration presented in Figure 2 c). This would only raise alarms for the cases the model believes are at the highest risk of reattendance and suggest appropiate clinically-validated intervention or additional clinical review. On average, using the configuration displayed in Figure 2 c, this would have raised an alarm for only 1.6 % of attendances in which a decision to discharge was made (approximately 2 times per day) and would expect to be correct approximately 50 % of the time – mitigating the risk of alarm fatigue. While such a model would only of be of limited impact because of the models low recall, as model performance improves, this configuration could be re-evaluated and changed to increase the impact of the model.
When considering deployment it is important to discuss the context in which these predictive models could be prospectively deployed. Our model was trained and retrospectively evaluated using data obtained local to the emergency department in Southampton, using data available to clinicians during standard clinical practice. This is clearly an advantage if the model was to be used at this location – the biases in the data and attendance characteristics will likely reflect what the model will encounter in production. Conversely, this does mean that the model will not necessarily generalize to different EDs without first training on their local data, this will be particularily prominient in EDs with a catchment zone with very different demographics to Southampton, which would have a differing disease prevalence and characteristics at presentation to the emergency department. Despite this since our model contains variables either in the standard UK emergency care dataset or regularily available to EDs nationally, it would be possible to evaluate this model directly in other EDs with little alteration. External validation of our model using data from different EDs is essential before prospective deployment beyond the department at which the training data was sourced.
Conclusion
In conclusion, we have constructed and retrospectively evaluated a gradient boosted decision tree classifier capable of predicting the 72-hour reattendance risk for a patient at the point of discharge from an emergency department. The highest performing model achieved an AUROC of 0.749 and an average precision of 0.232 on a set of attendances independent to the training set. We investigated the variables most indicative of risk and showed these were patient level factors (medical history) rather than visit level variables such as recorded vital signs. We demonstrated how explainable machine learning can be used to investigate the decisions a model is making. We suggested an implementation of the algorithm in a low-recall high-precision configuration such that alarms are only raised if the model deems the patient to be at a (clinically defined) heighted risk of reattendance. External validation and prospective clinical trials of these models are essential, with considerable consideration given to the planned intervention resulting from the model’s recommendation and the impact this would have on clinical decisions.
Data Availability
The data that support the findings of this study are available from UHS, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of UHS.
Author contributions statement
FPC performed the data analysis and modelling. DKB and ZDZ discussed and commented on the analysis with FPC. NW and FPC obtained governance and ethical approval. MA and FB created the data extract. FPC, MJB and NW managed the study at the UoS. MK managed the study at UHS. FPC and MK designed the study with assistance from TWVD. MK and TWVD provided clinical guidance and insight. FPC wrote the first draft of the manuscript with assistance from MK and TWVD. All authors frequently discussed the work and commented and contributed to future drafts of the manuscript.
Additional information
Competing interests
The authors declare no competing interests.
Data availability
Due to patient privacy concerns the dataset used in this study is not publicly available. However, it will be made available upon reasonable request
Acknowledgements
This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. We acknowledge support from the NIHR Wessex ARC.