Assessment of machine learning algorithms in national data to classify the risk of self-harm among young adults in hospital: a retrospective study

Anmol Arora; Louis Bojko; Santosh Kumar; Joseph Lillington; Sukhmeet Panesar; Bruno Petrungaro

doi:10.1101/2022.08.08.22278554

Summary

Background Self-harm is one of the most common presentations at accident and emergency departments in the UK and is a strong predictor of suicide risk. The UK Government has prioritised identifying risk factors and developing preventative strategies for self-harm. Machine learning offers a potential method to identify complex patterns with predictive value for the risk of self-harm.

Methods National data in the UK Mental Health Services Data Set were isolated for patients aged 18‒30 years who started a mental health hospital admission between Aug 1, 2020 and Aug 1, 2021, and had been discharged by Jan 1, 2022. Data were obtained on age group, gender, ethnicity, employment status, marital status, accommodation status and source of admission to hospital and used to construct seven machine learning models that were used individually and as an ensemble to predict hospital stays that would be associated with a risk of self-harm.

Outcomes The training dataset included 23 808 items (including 1081 episodes of self-harm) and the testing dataset 5951 items (including 270 episodes of self-harm). The best performing algorithms were the random forest model (AUC-ROC 0.70, 95%CI:0.66-0.74) and the ensemble model (AUC-ROC 0.77 95%CI:0.75-0.79).

Interpretation Machine learning algorithms could predict hospital stays with a high risk of self-harm based on readily available data that are routinely collected by health providers and recorded in the Mental Health Services Data Set. The findings should be validated externally with other real-world data.

Funding This study was supported by the Midlands and Lancashire Commissioning Support Unit.

Evidence before this study Despite self-harm being repeatedly labelled as a national priority for psychiatric healthcare research, it remains challenging for clinicians to stratify the risk of self-harm in patients. National guidelines have highlighted deficiencies in care and attention is being paid towards the use of large datasets to develop evidence-based risk stratification strategies. However, many of the tools so far developed rely upon elements of the patient’s clinical history, which requires well curated datasets at a population level and previous engagement with care services at an individual level. Reliance upon elements of a patient’s clinical history also risks biasing against patients with missing data or against hospitals where data is poorly recorded.

Added value of this study In this study, we use commissioning data that is routinely collected in the United Kingdom by healthcare providers with each hospital admission. Of the variables that were available for analysis, recursive feature elimination optimised our variable selection to include only age group, source of hospital admission, gender, and employment status. Machine learning algorithms were able to predict hospital episodes in which patients self-harmed in the majority of cases using a national dataset. Random forest and ensemble machine learning methods were the best-performing models. Sensitivity and specificity at predicting self-harm occurrence were 0.756 and 0.596, respectively, for the random forest model and 0.703 and 0.730 for the ensemble model. To our knowledge, this is the first study of its kind and represents an advance in the prediction of inpatient self-harm by limiting the amount of information required to make predictions to that which would be near-universally available at the point of the admission, nationally.

Implications of all the available evidence There is a role for machine learning to be used to stratify the risk of self-harm when patients are admitted to mental health facilities, using only commissioning data that is easily accessible at the point of care. External validation of these findings is required as whilst the algorithms were tested on a large sample of national data, there remains a need for prospective studies to assess the real-world application of such machine learning models.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

Funding was received from the Midlands and Lancashire Commissioning Support Unit.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Permission was granted by NHS England Data Services for the secondary analysis of anonymised data in the MHSDS, accessed through the NCDR secure server, for the purposes of this project.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

Information about the Mental Health Services Data Set is available at https://digital.nhs.uk/data-and-information/data-collections-and-data-sets/data-sets/mental-health-services-data-set/. The dataset documentation is publicly accessible but the data is only available through England NHS Data Services for approved uses.

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.