Hierarchical machine learning predicts geographical origin of Salmonella within four minutes of sequencing

Sion C. Bayliss; Rebecca K. Locke; Claire Jenkins; Marie Anne Chattaway; Timothy J. Dallman; Lauren A. Cowley

doi:10.1101/2022.08.23.22279111

Abstract

Salmonella enterica serovar Enteritidis is one of the most frequent causes of Salmonellosis globally and is commonly transmitted from animals to humans by the consumption of contaminated foodstuffs. Herein, we detail the development and application of a hierarchical machine learning model to rapidly identify and trace the geographical source of S. Enteritidis infections from whole genome sequencing data. 2,313 S. Enteritidis genomes collected by the UKHSA between 2014-2019 were used to train a ‘local classifier per node’ hierarchical classifier to attribute isolates to 4 continents, 11 sub-regions and 38 countries (53 classes). Highest classification accuracy was achieved at the continental level followed by the sub-regional and country levels (macro F1: 0.954, 0.718, 0.661 respectively). A number of countries commonly visited by UK travellers were predicted with high accuracy (hF1: >0.9). Longitudinal analysis and validation with publicly accessible international samples indicated that predictions were robust to prospective external datasets. The hierarchical machine learning framework provides granular geographical source prediction directly from sequencing reads in <4 minutes per sample, facilitating rapid outbreak resolution and real-time genomic epidemiology.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This work was funded by an Academy of Medical Sciences Springboard grant (SBF005\1089). CJ, TD and MAC are affiliated to the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Gastrointestinal Infections and Genomics and Enabling Data at University of Liverpool and University of Warwick respectively in partnership with the UK Health Security Agency (UKHSA). CJ and MAC are based at UKHSA. The views expressed are those of the author(s) and not necessarily those of the NIHR, the Department of Health and Social Care or the UK Health Security Agency.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The initial dataset consisted of 10,223 S. Enteritidis isolates collected and sequenced by UKHSA between 2014-2019 as a part of their routine disease monitoring programme. Raw read data was downloaded from the Short Read Archive (Bioproject: PRJNA248792)(Table S3).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

The final optimised hierarchical model as well as a pipeline for pre-processing raw read data to unitigs/patterns for input is available from https://github.com/SionBayliss/HierarchicalML with a short description and tutorial for ease of use. This end-to-end process, from FASTQ to prediction, is open access and available to users. Short read sequencing data is available from the Short Read Archive under Bioproject PRJNA248792.

https://github.com/SionBayliss/HierarchicalML

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.