RT Journal Article SR Electronic T1 Hierarchical machine learning predicts geographical origin of Salmonella within four minutes of sequencing JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2022.08.23.22279111 DO 10.1101/2022.08.23.22279111 A1 Bayliss, Sion C. A1 Locke, Rebecca K. A1 Jenkins, Claire A1 Chattaway, Marie Anne A1 Dallman, Timothy J. A1 Cowley, Lauren A. YR 2022 UL http://medrxiv.org/content/early/2022/08/25/2022.08.23.22279111.abstract AB Salmonella enterica serovar Enteritidis is one of the most frequent causes of Salmonellosis globally and is commonly transmitted from animals to humans by the consumption of contaminated foodstuffs. Herein, we detail the development and application of a hierarchical machine learning model to rapidly identify and trace the geographical source of S. Enteritidis infections from whole genome sequencing data. 2,313 S. Enteritidis genomes collected by the UKHSA between 2014-2019 were used to train a ‘local classifier per node’ hierarchical classifier to attribute isolates to 4 continents, 11 sub-regions and 38 countries (53 classes). Highest classification accuracy was achieved at the continental level followed by the sub-regional and country levels (macro F1: 0.954, 0.718, 0.661 respectively). A number of countries commonly visited by UK travellers were predicted with high accuracy (hF1: >0.9). Longitudinal analysis and validation with publicly accessible international samples indicated that predictions were robust to prospective external datasets. The hierarchical machine learning framework provides granular geographical source prediction directly from sequencing reads in <4 minutes per sample, facilitating rapid outbreak resolution and real-time genomic epidemiology.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis work was funded by an Academy of Medical Sciences Springboard grant (SBF005\1089). CJ, TD and MAC are affiliated to the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Gastrointestinal Infections and Genomics and Enabling Data at University of Liverpool and University of Warwick respectively in partnership with the UK Health Security Agency (UKHSA). CJ and MAC are based at UKHSA. The views expressed are those of the author(s) and not necessarily those of the NIHR, the Department of Health and Social Care or the UK Health Security Agency.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The initial dataset consisted of 10,223 S. Enteritidis isolates collected and sequenced by UKHSA between 2014-2019 as a part of their routine disease monitoring programme. Raw read data was downloaded from the Short Read Archive (Bioproject: PRJNA248792)(Table S3).I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesThe final optimised hierarchical model as well as a pipeline for pre-processing raw read data to unitigs/patterns for input is available from https://github.com/SionBayliss/HierarchicalML with a short description and tutorial for ease of use. This end-to-end process, from FASTQ to prediction, is open access and available to users. Short read sequencing data is available from the Short Read Archive under Bioproject PRJNA248792. https://github.com/SionBayliss/HierarchicalML