PT - JOURNAL ARTICLE AU - Banerjee, Amitava AU - Chen, Suliang AU - Dashtban, Muhammad AU - Pasea, Laura AU - Thygesen, Johan H AU - Fatemifar, Ghazaleh AU - Tyl, Benoit AU - Dyszynski, Tomasz AU - Asselbergs, Folkert W. AU - Lund, Lars H. AU - Lumbers, Tom AU - Denaxas, Spiros AU - Hemingway, Harry TI - Identifying subtypes of heart failure with machine learning: external, prognostic and genetic validation in three electronic health record sources with 320,863 individuals AID - 10.1101/2022.06.27.22276961 DP - 2022 Jan 01 TA - medRxiv PG - 2022.06.27.22276961 4099 - http://medrxiv.org/content/early/2022/06/28/2022.06.27.22276961.short 4100 - http://medrxiv.org/content/early/2022/06/28/2022.06.27.22276961.full AB - Background Reliable identification of heart failure (HF) subtypes might allow targeted management. Machine learning (ML) has been used to explore HF subtypes, but neither across large, independent, population-based datasets, nor across the full spectrum of causes and presentations, nor with clinical and non-clinical validation by different ML methods. Using our published framework, we identified and validated HF subtypes to address these gaps.Methods We analysed individuals ≥30 years with incident HF from two population-based electronic health records resources (1998-2018; Clinical Practice Research Datalink, CPRD: n=188,799 HF cases; The Health Improvement Network, THIN: n=124,263 HF cases). Pre-and post-HF factors (n=645) included demography, history, examination, blood laboratory values and medications. We identified subtypes using four unsupervised ML methods (K-means, hierarchical, K-Medoids and mixture model clustering) with 87 (from 645) factors in each dataset. We evaluated subtypes for: (i) external validity (across independent datasets); (ii) prognostic validity (predictive accuracy for 1-year mortality); and (iii) uniquely, genetic validity (in UK Biobank; n=9573 cases): association with polygenic risk score (PRS) for 11 HF related traits, and direct association with 12 reported HF single nucleotide polymorphisms (SNPs).Findings After identifying five clusters, we labelled HF subtypes: 1.Early-onset, 2.Late-onset, 3.AF-related, 4.Metabolic, and 5.Cardiometabolic. External validity: Subtypes were similar across datasets (c-statistic: 0.94, 0.80, 0.79, 0.83, 0.92 for the THIN model in CPRD and 0.79, 0.92, 0.90, 0.89, 0.92 for the CPRD model in THIN for subtypes 1-5, respectively). Prognostic validity: One-year all-cause mortality, risk of non-fatal cardiovascular diseases and all-cause hospitalisation (before and after HF diagnosis) differed across subtypes in CPRD and THIN data. Genetic validity: The AF-related subtype showed associations with PRS for related traits. Late-onset and Cardiometabolic subtypes were most comparable and strongly associated with PRS for Hypertension, Myocardial Infarction and Obesity (p-value < 9.09 × 10−4). We developed a prototype for clinical use, which could enable evaluation of effectiveness and cost-effectiveness.Interpretation Across four methods and three datasets, and including genetic data, in the largest HF study to-date, ML algorithms identified five subtypes in individuals with incident HF. These subtypes may inform aetiologic research, clinical risk prediction and the design of HF trials.Funding European Union Innovative Medicines Initiative.Evidence before this study In a systematic review until December 2019, we showed that studies of machine learning in subtyping and risk prediction in cardiovascular diseases are limited by small population size, relatively few factors and poor generalisability of findings due to lack of external validation. We further searched PubMed, medRxiv, bioRxiv, arXiv, for relevant peer-reviewed articles and preprints, focusing on machine learning studies in heart failure. Studies remain focused on single diseases, limited risk factors, often single method of machine learning, rarely use subtyping and risk prediction together, and have not been externally validated across datasets. For heart failure, all subtype discovery studies have identified subtypes based on clustering, but so far with no application to clinical practice.Added value of this study Across two independent, population-based datasets, we used four machine learning methods for subtyping and risk prediction with 89 aetiologic factors as well as 556 further factors for heart failure. We identified and validated five subtypes in incident heart failure, which differentially predicted outcomes. In addition, we externally validated clinical cluster differences by exploring corresponding genetic differences in a large-scale genetic cohort. Our methods and results highlight potential value of electronic health records and machine learning in understanding disease subtypes. Moreover, our approach to external, prognostic, and genetic validity provides a framework for validation of machine learning approaches for disease subtype discovery.Implications of all the available evidence Our analyses support coordinated use of large-scale, linked electronic health records to identify and validate disease subtypes with relevance for clinical risk prediction, patient selection for trials, and future genetic research.Competing Interest StatementAB is supported by research funding from the National Institute for Health Research (NIHR), British Medical Association, AstraZeneca, and UK Research and Innovation. BT and TD are employees of Bayer. All other authors declare no competing interests.Funding StatementEuropean Union Innovative Medicines Initiative.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:Approvals were by: (i) MHRA Independent Scientific Advisory Committee [18_217R]: Section 251 (NHS Social Care Act 2006), (ii) Scientific Review Committee [17THIN038-A1] and (iii) UKB 15422: Patient informed consent was not required or provided.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data produced in the present work are contained in the manuscript