Abstract
Clonal hematopoiesis (CH) is a phenomenon of clonal expansion of hematopoietic stem cells driven by somatic mutations affecting certain genes. Recently, CH has been linked to the development of a number of hematologic malignancies, cardiovascular diseases and other conditions. Although the most frequently mutated CH driver genes have been identified, a systematic landscape of the mutations capable of initiating this phenomenon is still lacking. Here, we train high-quality machine-learning models for 12 of the most recurrent CH driver genes to identify their driver mutations. These models outperform an experimental base-editing approach and expert-curated rules based on prior knowledge of the function of these genes. Moreover, their application to identify CH driver mutations across almost half a million donors of the UK Biobank reproduces known associations between CH driver mutations and age, and the prevalence of several diseases and conditions. We thus propose that these models support the accurate identification of CH across healthy individuals
Significance We developed and validated 12 gene-specific machine learning models to identify CH driver mutations, showing their advantage with respect to expert-curated rules. These models can support the identification and clinical interpretation of CH mutations in newly sequenced individuals.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
N.L.-B. acknowledges funding from the European Research Council (consolidator grant 682398). S. D. was supported by a Juan de la Cierva fellowship from Spanish Ministerio de Ciencia e Innovacion (IJC2020-044728-I). J. E. R-Z. was supported by a Postdoctoral AECC 2023 fellowship from fundacion cientifica asociacion Espanola Contra el Cancer (AECC) (POSTD234814RAMI). This project was supported by the CHEMOHEALTH project, funded by the Spanish Ministry of Science (MCIN), AEI /10.13039/501100011033/ and by FEDER, the AtheroClonal project, also funded by the Spanish Ministry of Science (MCIN), AEI and the European Union NextGenerationEU/PRTR, and the MyoClonal project, funded by la Caixa, HR22-00732. It has also been supported by the project Discovering the molecular signatures of cancer PROMotion to INform prevENTion (PROMINENT) funded by Cancer Research UK (CGCATF-2021/100008),National Cancer Institute (1OT2CA278668-01) and the Spanish Cancer Association, AECC. IRB Barcelona is a recipient of a Severo Ochoa Centre of Excellence Award from the Spanish Ministry of Economy and Competitiveness (MINECO; Government of Spain) and an Excellence Institutional grant by the Asociacion Espanola contra el Cancer, and is supported by CERCA (Generalitat de Catalunya).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Blood somatic mutation data required to train boostDM-CH models is available through Hartwig Medical Foundation (HMF) and dbGaP following the same procedure to access the original datasets used in the reverse calling approach. HMF blood somatic mutations are available as part of the data access request to HMF (https://www.hartwigmedicalfoundation.nl). TCGA blood somatic mutations are available through dbGaP (phs002867) to researchers who have obtained permission to access protected TCGA data. Panel-sequenced data from the IMPACT targeted cohort is available through cBioPortal (https://www.cbioportal.org/study/summary?id=msk_ch_2020g). Data in the UK Biobank and Japanese Biobank analyses is available upon access request to both entities (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access and https://biobankjp.org/en/info/offer.html, respectively)
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced in this study is available at www.intogen.org/ch/boostdm