Machine learning framework for predicting the presence of high-risk clonal haematopoiesis using complete blood count data: a population-based study of 431,531 UK Biobank participants

William G. Dunn; Isabella Withnell; Muxin Gu; Pedro Quiros; Sruthi Cheloor Kovilakam; Ludovica Marando; Sean Wen; Margarete A Fabre; Irina Mohorianu; Dragana Vuckovic; George S. Vassiliou

doi:10.1101/2024.09.30.24314606

Abstract

Background Clonal haematopoiesis (CH), the disproportionate expansion of a haematopoietic stem cell and its progeny, driven by somatic DNA mutations, is a common age-related phenomenon that engenders an increased risk of developing myeloid neoplasms (MN). At present, CH is identified by targeted sequencing of peripheral blood DNA, which is impractical to apply at population scale. The complete blood count (CBC) is an inexpensive, widely used clinical test. Here, we explore whether machine learning (ML) approaches applied to CBC data could predict individuals likely to harbour CH and prioritise them for DNA sequencing.

Methods The UK Biobank was filtered to identify 431,531 participants with paired CBC and whole exome sequencing (WES). Somatic mutations were previously identified from blood WES using Mutect2 to classify individuals with CH driver mutations. Using 18 CBC indices/features and basic demographics (age and sex), we trained a range of tree-based ML classifiers to infer as binary output, the presence/ absence of CH.

Findings Using Random Forest (RF) classifiers, we predicted the presence/absence of CH driven by mutations in one of five genes known to confer a high-risk of incident MN (JAK2, CALR, SF3B1, SRSF2 and U2AF1). We subsequently developed a unified, optimised RF classifier for high-risk CH driven by any of these genes and assessed its performance (median AUC 0.85). However, the low prevalence of high-risk CH implies that our model cannot be generalised to population scale without compromising its sensitivity (20.1% using stringent cutoff probability score).

Interpretation We showcase a proof-of-concept that the presence of high-risk CH can be inferred from CBC perturbations using RF classifiers. The future integration of raw blood cell analyser data can help improve the performance of our model and facilitate its application at scale.

Funding Cancer Research UK.

Evidence before this study We searched PubMed for articles published, in English, between database inception and 5^th of June 2024, using the terms “clonal hematopoiesis” AND (“machine learning” OR “artificial intelligence”). We additionally searched for the terms “clonal hematopoesis” AND “complete blood count”. We found 18 research articles: one article used ML approaches (XGBoost classifiers) to differentiate clonal haematopoiesis “driver” mutations from “passenger” mutations, but none linked machine learning frameworks to complete blood count data for predicting the presence of clonal haematopoiesis. Progression from clonal haematopoiesis to myeloid neoplasia is known to be associated with several blood count parameters; two recent publications developed clonal haematopoiesis risk stratification tools that incorporated blood count indices in their final risk prediction models (Gu et al. Nature Genetics, Weeks et al. NEJM Evidence). However, we found no study assessing whether blood count indices could be used to infer the presence of clonal haematopoiesis.

Added value of this study Here we show that CH driven by mutations in genes associated with high risk of progression to myeloid neoplasia can be reliably differentiated using ML approaches applied on peripheral blood indices; however, low-risk forms of CH (driven by mutations in the DNMT3A or TET2 genes) cannot be reliably inferred from CBC indices. While optimising the model we identified challenges in upscaling its applicability; we propose that the integration of single-cell resolution “raw” blood analyser data might overcome these issues. Previous efforts to enhance the scalability of CH screening focused on reducing DNA sequencing costs. Here, we provide a proof-of-concept that an extensively used clinical test, the CBC, can, using machine learning approaches, predict individuals more likely to harbour high-risk CH, who should be prioritised for genetic testing.

Implications of all the available evidence Our study proposes a model for predicting high-risk CH mutations by applying a Random Forest classifier on CBC indices; this represents an important step towards scalable screening for identifying individuals at high risk of developing myeloid neoplasia in the future. This is an attractive approach, as it relies solely on a routine, inexpensive test. Despite good sensitivity, the low prevalence of high-risk CH leads to a low positive predictive value that precludes the use of the predictive model as a population-wide pre-screening tool. To overcome this, we propose the future integration of raw blood analyser data into models like ours to improve the performance and scalability of this approach.

Competing Interest Statement

G.S.V. is a consultant to STRM.BIO and holds a research grant from AstraZeneca for research unrelated to that presented here. S.W. is an employee of AstraZeneca. M.A.F. is an employee and stockholder of AstraZeneca. The other authors declare no competing interests.

Funding Statement

WGD is funded by a Clinical Research Fellowship from the Cancer Research UK Cambridge Centre (CTRQQR-2021\100012). GSV is supported by a Cancer Research UK Senior Cancer Fellowship (C22324/A23015), and work in his laboratory is also funded by the Leukemia Lymphoma Society, Blood Cancer UK, European Research Council, Cancer Research UK, Kay Kendall Leukemia Fund, AstraZeneca and Wellcome Trust.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study utilised data available in the United Kingdom (UK) Biobank accessed under the approved application numbers 56844 and 69328. UK Biobank has approval from the North West Multi-centre Research Ethics Committee (MREC) as a Research Tissue Bank (RTB) approval. This approval means that researchers do not require separate ethical clearance and can operate under the RTB approval.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data used in this study are publicly available from the UK Biobank (https://www.ukbiobank.ac.uk/). Researchers may apply for access to the UK Biobank data via the Access Management System (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access).