Abstract
Background Clonal haematopoiesis (CH), the disproportionate expansion of a haematopoietic stem cell and its progeny, driven by somatic DNA mutations, is a common age-related phenomenon that engenders an increased risk of developing myeloid neoplasms (MN). At present, CH is identified by targeted sequencing of peripheral blood DNA, which is impractical to apply at population scale. The complete blood count (CBC) is an inexpensive, widely used clinical test. Here, we explore whether machine learning (ML) approaches applied to CBC data could predict individuals likely to harbour CH and prioritise them for DNA sequencing.
Methods The UK Biobank was filtered to identify 431,531 participants with paired CBC and whole exome sequencing (WES). Somatic mutations were previously identified from blood WES using Mutect2 to classify individuals with CH driver mutations. Using 18 CBC indices/features and basic demographics (age and sex), we trained a range of tree-based ML classifiers to infer as binary output, the presence/ absence of CH.
Findings Using Random Forest (RF) classifiers, we predicted the presence/absence of CH driven by mutations in one of five genes known to confer a high-risk of incident MN (JAK2, CALR, SF3B1, SRSF2 and U2AF1). We subsequently developed a unified, optimised RF classifier for high-risk CH driven by any of these genes and assessed its performance (median AUC 0.85). However, the low prevalence of high-risk CH implies that our model cannot be generalised to population scale without compromising its sensitivity (20.1% using stringent cutoff probability score).
Interpretation We showcase a proof-of-concept that the presence of high-risk CH can be inferred from CBC perturbations using RF classifiers. The future integration of raw blood cell analyser data can help improve the performance of our model and facilitate its application at scale.
Funding Cancer Research UK.
Evidence before this study We searched PubMed for articles published, in English, between database inception and 5th of June 2024, using the terms “clonal hematopoiesis” AND (“machine learning” OR “artificial intelligence”). We additionally searched for the terms “clonal hematopoesis” AND “complete blood count”. We found 18 research articles: one article used ML approaches (XGBoost classifiers) to differentiate clonal haematopoiesis “driver” mutations from “passenger” mutations, but none linked machine learning frameworks to complete blood count data for predicting the presence of clonal haematopoiesis. Progression from clonal haematopoiesis to myeloid neoplasia is known to be associated with several blood count parameters; two recent publications developed clonal haematopoiesis risk stratification tools that incorporated blood count indices in their final risk prediction models (Gu et al. Nature Genetics, Weeks et al. NEJM Evidence). However, we found no study assessing whether blood count indices could be used to infer the presence of clonal haematopoiesis.
Added value of this study Here we show that CH driven by mutations in genes associated with high risk of progression to myeloid neoplasia can be reliably differentiated using ML approaches applied on peripheral blood indices; however, low-risk forms of CH (driven by mutations in the DNMT3A or TET2 genes) cannot be reliably inferred from CBC indices. While optimising the model we identified challenges in upscaling its applicability; we propose that the integration of single-cell resolution “raw” blood analyser data might overcome these issues. Previous efforts to enhance the scalability of CH screening focused on reducing DNA sequencing costs. Here, we provide a proof-of-concept that an extensively used clinical test, the CBC, can, using machine learning approaches, predict individuals more likely to harbour high-risk CH, who should be prioritised for genetic testing.
Implications of all the available evidence Our study proposes a model for predicting high-risk CH mutations by applying a Random Forest classifier on CBC indices; this represents an important step towards scalable screening for identifying individuals at high risk of developing myeloid neoplasia in the future. This is an attractive approach, as it relies solely on a routine, inexpensive test. Despite good sensitivity, the low prevalence of high-risk CH leads to a low positive predictive value that precludes the use of the predictive model as a population-wide pre-screening tool. To overcome this, we propose the future integration of raw blood analyser data into models like ours to improve the performance and scalability of this approach.
Introduction
Haematopoiesis, the formation of the cellular components of blood, occurs continuously throughout life. At steady state, haematopoiesis generates 4-5 × 1011 cells per day1–4 and this vast output is maintained by a small pool of 50,000-200,000 multipotent haematopoietic stem cells (HSCs)5 through a cascade of differentiation and proliferation. Somatic mutations accumulate during life, and though most are inconsequential, some can enhance cellular fitness and are positively selected in physiologically normal tissues6–8. Clonal haematopoiesis (CH) is an age-related phenomenon that arises when a HSC acquires a somatic driver mutation (i.e. one that increases its fitness), leading to clonal expansion of the cell and its progeny9,10. Large population-based studies revealed that the most commonly mutated genes in CH are involved in epigenetic regulation (DNMT3A, TET2, ASXL1), signal transduction (JAK2, GNB1), DNA damage response and apoptosis (TP53, PPM1D), and splicing (SF3B1, SRSF2, U2AF1)9–14. The prevalence of CH increases with advancing age to affect at least 20% of those over 70 years, in whom the phenomenon is almost universally detectable when deep sequencing approaches are employed9–14.
A hallmark of CH is the associated increased risk of incident myeloid neoplasms (MN), a molecularly heterogenous group of blood cancers that include acute myeloid leukaemia (AML), myelodysplastic syndromes (MDS) and myeloproliferative neoplasms (MPN). The overall rate of progression to MN is low (0.5-1% per annum)9, but the risk and nature of malignant progression vary according to the mutant driver gene, the size of the clone, and the selection pressures to which the clone is exposed15,16. Recent advances have facilitated the precise estimation of the risk of progression from CH to MN16,17, such that individuals at high risk can be identified and prioritised for clinical follow-up. CH may precede the development of MN by years9–11,15,16,18, and this provides a window during which high-risk clones could be intercepted and targeted to avert or delay the development of MN.
A key impediment to prospective myeloid cancer prevention programmes is the lack of a scalable test to identify CH. At present, CH is identified by Next Generation Sequencing (NGS) of blood DNA targeted to a panel of genes recurrently mutated in MN. However, NGS is not performed in routine clinical practice and is impractical and costly to perform at scale. An alternative approach is to leverage low-cost, scalable, routine clinical tests to identify individuals likely to harbour CH who can be prioritised for sequencing. The complete blood count (CBC) is an inexpensive, routine clinical test, and CBC indices such as the red cell distribution width (RDW) and mean cell volume (MCV) are associated with progression from CH to MN18. We therefore sought to explore whether machine learning (ML) models could predict individuals with CH based on CBC features by analysis of paired CBC and whole exome sequencing (WES) data from 431,531 United Kingdom Biobank (UKB) participants.
Methods
Study design and participants
We utilised data from the UKB (https://www.ukbiobank.ac.uk/), a population-based cohort of 502,536 volunteers recruited to the United Kingdom recruited between 2006-2010 and aged between 37 and 73 years at recruitment19. Participants’ data was accessed under approved UKB applications number 56844 and 69328.
To derive a dataset for use in our ML pipeline, we excluded UKB participants with any missing CBC variables and those without WES data. Since CH is defined by the presence of a leukaemia-associated somatic driver mutation in an individual without an apparent blood neoplasm, participants with a previous diagnosis of a haematological malignancy were excluded from the final dataset, as were those who developed an incident haematological malignancy within 30 days of recruitment to the UKB. After exclusions, 431,531 participants were retained for downstream analyses.
Variable selection
We extracted all CBC variables measured in the UKB (n=22), and augmented the feature set with the participants’ age and sex. Some CBC variables are closely related or derived from one another; to assess collinearity we computed a pairwise Spearman’s rank correlation coefficient (rs) and excluded variables with a |rs| ≥ 0.9. This led us to exclude haematocrit, high light scatter reticulocyte count and the total white blood cell count, whilst retaining their highly correlated counterpart features (haemoglobin concentration, reticulocyte count and neutrophil count, respectively). Nucleated red blood cell count (NRBC) was also excluded as it exhibited near-zero variance (106 unique values, NRBC=0 in 98.9% of UKB participants).
Identification of clonal haematopoiesis from whole exome sequencing data
CH was identified from whole exome sequencing (WES) of blood DNA from 431,531 UKB participants as previously described16 (see Supplement). UKB participants were subsequently labelled as “any-driver-CH” or “no CH” based on the presence or absence of a driver mutation(s) at VAF ≥ 2%. For input to gene-specific models of CH, we additionally labelled UKB participants by driver gene (e.g. “TET2-CH”, “SRSF2-CH”, etc vs “no CH”). Individuals with ≥2 driver mutations were labelled on the gene with the highest VAF.
Supervised machine learning model development
Having derived ground truth levels from WES data, ML models were subsequently built for “any-driver-CH” (variant allele frequency, VAF, ≥ 2% with a driver mutation in any CH gene), “large clone any-driver-CH” (as previous but VAF ≥ 10%), and each driver gene CH subtype.
To develop a binary classifier for predicting the presence/absence of CH, we trained and evaluated a selection of tree-based machine learning models: Decision Trees, Random Forests and Extreme Gradient Boosting (XGBoost) Trees. Tree-based approaches were preferred since the set of input features was heterogeneous (continuous and categorical); moreover, these models, augmented with statistical analyses, may also capture the interaction between features. Aside from the assessment of near-zero variance and collinearity, no further pre-processing was applied to the input dataset.
All 18 CBC parameters were used as features, in addition to basic demographic data (age at sampling and sex). Since the UKB CH dataset was imbalanced, with significantly more controls (no CH) than cases (CH), a random down-sampling was performed to achieve a 1:1 ratio of cases:controls in the input data, to enhance model training and convergence; this down-sampling process was repeated ten times iteratively (Supplementary Figure 1). Subsequently, down-sampled datasets were partitioned on 80:20 training:test ratio.
All models were built using ten repeats of ten-fold cross-validation setups; a grid-search approach was used to tune the relevant hyperparameters (Supplementary Table 1). To avoid technical bias from the down-sampling step, a modified cross-validation was applied, training and evaluating each ML model ten times iteratively, each time using a different random down-sample of the majority (control) class, thereby quantifying the robustness and stability of each model to variation in the subset of control samples or train/test partition (Supplementary Figure 1). Model performance was assessed on the unseen test data, on receiver operating characteristic (ROC) curves and area under the curve (AUC), in addition to sensitivity and specificity.
From the Random Forests models, we determined variable importance by computing the mean decrease in node impurity from splitting on each feature (measured by Gini index), averaged across all trees and across each of the ten repeats of model-building, using the importance() function from the randomForest package in R (v4.7.1)21. The consistency across top-ranked variables per driver was visualised using quantitative Venn diagrams (upset plots, ComplexUpset package) on the top two variables. The feature selection was performed by ranking all n features by importance, in descending order, and iteratively excluding the least informative feature, to determine a minimum set of highly predictive features.
To assess the scalability of the final model in a “real-life” setting, i.e. with class imbalance (more controls (no CH) than cases (CH), we added unseen control cases to the test set to match the prevalence of CH cases in the test set to the prevalence of CH cases in the UKB cohort. We examined the trade-off of sensitivity (which is independent of prevalence), positive predictive value (which is dependent on prevalence) and the model prediction score, using this to determine the optimal cut-off score, that minimises the false positives whilst retaining adequate sensitivity.
All ML models were built using the Caret v6.0.91 package in R v3.6.320. A full list of packages used is available in Supplementary Methods. All code used to implement our ML framework is publicly available on GitHub: https://github.com/billydunn/chic.
Results
After excluding those with missing CBC data (n=32,670), missing WES data (n=36,368), or a prevalent diagnosis of a haematological malignancy (n=1840), CH (VAF ≥2%) was identified in 20,860/431,531 (4.8%) UKB participants, of whom 7637/20,860 (36.6%) had large clone CH (VAF ≥10%; Figure 1,Table 1). Using this UKB dataset, we developed a range of tree-based models using our ML framework, which we henceforth refer to as CHIC (Clonal Haematopoiesis Inference from Counts).
We firstly examined whether CH could be predicted from CBC data in the UKB using models agnostic to underlying driver mutations (henceforth “any-driver-CH”). Using CHIC we generated binary classifiers (CH/no CH) of any-driver CH using tree-models with 18 CBC variables augmented with age and sex as features. Classifiers of any-driver CH were better than random, but with limited performance across all model types (median AUC on unseen test set 0.62, 0.64 and 0.62 for DT, RF and XGB models respectively) (Figure 2A).
CH is a molecularly heterogenous entity, and we posited that the nature and strength of the CBC phenotype conferred by a somatic mutation may vary according to the specific driver gene. We trained driver gene-specific binary classifiers (with labels driver gene CH/no CH) using the same input variables as for the any-driver CH models. The most prevalent forms of CH, driven by mutations in DNMT3A and TET2, were not robustly detectable; this conclusion held for DNMT3A-R882 hotspot mutations, which are associated with a slightly higher risk of transformation to AML16 (median AUC 0.60, 0.62 and 0.64 for DNMT3A-R882, DNMT3A-non R882 and TET2 RF models respectively) (Figure 2B). By contrast, CH driven by lower prevalence but higher risk driver mutations in the genes JAK2, CALR, SF3B1, SRSF2 and U2AF1 performed well (median AUC 0.94, 0.91, 0.84, 0.82, 0.84 respectively for RF models) (Figure 2B). Since Random Forests (RF) models generally exhibited the best performance across the driver genes (Figure 2B, Supplementary Table 2), we focused on further developing and exploring RF models.
CH is strongly associated with age, whilst some driver genes exhibit sex bias. To understand the influence of age and sex in the RF models, we trained each set of driver gene-specific RF models in three iterations: i) with age and sex as the only features, ii) with CBC indices as the features, whilst age- and sex-matching cases to controls (to capture the predictive performance of CBC alone), and iii) with age, sex and CBC indices as features, without age-and sex-matching of cases/controls (to capture the predictive performance of both basic demographics and CBC indices). The performance of models trained with only age and sex as features was generally poor (median AUC <0.75 in all cases, Figure 2C); an exception was the age/sex-only model of SRSF2-CH, in line with the sharp rise in prevalence of SRSF2-CH with advancing age and its strong association with male sex15.
Classifiers of CH driven by high-risk genes JAK2, CALR, SF3B1, SRSF2 and U2AF1 also performed best when using CBC indices as features and age/sex matching cases to controls in the training and test sets. The predictability of the presence of CH driven by mutations in splicing factor genes (SF3B1, SRSF2 and U2AF1) was augmented when age and sex were added as features and age/sex matching was omitted. Acknowledging the age and sex predictive power, we added these features to CBC indices in subsequent models.
Since CH with mutations in any of JAK2, CALR, SF3B1, SRSF2 or U2AF1 was more predictable from CBC indices and more clinically relevant (associated with high risk of progression to MN), we next combined all predictors into a single binary classifier of “high-risk CH”, to predict the presence/absence of a mutation in any of these five genes (training on input data labelled as “high-risk CH” vs “no high-risk CH”). The resulting median AUC was 0.85 on the unseen test set, Figure 3A); the model also predicted the presence of large (VAF ≥10%) high-risk clones (median AUC on unseen test set 0.90, Supplementary Figure 2).
To further refine the classifier of high-risk CH with VAF ≥2%, we performed iterative feature selection, incrementally excluding the least discriminative feature, to obtain the minimal stable set of highly discriminative features; this demonstrated that our classifier of high-risk CH had undiminished performance using only six features: age at blood sampling, red cell distribution width (RDW), platelet count, platelet distribution width (PDW), platelet crit and mean corpuscular haemoglobin (MCH) (Figure 3B-C). We therefore chose this compact high-risk CH model to explore further, selecting the model that most closely approximated the median AUC across the ten models built using our iterative pipeline.
Next, we assessed the optimal prediction score cut-off (threshold) for our compact high-risk CH model by examining the trade-off between sensitivity and positive predictive value (PPV) (Figure 3D). In our UKB cohort, high-risk CH was rare (795/431,531 UKB participants, prevalence 0.18%): since the PPV is strongly influenced by the prevalence of positive cases, this necessitated the use of a stringent prediction score cut-off to minimise the number of false positives. To achieve this, we chose a cut-off probability of 0.925, giving a PPV of 8.1% and sensitivity of 20.1% in our unseen test cohort (n=86,306), whilst maintaining the specificity and negative predictive value (NPV) of >99.5% (Table 2).
A key limitation of the UKB is the low WES coverage, with the driver genes JAK2, SF3B1 and U2AF1 all having a median coverage of ≤ 31 reads16, rendering variant calling insensitive to smaller clones. As such, we examined outcomes for the 365 “false positive” cases identified by our high-risk CH classifier, and found that 38/365 (10.4%) developed MN at a median of 5.2 years from sampling. By contrast, only 317/85,782 (0.4%) percent of “true negatives” developed MN. Since CH is the shared precursor of the vast majority of MNs, these observations strongly suggest that the “false positive” individuals had CH below the limit of detection of WES.
To further explore this hypothesis, we searched for low VAF hotspot mutations amongst 38 individuals who developed MPN, but were not found to have this hotspot mutation by standard variant calling. To do so, we used “pileup” to detect hotspot mutant reads that were filtered out by the stringent criteria of standard calling; this revealed that 13/38 of apparently false positives who developed incident MN had detectable CH mutations by this method, including 11 with driver mutations in JAK2, a low coverage gene. This strongly suggests that we underestimated our model performance due to the constraints of WES.
Further examination of cases identified by CHIC revealed an enrichment in cases with thrombocytosis, suggestive of undiagnosed or unannotated MPN rather than CH (Supplementary Figure 3). Similarly, a few cases had cytopenias that would fall into the diagnostic criteria for CCUS (clonal cytopenia of undetermined significance) or MDS22. To overcome this, we constrained our training/test sets to individuals without cytopenias, thrombocytosis or erythrocytosis, (see Supplementary Methods) and retrained our high-risk classifier. This led to only a minor reduction in performance (median AUC on unseen test set 0.80, Supplementary Figure 4), however, this exacerbated the trade-off between sensitivity and PPV, leading to sensitivity and PPV of only 11.3% and 2.0% respectively at our proposed cutoff probability of 0.875 (Supplementary Figure 4).
In addition to their use for prediction, CHIC ML models could also uncover novel associations between driver mutations and CBC indices. By evaluating variable importance across all the driver-gene-specific classifiers, and summarising the overlap between the top two features in each model (Figure 4A), we observed known or expected associations: age was highly predictive across models, JAK2-CH and CALR-CH shared platelet count and platelet crit as important features whilst MCV was predictive of SF3B1-CH. Unexpected associations were also revealed: for example, the basophil count was discriminative for predicting the presence GNB1-CH only, whilst eosinophil count was discriminative for the presence of IDH2-CH. Examining the distribution of each of these CBC variables in the UKB, we found that individuals with GNB1-CH had a significantly increased basophil count (p = 5.93 × 10−11, Wilcoxon Rank Sum test), with a 4.5-fold increase in the prevalence of basophilia >0.1 × 109/L and 8.6-fold increase in the prevalence of basophilia >0.2 × 109/L, relative to participants without a GNB1 driver mutation (13.8% vs 3.2% and 5.0% vs 0.6% for basophilia >0.1 and >0.2 × 109/L, n = 178/431,353 for GNB1 mutant/wild-type respectively) (Figure 4B). Individuals with IDH2-CH had significantly lower eosinophil counts (p = 3.63 × 10−10, Wilcoxon Rank Sum test) and a propensity to eosinopenia, with 12/92 (13.0%) participants with IDH2-CH having absolute eosinopenia (eosinophils = 0 × 109/L) and 45/92 (48.9%) having an eosinophil count <0.1 × 109/L, by contrast individuals without IDH2 mutations had rates of eosinopenia of 2.9%/20.8% for absolute/<0.1 eosinopenia respectively (n = 70/431,461 for IDH2 mutant/wild-type respectively) (Figure 4C). Only IDH2-CH demonstrated a significant association between eosinophil count and clone size (rs = -0.51, p = 2.67 × 10−7); we observed no such association between GNB1-CH and basophil count (rs = 0.13, p = 0.09) (Supplementary Figure 5), though of note basophils are the rarest of the white blood cell subsets and as a result their counts are zero-biased, which may have confounded any putative association.
Discussion
We developed the CHIC framework and assessed a RF classifier that predicts the presence of high-risk CH from just five CBC variables and an individual’s age. This approach, named Clonal Haematopoiesis Inference from Counts (CHIC), can discriminate between individuals with and without mutations in five CH genes associated with high-risk of developing MN. Notably, CHIC retained an ability to discriminate high-risk CH cases from controls even amongst individuals without cytopenias, erythrocytosis or thrombocytosis, suggesting it may highlight individuals that may not otherwise come to medical attention. CHIC is an important first step towards developing a scalable screening test to identify individuals likely to harbour high-risk CH, who would then be prioritised for targeted NGS. This would not only vastly reduce the number needed to screen (NNS) per case of high-risk CH identified, but it would also justify the need to perform genetic testing. Even with its current limitations, the use of CHIC with a stringent cut-off probability on individuals without cytopenia or thrombo-/erythrocytosis would still markedly reduce the NNS from 727 to 40 individuals per case of high-risk CH (based on the prevalence of high-risk CH in an unselected population vs in those predicted as having high-risk CH by CHIC). The implementation of a scalable screening test would represent a significant milestone in myeloid cancer prevention, by addressing a key bottleneck in recruitment to interventional studies.
However, despite its promising metrics as a screening test, the performance of CHIC in an unselected population was limited by the rarity of high-risk CH, necessitating ceding sensitivity to achieve an acceptable PPV. Performance was further reduced for the restricted analysis of individuals without cytopenias or thrombo-/erythrocytosis. By re-training our model in this population, we found that CHIC was still able to discriminate individuals with and without high-risk CH, but the resultant small reduction in AUC (0.80 vs 0.85) exacerbated the difficulty in balancing sensitivity and PPV, precluding its use at population scale.
One approach for enhancing the performance of CHIC is to target its use on a population with a higher prevalence of high-risk CH. CHIC was trained within the age constraints of the UKB, but since the prevalence of high-risk mutations in splicing factors (SF3B1, SRSF2, U2AF1) rises sharply over the age of 70 years, we anticipate that application of CHIC in an older population would result in improved performance. Similarly, targeting CHIC to individuals with a polygenic15,23 or monogenic24 predisposition to CH is also likely to improve its performance/PPV.
An alternative approach would be to integrate higher-resolution CBC data into the CHIC classifier, to improve its ability to identify high-risk CH. Some of the most discriminative CBC indices for high-risk CH are derived summary stastics e.g. RDW, PDW and MCH calculated from single-cell measurements (i.e. RDW is a measure of variation in red cell volumes). The integration of the raw or otherwise summarised single-cell measurements has the potential to improve the prediction of high-risk CH, for example by revealing a fraction of cells with distinct indices arising from the CH clone or identifying other characteristic patterns of variation in these measurements; such raw (or “non-classical”) CBC traits have recently been exploited to explore genetic associations with blood cell morphology25.
Beyond MN prevention, CH is of wider public health relevance due to its association with non-haematological disorders, most notably atherosclerotic heart disease. Since JAK2-CH exhibits the strongest association with cardiovascular outcomes26, and was also the most amenable to prediction in our study, we anticipate that CHIC may also have utility in the primary prevention of cardiovascular disease by facilitating the identification of individuals with JAK2-CH. By retrofitting CH screening on to a routine blood test, we believe our CHIC approach presents an important step towards scalable, practical and inexpensive ML-based screening for high-risk CH and provides a proof-of-concept that individuals with high-risk CH can be differentiated from those without, based on CBC indices.
Data sharing
All data used in this study are publicly available from the UK Biobank (https://www.ukbiobank.ac.uk/). Researchers may apply for access to the UK Biobank data via the Access Management System (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access).
Data Availability
All data used in this study are publicly available from the UK Biobank (https://www.ukbiobank.ac.uk/). Researchers may apply for access to the UK Biobank data via the Access Management System (https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access).
Code availability
Scripts used to query the UK Biobank dataset are available from: https://github.com/IsabellaWithnell/Predicting_CH. Scripts used to implement the machine learning framework described in the manuscript are available from: https://github.com/billydunn/chic.
Declaration of Interests
G.S.V. is a consultant to STRM.BIO and holds a research grant from AstraZeneca for research unrelated to that presented here. S.W. is an employee of AstraZeneca. M.A.F. is an employee and stockholder of AstraZeneca. The other authors declare no competing interests.