Abstract:
Background Acute myeloid leukemia (AML) accounts for a fifth of childhood leukemia. Although survival rates for AML have greatly improved over the past few decades, they vary depending on demographic and AML type factors.
Objectives To predict the five-year survival among pediatric AML patients using machine learning algorithms and deploy the best performing algorithm as an online survival prediction tool.
Materials and methods Pediatric patients (0 to 14 years) with a microscopically confirmed AML were extracted from the Surveillance Epidemiology and End Results (SEER) database (2000-2011) and randomly split into training and test datasets (80/20 ratio). Four machine learning algorithms (logistic regression, support vector machine, gradient boosting, and K nearest neighbor) were trained on features to predict five-year survival. Performances of the algorithms were compared, and the best performing algorithm was deployed as an online prediction tool.
Results A total of 1,477 patients met our inclusion criteria. The gradient boosting algorithm was the best performer in terms of discrimination and predictive ability. It was deployed as the online survival prediction tool named OSPAM-C (https://ashis-das.shinyapps.io/ospam/).
Conclusions Our study provides a framework for the development and deployment of an online survival prediction tool for pediatric patients with AML. While external validation is needed, our survival prediction tool presents an opportunity to reach informed clinical decision-making for AML patients.
1. Introduction
Acute myeloid leukemia (AML) is a heterogenous hematological cancer with expansion of abnormally differentiated myeloid hematopoietic progenitor cells and it accounts for a fifth of childhood leukemia [1,2]. The overall survival of children due to AML has improved in the recent decades due to advancements in therapy and it is currently around 70% [3–5]. However, survival rates vary depending on demographic and AML type factors [6–9]. Therefore, it is essential to understand the prognostic factors for AML outcomes for effective planning of treatment and rehabilitation modalities. While there have been few studies translating the prognostic factors to predictive models on AML, they have focused on adult patients and none have used machine learning specifically for predicting pediatric patient survival [10,11].
Machine learning consists of a group of artificial intelligence techniques, where the algorithms learn the patterns in the data without being explicitly programmed to carry out specific applications. Learning from a set of data (training data), machine learning algorithms apply a predictive model to unseen data (test data) [12]. Utilizing the already available data from hospitals and medical databases, machine learning has the potential to diagnose health conditions, predict appropriate treatment methods and patient survival to improve overall quality of life. There have been several applications of machine learning in healthcare, such as predicting diseases, health events and drug response, survival prediction, clustering of patients based on risk classification, analyzing genetics data and medical imaging [13–17]. In the field of cancer research, a few studies have utilized machine learning for predicting cancer survival from hospital records and registries [18–23]. The Surveillance Epidemiology and End Result (SEER) database is the largest publicly available source of cancer statistics in the United States and it includes approximately 28% of the population [24]. Though several studies have applied machine learning on predicting patient survival on various cancers from SEER database, none have applied it on AML for pediatric patients [20–22].
Our study had two objectives, (1) predict the five-year survival among pediatric (0 to 14 years) AML patients using machine learning algorithms, and (2) deploy the best performing algorithm as a web application for future validation and clinical use.
2. Material and methods
2.1. Patients
Patients for this study were selected from the Surveillance Epidemiology and End Result (SEER) database (1975-2016) [25]. The standard for case completeness for the SEER database is 98% and all patients were followed up for 10 years after routine treatment until death or loss to follow-up [26]. The database includes patient details from 1975 through 2016 and reports their demographic background, cancer characteristics, and survival. The available variables on AML were age, sex, race, marital status, AML histologic subtype, AML grade, SEER registry details (name, state and county), year of diagnosis, and survival in months.
Our inclusion criteria for this study were microscopically confirmed AML for patients aged 14 or younger. We excluded patients without microscopically confirmed AML, with unknown survival time and those with their years of diagnosis before 2000. So as to have adequate follow up period after the diagnosis, we considered the patients diagnosed between 2000 and 2011 as our sample. A total of 76,382 AML patients were diagnosed with AML between 1975 and 2016 across all age groups. After excluding patients that did not meet our inclusion criteria, 1,477 pediatric AML patients were included in our study.
2.2. Outcome variable
Our outcome variable was survival of five years or more among AML patients. In the SEER database, survival is a continuous variable with units in months. So, we created a binary variable where any patient with a survival of 60 months or more was coded “yes”, or otherwise “no”.
2.3 Predictors
We considered individual patient level demographic and disease variables as predictors. Demographic predictors were sex, age (years at diagnosis), and race. There were six races – “Hispanic”, “non-Hispanic American Indian/Alaska native”, “non-Hispanic Asian or Pacific Islander”, “non-Hispanic black”, “non-Hispanic white” and “non-Hispanic unknown”.
Disease variables that were available in the database were AML sub-type and grade. In our sample, there were 14 AML subtypes available according to the 3rd edition of the International Classification of Diseases for Oncology (ICD-O-3) [27]. The AML subtypes were the following: 9840/3 – acute erythroid leukemia; 9861/3 – AML, NOS; 9866/3 – acute promyelocytic leukemia (AML with t (15;17) (q22; q12)) PML/RARA; 9867/3 – acute myelomonocytic leukemia; 9871/3 – AML with inv (16)(p13.1q22) or t (16;16) (p13.1;q22), CBFB-MYH11; 9872/3 – AML with minimal differentiation; 9873/3 – AML without maturation; 9874/3 – AML with maturation; 9895/3 – AML with myelodysplasia-related changes; 9896/3 – AML, t (8;21)(q22;q22) RUNX1-RUNX1T1; 9897/3 – AML with t (9;11) (p22;q23), MLLT3-MLL; 9898/3 – AML with Down Syndrome; 9910/3 – acute megakaryoblastic leukemia; and 9920/3 – therapy related myeloid neoplasm. A vast majority of patients (93 percent) had unknown AML grade. Thus, we excluded this variable from our analysis.
2.4 Statistical Methods
2.4.1 Descriptive Analysis
We performed descriptive analyses for the predictors stratified by their classes. The correlation was tested among all predictors with Pearson’s correlation coefficient.
2.4.2 Predictive Analysis
We employed machine learning to predict the determinants of five-year survival to AML. We applied four commonly used supervised machine learning algorithms in cancer research – logistic regression, support vector machine, K neighbor classification, and gradient boosting – to understand which algorithm provides higher accuracy of prediction. We ran the best-fitting model for each algorithm to derive the predictions. The best-fit was derived through optimization techniques as described under each algorithm below.
2.4.2.1 Logistic Regression (LR)
Logistic regression is used for classification problems, i.e. binary or categorical output. The algorithm fits the best model to describe the relationship between the output and input (predictor) variables [28]. We used the grid search function to identify the best fit parameters, which were L2 regularization and a penalty strength of 1.
2.4.2.2 Support Vector Machine (SVM)
The data is classified into two classes in support vector machine (SVM) based on the output variable over a hyperplane [23]. The algorithm tries to maximize the distance between the hyperplane and the two closest data points from each class. There are three critical parameters in SVM – kernel (transforms data into a spatial form such as linear, radial, sigmoid, or polynomial), penalty (an error term, also called regularization) and gamma (a measure of model fitting). Using grid search feature for optimization, the best parameters in our model for kernel, penalty and gamma were radial, 1 and 0.1 respectively.
2.4.2.3 K Nearest Neighbors (KNN)
The class of a new observation is decided by the majority class among its neighbors in KNN algorithm [29]. There are three important parameters for KNN – number of nearest neighbors, distance metric and weights. Number of nearest neighbors refers to the number of data points a new observation is assigned to. Distance metric is a measure of the distance between the new observation and the nearest neighbors. There are three possible distance metrics – Euclidean, Manhattan and Minkowski. Weight is a measure to test the contribution of the members in the neighborhood. The members can be weighted equally (uniform weight) or higher weights for nearest members (distance weight). Using grid search feature for optimization, the best parameters in our model were 15 nearest neighbors, Manhattan metric and uniform weights.
2.4.2.4 Gradient Boosting
Gradient boosting is an algorithm that uses a combination of shallow and successive decision trees [30]. Decision trees consist of recursively partitioning (also known as splitting) of the predictors. Each decision tree learns successively and improves on the previous (learning rate). One must define the maximum depth for each decision tree (number of levels up to which splitting continues) and minimum leaf sample to split (minimum number of observations required in a node to be considered for splitting). Eventually, predictions are based on a weighted combination of these trees. We used grid search feature to optimize model parameters. The best fit parameters were – 80 decision trees, maximum depth of three for each tree, minimum leaf samples of seven to split, three maximum features and 0.15 learning rate.
2.4.2.5 Evaluation of the performance of the algorithms
The data was split into training (80 percent) and test segments (20 percent) for all algorithms. First, the algorithms were trained on the training segment and then were validated on the test segment for determining predictions. The data was 10-fold cross-validated with the data split into 80% training and 20% test observations randomly ten times for all algorithms. The average of the cross-validations was taken as the final result. The models were evaluated with accuracy (correct prediction of survived patients as survived and non-survived patients as non-survived), precision (ratio of correctly predicted survived patients to the total predicted survived patients), recall (ratio of correctly predicted survived patients to the all patients), F1 score (weighted average of precision and recall), and area under the receiver operating characteristics curve (AUC) [35]. A receiver operator characteristic (ROC) curve presents a plot of the true positive rate (y-axis) against the false positive rate (x-axis) for each individual algorithm. AUC measures the area under the ROC curve, and it ranges from 0.50 to 1.0 where 0.50 indicates the lowest discriminating score and 1.0 indicates the highest discriminating score.
The statistical analyses were performed using Python programming language Version 3.7 (Python Software Foundation, Wilmington, DE, USA) and the deep neural network was implemented on the TensorFlow platform [36]. The web application was built using the Shiny package for R and deployed with Shiny server (R Foundation for Statistical Computing, Vienna, Austria).
3. Results
In this section, we present the profile of patients, performance of the algorithms and our online survival prediction tool.
3.1 Patient profile
The demographic profile of the patients is presented in Table 1. The mean age of the patients was 6.1 years with a standard deviation of 5. Slightly above half were males (52.9%). Among various races, non-Hispanic whites were the majority (43.4%) followed by Hispanics (31.8%) and non-Hispanic blacks (13.7%). Out of all AML subtypes, patients with AML not otherwise specified (NOS) were the majority group (39.2%). Closer to 60% of the patients in our sample had a survival of five or more years. The correlation coefficients between the predictors ranged from – 0.14 to 0.02.
3.2 Performance of the algorithms
The performance metrics of the algorithms (logistic regression, support vector machine, K nearest neighbor, and gradient boosting) are shown in table 2. The accuracy of gradient boosting was the highest (0.681) followed by KNN (0.635), SVM (0.618), and logistic regression (0.588). F1-score (harmonic mean of precision and recall) was the highest for the gradient boosting (0.692), followed by SVM (0.672), logistic regression (0.664), and KNN (0.663). Area under receiver operating characteristic curve (AUC) ranged from 0.561 to 0.726 with the highest score for the gradient boosting algorithm. Considering all the performance metrics, gradient boosting was the best performer.
3.3 Online survival prediction tool – OSPAM-C
The best performing model, gradient boosting was deployed as the online survival prediction tool named as “Online Survival Prediction tool for Acute Myeloid Leukemia in children” – “OSPAM-C” (https://ashis-das.shinyapps.io/ospam/). As shown in figure 1, the user interface has four boxes to select input features as drop-down menus. The features are age (fourteen options – 0 through 14 years), sex (two options – male and female), race (six options – Hispanic, non-Hispanic American Indian/Alaska native, non-Hispanic Asian or Pacific Islander, non-Hispanic Black, non-Hispanic white and unknown) and AML sub-type (seventeen options according to the 3rd edition of the ICD-O-3 and WHO 2008 definitions). A user has to select one option each from the feature boxes and click the submit button to estimate the five-year survival probability in percentages. For instance, the tool gives a five-year survival prediction of 57.4 % for a 12-year old female Hispanic patient suffering from AML with maturation (9874/3).
Discussion
In this study, we utilized machine learning algorithms to predict five-year survival among pediatric AML patients. Among all our algorithms, gradient boosting performed the best and was deployed as an online survival prediction tool for pediatric AML named OSPAM-C.
Acute myeloid leukemia is one of the most common malignancies among children. While the overall survival has improved for children in recent times, it still has one of the worst survival probabilities among the leading pediatric cancers. AML is also a heterogenous condition with several biological, clinical and genetic factors influencing treatment response and prognosis [37]. While few have explored the predictors of AML survival among children applying conventional analytic methods on SEER database, none have applied machine learning yet [7,38,39].
There are a few predictive web applications to estimate survival for other cancers from SEER database such as chondrosarcoma, spinal chordoma, and glioblastoma [21,40,41]. However, we believe this is the first web-based survival prediction model for pediatric AML patients. Using SEER database, Thio et al. and Karhade et al. applied machine learning algorithms respectively to 1,554 chondrosarcoma and 265 spinal chordoma patients to predict five-year survival [21,40]. They utilized decision tree, support vector machine, Bayes point machine and neural networks. Among their algorithms in both studies, Bayes point machine was the best performer that was deployed for the web application. Similarly, Senders et al applied 15 machine learning and statistical algorithms – accelerated failure time (AFT), bagged decision trees, boosted decision trees, boosted decision trees survival, Cox proportional hazards regression (CPHR), extreme boosted decision trees, k-nearest neighbors, generalized linear models, lasso and elastic-net regularized generalized linear models, multilayer perceptron, naïve Bayes, random forests, random forest survival, recursive partitioning, and support vector machines [41]. The AFT algorithm was deployed as the online prediction tool. The C-statistics (AUC) were 0.868, 0.8 and 0.7 respectively for chondrosarcoma, spinal chordoma, and glioblastoma predictions with their best performing models, whereas it was 0.726 in our best performing model.
Our study has several potential limitations. First, as we used SEER data, there were certain missing clinical features such as treatment type, response to initial therapy, stage and extent of disease. Moreover, due to unavailability of meaningful responses, we had to drop the grade of AML. Second, the database does not collect information on key socio-demographic features such as geographic location, household education and economic status. Third, there was no information in the database on molecular biology, genomics, proteomics, or metabolomics factors. All these additional clinical and socio-demographic factors are known to influence survival in AML patients. Inclusion of these additional features may improve the accuracy and reliability of the model.
Our survival prediction tool is the first of its kind for pediatric AML. Although we used data from the largest cancer database in the US, the tool is yet to be validated. Therefore, we advise caution for clinicians and patients who intend to use this tool as a predictive guide for ascertaining survival for pediatric AML patients. Clinical experts must balance the predictions from this tool against their clinical experience, genomics and other relevant clinical information. We hope this tool will further be validated and possibly reoptimized using heterogenous data from various cohorts in multiple practice settings. While external validation is needed, our survival prediction tool presents an opportunity to inform clinical decision-making for AML patients.
Data Availability
SEER Data https://seer.cancer.gov/
Authors’ contributions
Conceived and designed this study: Ashis Kumar Das, Shiba Mishra, Devi Kalyan Mishra, Saji Saraswathy Gopalan
Analyzed and explained the data: Ashis Kumar Das, Shiba Mishra, Devi Kalyan Mishra, Saji Saraswathy Gopalan
All authors contributed to the writing and approved the final manuscript.
Declaration of Competing Interest
The authors declare that there is no conflict of interest. The views expressed in the paper are that of the authors and do not reflect that of their affiliations. This particular work was conducted outside of the authors’ organizational affiliations.
Acknowledgements
We are grateful to the contributors of the Surveillance, Epidemiology, and End Results Program as well as to the National Cancer Institute for making this data publicly available.