Abstract
Electronic Health Records (EHRs) are increasingly used to develop machine learning models in predictive medicine. There has been limited research on utilizing machine learning methods to predict childhood obesity and related disparities in classifier performance among vulnerable patient subpopulations. In this work, classification models are developed to recognize pediatric obesity using temporal condition patterns obtained from patient EHR data. We trained four machine learning algorithms (Logistic Regression, Random Forest, XGBoost, and Neural Networks) to classify cases and controls as obesity positive or negative, and optimized hyperparameter settings through a bootstrapping methodology. To assess the classifiers for bias, we studied model performance by population subgroups then used permutation analysis to identify the most predictive features for each model and the demographic characteristics of patients with these features. Mean AUC-ROC values were consistent across classifiers, ranging from 0.72-0.80. Some evidence of bias was identified, although this was through the models performing better for minority subgroups (African Americans and patients enrolled in Medicaid). Permutation analysis revealed that patients from vulnerable population subgroups were over-represented among patients with the most predictive diagnostic patterns. We hypothesize that our models performed better on under-represented groups because the features more strongly associated with obesity were more commonly observed among minority patients. These findings highlight the complex ways that bias may arise in machine learning models and can be incorporated into future research to develop a thorough analytical approach to identify and mitigate bias that may arise from features and within EHR datasets when developing more equitable models.
Author Summary Childhood obesity is a pressing health issue. Machine learning methods are useful tools to study and predict the condition. Electronic Health Record (EHR) data may be used in clinical research to develop solutions and improve outcomes for pressing health issues such as pediatric obesity. However, EHR data may contain biases that impact how machine learning models perform for marginalized patient subgroups. In this paper, we present a comprehensive framework of how bias may be present within EHR data and external sources of bias in the model development process. Our pediatric obesity case study describes a detailed exploration of a real-world machine learning model to contextualize how concepts related to EHR data and machine learning model bias occur in an applied setting. We describe how we evaluated our models for bias, and considered how these results are representative of health disparity issues related to pediatric obesity. Our paper adds to the limited body of literature on the use of machine learning methods to study pediatric obesity and investigates the potential pitfalls in using a machine learning approach when studying social significant health issues.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This work was supported by a grant from the Commonwealth Universal Research Enhancement (C.U.R.E.) program funded by the Pennsylvania Department of Health 2015 Formula award SAP #4100072543. This work was also supported by funding from The Childrens Hospital of Philadelphia (CHOP) Drexel Research Fellowship Program: Informatics and Analytics Collaborative Research.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Ethics statement: Non-study personnel extracted all data from the EHR and removed protected health information (PHI) identifiers, except for dates, prior to transfer to the study database. Date information was removed from the analysis dataset used in this study. The CHOP Institutional Review Board approved this study and waived the requirement for consent.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
Data cannot be shared publicly because of HIPAA requirements. Please contact pedsnet@chop.edu with questions regarding data availability.