Abstract
In this experiment, an R-script was developed to select the best performing machine learning (ML) predictive classification algorithm for IBS-subtype, and compare the performance of two datasets from the same clinical cohort – 1) The Complete Blood Count (CBC) results, and 2) A 250-gene Nanostring expression panel run on RNA from the “Buffy Coat” fraction. This publicly available data was compiled from open-source repositories and previously published supplementary data. Column labels were reformatted according to “tidy-data” standards. NA values in the data were imputed based on the mean value of the data column. Subject groups included Control (ie. healthy), IBS-D (diarrhea predominant), and IBS-C (constipation predominant) subtypes. These groups had unequal numbers in the original study, and so random re-sampling was used to make the group numbers equal for downstream linear regression-based analyses. The data was randomly split into training and validation subsets, and 5 classification algorithms were tested. Random Forest was clearly the best performing algorithm for both CBC and gene expression panel data, generally with >95% predictive accuracy, without additional tuning. The 250-gene RNA expression panel performed somewhat better than the CBC profile under a Random Forest model, however the CBC profiles had only 13 predictor variables vs. the 250 of the RNA expression panel. Some artifacts may result from the duplication of IBS-D and IBS-C rows from to the group-size balancing method, and so larger and more comprehensive datasets will be obtained for a follow-up analysis. The R-script and reformatted data are published as supplementary material here, and as a component of the ‘AnalyzeBloodworkv1.2’ GitHub repository.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This project was partially funded from NSF XSEDE Educational Allocation to Jeffrey Robinson: Bioinformatics Training for Applications in Translational and Molecular Biosciences. Extreme Science and Engineering Discovery Environment (XSEDE), supported by National Science Foundation grant number ACI-1548562. This research was developed during the course of Robinsons faculty responsibilities at UMBCs Translational Life Science Technology BS program, and parts of the AnalyzeBloodwork GitHub repository have been used for training students in R-coding and methods of statistical analysis in the courses BTEC330 (Software Applications), and BTEC495 (independent student research).
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This is a re-analysis of previously published data, the full links and references to the previously published, open-sourced data have been included in the text.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Data Availability
This analysis utilizes previously published, open-source data from pre-print articles and the NCBI GEO database. All sources of data are linked to in the text.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124549
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL25996
https://github.com/PhyloGrok/AnalyzeBloodwork
https://www.preprints.org/manuscript/201912.0180/v1
https://www.biorxiv.org/content/10.1101/608208v1.supplementary-material