PT - JOURNAL ARTICLE AU - Moses, Cook AU - Bessi, Qorri AU - Amruth, Baskar AU - Jalal, Ziauddin AU - Luca, Pani AU - Shashibushan, Yenkanchi AU - Geraci, Joseph TI - Small Patient Datasets Reveal Genetic Drivers of Non-Small Cell Lung Cancer Subtypes Using a Novel Machine Learning Approach AID - 10.1101/2021.07.27.21261075 DP - 2022 Jan 01 TA - medRxiv PG - 2021.07.27.21261075 4099 - http://medrxiv.org/content/early/2022/09/26/2021.07.27.21261075.short 4100 - http://medrxiv.org/content/early/2022/09/26/2021.07.27.21261075.full AB - Background There are many small datasets of significant value in the medical space that are being underutilized. Due to the heterogeneity of complex disorders found in oncology, systems capable of discovering patient subpopulations while elucidating etiologies is of great value as it can indicate leads for innovative drug discovery and development.Materials and Methods Here, we report on a machine intelligence-based study that utilized a combination of two small non-small cell lung cancer (NSCLC) datasets consisting of 58 samples of adenocarcinoma (ADC) and squamous cell carcinoma (SCC) and 45 samples (GSE18842). Utilizing a set of standard machine learning (ML) methods which are described in this paper, we were able to uncover subpopulations of ADC and SCC while simultaneously extracting which genes, in combination, were significantly involved in defining the subpopulations. We also utilized a proprietary interactive hypothesis-generating method designed to work with machine learning methods, which provided us with an alternative way of pinpointing the most important combination of variables. The discovered gene expression variables were used to train ML models. This allowed us to create methods using standard methods and to also validate our in-house methods for heterogeneous patient populations, as is often found in oncology.Results Using these methods, we were able to uncover genes implicated by other methods and accurately discover known subpopulations without being asked, such as different levels of aggressiveness within the SCC and ADC subtypes. Furthermore, PIGX was a novel gene implicated in this study that warrants further study due to its role in breast cancer proliferation.Conclusion Here we demonstrate the ability to learn from small datasets and reveal well-established properties of NSCLC. This demonstrates the utility for machine learning techniques to reveal potential genes of interest, even from small data sets, and thus the driving factors behind subpopulations of patients.Competing Interest StatementJ.G. is a major shareholder of NetraMark Corp, where NetraMark is a technology company providing clinical trial support to pharmaceutical companies. L.P. has previously acted as a scientific consultant for AbbVie USA; Acadia USA; BCG Switzerland; Boehringer Ingelheim International GmbH; Compass Pathways; EDRA-Publishing, Italy; Ferrer Spain; Gedeon-Richter, Hungary; Inpeco SA, Switzerland; Johnson & Johnson USA; NeuroCog Trials USA; Novartis-Gene Therapies, Switzerland; Otsuka USA; Pfizer Global USA; PharmaMar Spain; Relmada Therapeutics USA; Takeda, USA; VeraSci, USA; Vifor Switzerland. Funding StatementPart of this research was funded by NetraMark Corp in the form of salary for Dr. Joseph Geraci, and computational resources.Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:This data is freely available through the Gene Expression Omnibus site and is anonymized. Thus for this project there was only an internal review conducted by Dr. Joseph Geraci and Jalal Ziauddin who are both employees at NetraMark corp. I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.YesAll data was procured from a publicly available database from the Gene Expression Omnibus. The dataset consisted of 58 samples of ADC and SCC (GSE10245) and 45 samples of human lung cancer and controls (GSE18842) to obtain a total of 103 samples. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10245 ADCadenocarcinomaAUCarea under the curveCNNconvolutional neural networkCTcomputed tomographyEMTepithelial-to-mesenchymal transitionLCClarge cell carcinomaMLmachine learningNSCLCnon-small cell lung cancerPETpositron emission tomographyROCreceiver operator curveSCCsquamous cell carcinomaSVMsupport vector machineTMEtumor microenvironment