Small Patient Datasets Reveal Genetic Drivers of Non-Small Cell Lung Cancer Subtypes Using a Novel Machine Learning Approach

Cook Moses; Qorri Bessi; Baskar Amruth; Ziauddin Jalal; Pani Luca; Yenkanchi Shashibushan; Joseph Geraci

doi:10.1101/2021.07.27.21261075

Abstract

Background There are many small datasets of significant value in the medical space that are being underutilized. Due to the heterogeneity of complex disorders found in oncology, systems capable of discovering patient subpopulations while elucidating etiologies is of great value as it can indicate leads for innovative drug discovery and development.

Materials and Methods Here, we report on a machine intelligence-based study that utilized a combination of two small non-small cell lung cancer (NSCLC) datasets consisting of 58 samples of adenocarcinoma (ADC) and squamous cell carcinoma (SCC) and 45 samples (GSE18842). Utilizing a set of standard machine learning (ML) methods which are described in this paper, we were able to uncover subpopulations of ADC and SCC while simultaneously extracting which genes, in combination, were significantly involved in defining the subpopulations. We also utilized a proprietary interactive hypothesis-generating method designed to work with machine learning methods, which provided us with an alternative way of pinpointing the most important combination of variables. The discovered gene expression variables were used to train ML models. This allowed us to create methods using standard methods and to also validate our in-house methods for heterogeneous patient populations, as is often found in oncology.

Results Using these methods, we were able to uncover genes implicated by other methods and accurately discover known subpopulations without being asked, such as different levels of aggressiveness within the SCC and ADC subtypes. Furthermore, PIGX was a novel gene implicated in this study that warrants further study due to its role in breast cancer proliferation.

Conclusion Here we demonstrate the ability to learn from small datasets and reveal well-established properties of NSCLC. This demonstrates the utility for machine learning techniques to reveal potential genes of interest, even from small data sets, and thus the driving factors behind subpopulations of patients.

Competing Interest Statement

J.G. is a major shareholder of NetraMark Corp, where NetraMark is a technology company providing clinical trial support to pharmaceutical companies. L.P. has previously acted as a scientific consultant for AbbVie USA; Acadia USA; BCG Switzerland; Boehringer Ingelheim International GmbH; Compass Pathways; EDRA-Publishing, Italy; Ferrer Spain; Gedeon-Richter, Hungary; Inpeco SA, Switzerland; Johnson & Johnson USA; NeuroCog Trials USA; Novartis-Gene Therapies, Switzerland; Otsuka USA; Pfizer Global USA; PharmaMar Spain; Relmada Therapeutics USA; Takeda, USA; VeraSci, USA; Vifor Switzerland.

Funding Statement

Part of this research was funded by NetraMark Corp in the form of salary for Dr. Joseph Geraci, and computational resources.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This data is freely available through the Gene Expression Omnibus site and is anonymized. Thus for this project there was only an internal review conducted by Dr. Joseph Geraci and Jalal Ziauddin who are both employees at NetraMark corp.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Footnotes

Figures have been revised for clarity. Introduction has been revised for brevity.

Data Availability

All data was procured from a publicly available database from the Gene Expression Omnibus. The dataset consisted of 58 samples of ADC and SCC (GSE10245) and 45 samples of human lung cancer and controls (GSE18842) to obtain a total of 103 samples.

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10245

Abbreviations

ADC: adenocarcinoma
AUC: area under the curve
CNN: convolutional neural network
CT: computed tomography
EMT: epithelial-to-mesenchymal transition
LCC: large cell carcinoma
ML: machine learning
NSCLC: non-small cell lung cancer
PET: positron emission tomography
ROC: receiver operator curve
SCC: squamous cell carcinoma
SVM: support vector machine
TME: tumor microenvironment

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.