Abstract
Background There are many small datasets of significant value in the medical space that are being underutilized. Due to the heterogeneity of complex disorders found in oncology, systems capable of discovering patient subpopulations while elucidating etiologies is of great value as it can indicate leads for innovative drug discovery and development.
Materials and Methods Here, we report on a machine intelligence-based study that utilized a combination of two small non-small cell lung cancer (NSCLC) datasets consisting of 58 samples of adenocarcinoma (ADC) and squamous cell carcinoma (SCC) and 45 samples (GSE18842). Utilizing a set of standard machine learning (ML) methods which are described in this paper, we were able to uncover subpopulations of ADC and SCC while simultaneously extracting which genes, in combination, were significantly involved in defining the subpopulations. We also utilized a proprietary interactive hypothesis-generating method designed to work with machine learning methods, which provided us with an alternative way of pinpointing the most important combination of variables. The discovered gene expression variables were used to train ML models. This allowed us to create methods using standard methods and to also validate our in-house methods for heterogeneous patient populations, as is often found in oncology.
Results Using these methods, we were able to uncover genes implicated by other methods and accurately discover known subpopulations without being asked, such as different levels of aggressiveness within the SCC and ADC subtypes. Furthermore, PIGX was a novel gene implicated in this study that warrants further study due to its role in breast cancer proliferation.
Conclusion Here we demonstrate the ability to learn from small datasets and reveal well-established properties of NSCLC. This demonstrates the utility for machine learning techniques to reveal potential genes of interest, even from small data sets, and thus the driving factors behind subpopulations of patients.
1. Introduction
The collection of transcriptomic data is expensive, resulting in datasets with a small number of sample sizes (in the hundreds) but thousands of variables. As a result, several techniques that are making significant strides in the imaging space, such as deep neural networks, are not suitable for these datasets, as a large number of samples are required. Furthermore, the heterogeneity of the patient population and the complexity of diseases found in oncology requires going beyond the labels. The development of techniques that can explain the driving variables behind patient subpopulations is tremendously valuable in identifying and developing novel therapeutic agents – this is particularly relevant for mapping out heterogeneous diseases such as lung cancer.
Lung cancer is the leading cause of cancer mortality worldwide, with non-small cell lung cancer (NSCLC) accounting for 85% of all lung cancers [1]. NSCLC can be divided into three histological subtypes with distinct phenotypes and prognoses: adenocarcinoma (ADC), squamous cell carcinoma (SCC) and large cell carcinoma (LCC) [2, 3]. The histological differences across these subtypes suggest that distinct molecular mechanisms underlie the observed phenotypic differences. Although the differential gene expressions across NSCLC subtypes have been of increasing interest, the therapeutic implications on how these pathways interact, is only more recently being investigated [4]. The remarkable degree of genetic variability within each histological subtype only highlights the importance of molecular biology and genotyping for NSCLC [5, 6].
Fortunately, machine learning (ML) advancements have served as promising tools for stratifying NSCLC, predicting transcriptional mutations based on histological slides or discriminating NSCLC subtype through genomic expression levels. The bulk of ML efforts have focused on image analysis for predicting the stage of NSCLC [7-10]. However, the growing body of evidence highlighting the molecular abnormalities that underlie the genomic subtypes of NSCLC can train ML algorithms to identify novel biomarkers for NSCLC, moving towards precision medicine [11-13]. For instance, previous reports have identified that ADC is associated with increased expression of genes related to protein transport and cell junctions, while SCC is associated with increased expression of genes related to cell division and DNA replication [14]. An analysis of gene expression profiles between ADC and SCC using machine learning has been previously reported, identifying several genes including CSTA, TP63, SERPINB13, CLCA2, BICD2, PERP, FAT2, BNC1, ATP11B, FAM83B, KRT5, PARD6G, and PKP1 which were differentially expressed in ADC and SCC [15].
Here, using a combination of ML tools designed to learn from patient datasets to analyze gene expression data derived from ADC and SCC NSCLC patients, we were able to identify novel driving genes that distinguish these two broad subtypes. ML with statistical modelling tailored for small datasets has shown promise in showcasing disease heterogeneity[16]. Because large datasets are critical for contemporary machine learning methods such as CNNs, there is a need for alternative techniques when data banks are insufficient to train the model. In addition, significant features found within small datasets may become diluted by more obvious statistical features and hence over-represented in large datasets. As such, ML methods must be carefully used and complemented by statistical methods that allow for the discovery of non-linear ways in which groups of genes may interact to drive disease heterogeneity. The methodology presented here is designed for small datasets, which presents as a novel way of hypothesizing genetic subpopulations that may result in pathanogenesis. Our findings support genes previously reported to distinguish ADC and SCC subtypes. However, the novelty of this work lies in the machine’s ability to discover previously unknown subpopulations that are defined by several genes at a time. These findings shed light on the different mechanisms at play within these subtypes.
This article has been formatted according to the TRIPOD guidelines.
2. Materials and Methods
Datasets
The dataset consisted of 40 samples of ADC and 18 samples of SCC (GSE10245) [17] and 9 samples of ADC and 36 samples of SCC (GSE18842) [18] to obtain a total of 103 samples. Only GSE10245 was used when analyzing gene expression levels for discriminating differences between sex as this data was omitted from GSE18842. Genetic expression levels denote relative RMA-calculated signal intensity [19]. Bar plot means represent the mean expression level and error bars represent the standard deviation of the pooled data from each probe ID.
We utilized publicly available data sets that upon inspection had excellent signal for separating out adenocarcinoma and squamous cell carcinoma. The data consists of gene expression and is very expensive to acquire. We decided to analyze these two data sets because we were interested in what we could accomplish with a smaller than ideal data set using machine learning. This paper is a report of our findings after using a set of techniques that are appropriate for small data in order to encourage others to explore smaller data sets as there may be hidden valuable information within them that could be extracted with the techniques we described.
Machine Intelligence
In this study, we used a methodology to organize the resulting models from several well-known machine learning methods to explore NSCLC genetic heterogeneity within a small dataset. This organizational technique was used to extract insights from models that could then be compared with statistical methods suitable for small data. The only proprietary method used for these results are the techniques referred to as a feature selection tool [20, 21], in order to help us reduce the size of the data set to 16 dimensions. More specifically, we used these methods to create several new 16-dimensional data sets. We then used the following algorithm, based on standard methods, to create models and insights. For the work reported in this paper, we utilized the following process, after we performed our feature reduction:
First, a simple variable reduction was performed via standard univariate reduction methods and ensemble trees (Random Forest) through cross-validation [22, 23]. The only dependent variables used were ADC vs SCC. All univariate statistical methods incorporated Bonferroni corrections.
At this point we exercised two options: a) we used methods [20] to arrive at 16 variable data sets (in order to test this system), and b) we allowed step 1 above to be our sole variable selection method. For replication purposes, one may run step 1 alone.
Principal components were utilized as a linear unsupervised clustering method to reveal obvious subpopulation structures.
The loadings from the principal components were utilized to reduce the variables.
Using the t-SNE [24], HDBSCAN [25] and UMAP [26] algorithms, we were able to extract subpopulations.
We then collected the sample IDs from the clusters formed from these two clustering models, systematically compared each group with the others, and then applied statistical methods to determine differentially expressed gene candidates.
In order to determine the significance of a gene, a standard Student t-test was used when two subpopulations were compared, and if more than two subpopulations were compared, then an ANOVA was used. We then plotted the resulting clusters for the purpose of illustrating our findings. Again, these methods incorporated Bonferroni corrections.
Clustering was performed via principal components, t-SNE, HDBSCAN and UMAP and these were the basis of the maps found in this paper. Some proprietary algorithms were used to organize the resulting clustering models, in addition to the random forest models, such that we were able to explore the models interactively to derive a deeper understanding of the driving genes behind the sub-clusters [20]. The NetraAI system goes beyond these capabilities, but we did not utilize these proprietary methods to maintain academic standards. By allowing ourselves to use the proprietary organization methods provided by the NetraAI, we were able to identify subpopulations that we could compare with statistical methods suitable for a dataset with so few samples and avoid overfitting that often comes with utilizing machine learning methods with small datasets.
3. Results
3.1 Machine learning identifies differentially expressed genes from a small NSCLC dataset
Using the ADC and SCC tumor gene expression data, our approach was able to generate a map distinguishing SCC (blue) and ADC subjects (red) (Figure 1). The genes that were found to have driven this distinction were DSC3, VSNL1, SLC6A10P, IRF6, DST, CLCA2, DSG3, LPCAT1 and PIGX. Previous studies have reported on differentially expressed genes in ADC and SCC. Here, we identified 17 genes that discriminate between SCC and ADC (Table 1). It is noteworthy that 16 of the 17 genes we identified have been previously reported to be differentially expressed in SCC and ADC, validating our methods. Interestingly, we found genes associated with gap junctions and tight junctions to be strong driving forces differentiating SCC and ADC. It is noteworthy that PIGX was the only gene identified that has not been previously associated with NSCLC. Although there have been reports that PIGX promotes cancer cell proliferation by suppressing EHD2 and ZIC1, this warrants further investigation [27].
3.2 ADC and SCC are associated with distinct cellular adhesion molecules
Reports of SCC being characterized by the upregulation of desmosome and gap junction genes and ADC characterized by the upregulation of tight junction genes suggest that NSCLC subtypes are associated with a distinct set of adhesion molecules [17]. Here, we found that SCC was associated with cell adhesion marker DSC3, and ADC was associated with tight junction marker CGN (Figure 2). We identified two probes corresponding to DSC3, 206032_at and 206033_s_at. There was a statistically significant association of both DSC3 probes with SCC (p < 0.0001) (Figure 2A). Interestingly, the elevated expression of DSC3 was associated with males; however, this was not statistically significant (p = 0.062 for 206032_at and p = 0.077 for 206033_s_at). In contrast, the two probes corresponding to CGN, 223232_s_at and 223233_s_at were significantly associated with ADC (p < 0.0001) (Figure 2). The CGN probes were significantly associated with females (p = 0.014). The variability of adhesion molecule expression across sex warrants further investigation to elucidate the details of the correlation and advance towards gender related precision medicine.
3.3 SLC6A10P may be a key driver of a more aggressive ADC subtype
Elevated expression of SLC6A10P was significantly associated with two subgroups of ADC (p<0.0001) (Figure 3), in line with previous reports [28, 29]. Interestingly, increased expression of the pseudogene SLC6A10P in ADC has been associated with increased metastatic risk and reported to be a significant predictor of poor clinical outcome [29]. Our ML methodology was able to reveal subpopulations of ADC subjects that are uniquely classified by SLC6A10P (p = 1.3×10−5).. This demonstrates the potential power of machine intelligence to reveal aetiologias within complex diseases, even when a small number of samples are present. However, the methods must be used to reveal subpopulations that can then be compared using appropriate statistical methods suitable for comparing small groups.
3.4 IRF6 and CLCA2 drive unique subpopulations of SCC
Consistent with previous reports, we found two distinct subpopulations of SCC were found to be driven by IRF6 and CLCA2 (Figure 4) [28, 30]. IRF6 and CLCA2 expression levels were higher in SCC than ADC (p<0.0001) (Figure 4B and 4C). The significance value between the CLCA2 and IRF6 probes in the two encircled SCC groups were evaluated to be 4.4×10−7, 5.8×10−3, 9.3×10−7 and 0.046 for the 206164_at, 206165_s_at, 206166_s_at and 1552477_a_at probes, respectively.
4. Discussion
This study highlights the genetic heterogeneity within NSCLC subtypes. Using a small dataset, we were able to identify a set of 17 genes that distinguish SCC and ADC (Table 1). Within these 17 genes, most have been previously reported to be associated either with NSCLC or a specific subtype of NSCLC, validating our ML approach. These findings were aligned with previous reports on SCC genes being associated with the organization and assembly of cell and gap junctions, glutathione conjugation and the redox stress response, ECM organization and collagen-related proteins, interferon and cytokine signaling, and HLA downregulation and ADC genes associated with ECM organization proteins and complement, interferon and cytokine signaling, and collagen-related genes and proteins for ECM organization [31]. Another study identified epidermis development, cell division, and epithelial cell differentiation as the most common categories characterizing SCC, and cell adhesion enrichment, biological adhesion, and coagulation for ADC [32]. However, some of the genes we identified have not been previously associated with NSCLC or a specific subtype and represent areas that warrant greater investigation for the advancement of precision medicine in NSCLC. Below, the genes of interest found using our methodology are highlighted in the context of previous findings in NSCLC.
The first of the previously reported NSCLC-associated genes we identified, DSC3, plays a role in epidermal morphology and keratinocyte proliferation [18]. There are several studies that report on DSC3 distinguishing ADC and SCC, with a higher expression in SCC [33-36]. Notably, there has been a report on the association between DSC3 and tumor suppressor activity in NSCLC mediated by inhibition of EGFR [37]. However, there remain contradictory associations with DSC3 and prognosis, with elevated levels associated with increased metastatic risk in melanoma and better prognosis in lung and colon cancer [35]. This suggests that the same molecule may have differential effects in the tumor microenvironment (TME), which presents as an interesting field of research to understand how DSC3 expression correlates with NSCLC subtypes depending on where they originate in the lung.
VSNL1 codes for the calcium-sensor protein VILIP1. Lower VSNL1 expression has been correlated with poor clinical outcomes in NSCLC patients [38]. VILIP1 has been reported to be decreased or undetectable in aggressive and invasive SCC, while less aggressive SCC displayed VILIP-1 expression [38]. There is evidence linking decreased VILIP1 expression to increased cell motility and malignancy, suggesting that VSNL1 downregulation promotes SCC tumor invasiveness [39].
Although a direct role of IRF6 in lung cancer has not been identified, studies suggest that IRF6 is a crucial regulator of the cell cycle, promoting progression to the G0 state and allowing for uncontrolled cell proliferation [18]. Decreased IRF6 expression has been associated with poor prognosis of gastric cancer and increased invasiveness of breast cancer [40, 41].
Interestingly, SLC6A10P was the single gene that we found to drive two specific subtypes of ADC. SLC6A10P was previously found to be a marker for aggressive ADC [29], and recently, implicated within the Notch signaling pathway [42]. Our findings suggest that SLC6A10P warrants further investigation as a genetic biomarker in the context of the ADC patient subpopulation.
DST and DSC3 have been increasingly reported to be highly expressed in both ADC and SCC. Overexpression of these desmosomal genes are associated with increased CD8+ T-cell infiltration in ADC [35].
CLCA2 has been implicated as a negative regulator of cancer cell migration [43]. In the lung, CLCA2 has been reported to be highly expressed in SCC, suggesting that it may serve as a diagnostic marker to differentiate SCC from ADC. Female patients with CLCA2-negative SCC exhibited significantly poorer prognoses [30].
DSG3 has been reported to play a role in SCC and has been used as a sensitive and specific marker for SCC. It was also shown to be an effective discriminator between SCC and ADC [44, 45]. Higher DSG3 expression correlated with lower survival in SCC [46]. DSG3 and KRT5 have been reported to be downregulated in AC [47].
LPCAT1 has recently been shown to be overexpressed in lung SCC and associated with decreased OS [48]. In lung ADC, gene overexpression was associated with higher probabilities of ADC metastasis and poor clinical outcomes [49].
There is evidence that supports the role of cell adhesion proteins in both ADC and SCC. However, GJB5 has been implicated in SCC mechanisms and is associated with gap junctions [31]. It is not surprising that there is a higher expression of GJB5 in SCC as it is primarily associated with gap junctions (Figure 2) [17, 32]. GJB5 (gap junction protein beta 5 or protein-coding gene: Cx31.1) is involved in intercellular communication related to epidermal differentiation and environmental sensing.
Cx31.1 was found to be downregulated in NSCLC with expression levels inversely related to metastatic potential, suggesting it inhibits malignant properties of NSCLC cell lines. Cx31.1 is colocalized with LC3-II (autophagy marker light chain 3) and acts as a tumor suppressor as it plays a role in the regulation of cell proliferation, cell differentiation, tissue development and apoptosis [32].
TRIM29 has been shown to be upregulated in NSCLC, and may be a marker for tumour aggressiveness [50]. It has been further associated with poorer histological grade and clinical outcomes in SCC [51]. It has been suggested that this may be due to the inhibition of p53 via TRIM29 [52].
KRT17 overexpression has been associated with both subtypes of NSCLC, but was significantly correlated to more advanced tumour grade, lymph node metastatic potential, and overall survival in ADC [53].
BCN1 has been reported to be hypermethylated in NSCLC tissue [54]. Furthermore, decreased expression of BNC1 has been observed in other carcinomas [55]. Aberrant BNC1 and BNC2 expression contributes to tumor progression [56].
Reports of upregulation of desmosomes and gap junctions in SCC and tight junctions in ADC suggests that SCC and ADC are characterized by a distinct set of adhesion molecules [17]. Here, we found that ADC was identified by CGN and SCC by DSC3 (Figure 2). CGN (cingulin) is involved in the organization of tight junctions and is downregulated in SCC [17]. In contrast, ADC has been reported to be characterized by tight junctions, while SCC is characterized by gap junctions.
In addition to the 17 identified genes differentially expressed in ADC and SCC, PTGFRN (prostaglandin F2 receptor negative regulator; CD315) was also found to be associated with ADC. PTGFRN has been reported to be associated with worse survival in glioblastoma, while inhibition has been associated with decreased proliferation and tumor growth [57, 58]. PTGFRN inhibits the binding of prostaglandin F2α to its receptor. Notably, there are reports that PTGFRN is associated with small cell lung cancer; however, the role remains unknown [59, 60].
IRF6 and CLCA2 have previously been implicated in lung SCC [28, 30]. CLCA2 in particular was highlighted to differentiate ADC and SCC. Furthermore, SCC expression was correlated with tumour grade upon histological characterization. In particular, CLCA2 negative samples were associated with poorly differentiated tumours [30].
Males have been reported to have a significantly poorer NSCLC prognosis compared to females, shifting efforts towards sex-based approaches to diagnosis, prognosis, and therapeutic interventions [61, 62]. Additionally, estrogens have been associated with increased risk in ADC in women despite equal expression of estrogen receptors α and β, however, the role remains unclear [63]. While there are several reports on the sex-based differences in cancer mechanisms, including differences in metabolism, immunity, and angiogenesis, differences in CGN and DSC3 expression have not been previously reported to the best of our knowledge [64]. Gap junction proteins, also known as connexins, serve as channels that connect the interior of adjacent cells, facilitating intracellular homeostasis and coordination of activities via second messengers [65]. Desmosomes primarily provide mechanical strength via a structural network. In contrast, tight junctions form a barrier around the cell, regulating permeability of the paracellular space [66, 67]. These molecules play critical roles in epithelial-to-mesenchymal transition, a process involved in cancer metastasis. Though no sex-based differences have been reported, this presents as a unique field of research, as there may be different druggable targets for males and females.
Finally, the phosphatidylinositol glycan anchor biosynthesis class gene, PIGX, was found to be a driver of ADC and SCC differentiation in several instances (Figure 1). Little is known about the role of PIGX in NSCLC. However, it has been noted that PIGX has a proliferative role when expressed in breast cancer cells [27]. In addition, authors found higher PIGX expression was associated with shorter recurrence-free survival. This suggests that this gene plays a role in NSCLC that warrants further study.
5. Conclusions
The approach utilized here to derive the insights relied on the ability for certain machine learning methods to create hypotheses about subpopulations of patients, and then to statistically test the driving variables of these subgroups of patients. In this way, we utilize machine learning to derive potential insights and then utilize statistical methods that are suitable for small data to evaluate differential expression. In order to create robust predictive models with machine intelligence, one requires large data sets, but here we utilized the ability for some of these methods to create hypotheses instead, and then use methods appropriate for small data to test these hypotheses. This bidirectional attack allowed us to derive insights from these small datasets that have been previously validated and to finally derive a new potential role for the gene PIGX in NSCLC.
Limitations
This study highlights a methodology that targets genes of interest in a small, heterogenous subpopulation of NSCLC. Although small populations are more prone to local drivers of genetic heterogeneity, they will not encompass all genes that may drive other subtypes of NSCLC in patients. Therefore, a limitation of this methodology resides in it’s inability to forecast more obvious patterns found in larger datasets.
Data Availability
All data was procured from a publicly available database from the Gene Expression Omnibus. The dataset consisted of 58 samples of ADC and SCC (GSE10245) and 45 samples of human lung cancer and controls (GSE18842) to obtain a total of 103 samples.
Footnote
The TRIPOD reporting statement has been completed.
The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
According to the TRIPOD Checklist: Prediction Model Development, the following Items can be found:
1. Page 1
2. Page 2
3a. Introduction. Page 4, paragraph 1
3b. Introduction. Page 4, paragraph 2
4a. Datasets: Page 5, paragraph 1
4b. Not applicable
5. Datasets: Page 5, paragraph 1
6. Machine Intelligence. Page 6.
7. Machine Intelligence. Page 6, 7.
8. Datasets. Page 6, paragraph 2
9. Complete case analysis
10 a. Machine Intelligence. Page 6, 7.
10 b. Machine Intelligence. Page 6, 7.
10 c. Machine Intelligence. Pages 6-8
11. Not applicable
13 a, b. Datasets: Page 5, paragraph 1
14 a. Datasets: Page 5, paragraph 1
14 b. Not applicable
15 a, b. Machine Intelligence. Pages 6-8
16. Results, Pages 8-16
18. Conclusion. Page 23
19b. Discussion. Page 16-21
20. Discussion. Page 22
21. Not applicable
22. Page 27
Conflicts of interest
J.G. is a major shareholder of NetraMark Corp, where NetraMark is a technology company providing clinical trial support to pharmaceutical companies.
L.P. has previously acted as a scientific consultant for AbbVie USA; Acadia USA; BCG Switzerland; Boehringer Ingelheim International GmbH; Compass Pathways; EDRA-Publishing, Italy; Ferrer Spain; Gedeon-Richter, Hungary; Inpeco SA, Switzerland; Johnson & Johnson USA; NeuroCog Trials USA; Novartis-Gene Therapies, Switzerland; Otsuka USA; Pfizer Global USA; PharmaMar Spain; Relmada Therapeutics USA; Takeda, USA; VeraSci, USA; Vifor Switzerland.
Ethical approval
Not applicable
Consent to participate
Not applicable
Availability of data and materials
Data was obtained from publicly available datasets GSE10245 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE10245 and GSE18842 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE18842
Funding
Part of this research was funded by NetraMark Corp in the form of salary for Dr. Joseph Geraci, and computational resources.
Footnotes
Figures have been revised for clarity. Introduction has been revised for brevity.
Abbreviations
- ADC
- adenocarcinoma
- AUC
- area under the curve
- CNN
- convolutional neural network
- CT
- computed tomography
- EMT
- epithelial-to-mesenchymal transition
- LCC
- large cell carcinoma
- ML
- machine learning
- NSCLC
- non-small cell lung cancer
- PET
- positron emission tomography
- ROC
- receiver operator curve
- SCC
- squamous cell carcinoma
- SVM
- support vector machine
- TME
- tumor microenvironment
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].
- [10].↵
- [11].↵
- [12].
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].
- [69].
- [70].