Machine learning predicts metastatic progression using novel differentially expressed lncRNAs as potential markers in pancreatic cancer ======================================================================================================================================= * Hasan Alsharoh ## Abstract Pancreatic cancer (PC) is associated with high mortality overall. Recent literature has focused on investigating long noncoding RNAs (lncRNAs) in several cancers, but studies on their functions in PC are lacking. To identify significantly altered expression of lncRNA in PC, I collected information from The Cancer Genome Atlas (TCGA) and extracted RNA-sequencing (RNA-seq) transcriptomic profiles of pancreatic carcinomas and performed differential gene expression analysis. Out of 60,660 gene transcripts shared between 151 PC patients, I identified 38 lncRNAs that were significantly differentially expressed. To further investigate the functions of these genes, gene set enrichment analysis (GSEA) was performed on the population lncRNA panel. GSEA results revealed enrichment of several terms implicated in proliferation. To assess the contribution of these lncRNAs to metastatic progression, I used different ML algorithms, including logistic regression (LR), support vector machine (SVM), random forest classifier (RFC) and eXtreme Gradient Boosting Classifier (XGBC). Explicitly using significantly differentiated lncRNA genes and hyperparameter tuning, in addition to reducing bias through the synthetic minority oversampling technique, the accuracy of the ML models improved. Regardless, out of the four algorithms, both SVM and RFC were able to predict metastatic progression with 76% accuracy. To the best of my knowledge, this is the first study of its kind to identify this lncRNA panel to differentiate between nonmetastatic PC and metastatic PC, with many novel lncRNAs previously unmapped to PC. The ML accuracy score reveals important involvement of the detected RNAs. Based on these findings, I suggest further investigations of this gene panel *in vitro* and *in vivo*, as they could be targeted for improved outcomes in PC patients, as well as assist in the diagnosis of metastatic progression based on RNA-seq data of primary pancreatic tumors. ## 1. Introduction Pancreatic cancer (PC) is one of the deadliest cancers, with an overall five-year survival between 7.2 and 10% according to the literature1,2. Evidence suggests that PC is often diagnosed in the late stages of tumorigenesis, likely contributing to its high mortality rate3. Recent literature has provided increasing evidence regarding the involvement of long noncoding RNAs (lncRNAs) in the development, invasiveness, angiogenic potential, chemotherapeutic resistance and metastatic capacity of PC4. LncRNAs are RNA molecules characterized by having an arbitrary lower cutoff of 200 nucleotides that have been shown not to code for proteins post-transcriptionally 4,5. LncRNAs have been shown to play complex roles in biological processes in various tissues, with possible implications in DNA repair, cellular proliferation, and human diseases, which made them a common target for recent literature to investigate in cancer 6. lncRNAs have further been used as biomarkers for overcoming chemoresistance, as well as for the diagnosis of several cancers, including PC 7–10. Emerging research has been able to provide evidence regarding the use of lncRNAs for improved diagnostic accuracy, prognosis prediction, and treatment adjustment using various methods, including machine learning (ML) techniques8–10. Literature regarding the utilization of ML algorithms has been rapidly rising, with literature urging more rapid use of such algorithms in oncology to increase diagnostic accuracy or to further improve on the available algorithms 11–13. In this study, I aimed to investigate potential lncRNAs involved in the metastatic progression of PC based on RNA-sequencing (RNA-seq) data. To achieve this objective, I collected publicly available data from the cancer genome atlas (TCGA) for 172 patients and filtered the data according to predefined inclusion and exclusion criteria, which resulted in 151 PC records. PC records were further categorized according to their TNM staging, and tumor data were separated into tumors with metastatic activity (TMAs) and tumors without metastatic activity (TWAs). Using bioinformatics analytic techniques, I identified 125 differentially expressed genes (DEGs) among 60,660 genes involved in this study, many of which were novel. I further assessed the functions of this global gene panel using a multiparametric approach. Finally, I extracted lncRNA counts from the RNA-seq data from the PC population and further characterized 38 novel lncRNAs that were significantly differentially expressed. To further evaluate their involvement, I used 4 ML algorithms to predict and distinguish between TMAs and TWAs. These algorithms included multivariate logistic regression (LR), support vector machine (SVM), random forest classifier (RFC), and eXtreme Gradient Boosting Classifier (XGBC). I used several techniques to further reduce the bias within the included sample as described in the methodology. Training and evaluation of the ML algorithms was performed by separating the dataset from the 38 DEGs into a training set and a testing set to eventually evaluate the performance of each of the models. Out of all the ML algorithms, SVM and RFC were able to predict TMAs and TWAs with 76% accuracy using the 38 lncRNA data, suggesting important implications for the specified set of lncRNAs in PC. To the best of my knowledge, this is the first study to identify the involvement of this specific lncRNA panel in PC, with many novel lncRNAs lacking any studies performed on which. The results of this research have important clinical implications, as the novelty of the lncRNAs requires further comprehensive validation and *in vitro* and *in vivo* investigations. The accuracy shown by the ML model suggests that these novel lncRNAs could be used as biomarkers and further targeted for improved diagnosis and outcome in PC patients. ## 2. Methods ### 2.1. Data acquisition TCGA database was used for data collection and is available at [https://www.cancer.gov/tcga](https://www.cancer.gov/tcga). Exploration of TCGA-PAAD project data to acquire pancreatic RNA-seq data was performed on 25/10/2023. File filters applied included a) Data Category: transcriptome profiling; b) Data Type: Gene Expression Quantification; c) Experimental strategy: RNA-Seq; d) Access: open. The case filters applied included the following: a) primary site: pancreas; b) project: TCGA-PAAD; and c) disease type: ductal and lobular neoplasms, adenomas and adenocarcinomas. The inclusion criteria were that for each RNA-seq dataset to be of similar structure, for the predefined PC tumors mentioned in the filters, or regardless of age and gender. Primary tumors, regardless of metastatic stage, were also included. Exclusion criteria included defects in dataset structure, RNA-seq for tumor adjacent tissues, or those that had undergone prior therapy to a potential previous malignancy. I also excluded records with annotations specifying that tumor data were incorrectly labeled in terms of whether the tumor was neoplastic. Further categorization was performed for the acquired data using Excel sheets. For TNM subgroup analysis, tumors with staging data were categorized into tumors with metastatic activity, which included those classified as M1, MX/M0 and N1 or above, and tumors without metastatic activity, which included those classified as M0N0. Acquired data were also filtered to include only lncRNA gene expression quantification. This subgrouping was performed prior to DGEA to assess differentially expressed genes between TWAs and TMAs. ### 2.2. Data analysis Bioinformatics analysis was conducted on the data following matching the subjects to the study’s inclusion and exclusion criteria. Python v3.11 (available at [https://www.python.org/](https://www.python.org/)) was used in an Anaconda jupyter lab environment 14,15. To restructure the dataset up for the study population RNA-seq datasets and to import the data into Python, the glob module was used 16. Data manipulation was performed using pandas library v1.5.3 17. Libraries such as numpy and scipy were also utilized for data processing 18,19. Differential gene expression analysis (DGEA) was performed using PyDESeq2, an R package implemented in Python that has been suggested to be reliable and comparable to the R package20. The DEGs were matched to gene symbols and further visualized using the matplotlib21, seaborn22, and sanbomics23 packages. PyDeseq2 calculates the significance of genes using the Wald test, performs count normalization using the trimmed mean of M values (TMM), similar to DESeq2, and relies on the statsmodels library24,25. Using count normalization has been shown to have higher accuracy than TPM (transcripts per million) and FPKM (fragments per kilobase of transcript per million fragments mapped)26. A further description of the package is available elsewhere20. Significant differentiation after adjustment of p values was considered at p<0.05 and an absolute log2-fold change (log2FC) of >0.5. A heatmap of the DEGs was made through the matplotlib21 package as well. Pearson’s correlation coefficient was calculated and mapped for all gene transcript data. ### 2.3. Gene set and ontology enrichment analysis Gene set enrichment analysis (GSEA) is a method of interpreting gene-wide expression profiles27. GSEA was performed using the GSEApy v1.0.6 package, a Rust implementation of GSEA in python, used for performing computation of RNA-seq count data to evaluate predefined gene sets in association with different phenotypes. I ranked expression data using the prerank function available in the package. The accuracy of this package has been previously proven, and the method to use it is described extensively elsewhere28. Enrichment was performed for several gene collections from MSigDB available at ([https://www.gsea-msigdb.org/](https://www.gsea-msigdb.org/)) and miRTarBase 201729. Gene sets and collections that were evaluated for enrichment were c2.cp.kegg.v2023.1.Hs.symbols, c3.mir.v2023.1.Hs.symbols, c3.tft.v2023.1.Hs.symbols, c4.cgn.v2023.1.Hs.symbols, c5.go.bp.v2023.1.Hs.symbols, c5.go.cc.v2023.1.Hs.symbols, c5.go.mf.v2023.1.Hs.symbols, c5.hpo.v2023.1.Hs.symbols, c6.all.v2023.1.Hs.symbols, h.all.v2023.1.Hs.symbols, and miRTarBase_2017. Gene Ontology (GO) is a detailed resource with annotations of gene and gene product functions 30,31. It provides the potential to describe gene functions by assigning them to specific terms in which the genes are linked, detailing their relationships with each other. GO term enrichment was performed through GSEApy, and the results were extracted through tools available in said package. GO graph was made after extracting enriched GO terms and the source identifiers were insert into AmiGO32. The false discovery rate (FDR) was considered significant when FDR<0.05. Visualization of GSEA results was performed using tools from GSEApy. Data collected from GSEA results included terms, FDR, enrichment and negative enrichment scores, as well as matched genes. The minimum matching size for gene sets when performing GSEA for the global gene panel was set to 150. However, for the lncRNA panel, the minimum matching size was set to 3, as there were few enriched gene sets. ### 2.4. Machine learning models I employed multivariate LR, SVM, RFC, and XGBC to predict metastatic risk for the population based on the lncRNA gene count data from TCGA. DEGs were extracted from DGEA for use as sole predictors of metastatic progression in the study population. Analysis of the models’ accuracy was performed using packages from the scipy, scikit-learn, and matplotlib libraries. To train the ML algorithms, data were categorized into a training set (70% of the data) and a testing set (30%). A random state number was set for all the implemented ML models to dictate a specific seed of randomness during the analysis to maintain reproducibility. For binary classification, TNM stage of IIa or below was designated “0” and considered the TWM for the ML algorithms, while TNM stage IIb or above was designated “1” and considered the TMA. The testing sets were hidden from the ML algorithms to evaluate the predictive capacity performance following model training. Furthermore, hyperparameter tuning was performed to improve the predictive accuracy of the model. This was done through the GridSearchCV and BayesianSearchCV modules. Fivefold cross-validation was set as a parameter, and data regularization was done through L2 method. The inverse of the regularization strength (or penalty values) was set according to the optimal values found by the search modules specified above. To identify the best parameters, values were also tested over 50 iterations. Moreover, the synthetic minority oversampling technique (SMOTE) was performed to artificially increase TWM population numbers to reduce bias, which has proven to be a powerful tool in improving ML accuracy and addressing imbalanced samples33. These methods of standardization were performed for all ML algorithms used. ML algorithms used were also provided by the scikit-learn and XGBoost libraries. All of the algorithms consist of supervised machine learning algorithms, and are commonly used for classifications of tumors34,35. Further, L2 regularization has been considered to provide improved accuracy of the ML algorithms36. ## 3. Results ### 3.1. Primary characteristics of the study population Of the 179 retrieved records, 23 were excluded for the following annotations: a) “This case is a neuroendocrine tumor and should not have been included in the PAAD study” (n = 8); b) “Per the PAAD EPC, this tumor is a normal pancreas with atrophy” (n = 5); c) “Per the PAAD EPC, this tumor is an atrophic pancreas” (n = 3); d) “Per the PAAD EPC, this tumor is a noninvasive IPMN” (n = 1); e) “Per the PAAD EPC, this tumor is an acinar cell carcinoma” (n = 1); f) “Per the PAAD EPC, this tumor is a normal ampula of Vater” (n = 1); g) “The PAAD EPC states that this case likely did not arise in the pancreas (ampullary)” (n = 1); h) “Systemic treatment given to the prior/other malignancy” (n = 1); i) “Per the PAAD EPC, this tumor is an atrophic pancreas with a single focus of low-grade PanIN” (n = 2); “Samples identified in the sample sheet with a sample type of "Solid Tissue Normal" (from normal tissue adjacent to malignancy)” ( According to the flow diagram found in **Figure 1**. A total of 151 patient records were included. **Table 1** summarizes the characteristics of the cohort. Notably, 115 records were classified as TMAs, while 36 were classified as TWAs. Of the TMAs, 116 were diagnosed as TNM stage IIb, and 8 were diagnosed as stage III and IV. For the TWAs, 26 were at TNM stage IIa. ![](http://medrxiv.org/) []() Figure 1. Flow diagram of the study. Created with Lucidchart, [www.lucidchart.com](http://www.lucidchart.com). TCGA: The Cancer Genome Atlas; PAAD: Pancreatic adenocarcinoma; TMA: Tumor with metastatic activity; TWA: Tumor without metastatic activity; DGEA: Differential gene expression analysis; GSEA: Gene set enrichment analysis; ML: Machine learning. View this table: []() Table 1. Population primary characteristics The age range of the total patient sample was between 35 and 88 years old (mean = 64.66 ± 10.91). Ninety-four were males, and 78 were females. When reported, 143 had infiltrating duct carcinoma, and 16 had adenocarcinoma as the primary diagnosis. Eight had neuroendocrine tumors but were excluded. Seventeen pancreatic tumors had no specified location, 125 were pancreatic head lesions, 15 were pancreatic body lesions, and 13 were pancreatic tail lesions. The RNA-seq data included 60,660 gene expression profiles for each of the included patient and control samples. Transcriptomic profiling was performed for the same genes in all patient samples. Of the available transcripts, 16,901 were lncRNAs. After removing lncRNAs with 0 values among all patients, 15,879 lncRNAs remained. All details regarding the included samples are available in **Supplementary Material 1.** ### 3.2. DGEA and GSEA of all gene transcripts A total of 60,660 gene transcripts were filtered following PyDESeq2 analysis, and unavailable values were dropped, resulting in 47,528 transcripts. DGEA revealed 125 differentially expressed genes, as shown in **Table 2**, and the top differentially expressed genes are shown in **Figure 2**. Notably, ADH7, SERPINB13, MIR205HG, NTS, and LINC01300 were the most downregulated genes, with log2FC values of -3.42295, - 3.4189, -3.12513, -3.02808, and -2.72096, respectively. The most upregulated genes were PAX7, AC010789.1, TMPRSS15, DEFA6, and DEFA5 and had log2FC values of 3.149596, 3.506053, 3.538356, 3.594891, and 4.800701, respectively. ![](http://medrxiv.org/) []() Figure 2. Differentially expressed genes in PC. Absolute log2FC>0.5 and adjusted p value<0.05 were considered as the significance thresholds. ![](http://medrxiv.org/) []() Fig.3 **A.** GOBP (GO biological process) term enrichment. Upregulated genes had a lower rank, and downregulated genes had a higher rank. The enrichment score correlates with the number of genes from the gene panel enriching the gene set with significantly differentiated expression. More genes enriching this term are downregulated in this study due to the enrichment score reaching -0.5 since these genes have a higher density of higher ranked genes. **B.** miRTarBase_2017 term enrichment. View this table: []() Table 2. Differentially expressed genes found in the global gene sample GSEA was subsequently performed, with libraries investigated available in **Supplementary Materials 2**. There were many gene sets enriched with the genes, as many genes were included in the study’s gene panel. Notably, several GO terms were enriched, as well as some terms from miRTarBase 2017, as shown in **Figure 3 A and B**. FDR values were significant for the enriched terms (FDR<0.01). Upregulated genes had a lower rank, and downregulated genes had a higher rank. The enrichment score correlates with the number of genes from the gene panel enriching the gene set with significantly differentiated expression. Here, the gene set was more enriched with the upregulated genes from the gene panel. ### 3.3. lncRNA DGEA, correlations, and GSEA Further subgroup analysis was performed for lncRNAs in PC, which returned 16,901 gene expression values, for which PyDeseq2 was also used to analyze DEGs. Dropping the 0-sum, duplicate, and unavailable values retrieved 15,568 lncRNAs. Of the lncRNA panel, 38 lncRNAs were significantly differentially expressed (shown in **Figure 4**). ![](http://medrxiv.org/) []() Figure 4. Differentially expressed LncRNA. Absolute log2FC>0.5 and adjusted p value<0.05 were considered as the significance thresholds. Interestingly, the most downregulated genes were LINC01300, DUSP5-DT, AL513128.3, MIR205HG, and AC132192.2, with Log2FC values of -2.55682, -1.55378, -0.70877, -2.68894, and -0.68868, respectively. The most upregulated genes were AC010789.1, LINC00486, ENSG00000261409 (referred to as RF00019), LINC01115, and AC133530.1, with log2FC values of 2.154221, 1.214608, 3.647081, 1.705921, and 2.388161, respectively. Results of DGEA on the lncRNAs are shown in **Table 3**. View this table: []() Table 3. DGEA of lncRNAs in PC. Moreover, since the number of DEGs was feasible, to further visualize the relationship between these lncRNAs, each was correlated to the rest, and Pearson’s correlation coefficients for all the lncRNAs were extracted. The results are visualized in **Figure 5**. A table of all Pearson’s correlation coefficients can be found in **Supplementary Material 3**. ![](http://medrxiv.org/) []() Figure 5. Hierarchical clustering heatmap of lncRNAs amongst the sample population. The color gradient in the legend refers to Pearson’s correlation coefficient. The dendrogram linkage is based on the correlation strength. Geneid: ENSEMBL ID. tw: TWAs; tm: TMAs. GSEA and GO analyses were subsequently performed for all the lncRNA data. Due to the lack of studies on the genes of these transcripts, there was no significant enrichment in most databases. Notably, a few terms were enriched from the MSigDB c3.tft.v2023.1.Hs.symbols collection, which is focused on transcription factors. The results of the term enrichment for the top 10 terms in this collection are shown in **Figure 6**, and the results for insignificant term enrichment for other collections and databases can be found in **Supplementary Material 3.** ![](http://medrxiv.org/) []() Figure 6. GSEA of lncRNA data. Terms are more significantly enriched with downregulated genes. ### 3.4. ML model prediction of PC metastatic potential according to lncRNA gene expression Following the training and testing of each of the ML models, optimizations were performed to find the highest possible accuracy obtainable while reducing bias. Therefore, SMOTE was implemented in all the ML algorithms. Reducing sample imbalances improved the predictive accuracy of the utilized algorithms. Following SMOTE implementation and thorough hyperparameter tuning, LR demonstrated an accuracy score of 73.91% when distinguishing between TMAs and TWAs when tested, as well as an F1 score of 82.57% and a recall of 90.63%. Regardless, the area under the curve (AUC) for LR was 0.63, which was relatively low. **Figure 7 A and B** show the receiver operating characteristic (ROC) curve and for logistic regression following the implementation of SMOTE and the precision-recall (PR) curve. **C** shows the weight of each lncRNA (feature) in assisting the regression For the SVM model, SMOTE implementation, and hyperameter tuning also improved the predictive potential of the algorithm, which, on testing, returned an accuracy of 76.09%, with a true positive rate of 84.51% and a recall of 93.75%. **Figure A and B** show the ROC curve as well as the PR curve of the SVM model. ![](http://medrxiv.org/) []() Fig.7 **A.** The LR model showed an AUC = 0.63, demonstrating relatively weak classification performance, despite the good accuracy of detecting PC cases at TNM stage IIb or above. **B.** LR model accuracy of predicting positive values in comparison to the true positive rate (recall). **C.** Weights of each of the differentially expressed lncRNAs allowing the LR model to differentiate between nonmetastatic tumors and metastatic tumors. ![](http://medrxiv.org/) []() Fig. 8 **A.** The SVM algorithm showed an AUC = 0.65, demonstrating modest accuracy of detecting PC cases at TNM stage IIb or above and distinguishing them from less metastatic stages. **B.** SVM model accuracy of predicting positive values in comparison to its recall capacity. RFC was one of the most accurate models; after hyperparameter tuning, it returned an accuracy of 76.09% and an F1 score of 81.96%, with a recall of 78.13%. Most importantly, the AUC for this model was 0.75, showing good performance in classifying the tumors. Regardless, the gene panel consisting of 38 genes allowed the ML algorithms to discern advanced TNM stages from relatively early TNM stages in PC. **Figure 9 A and B** also show the RFC model accuracy and PR curve. As for XGBC, the model showed 71.73% accuracy; This specific model had the most inconsistency in predicting tumor types following each randomization. **Figure 10 A and B** show the low AUC and its PR curve. Data regarding the evaluation of the ML algorithms are available in **Supplementary Material 4**. ![](http://medrxiv.org/) []() Fig.9 **A.** The RFC ROC AUC was 0.75, demonstrating acceptable accuracy of detecting PC cases among the other ML algorithms when using the differentially expressed lncRNA counts data, regardless of hyperparameter tuning. **B.** The RFC PR curve showed good recall, albeit with low precision. ![](http://medrxiv.org/) []() ## 4. Discussion Despite advances in diagnostics and therapeutics, PC remains a very challenging condition to treat, with consistently high mortality rates and limited available treatments37,38. Recently, research has focused on identifying prognostic markers for PC, and preclinical studies have identified several prognostic lncRNA signatures8,39–41. LncRNAs have been further suggested to have implications in diagnosis, drug resistance, and therapeutics in PC4. However, as most patients are often diagnosed at advanced stages of disease, mutational burdens show complex relationships with lncRNA regulation4. Therefore, as the literature suggests, these relationships must be investigated to adjust treatment modalities. This becomes even more crucial in the latter stages of PC. This study aimed to provide details regarding DEGs in PC first and then to further analyze differentially expressed lncRNA and assess the diagnostic potential of these lncRNAs during the transition from stage IIa and stage IIb and above. These lncRNAs were extracted after performing DGEA to extract 38 gene transcripts from the global RNA-seq gene panel among 151 patient samples. The diagnostic potential of lncRNAs was assessed using supervised ML techniques to predict metastatic transition. I employed four ML techniques with established accuracy in prediction: LR42, SVM43, RFC43 and XBGC44. DGEA of the global gene panel revealed 125 DEGs, many of which were previously uninvestigated. Of the downregulated DEGs, ADH7 was hypothesized to have implications when mutated in pancreatic injury45. NTS was also associated with PC46. However, SERPINB13 and MIR205HG were previously unexplored in PC but had been discussed in other cancers and were implicated in poor clinical outcomes47,48. No studies are available regarding LINC01300, which warrants further investigation. For the upregulated DEGs, PAX7 was previously reported to have some relationship with cancers, yet studies regarding this specific gene transcript are lacking 49. For DEFA6 and DEFA5, a report suggested a link between them and clinical outcomes in colorectal cancer50. While there were no studies regarding AC010789.1 and TMPRSS15 in PC, some studies linked the potential implications of these genes with other cancers51,52. GSEA for the global gene panel revealed several enriched pathways. For example, GO enrichment revealed that the gene panel significantly enriched pathways relevant in the regulation of aerobic respiration (GO:1903715), electron transport carrier chain (GO:0022900), and mitochondrial gene expression and translation into RNA transcripts (GO:0140053). Notably, of the miRTarBase enriched pathways, mir-30b-5p microRNA (miRNA) was previously linked to PC53,54. While miR-548x-3p has not been studied regarding its function in cancer, miR-144-3p was previously implicated in PC55,56. Additionally, mir-548j-3p had no studies documenting its relationship with cancer. For miR-1468-3p, some studies have suggested it as a biomarker for non-small cell lung cancer and prostate cancer57,58. Following the filtering of the global RNA-seq gene panel to lncRNAs exclusively, DGEA revealed 38 differentially expressed lncRNAs, many of which were novel. LINC01300 and MIR205HG, as previously described, in addition to DUSP5-DT and AL513128.3, had no studies in PC, with the latter two lacking any studies on which. In contrast, one report regarding AC132192.2 indicated its relevance in prostate cancer59. For the upregulated lncRNAs, AC010789.1, as previously stated, had a report regarding its function in colorectal cancer52,60. LINC00486, RF00019, LINC01115, and AC133530.1 all lack validation studies in PC, but other reports indicate involvement in several diseases, including cancer61–64. As these novel lncRNAs lack studies regarding their functions, GSEA of the selected MSigDB collections returned no significant enrichment but in one transcription factor collection. Notably, the most enriched pathway described genes containing one or more binding regions for a transcription factor that regulates cell fate and controls cell cycle progression from the mitotic phase to interphase, known as TOX high mobility group box family member 4 (TOX4)65,66. Interestingly, lncRNAs enriching this path were primarily downregulated. To further explore the significance of the identified 38 lncRNAs, ML algorithms were employed to predict the metastatic state of cancer (designated “0” for stages IIa or below and “1” for stages IIb and above). Of all the algorithms, RFC showed superior accuracy to the other algorithms, showing an AUC of 0.75 and an accuracy of over 76%. While there is much to be understood regarding the functions of the identified lncRNA panel, the accuracy shown by RFC reveals important aspects about the involvement of these lncRNAs in PC. This finding warrants further *in vitro* and *in vivo* investigations. For most of the identified lncRNA panels, this was the first study to uncover their involvement in PC. Regardless, there are many clinical implications for the findings discussed here. The results of this study suggest that the identified lncRNAs could be further utilized to assess the metastatic potential of PC, as well as aid in drug development, since these lncRNAs can be used as drug targets. Since their involvement allowed the prediction and distinction between TNM stages, further investigation of their functions seems crucial. Despite the significant findings, this study is not without limitations. First, DEGA was performed for a large number of data, which likely raised data noise. Second, TWAs used as controls were low in number, as most samples had a stage IIb diagnosis, and SMOTE was necessary to utilize for the ML algorithms to reduce bias. Third, there was a lack of normal tissue control samples, which makes it difficult to provide more accurate assessments of the nature of these genes. Last, there might have been biases in the TCGA data from incorrect measurements or sequencing, potentially skewing the results of the RNA-seq data. All of these findings indicate that the findings of this study should be further validated and interpreted with caution. Regardless, the presence of evidence regarding some of the identified novel lncRNAs indicates the strength of the rigorous methods used in this study. This further adds to the implications of the findings discussed here and the importance of future research to address these novel lncRNAs as potential markers of metastatic progression in PC. ## 5. Conclusion DGEA utilized in this study identified a set of 38 novel lncRNAs that could contribute to metastatic progression in PC. GSEA was unable to provide sufficient information to further describe the functions of these lncRNA, due to the scarcity of available data relevant to the genes identified. Since different ML algorithms were able to predict metastatic PC with acceptable accuracy and the RFC model predicted PC with 76% accuracy based on the 38 lncRNA DEG panel, it is likely that these genes participate in the metastatic progression of PC, warranting further investigation. The significance and importance of this study is represented by the identified novel lncRNA gene set. Metastatic PC lacks sufficient studies regarding the involvement of lncRNAs in tumor proliferation and progression, especially those that use ML algorithms with proven accuracy. This is the first study of its kind to use this methodology to reveal the discussed gene set in PC to distinguish between early-stage and advanced PC. Regardless, more studies are needed to identify the role these genes play in PC metastasis and other cancers. Based on the findings of this study, I suggest further research to take place into the role of these genes. *In vitro* and *in vivo* experiments must be conducted to further elucidate the functions these genes may take part in. The accuracy of the ML algorithms to determine PC metastatic potential reveals that these genes could be added to diagnostic methods if their clinical manifestations are confirmed by future studies. ## 6. Data availability statement All raw data acquired from TCGA, in addition to all analyses performed on said data and source code utilized to perform the analyses mentioned in the methodology, are available at the link [https://github.com/hasanalsharoh/PanC](https://github.com/hasanalsharoh/PanC). ## Supporting information Supplementary Material 1 [[supplements/297724_file03.xlsx]](pending:yes) Supplementary Material 2 [[supplements/297724_file04.xlsx]](pending:yes) Supplementary Material 3 [[supplements/297724_file05.xlsx]](pending:yes) Supplementary Material 4 [[supplements/297724_file06.xlsx]](pending:yes) * Received November 1, 2023. * Revision received November 1, 2023. * Accepted November 3, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## 7. References 1. 1.Hu, J.X., Zhao, C.F., Chen, W.B., Liu, Q.C., Li, Q.W., Lin, Y.Y., and Gao, F. (2021). Pancreatic cancer: A review of epidemiology, trend, and risk factors. World J Gastroenterol 27, 4298–4321. doi:10.3748/wjg.v27.i27.4298. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3748/wjg.v27.i27.4298&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 2. 2.Partyka, O., Pajewska, M., Kwaśniewska, D., Czerw, A., Deptała, A., Budzik, M., Cipora, E., Gąska, I., Gazdowicz, L., Mielnik, A., et al. (2023). Overview of Pancreatic Cancer Epidemiology in Europe and Recommendations for Screening in High-Risk Populations. Cancers 15. doi:10.3390/cancers15143634. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15143634&link_type=DOI) 3. 3.Andersson, R., Haglund, C., Seppänen, H., and Ansari, D. (2022). Pancreatic cancer – the past, the present, and the future. Scandinavian Journal of Gastroenterology 57, 1169–1177. doi:10.1080/00365521.2022.2067786. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/00365521.2022.2067786&link_type=DOI) 4. 4.Bin, W., Yuan, C., Qie, Y., and Dang, S. (2023). Long non-coding RNAs and pancreatic cancer: A multifaceted view. Biomedicine & Pharmacotherapy 167, 115601. doi:10.1016/j.biopha.2023.115601. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.biopha.2023.115601&link_type=DOI) 5. 5.Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte, M., Zuk, O., Carey, B.W., Cassady, J.P., et al. (2009). Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223–227. doi:10.1038/nature07672. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature07672&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19182780&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000264059700048&link_type=ISI) 6. 6.Kore, H., Datta, K.K., Nagaraj, S.H., and Gowda, H. (2023). Protein-coding potential of non-canonical open reading frames in human transcriptome. Biochem Biophys Res Commun 684, 149040. doi:10.1016/j.bbrc.2023.09.068. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.bbrc.2023.09.068&link_type=DOI) 7. 7.Aswathy, R., and Sumathi, S. (2023). Defining new biomarkers for overcoming therapeutical resistance in cervical cancer using lncRNA. Mol Biol Rep. doi:10.1007/s11033-023-08864-w. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s11033-023-08864-w&link_type=DOI) 8. 8.Zhang, N., Yu, X., Sun, H., Zhao, Y., Wu, J., and Liu, G. (2023). A prognostic and immunotherapy effectiveness model for pancreatic adenocarcinoma based on cuproptosis-related lncRNAs signature. Medicine (Baltimore) 102, e35167. doi:10.1097/md.0000000000035167. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/md.0000000000035167&link_type=DOI) 9. 9.Wang, T., Ji, M., Liu, W., and Sun, J. (2023). Development and validation of a novel DNA damage repair-related long non-coding RNA signature in predicting prognosis, immunity, and drug sensitivity in uterine corpus endometrial carcinoma. Comput Struct Biotechnol J 21, 4944–4959. doi:10.1016/j.csbj.2023.10.025. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.csbj.2023.10.025&link_type=DOI) 10. 10.Zhao, Y., Song, Y., Zhang, Y., Ji, M., Hou, P., and Sui, F. (2023). Screening protective miRNAs and constructing novel lncRNAs/miRNAs/mRNAs networks and prognostic models for triple-negative breast cancer. Mol Cell Probes 72, 101940. doi:10.1016/j.mcp.2023.101940. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.mcp.2023.101940&link_type=DOI) 11. 11.11. Collins, G.S., Whittle, R., Bullock, G.S., Logullo, P., Dhiman, P., de Beyer, J.A., Riley, R.D., and Schlussel, M.M. (2023). OPEN SCIENCE PRACTICES NEED SUBSTANTIAL IMPROVEMENT IN PROGNOSTIC MODEL STUDIES IN ONCOLOGY USING MACHINE LEARNING. J Clin Epidemiol. doi:10.1016/j.jclinepi.2023.10.015. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jclinepi.2023.10.015&link_type=DOI) 12. 12.Rasti, P., Wolf, C., Dorez, H., Sablong, R., Moussata, D., Samiei, S., and Rousseau, D. (2019). Machine Learning-Based Classification of the Health State of Mice Colon in Cancer Study from Confocal Laser Endomicroscopy. Sci Rep 9, 20010. doi:10.1038/s41598-019-56583-9. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41598-019-56583-9&link_type=DOI) 13. 13.Sharma, A.N., Shwe, S., and Mesinkovska, N.A. (2022). Current state of machine learning for non-melanoma skin cancer. Arch Dermatol Res 314, 325–327. doi:10.1007/s00403-021-02236-9. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00403-021-02236-9&link_type=DOI) 14. 14.Anaconda (2016). (Anaconda Software Distribution). 15. 15.Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., et al. (2016). Jupyter Notebooks – a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas, (IOS Press), pp. 87–90. doi:10.3233/978-1-61499-649-1-87. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3233/978-1-61499-649-1-87&link_type=DOI) 16. 16.glob — Unix style pathname pattern expansion. (2023). 17. 17.team, T.p.d. (2023). pandas-dev/pandas: Pandas. 18. 18.Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., et al. (2020). Array programming with NumPy. Nature 585, 357–362. doi:10.1038/s41586-020-2649-2. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2649-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32939066&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 19. 19.Virtanen, P., Gommers, R., Oliphant, T.E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17, 261–272. doi:10.1038/s41592-019-0686-2. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41592-019-0686-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32015543&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 20. 20.Boris, M., Maria, T., Vincent, C., and Mathieu, A. (2022). PyDESeq2: a python package for bulk RNA-seq differential expression analysis. bioRxiv, 2022.2012.2014.520412. doi:10.1101/2022.12.14.520412. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYmlvcnhpdiI7czo1OiJyZXNpZCI7czoxOToiMjAyMi4xMi4xNC41MjA0MTJ2MSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzExLzAzLzIwMjMuMTEuMDEuMjMyOTc3MjQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 21. 21.Hunter, J.D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9, 90–95. doi:10.1109/MCSE.2007.55. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/MCSE.2007.55&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=WOS:00024566&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 22. 22.Waskom, M.L. (2021). seaborn: statistical data visualization. Journal of Open Source Software 6, 3021. doi:10.21105/joss.03021. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.21105/joss.03021&link_type=DOI) 23. 23.Sanborn, M. (2023). sanbomics. 24. 24.Seabold, S., and Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python. In 61. (Austin, TX), pp. 10–25080. 25. 25.Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology 15, 1–21. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/gb-2014-15-1-r1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24393533&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 26. 26.Zhao, Y., Li, M.-C., Konaté, M.M., Chen, L., Das, B., Karlovich, C., Williams, P.M., Evrard, Y.A., Doroshow, J.H., and McShane, L.M. (2021). TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository. Journal of Translational Medicine 19, 269. doi:10.1186/s12967-021-02936-w. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12967-021-02936-w&link_type=DOI) 27. 27.Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., and Mesirov, J.P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102, 15545–15550. doi:10.1073/pnas.0506580102. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTAyLzQzLzE1NTQ1IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTEvMDMvMjAyMy4xMS4wMS4yMzI5NzcyNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 28. 28.Fang, Z., Liu, X., and Peltz, G. (2023). GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics 39, btac757. doi:10.1093/bioinformatics/btac757. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btac757&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36426870&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 29. 29.Chou, C.H., Shrestha, S., Yang, C.D., Chang, N.W., Lin, Y.L., Liao, K.W., Huang, W.C., Sun, T.H., Tu, S.J., Lee, W.H., et al. (2018). miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions. Nucleic Acids Res 46, D296–d302. doi:10.1093/nar/gkx1067. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkx1067&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29126174&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 30. 30. The Gene Ontology, C., Aleksander, S.A., Balhoff, J., Carbon, S., Cherry, J.M., Drabkin, H.J., Ebert, D., Feuermann, M., Gaudet, P., Harris, N.L., et al. (2023). The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031. doi:10.1093/genetics/iyad031. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/genetics/iyad031&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36866529&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 31. 31.Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29. doi:10.1038/75556. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/75556&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=10802651&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000086884000011&link_type=ISI) 32. 32.Carbon, S., Ireland, A., Mungall, C.J., Shu, S., Marshall, B., Lewis, S., the Ami, G.O.H., and the Web Presence Working, G. (2009). AmiGO: online access to ontology and annotation data. Bioinformatics 25, 288–289. doi:10.1093/bioinformatics/btn615. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btn615&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19033274&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000262518300025&link_type=ISI) 33. 33.Chawla, N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W.P. (2002). SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Int. Res. 16, 321–357, numpages = 337. 34. 34.Garg, S., and Raghavan, B. (2023). Comparison of machine learning algorithms for the classification of spinal cord tumor. Irish Journal of Medical Science (1971 -). doi:10.1007/s11845-023-03487-3. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s11845-023-03487-3&link_type=DOI) 35. 35.Bruno, V., Betti, M., D’Ambrosio, L., Massacci, A., Chiofalo, B., Pietropolli, A., Piaggio, G., Ciliberto, G., Nisticò, P., Pallocca, M., et al. (2023). Machine learning endometrial cancer risk prediction model: integrating guidelines of European Society for Medical Oncology with the tumor immune framework. Int J Gynecol Cancer. doi:10.1136/ijgc-2023-004671. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiaWpnYyI7czo1OiJyZXNpZCI7czoxMDoiMzMvMTEvMTcwOCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzExLzAzLzIwMjMuMTEuMDEuMjMyOTc3MjQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 36. 36.Gutman, R., Aronson, D., Caspi, O., and Shalit, U. (2023). What drives performance in machine learning models for predicting heart failure outcome? Eur Heart J Digit Health 4, 175–187. doi:10.1093/ehjdh/ztac054. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ehjdh/ztac054&link_type=DOI) 37. 37.Wall, N.R., Fuller, R.N., Morcos, A., and De Leon, M. (2023). Pancreatic Cancer Health Disparity: Pharmacologic Anthropology. Cancers (Basel) 15. doi:10.3390/cancers15205070. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15205070&link_type=DOI) 38. 38.de Jesus, V.H.F., Mathias-Machado, M.C., de Farias, J.P.F., Aruquipa, M.P.S., Jácome, A.A., and Peixoto, R.D. (2023). Targeting KRAS in Pancreatic Ductal Adenocarcinoma: The Long Road to Cure. Cancers (Basel) 15. doi:10.3390/cancers15205015. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15205015&link_type=DOI) 39. 39.Sun, Y., Yao, L., Man, C., Gao, Z., He, R., and Fan, Y. (2023). Development and validation of cuproptosis-related lncRNAs associated with pancreatic cancer immune microenvironment based on single-cell. Front Immunol 14, 1220760. doi:10.3389/fimmu.2023.1220760. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fimmu.2023.1220760&link_type=DOI) 40. 40.Wang, H., Ding, Y., He, Y., Yu, Z., Zhou, Y., Gong, A., and Xu, M. (2023). LncRNA UCA1 promotes pancreatic cancer cell migration by regulating mitochondrial dynamics via the MAPK pathway. Arch Biochem Biophys 748, 109783. doi:10.1016/j.abb.2023.109783. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.abb.2023.109783&link_type=DOI) 41. 41.Zhang, R., Wang, X., Ying, X., Huang, Y., Zhai, S., Shi, M., Tang, X., Liu, J., Shi, Y., Li, F., et al. (2023). Hypoxia-induced long non-coding RNA LINC00460 promotes p53 mediated proliferation and metastasis of pancreatic cancer by regulating the miR-4689/UBE2V1 axis and sequestering USP10. Int J Med Sci 20, 1339–1357. doi:10.7150/ijms.87833. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7150/ijms.87833&link_type=DOI) 42. 42.Tsai, C.W., Chang, W.S., Yueh, T.C., Wang, Y.C., Chin, Y.T., Yang, M.D., Hung, Y.C., Mong, M.C., Yang, Y.C., Gu, J., and Bau, D.T. (2023). The Significant Impacts of Interleukin-8 Genotypes on the Risk of Colorectal Cancer in Taiwan. Cancers (Basel) 15. doi:10.3390/cancers15204921. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15204921&link_type=DOI) 43. 43.Earnest, A., Tesema, G.A., and Stirling, R.G. (2023). Machine Learning Techniques to Predict Timeliness of Care among Lung Cancer Patients. Healthcare (Basel) 11. doi:10.3390/healthcare11202756. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/healthcare11202756&link_type=DOI) 44. 44.Padwal, M.K., Basu, S., and Basu, B. (2023). Application of Machine Learning in Predicting Hepatic Metastasis or Primary Site in Gastroenteropancreatic Neuroendocrine Tumors. Current Oncology 30, 9244–9261. doi:10.3390/curroncol30100668. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/curroncol30100668&link_type=DOI) 45. 45.Chiang, C.P., Wu, C.W., Lee, S.P., Chung, C.C., Wang, C.W., Lee, S.L., Nieh, S., and Yin, S.J. (2009). Expression pattern, ethanol-metabolizing activities, and cellular localization of alcohol and aldehyde dehydrogenases in human pancreas: implications for pathogenesis of alcohol-induced pancreatic injury. Alcohol Clin Exp Res 33, 1059–1068. doi:10.1111/j.1530-0277.2009.00927.x. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1530-0277.2009.00927.x&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19382905&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 46. 46.Kanellopoulos, P., Nock, B.A., Krenning, E.P., and Maina, T. (2020). Optimizing the Profile of [(99m)Tc]Tc-NT(7-13) Tracers in Pancreatic Cancer Models by Means of Protease Inhibitors. Int J Mol Sci 21. doi:10.3390/ijms21217926. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/ijms21217926&link_type=DOI) 47. 47.de Koning, P.J., Bovenschen, N., Leusink, F.K., Broekhuizen, R., Quadir, R., van Gemert, J.T., Hordijk, G.J., Chang, W.S., van der Tweel, I., Tilanus, M.G., and Kummer, J.A. (2009). Downregulation of SERPINB13 expression in head and neck squamous cell carcinomas associates with poor clinical outcome. Int J Cancer 125, 1542–1550. doi:10.1002/ijc.24507. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/ijc.24507&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19569240&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 48. 48.Xu, Y., Yuan, C., Peng, J., Zhou, L., Lin, Y., Wang, Y., Zhang, J., Ma, J., Yin, W., and Lu, J. (2022). LncRNA MIR205HG expression predicts efficacy of neoadjuvant chemotherapy for patients with locally advanced breast cancer. Genes Dis 9, 837–840. doi:10.1016/j.gendis.2021.10.001. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.gendis.2021.10.001&link_type=DOI) 49. 49.He, W.A., Berardi, E., Cardillo, V.M., Acharyya, S., Aulino, P., Thomas-Ahner, J., Wang, J., Bloomston, M., Muscarella, P., Nau, P., et al. (2013). NF-κB-mediated Pax7 dysregulation in the muscle microenvironment promotes cancer cachexia. J Clin Invest 123, 4821–4835. doi:10.1172/jci68523. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1172/JCI68523&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24084740&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000326611900029&link_type=ISI) 50. 50.Zhao, X., Lu, M., Liu, Z., Zhang, M., Yuan, H., Dan, Z., Wang, D., Ma, B., Yang, Y., Yang, F., et al. (2022). Comprehensive analysis of alfa defensin expression and prognosis in human colorectal cancer. Front Oncol 12, 974654. doi:10.3389/fonc.2022.974654. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fonc.2022.974654&link_type=DOI) 51. 51.Sun, N.K., Huang, S.L., Lu, H.P., Chang, T.C., and Chao, C.C. (2015). Integrative transcriptomics-based identification of cryptic drivers of taxol-resistance genes in ovarian carcinoma cells: Analysis of the androgen receptor. Oncotarget 6, 27065–27082. doi:10.18632/oncotarget.4824. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.18632/oncotarget.4824&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26318424&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 52. 52.Duan, W., Kong, X., Li, J., Li, P., Zhao, Y., Liu, T., Binang, H.B., Wang, Y., Du, L., and Wang, C. (2020). LncRNA AC010789.1 Promotes Colorectal Cancer Progression by Targeting MicroRNA-432-3p/ZEB1 Axis and the Wnt/β-Catenin Signaling Pathway. Front Cell Dev Biol 8, 565355. doi:10.3389/fcell.2020.565355. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fcell.2020.565355&link_type=DOI) 53. 53.Liu, Y., Xu, G., and Li, L. (2021). LncRNA GATA3-AS1-miR-30b-5p-Tex10 axis modulates tumorigenesis in pancreatic cancer. Oncol Rep 45. doi:10.3892/or.2021.8010. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3892/or.2021.8010&link_type=DOI) 54. 54.Chen, K., Wang, Q., Liu, X., Wang, F., Yang, Y., and Tian, X. (2022). Hypoxic pancreatic cancer derived exosomal miR-30b-5p promotes tumor angiogenesis by inhibiting GJA1 expression. Int J Biol Sci 18, 1220–1237. doi:10.7150/ijbs.67675. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7150/ijbs.67675&link_type=DOI) 55. 55.Liu, S., Luan, J., and Ding, Y. (2018). miR-144-3p Targets FosB Proto-oncogene, AP-1 Transcription Factor Subunit (FOSB) to Suppress Proliferation, Migration, and Invasion of PANC-1 Pancreatic Cancer Cells. Oncol Res 26, 683–690. doi:10.3727/096504017x14982585511252. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3727/096504017X14982585511252&link_type=DOI) 56. 56.Yang, J., Cong, X., Ren, M., Sun, H., Liu, T., Chen, G., Wang, Q., Li, Z., Yu, S., and Yang, Q. (2019). Circular RNA hsa\_circRNA\_0007334 is Predicted to Promote MMP7 and COL1A1 Expression by Functioning as a miRNA Sponge in Pancreatic Ductal Adenocarcinoma. J Oncol 2019, 7630894. doi:10.1155/2019/7630894. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1155/2019/7630894&link_type=DOI) 57. 57.Janpipatkul, K., Trachu, N., Watcharenwong, P., Panvongsa, W., Worakitchanon, W., Metheetrairut, C., Oranratnachai, S., Reungwetwattana, T., and Chairoungdua, A. (2021). Exosomal microRNAs as potential biomarkers for osimertinib resistance of non-small cell lung cancer patients. Cancer Biomark 31, 281–294. doi:10.3233/cbm-203075. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3233/cbm-203075&link_type=DOI) 58. 58.Daniel, R., Wu, Q., Williams, V., Clark, G., Guruli, G., and Zehner, Z. (2017). A Panel of MicroRNAs as Diagnostic Biomarkers for the Identification of Prostate Cancer. Int J Mol Sci 18. doi:10.3390/ijms18061281. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/ijms18061281&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28621736&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 59. 59.Wang, K., Zhong, W., Long, Z., Guo, Y., Zhong, C., Yang, T., Wang, S., Lai, H., Lu, J., Zheng, P., and Mao, X. (2021). 5-Methylcytosine RNA Methyltransferases-Related Long Non-coding RNA to Develop and Validate Biochemical Recurrence Signature in Prostate Cancer. Front Mol Biosci 8, 775304. doi:10.3389/fmolb.2021.775304. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fmolb.2021.775304&link_type=DOI) 60. 60.Li, R., Gao, X., Sun, H., Sun, L., and Hu, X. (2022). Expression characteristics of long non-coding RNA in colon adenocarcinoma and its potential value for judging the survival and prognosis of patients: bioinformatics analysis based on The Cancer Genome Atlas database. J Gastrointest Oncol 13, 1178–1187. doi:10.21037/jgo-22-384. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.21037/jgo-22-384&link_type=DOI) 61. 61.Zeng, X., Wang, Y., Liu, B., Rao, X., Cao, C., Peng, F., Zhi, W., Wu, P., Peng, T., Wei, Y., et al. (2023). Multi-omics data reveals novel impacts of human papillomavirus integration on the epigenomic and transcriptomic signatures of cervical tumorigenesis. J Med Virol 95, e28789. doi:10.1002/jmv.28789. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/jmv.28789&link_type=DOI) 62. 62.Wang, W.F., Zhong, H.J., Cheng, S., Fu, D., Zhao, Y., Cai, H.M., Xiong, J., and Zhao, W.L. (2023). A nuclear NKRF interacting long noncoding RNA controls EBV eradication and suppresses tumor progression in natural killer/T-cell lymphoma. Biochim Biophys Acta Mol Basis Dis 1869, 166722. doi:10.1016/j.bbadis.2023.166722. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.bbadis.2023.166722&link_type=DOI) 63. 63.Bi, X.-a., Li, L., Xu, R., and Xing, Z. (2021). Pathogenic Factors Identification of Brain Imaging and Gene in Late Mild Cognitive Impairment. Interdisciplinary Sciences: Computational Life Sciences 13, 511520. doi:10.1007/s12539-021-00449-0. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s12539-021-00449-0&link_type=DOI) 64. 64.Gusev, F.E., Reshetov, D.A., Mitchell, A.C., Andreeva, T.V., Dincer, A., Grigorenko, A.P., Fedonin, G., Halene, T., Aliseychik, M., Filippova, E., et al. (2019). Chromatin profiling of cortical neurons identifies individual epigenetic signatures in schizophrenia. Transl Psychiatry 9, 256. doi:10.1038/s41398-019-0596-1. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41398-019-0596-1&link_type=DOI) 65. 65.Yevshin, I., Sharipov, R., Kolmykov, S., Kondrakhin, Y., and Kolpakov, F. (2019). GTRD: a database on gene transcription regulation-2019 update. Nucleic Acids Res 47, D100–d105. doi:10.1093/nar/gky1128. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gky1128&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30445619&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom) 66. 66. The UniProt, C. (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research 51, D523–D531. doi:10.1093/nar/gkac1052. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkac1052&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36408920&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F03%2F2023.11.01.23297724.atom)