Machine learning predicts metastatic progression using novel differentially expressed lncRNAs as potential markers in pancreatic cancer ======================================================================================================================================= * Hasan Alsharoh ## **Abstract** Pancreatic cancer (PC) is associated with high mortality overall. Recent literature has focused on investigating long noncoding RNAs (lncRNAs) in several cancers, but studies on their functions in PC are lacking. The purpose of this study was to identify novel lncRNAs and utilize machine learning to techniques to predict metastatic cases of PC using the identified lncRNAs. To identify significantly altered expression of lncRNA in PC, data was collected from The Cancer Genome Atlas (TCGA) and RNA-sequencing (RNA-seq) transcriptomic profiles of pancreatic carcinomas were extracted for differential gene expression analysis. To assess the contribution of these lncRNAs to metastatic progression, different ML algorithms were used, including logistic regression (LR), support vector machine (SVM), random forest classifier (RFC) and eXtreme Gradient Boosting Classifier (XGBC). To improve the predictive accuracy of these models, hyperparameter tuning was performed, in addition to reducing bias through the synthetic minority oversampling technique. Out of 60,660 gene transcripts shared between 151 PC patients, 38 lncRNAs that were significantly differentially expressed were identified. To further investigate the functions of the novel lncRNAs, gene set enrichment analysis (GSEA) was performed on the population lncRNA panel. GSEA results revealed enrichment of several terms implicated in proliferation. Moreover, using the 4 ML algorithms to predict metastatic progression returned 76% accuracy for both SVM and RFC, explicitly based on the novel lncRNA panel. To the best of my knowledge, this is the first study of its kind to identify this lncRNA panel to differentiate between non-metastatic PC and metastatic PC, with many novel lncRNAs previously unmapped to PC. The ML accuracy score reveals important involvement of the detected RNAs. Based on these findings, I suggest further investigations of this lncRNA panel *in vitro* and *in vivo*, as they could be targeted for improved outcomes in PC patients, as well as assist in the diagnosis of metastatic progression based on RNA-seq data of primary pancreatic tumors. Keywords * Pancreatic cancer * lncRNAs * machine learning * metastatic progression ## 1. Introduction Pancreatic cancer (PC) is one of the deadliest cancers, with an overall five-year survival between 7.2 and 10% according to the literature 1,2. Evidence suggests that PC is often diagnosed until the late stages of tumorigenesis, likely contributing to its high mortality rate 3. Recent literature has provided increasing evidence regarding the involvement of long noncoding RNAs (lncRNAs) in the development, invasiveness, angiogenic potential, chemotherapeutic resistance and metastatic capacity of PC 4. LncRNAs are RNA molecules characterized by having an arbitrary lower cutoff of 200 nucleotides that have been shown not to code for proteins post-transcriptionally 4,5. LncRNAs have been shown to play complex roles in biological processes in various tissues, with possible implications in DNA repair, cellular proliferation, and human diseases, which made them a common target for recent literature to investigate in cancer 6. lncRNAs have further been used as biomarkers for overcoming chemoresistance, as well as for the diagnosis of several cancers, including PC 7–10. Emerging research has been able to provide evidence regarding the use of lncRNAs for improved diagnostic accuracy, prognosis prediction, and treatment adjustment using various methods, including machine learning (ML) techniques 8–10. Literature regarding the utilization of ML algorithms has been rapidly rising, with literature urging more rapid use of such algorithms in oncology to increase diagnostic accuracy or to further improve on the available algorithms 11–13. The aim of this study was to investigate potential lncRNAs involved in the metastatic progression of PC based on RNA-sequencing (RNA-seq) data. To achieve this objective, publicly available data from the cancer genome atlas (TCGA) for 172 patients was collected, and the data was filtered according to predefined inclusion and exclusion criteria, which resulted in 151 PC records. PC records were further categorized according to their TNM staging, and tumor data were separated into tumors with metastatic activity (TMAs) and tumors without metastatic activity (TWAs). Using bioinformatics analytic techniques, I identified 125 differentially expressed transcripts (DETs) among 60,660 transcripts involved in this study, many of which were novel. Further, the functions of this global transcript panel (including protein-coding transcripts, and lncRNAs) was assesed using a multiparametric approach. Finally, lncRNA transcriptomic data was extracted from the RNA-seq dataset from the PC population, further characterizing 38 lncRNAs that were significantly differentially expressed, with most falling into the 125 DETs. To evaluate the lncRNA involvement, 4 ML algorithms were used to predict and distinguish between TMAs and TWAs. These algorithms included multivariate logistic regression (LR), support vector machine (SVM), random forest classifier (RFC), and eXtreme Gradient Boosting Classifier (XGBC). Several techniques were used to further reduce the bias within the included sample as described in the methodology. Training and evaluation of the ML algorithms was performed by separating the dataset from the 38 DETs into a training set and a testing set to eventually evaluate the performance of each of the models. Out of all the ML algorithms, SVM and RFC were able to predict TMAs and TWAs with 76% accuracy using the 38 lncRNA data, suggesting important implications for the specified set of lncRNAs in PC. To the best of my knowledge, this is the first study to identify the involvement of this specific lncRNA panel in PC, with many novel lncRNAs lacking any studies performed on which. The results of this research could potentially have important clinical implications, as the novelty of the identified lncRNAs requires further comprehensive validation and *in vitro* and *in vivo* investigations. The accuracy shown by the ML model suggests that these novel lncRNAs could be used as biomarkers and further targeted for improved diagnosis and outcome in PC patients. ## 2. Materials and methods ### 2.1. Data acquisition TCGA database was used for data collection and is available at [https://www.cancer.gov/tcga](https://www.cancer.gov/tcga). Exploration of TCGA-PAAD project data to acquire pancreatic RNA-seq data was performed on 25/10/2023. File filters applied included a) Data Category: transcriptome profiling; b) Data Type: Gene Expression Quantification; c) Experimental strategy: RNA-Seq; d) Access: open. The case filters applied included the following: a) primary site: pancreas; b) project: TCGA-PAAD; and c) disease type: ductal and lobular neoplasms, adenomas and adenocarcinomas. The inclusion criteria were that for each RNA-seq dataset to be of similar structure, for the predefined PC tumors mentioned in the filters, regardless of age and gender. Primary tumors, regardless of metastatic stage, were also included. Exclusion criteria included defects in dataset structure, RNA-seq for tumor adjacent tissues, or those that had undergone prior therapy to a potential previous malignancy. Records with annotations specifying that tumor data were incorrectly labeled in terms of whether the tumor was neoplastic, were also excluded. Further categorization was performed for the acquired data using Excel sheets. For TNM subgroup analysis, tumors with staging data were categorized into tumors with metastatic activity, which included those classified as M1, MX/M0 and N1 or above, and tumors without metastatic activity, which included those classified as M0N0. Acquired data were also filtered to include only lncRNA gene expression quantification. This subgrouping was performed prior to DGEA to assess DETs between TWAs and TMAs. ### 2.2. Data analysis Bioinformatics analysis was conducted on the data following matching the subjects to the study’s inclusion and exclusion criteria. Python v3.11 (available at [https://www.python.org/](https://www.python.org/)) was used in an Anaconda jupyter lab environment 14,15. To restructure the dataset up for the study population RNA-seq datasets and to import the data into Python, the glob module was used 16. Data structure manipulation and organization was performed using pandas library v1.5.3 17. Libraries such as numpy and scipy were also utilized for data processing 18,19. Differential gene expression analysis (DGEA) was performed using PyDESeq2, an R package implemented in Python that has been suggested to be reliable and comparable to the R package20. The DETs were matched to gene symbols and further visualized using the matplotlib21, seaborn22, and sanbomics23 packages. PyDESeq2 calculates the significance of transcripts using the Wald test, performs count normalization using the trimmed mean of M values (TMM), similar to DESeq2, and relies on the statsmodels library 24,25. Using count normalization has been shown to have higher accuracy than TPM (transcripts per million) and FPKM (fragments per kilobase of transcript per million fragments mapped) 26. A more comprehensive description of the package is available elsewhere 27. Significant differentiation after adjustment of p values was considered at p<0.05 and an absolute log2-fold change (log2FC) of >0.5. A heatmap of the DEGs was made through the matplotlib 21 package as well. Pearson’s correlation coefficient was calculated and mapped for all gene transcript data. ### 2.3. Gene set and ontology enrichment analysis Gene set enrichment analysis (GSEA) is a method of interpreting gene-wide expression profiles28. GSEA was performed using the GSEApy v1.0.6 package, a Rust implementation of GSEA in python, used for performing computation of RNA-seq count data to evaluate predefined gene sets in association with different phenotypes. Gene expression data was ranked using the prerank function available in the package. The accuracy of this package has been previously proven, and the method to use it is described extensively elsewhere29. Enrichment was performed for several gene collections from MSigDB available at ([https://www.gsea-msigdb.org/](https://www.gsea-msigdb.org/)) and miRTarBase 201730. Gene sets and collections that were evaluated for enrichment were c2.cp.kegg.v2023.1.Hs.symbols, c3.mir.v2023.1.Hs.symbols, c3.tft.v2023.1.Hs.symbols, c4.cgn.v2023.1.Hs.symbols, c5.go.bp.v2023.1.Hs.symbols, c5.go.cc.v2023.1.Hs.symbols, c5.go.mf.v2023.1.Hs.symbols, c5.hpo.v2023.1.Hs.symbols, c6.all.v2023.1.Hs.symbols, h.all.v2023.1.Hs.symbols, and miRTarBase_2017. Gene Ontology (GO) is a detailed resource with annotations of gene and gene product functions 31,32. It provides the potential to describe gene functions by assigning them to specific terms in which the transcripts’ genes are linked, detailing their relationships with each other. GO term enrichment was performed through GSEApy, and the results were extracted through tools available in said package. The false discovery rate (FDR) was considered significant when FDR<0.05. Visualization of GSEA results was performed using tools from GSEApy. Data collected from GSEA results included terms, FDR, enrichment and negative enrichment scores, as well as matched genes. The minimum matching size for gene sets when performing GSEA for the global RNA-seq panel was set to 150. However, for the lncRNA panel, the minimum matching size was set to 3, as there were few enriched gene sets. ### 2.4. ML models Multivariate LR, SVM, RFC, and XGBC were employed to predict metastatic risk for the population based on the lncRNA count data from TCGA. DETs were extracted from DGEA for use as sole predictors of metastatic progression in the study population. Analysis of the models’ accuracy was performed using packages from the scipy, scikit-learn, and matplotlib libraries. To train the ML algorithms, data were categorized into a training set (70% of the data) and a testing set (30%). A random state number was set for all the implemented ML models to dictate a specific seed of randomness during the analysis to maintain reproducibility. For binary classification, TNM stage of IIa or below was designated “0” and considered the TWM for the ML algorithms, while TNM stage IIb or above was designated “1” and considered the TMA. The testing sets were hidden from the ML algorithms to evaluate the predictive capacity performance following model training. Furthermore, hyperparameter tuning was performed to improve the predictive accuracy of the model. This was done through the GridSearchCV and BayesianSearchCV modules. Fivefold cross-validation was set as a parameter, and data regularization was done through L2 method, all of which have been shown to reduce bias and lower classification errors, also reducing sensitivity to outliers33. The inverse of the regularization strength (or penalty values) was set according to the optimal values found by the search modules specified above. To identify the best parameters, values were also tested over 50 iterations. Moreover, the synthetic minority oversampling technique (SMOTE) was performed to artificially increase TWM population numbers to reduce bias, which has proven to be a powerful tool in improving ML accuracy and addressing imbalanced samples 34. These methods of standardization were performed for all ML algorithms used. ML algorithms used were also provided by the scikit-learn and XGBoost libraries. All of the algorithms consist of supervised machine learning algorithms, and are commonly used for classifications of tumors35,36. Further, L2 regularization has been considered to improve the accuracy of the ML algorithms 37. To assess the performance of the ML algorithms, several evaluations were performed for each model. Accuracy is a very commonly used ML evaluation metric, here representing the ratio between the correctly determined TMAs and TWAs. Recall represents the sensitivity, describing the rate of correctly classified TMAs. Precision describes the ratio between correctly identified TMAs and all samples designated “TMA”. The F1 score metric represents the mean of precision and recall. A thorough description of the evaluation metrics used here is beyond the scope of this article, and has been explained comprehensively by Hicks et, al. elsewhere38. Area under the curve (AUC) was also used for evaluating the measures, and ML models having an AUC between 0.6 and 0.75 are considered to show possibly helpful discrimination (classification capacity), while above 0.75 indicate a clearly helpful classification capacity, as described elsewhere39. ## 3. Results ### 3.1. Primary characteristics of the study population Of the 179 retrieved records, 23 were excluded for the following annotations: a) “This case is a neuroendocrine tumor and should not have been included in the PAAD study” (n = 8); b) “Per the PAAD EPC, this tumor is a normal pancreas with atrophy” (n = 5); c) “Per the PAAD EPC, this tumor is an atrophic pancreas” (n = 3); d) “Per the PAAD EPC, this tumor is a noninvasive IPMN” (n = 1); e) “Per the PAAD EPC, this tumor is an acinar cell carcinoma” (n = 1); f) “Per the PAAD EPC, this tumor is a normal ampula of Vater” (n = 1); g) “The PAAD EPC states that this case likely did not arise in the pancreas (ampullary)” (n = 1); h) “Systemic treatment given to the prior/other malignancy” (n = 1); i) “Per the PAAD EPC, this tumor is an atrophic pancreas with a single focus of low-grade PanIN” (n = 2); “Samples identified in the sample sheet with a sample type of “Solid Tissue Normal” (from normal tissue adjacent to malignancy)” ( According to the flow diagram found in Figure 1. A total of 151 patient records were included. **Table 1** summarizes the characteristics of the cohort. Notably, 115 records were classified as TMAs, while 36 were classified as TWAs. Of the TMAs, 116 were diagnosed as TNM stage IIb, and 8 were diagnosed as stage III and IV. For the TWAs, 26 were at TNM stage IIa. ![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F1.medium.gif) [Figure 1.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F1) Figure 1. Flow diagram of the study. Created with Lucidchart, [www.lucidchart.com](http://www.lucidchart.com). TCGA: The Cancer Genome Atlas; PAAD: Pancreatic adenocarcinoma; TMA: Tumor with metastatic activity; TWA: Tumor without metastatic activity; DGEA: Differential gene expression analysis; GSEA: Gene set enrichment analysis; ML: Machine learning. View this table: [Table 1.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/T1) Table 1. Population primary characteristics The age range of the total patient sample was between 35 and 88 years old (mean = 64.66 ± 10.91). Ninety-four were males, and 78 were females. When reported, 143 had infiltrating duct carcinoma, and 16 had adenocarcinoma as the primary diagnosis. Eight had neuroendocrine tumors but were excluded. Seventeen pancreatic tumors had no specified location, 125 were pancreatic head lesions, 15 were pancreatic body lesions, and 13 were pancreatic tail lesions. The RNA-seq data included 60,660 transcript expression profiles for each of the included patient and control samples. Transcriptomic profiling was performed for the same genes in all patient samples. Of the available transcripts, 16,901 were lncRNAs. After removing lncRNAs with 0 values among all patients, 15,879 lncRNAs remained. All details regarding the included samples are available in **Supplementary Material 1.** ### 3.2. DGEA and GSEA of all transcripts A total of 60,660 gene transcripts were filtered following PyDESeq2 analysis, and unavailable values were dropped, resulting in 47,528 transcripts. DGEA revealed 125 differentially expressed transcripts, as shown in **Table 2**, and the top DETs are shown in Figure 2. Notably, ADH7, SERPINB13, MIR205HG, NTS, and LINC01300 were the most downregulated DETs, with log2FC values of -3.42295, -3.4189, -3.12513, – 3.02808, and -2.72096, respectively. The most upregulated DETs were PAX7, AC010789.1, TMPRSS15, DEFA6, and DEFA5 and had log2FC values of 3.149596, 3.506053, 3.538356, 3.594891, and 4.800701, respectively. ![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F2.medium.gif) [Figure 2.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F2) Figure 2. Differentially expressed transcripts in PC. Absolute log2FC>0.5 and adjusted p value<0.05 were considered as the significance thresholds. View this table: [Table 2.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/T2) Table 2. Differentially expressed protein-coding and long non-coding transcripts found in the global RNA-seq population GSEA was subsequently performed, with libraries investigated available in **Supplementary Materials 2**. There were many gene sets enriched with the transcripts, as many transcripts were included in the study’s RNA-seq panel. Notably, several GO terms were enriched, as well as some terms from miRTarBase 2017, as shown in Figure 3 **A and B**. FDR values were significant for the enriched terms (FDR<0.01). ![Figure 3](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F3.medium.gif) [Figure 3](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F3) Figure 3 **A.** GOBP (GO biological process) term enrichment. Upregulated genes had a lower rank, and downregulated genes had a higher rank. The enrichment score correlates with the number of genes from the RNA-seq panel enriching the gene set with significantly differentiated expression. More transcripts enriching this term are downregulated in this study due to the enrichment score reaching – 0.5 since these genes have a higher density of higher ranked genes. **B.** miRTarBase_2017 term enrichment. Upregulated genes had a lower rank, and downregulated genes had a higher rank. The enrichment score correlates with the number of genes from the gene panel enriching the gene set with significantly differentiated expression. Here, the gene set was more enriched with the upregulated genes from the RNA-seq panel. ### 3.3. lncRNA DGEA, correlations, and GSEA Further subgroup analysis was performed for lncRNAs in PC, which returned 16,901 expression values, for which PyDeseq2 was also used to analyze DETs. Dropping the 0-sum, duplicate, and unavailable values retrieved 15,568 lncRNAs. Of the lncRNA panel, 38 lncRNAs were significantly differentially expressed (shown in Figure 4). ![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F4.medium.gif) [Figure 4.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F4) Figure 4. Differentially expressed lncRNA. Absolute log2FC>0.5 and adjusted p value<0.05 were considered as the significance thresholds. Interestingly, the most downregulated lncRNAs were LINC01300, DUSP5-DT, AL513128.3, MIR205HG, and AC132192.2, with Log2FC values of -2.55682, -1.55378, -0.70877, -2.68894, and -0.68868, respectively. The most upregulated DET lncRNAs were AC010789.1, LINC00486, ENSG00000261409 (referred to as RF00019), LINC01115, and AC133530.1, with log2FC values of 2.154221, 1.214608, 3.647081, 1.705921, and 2.388161, respectively. Results of DGEA on the lncRNAs are shown in **Table 3**. View this table: [Table 3.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/T3) Table 3. DGEA of lncRNAs in PC. Moreover, since the number of DETs was feasible, to further visualize the relationship between these lncRNAs, each transcript’s natural logarithm of 1 plus (normalized count) data was correlated to their respective PC cases, and Pearson’s correlation coefficients for all the lncRNAs were extracted. The results are visualized in Figure 5. A table of all Pearson’s correlation coefficients can be found in **Supplementary Material 3**. ![Figure 5](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F5.medium.gif) [Figure 5](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F5) Figure 5 **A.** Hierarchical clustering heatmap of identified lncRNAs and their corresponding expression per PC case. The color gradient in the legend refers to the natural logarithm of 1 plus (normalized count) values. **B.** Hierarchical clustering heatmap of lncRNAs amongst the sample population. The color gradient in the legend refers to Pearson’s correlation coefficient. The dendrogram linkage is based on the correlation strength. Geneid: ENSEMBL ID for the gene encoding the respective lncRNA transcript. tw: TWAs; tm: TMAs. GSEA and GO analyses were subsequently performed for all the lncRNA data. Due to the lack of studies on the genes of these transcripts, there was no significant enrichment in most databases. Notably, a few terms were enriched from the MSigDB c3.tft.v2023.1.Hs.symbols collection, which is focused on transcription factors. The results of the term enrichment for the top 10 terms in this collection are shown in Figure 6, and the results for insignificant term enrichment for other collections and databases can be found in **Supplementary Material 3.** ![Figure 6.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F6.medium.gif) [Figure 6.](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F6) Figure 6. GSEA of lncRNAs in MSigDB transcription factors gene set. Terms are more significantly enriched with downregulated genes. ### 3.4. ML model prediction of PC metastatic potential according to lncRNA expression Following the training and testing of each of the ML models, optimizations were performed to find the highest possible accuracy obtainable while reducing bias. Therefore, SMOTE was implemented in all the ML algorithms. Reducing sample imbalances improved the predictive accuracy of the utilized algorithms. Following SMOTE implementation and thorough hyperparameter tuning, LR demonstrated an accuracy score of 73.91% when distinguishing between TMAs and TWAs when tested, as well as an F1 score of 82.57% and a recall of 90.63%. Regardless, the AUC for LR was 0.63. Figure 7 **A and B** show the receiver operating characteristic (ROC) curve and for logistic regression following the implementation of SMOTE and the precision-recall (PR) curve. As the LR model was the only allowing the determination of prediction coefficients, assessment of which lncRNA had the highest weight in predictions was performed. The most notable lncRNAs with positive correlation coefficients (>0.50) include: LINC02575, LINC01115, LINC02428, PURPL, and AL035425.3. While the most notable ones with negative correlations (<-0.50) indicating a lower likelihood for metastatic PC include: AC207130.1, and AL358777.3. Figure 7 **C** shows the weight of each lncRNA (feature) in assisting the regression model in classifying test cases into TMAs and TWAs. ![Figure 7](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F7.medium.gif) [Figure 7](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F7) Figure 7 **A.** The LR model showed an AUC = 0.63, demonstrating relatively helpful classification performance, with good accuracy of detecting PC cases at TNM stage IIb or above. **B.** LR model accuracy of predicting positive values in comparison to the true positive rate (recall). **C.** Weights of each of the differentially expressed lncRNAs allowing the LR model to differentiate between non-metastatic tumors and metastatic tumors. For the SVM model, SMOTE implementation, and hyperameter tuning also improved the predictive potential of the algorithm, which, on testing, returned an accuracy of 76.09%, with a true positive rate of 84.51% and a recall of 93.75%. **Figure A and B** show the ROC curve as well as the PR curve of the SVM model. ![Figure 8](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F8.medium.gif) [Figure 8](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F8) Figure 8 **A.** The SVM algorithm showed an AUC = 0.65, demonstrating modest performance when accurately detecting PC cases at TNM stage IIb or above and distinguishing them from less metastatic stages. **B.** SVM model accuracy of predicting positive values in comparison to its recall capacity. RFC was one of the most accurate models; after hyperparameter tuning, it returned an accuracy of 76.09% and an F1 score of 81.96%, with a recall of 78.13%. Most importantly, the AUC for this model was 0.75, showing good performance in classifying the tumors. Regardless, the lncRNA panel consisting of 38 differentially expressed lncRNAs allowed the ML algorithms to discern advanced TNM stages from relatively early TNM stages in PC. Figure 9 **A and B** also show the RFC model accuracy and PR curve. ![Figure 9](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F9.medium.gif) [Figure 9](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F9) Figure 9 **A.** The RFC ROC AUC was 0.75, demonstrating clearly significant classification accuracy of detecting PC cases among the other ML algorithms when using the differentially expressed lncRNA counts data. **B.** The RFC PR curve showed good recall, and acceptable precision. As for XGBC, the model showed 71.73% accuracy; This specific model had the most inconsistency in predicting tumor types following each randomization. Figure 10 **A and B** show the low AUC and its PR curve. Data regarding the evaluation of the ML algorithms are available in **Supplementary Material 4**. ![Figure 10](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/11/13/2023.11.01.23297724/F10.medium.gif) [Figure 10](http://medrxiv.org/content/early/2023/11/13/2023.11.01.23297724/F10) Figure 10 **A.** XGBC showed the lowest AUC of 0.58. While the accuracy for detecting metastatic PC cases was high, the false positive rate was also high. **B.** Poorest reliability amongst the ML algorithms in the XGBC model. ## 4. Discussion Despite advances in diagnostics and therapeutics, PC remains a very challenging condition to treat, with consistently high mortality rates and limited available treatments40,41. Recently, research has focused on identifying prognostic markers for PC, and preclinical studies have identified several prognostic lncRNA signatures8,42–44. LncRNAs have been further suggested to have implications in diagnosis, drug resistance, and therapeutics in PC4. However, as most patients are often diagnosed at advanced stages of disease, mutational burdens show complex relationships with lncRNA regulation4. Therefore, as the literature suggests, these relationships must be investigated to adjust treatment modalities. This becomes even more crucial in the latter stages of PC. This study aimed to provide details regarding DETs in PC first and then to further analyze differentially expressed lncRNA and assess the diagnostic potential of these lncRNAs during the transition from stage IIa and stage IIb and above. These lncRNAs were extracted after performing DGEA to extract 38 gene transcripts from the global RNA-seq panel among 151 patient samples. The diagnostic potential of lncRNAs was assessed using supervised ML techniques to predict metastatic transition. Four ML techniques with established accuracy in prediction were used in this research: LR45, SVM46, RFC46 and XBGC47. DGEA of the global RNA-seq panel revealed 125 DETs, many of which were previously uninvestigated. Of the downregulated DETs, ADH7 was hypothesized to have implications when mutated in pancreatic injury48. NTS was also associated with PC49. However, SERPINB13 and MIR205HG were previously unexplored in PC but had been discussed in other cancers and were implicated in poor clinical outcomes50,51. No studies are available regarding LINC01300, which warrants further investigation. For the upregulated DETs, PAX7 was previously reported to have some relationship with cancers, yet studies regarding this specific gene transcript are lacking 52. For DEFA6 and DEFA5, a report suggested a link between them and clinical outcomes in colorectal cancer53. While there were no studies regarding AC010789.1 and TMPRSS15 in PC, some studies linked the potential implications of these transcripts with other cancers54,55. GSEA for the global RNA-seq panel revealed several enriched pathways in many gene sets. For example, GO enrichment revealed that the RNA-seq panel significantly enriched pathways relevant in the regulation of aerobic respiration (GO:1903715), electron transport carrier chain (GO:0022900), and mitochondrial gene expression and translation into RNA transcripts (GO:0140053). Notably, of the miRTarBase enriched pathways, mir-30b-5p microRNA (miRNA) was previously linked to PC56,57. While miR-548x-3p has not been studied regarding its function in cancer, miR-144-3p was previously implicated in PC58,59. Additionally, mir-548j-3p had no studies documenting its relationship with cancer. For miR-1468-3p, some studies have suggested it as a biomarker for non-small cell lung cancer and prostate cancer60,61. Following the filtering of the global RNA-seq panel to lncRNAs exclusively, DGEA revealed 38 differentially expressed lncRNAs, many of which were novel. LINC01300 and MIR205HG, as previously described, in addition to DUSP5-DT and AL513128.3, had no studies in PC, with the latter two completely lacking any studies on which. In contrast, one report regarding AC132192.2 indicated its relevance in prostate cancer62. For the upregulated lncRNAs, AC010789.1, as previously stated, had a report regarding its function in colorectal cancer55,63. LINC00486, RF00019, LINC01115, and AC133530.1 all lack validation studies in PC, but other reports indicate involvement in several diseases, including cancer64–67. As these novel lncRNAs lack studies regarding their functions, GSEA of the selected MSigDB collections returned no significant enrichment but in one transcription factors collection. Notably, the most enriched pathway described genes containing one or more binding regions for a transcription factor that regulates cell fate and controls cell cycle progression from the mitotic phase to interphase, known as TOX high mobility group box family member 4 (TOX4)68,69. Interestingly, lncRNAs enriching this term were primarily downregulated. To further explore the significance of the identified 38 lncRNAs, ML algorithms were employed to predict the metastatic state of cancer (designated “0” for stages IIa or below and “1” for stages IIb and above). The LR model suggested that a few lncRNAs may have more significance in metastatic progression, most notably: LINC02575, LINC01115, LINC02428, PURPL, AL035425.3, AC207130.1, and AL358777.3. All of which lacking studies in PC. Nonetheless, LINC02575 has been found to be implicated in proliferation of laryngeal squamous cell carcinomas 70; PURPL was indicated to have an involvement in ovarian and gastric cancers 71–73; and AL035425.3 was suggested to be implicated in the prognosis of triple negative breast cancer74. Of all the ML algorithms, RFC showed superior accuracy to the other algorithms, showing an AUC of 0.75 and an accuracy of over 76%. RFC models have been previously shown to have superior performance to several other ML algorithms39. While there is much to be understood regarding the functions of the identified lncRNA panel, the accuracy shown by RFC reveals important aspects about the involvement of these lncRNAs in PC. These finding warrants further *in vitro* and *in vivo* investigations of the identified lncRNA panel. For most of the identified lncRNA panels, this was the first study to uncover their involvement in PC. Regardless, there are many clinical implications for the findings discussed here. The results of this study suggest that the identified lncRNAs could be further utilized to assess the metastatic potential of PC, as well as aid in drug development, since these lncRNAs can be used as drug targets. Since their involvement allowed the prediction and distinction between TNM stages, further investigation of their functions seems crucial. Despite the significant findings, this study is not without limitations. First, DEGA was performed for a large number of data, which likely raised data noise. Second, TWAs used as controls were low in number, as most samples had a stage IIb diagnosis, and SMOTE was necessary to utilize for the ML algorithms to reduce bias. Third, there was a lack of normal tissue control samples, which makes it difficult to provide more accurate assessments of the nature of these lncRNAs. Last, there might have been biases in the TCGA data from incorrect measurements or sequencing, potentially skewing the results of the RNA-seq data. All of these findings indicate that the findings of this study should be further validated and interpreted with caution. Regardless, the presence of some evidence regarding some of the identified novel lncRNAs in other cancers suggests their potential involvement in PC proliferation and metastasis. This further adds to the implications of the findings discussed here and the importance of future research to address these novel lncRNAs as potential markers of metastatic progression in PC. ## 5. Conclusion DGEA utilized in this study identified a set of 38 novel lncRNAs that could contribute to metastatic progression in PC. GSEA was unable to provide sufficient information to further describe the functions of these lncRNA, due to the scarcity of available data relevant to the transcripts identified. Since different ML algorithms were able to predict metastatic PC with acceptable accuracy and the RFC model predicted PC with 76% accuracy based on the 38 lncRNA panel, it is likely that these DETs participate in the metastatic progression of PC, warranting further investigation. The significance and importance of this study is represented by the identified novel lncRNA set. Metastatic PC lacks sufficient studies regarding the involvement of lncRNAs in tumor proliferation and progression, especially those that use ML algorithms with proven accuracy. This is the first study of its kind to use this methodology to reveal the discussed lncRNA panel in PC to distinguish between early-stage and advanced PC. Regardless, more studies are needed to identify the role these genes play in PC metastasis and other cancers. Based on the findings of this study, I suggest further research to take place into the roles of these RNAs in metastatic PC. *In vitro* and *in vivo* experiments must be conducted to further elucidate the functions these lncRNAs may take part in. The accuracy of the ML algorithms when classifying metastatic PC reveals that these lncRNAs could have important potentials in improving the diagnostic accuracy for metastatic PC when implemented with other techniques, and should be evaluated for therapeutic potentials. ## Supporting information Supplementary Material 1 [[supplements/297724_file02.xlsx]](pending:yes) Supplementary Material 2 [[supplements/297724_file03.xlsx]](pending:yes) Supplementary Material 3 [[supplements/297724_file04.xlsx]](pending:yes) Supplementary Material 4 [[supplements/297724_file05.xlsx]](pending:yes) ## Data Availability All raw data acquired from TCGA, in addition to all analyses performed on said data and source code utilized to perform the analyses mentioned in the methodology are available at the link https://github.com/hasanalsharoh/PanC. [https://github.com/hasanalsharoh/PanC](https://github.com/hasanalsharoh/PanC) ## Footnotes * Figures 2 and 4 show a significance threshold of 0.75, instead of 0.5 as written in the description. Figure 5 description is for figure 5B, this issue happened as I was unsure of which of the figures to include due to size constraints. Nonetheless, in this revision, I include both figures (A and B) for clarity. Further revisions were done for calculations and everything else, results remain unchanged and conclusions unaltered thereof. Manuscript text remains unaltered following my revisions. * Received November 1, 2023. * Revision received November 13, 2023. * Accepted November 13, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## References 1. 1.Hu JX, Zhao CF, Chen WB, et al. Pancreatic cancer: A review of epidemiology, trend, and risk factors. World J Gastroenterol. Jul 21 2021;27(27):4298–4321. doi:10.3748/wjg.v27.i27.4298 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3748/wjg.v27.i27.4298&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 2. 2.Partyka O, Pajewska M, Kwaśniewska D, et al. Overview of Pancreatic Cancer Epidemiology in Europe and Recommendations for Screening in High-Risk Populations. Cancers. 2023;15(14). doi:10.3390/cancers15143634 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15143634&link_type=DOI) 3. 3.Andersson R, Haglund C, Seppänen H, Ansari D. Pancreatic cancer – the past, the present, and the future. Scandinavian Journal of Gastroenterology. 2022/10/03 2022;57(10):1169–1177. doi:10.1080/00365521.2022.2067786 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/00365521.2022.2067786&link_type=DOI) 4. 4.Bin W, Yuan C, Qie Y, Dang S. Long non-coding RNAs and pancreatic cancer: A multifaceted view. Biomedicine & Pharmacotherapy. 2023/11/01/ 2023;167:115601. doi:10.1016/j.biopha.2023.115601 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.biopha.2023.115601&link_type=DOI) 5. 5.Guttman M, Amit I, Garber M, et al. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009/03/01 2009;458(7235):223–227. doi:10.1038/nature07672 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/nature07672&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19182780&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000264059700048&link_type=ISI) 6. 6.Kore H, Datta KK, Nagaraj SH, Gowda H. Protein-coding potential of non-canonical open reading frames in human transcriptome. Biochem Biophys Res Commun. Oct 13 2023;684:149040. doi:10.1016/j.bbrc.2023.09.068 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.bbrc.2023.09.068&link_type=DOI) 7. 7.Aswathy R, Sumathi S. Defining new biomarkers for overcoming therapeutical resistance in cervical cancer using lncRNA. Mol Biol Rep. Oct 25 2023;doi:10.1007/s11033-023-08864-w [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s11033-023-08864-w&link_type=DOI) 8. 8.Zhang N, Yu X, Sun H, Zhao Y, Wu J, Liu G. A prognostic and immunotherapy effectiveness model for pancreatic adenocarcinoma based on cuproptosis-related lncRNAs signature. Medicine (Baltimore). Oct 20 2023;102(42):e35167. doi:10.1097/md.0000000000035167 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1097/md.0000000000035167&link_type=DOI) 9. 9.Wang T, Ji M, Liu W, Sun J. Development and validation of a novel DNA damage repair-related long non-coding RNA signature in predicting prognosis, immunity, and drug sensitivity in uterine corpus endometrial carcinoma. Comput Struct Biotechnol J. 2023;21:4944–4959. doi:10.1016/j.csbj.2023.10.025 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.csbj.2023.10.025&link_type=DOI) 10. 10.Zhao Y, Song Y, Zhang Y, Ji M, Hou P, Sui F. Screening protective miRNAs and constructing novel lncRNAs/miRNAs/mRNAs networks and prognostic models for triple-negative breast cancer. Mol Cell Probes. Oct 24 2023;72:101940. doi:10.1016/j.mcp.2023.101940 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.mcp.2023.101940&link_type=DOI) 11. 11.Collins GS, Whittle R, Bullock GS, et al. OPEN SCIENCE PRACTICES NEED SUBSTANTIAL IMPROVEMENT IN PROGNOSTIC MODEL STUDIES IN ONCOLOGY USING MACHINE LEARNING. J Clin Epidemiol. Oct 27 2023;doi:10.1016/j.jclinepi.2023.10.015 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jclinepi.2023.10.015&link_type=DOI) 12. 12.Rasti P, Wolf C, Dorez H, et al. Machine Learning-Based Classification of the Health State of Mice Colon in Cancer Study from Confocal Laser Endomicroscopy. Sci Rep. Dec 27 2019;9(1):20010. doi:10.1038/s41598-019-56583-9 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41598-019-56583-9&link_type=DOI) 13. 13.Sharma AN, Shwe S, Mesinkovska NA. Current state of machine learning for non-melanoma skin cancer. Arch Dermatol Res. May 2022;314(4):325–327. doi:10.1007/s00403-021-02236-9 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s00403-021-02236-9&link_type=DOI) 14. 14.14. Anaconda. Version Vers. 2-2.4.0. Anaconda Software Distribution; 2016. [https://anaconda.org](https://anaconda.org) 15. 15.Kluyver T, Ragan-Kelley B, Pérez F, et al. Jupyter Notebooks – a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press; 2016:87–90. 16. 16.glob — Unix style pathname pattern expansion. 2023. 17. 17.pandas-dev/pandas: Pandas. Version 1.5.3. 2023. 18. 18.Harris CR, Millman KJ, van der Walt SJ, et al. Array programming with NumPy. Nature. 2020/09/01 2020;585(7825):357–362. doi:10.1038/s41586-020-2649-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41586-020-2649-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32939066&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 19. 19.Virtanen P, Gommers R, Oliphant TE, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods. 2020/03/01 2020;17(3):261–272. doi:10.1038/s41592-019-0686-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41592-019-0686-2&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32015543&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 20. 20.Muzellec B, Teleńczuk M, Cabeli V, Andreux M. PyDESeq2: a python package for bulk RNA-seq differential expression analysis. Bioinformatics. 2023;39(9):btad547. doi:10.1093/bioinformatics/btad547 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btad547&link_type=DOI) 21. 21.Hunter JD. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering. 2007;9(3):90–95. doi:10.1109/MCSE.2007.55 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/MCSE.2007.55&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=WOS:00024566&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 22. 22.22. Waskom ML. seaborn: statistical data visualization. Journal of Open Source Software. 2021;6(60):3021. doi:10.21105/joss.03021 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.21105/joss.03021&link_type=DOI) 23. 23.sanbomics. 2023. [https://pypi.org/project/sanbomics](https://pypi.org/project/sanbomics) 24. 24.Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. Austin, TX; 2010:10–25080. 25. 25.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology. 2014;15(12):1–21. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/gb-2014-15-1-r1&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24393533&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 26. 26.Zhao Y, Li M-C, Konaté MM, et al. TPM, FPKM, or Normalized Counts? A Comparative Study of Quantification Measures for the Analysis of RNA-seq Data from the NCI Patient-Derived Models Repository. Journal of Translational Medicine. 2021/06/22 2021;19(1):269. doi:10.1186/s12967-021-02936-w [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12967-021-02936-w&link_type=DOI) 27. 27.Boris M, Maria T, Vincent C, Mathieu A. PyDESeq2: a python package for bulk RNA-seq differential expression analysis. bioRxiv. 2022:2022.12.14.520412. doi:10.1101/2022.12.14.520412 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYmlvcnhpdiI7czo1OiJyZXNpZCI7czoxOToiMjAyMi4xMi4xNC41MjA0MTJ2MSI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzExLzEzLzIwMjMuMTEuMDEuMjMyOTc3MjQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 28. 28.Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005/10/25 2005;102(43):15545–15550. doi:10.1073/pnas.0506580102 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoicG5hcyI7czo1OiJyZXNpZCI7czoxMjoiMTAyLzQzLzE1NTQ1IjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTEvMTMvMjAyMy4xMS4wMS4yMzI5NzcyNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=) 29. 29.Fang Z, Liu X, Peltz G. GSEApy: a comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics. 2023;39(1):btac757. doi:10.1093/bioinformatics/btac757 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btac757&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36426870&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 30. 30.Chou CH, Shrestha S, Yang CD, et al. miRTarBase update 2018: a resource for experimentally validated microRNA-target interactions. Nucleic Acids Res. Jan 4 2018;46(D1):D296–d302. doi:10.1093/nar/gkx1067 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkx1067&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29126174&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 31. 31.31. The Gene Ontology C, Aleksander SA, Balhoff J, et al. The Gene Ontology knowledgebase in 2023. Genetics. 2023;224(1):iyad031. doi:10.1093/genetics/iyad031 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/genetics/iyad031&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36866529&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 32. 32.Ashburner M, Ball CA, Blake JA, et al. Gene Ontology: tool for the unification of biology. Nature Genetics. 2000/05/01 2000;25(1):25–29. doi:10.1038/75556 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/75556&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=10802651&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000086884000011&link_type=ISI) 33. 33.Fardin P, Barla A, Mosci S, Rosasco L, Verri A, Varesio L. The l1-l2 regularization framework unmasks the hypoxia signature hidden in the transcriptome of a set of heterogeneous neuroblastoma cell lines. BMC Genomics. Oct 15 2009;10:474. doi:10.1186/1471-2164-10-474 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/1471-2164-10-474&link_type=DOI) 34. 34.Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority over-Sampling Technique. J Artif Int Res. jun 2002;16(1):321–357, numpages = 37. 35. 35.Garg S, Raghavan B. Comparison of machine learning algorithms for the classification of spinal cord tumor. Irish Journal of Medical Science (1971 -). 2023/08/19 2023;doi:10.1007/s11845-023-03487-3 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s11845-023-03487-3&link_type=DOI) 36. 36.Bruno V, Betti M, D’Ambrosio L, et al. Machine learning endometrial cancer risk prediction model: integrating guidelines of European Society for Medical Oncology with the tumor immune framework. Int J Gynecol Cancer. Oct 24 2023;doi:10.1136/ijgc-2023-004671 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NDoiaWpnYyI7czo1OiJyZXNpZCI7czoxMDoiMzMvMTEvMTcwOCI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIzLzExLzEzLzIwMjMuMTEuMDEuMjMyOTc3MjQuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 37. 37.Gutman R, Aronson D, Caspi O, Shalit U. What drives performance in machine learning models for predicting heart failure outcome? Eur Heart J Digit Health. May 2023;4(3):175–187. doi:10.1093/ehjdh/ztac054 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ehjdh/ztac054&link_type=DOI) 38. 38.Hicks SA, Strümke I, Thambawita V, et al. On evaluation metrics for medical applications of artificial intelligence. Sci Rep. Apr 8 2022;12(1):5979. doi:10.1038/s41598-022-09954-8 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41598-022-09954-8&link_type=DOI) 39. 39.Tan KR, Seng JJB, Kwan YH, et al. Evaluation of Machine Learning Methods Developed for Prediction of Diabetes Complications: A Systematic Review. J Diabetes Sci Technol. Mar 2023;17(2):474–489. doi:10.1177/19322968211056917 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/19322968211056917&link_type=DOI) 40. 40.Wall NR, Fuller RN, Morcos A, De Leon M. Pancreatic Cancer Health Disparity: Pharmacologic Anthropology. Cancers (Basel). Oct 20 2023;15(20)doi:10.3390/cancers15205070 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15205070&link_type=DOI) 41. 41.de Jesus VHF, Mathias-Machado MC, de Farias JPF, Aruquipa MPS, Jácome AA, Peixoto RD. Targeting KRAS in Pancreatic Ductal Adenocarcinoma: The Long Road to Cure. Cancers (Basel). Oct 17 2023;15(20)doi:10.3390/cancers15205015 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15205015&link_type=DOI) 42. 42.Sun Y, Yao L, Man C, Gao Z, He R, Fan Y. Development and validation of cuproptosis-related lncRNAs associated with pancreatic cancer immune microenvironment based on single-cell. Front Immunol. 2023;14:1220760. doi:10.3389/fimmu.2023.1220760 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fimmu.2023.1220760&link_type=DOI) 43. 43.Wang H, Ding Y, He Y, et al. LncRNA UCA1 promotes pancreatic cancer cell migration by regulating mitochondrial dynamics via the MAPK pathway. Arch Biochem Biophys. Oct 8 2023;748:109783. doi:10.1016/j.abb.2023.109783 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.abb.2023.109783&link_type=DOI) 44. 44.Zhang R, Wang X, Ying X, et al. Hypoxia-induced long non-coding RNA LINC00460 promotes p53 mediated proliferation and metastasis of pancreatic cancer by regulating the miR-4689/UBE2V1 axis and sequestering USP10. Int J Med Sci. 2023;20(10):1339–1357. doi:10.7150/ijms.87833 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7150/ijms.87833&link_type=DOI) 45. 45.Tsai CW, Chang WS, Yueh TC, et al. The Significant Impacts of Interleukin-8 Genotypes on the Risk of Colorectal Cancer in Taiwan. Cancers (Basel). Oct 10 2023;15(20)doi:10.3390/cancers15204921 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/cancers15204921&link_type=DOI) 46. 46.Earnest A, Tesema GA, Stirling RG. Machine Learning Techniques to Predict Timeliness of Care among Lung Cancer Patients. Healthcare (Basel). Oct 18 2023;11(20)doi:10.3390/healthcare11202756 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/healthcare11202756&link_type=DOI) 47. 47.Padwal MK, Basu S, Basu B. Application of Machine Learning in Predicting Hepatic Metastasis or Primary Site in Gastroenteropancreatic Neuroendocrine Tumors. Current Oncology. 2023;30(10):9244–9261. doi:10.3390/curroncol30100668 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/curroncol30100668&link_type=DOI) 48. 48.Chiang CP, Wu CW, Lee SP, et al. Expression pattern, ethanol-metabolizing activities, and cellular localization of alcohol and aldehyde dehydrogenases in human pancreas: implications for pathogenesis of alcohol-induced pancreatic injury. Alcohol Clin Exp Res. Jun 2009;33(6):1059–68. doi:10.1111/j.1530-0277.2009.00927.x [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1530-0277.2009.00927.x&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19382905&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 49. 49.Kanellopoulos P, Nock BA, Krenning EP, Maina T. Optimizing the Profile of [(99m)Tc]Tc-NT(7-13) Tracers in Pancreatic Cancer Models by Means of Protease Inhibitors. Int J Mol Sci. Oct 26 2020;21(21)doi:10.3390/ijms21217926 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/ijms21217926&link_type=DOI) 50. 50.de Koning PJ, Bovenschen N, Leusink FK, et al. Downregulation of SERPINB13 expression in head and neck squamous cell carcinomas associates with poor clinical outcome. Int J Cancer. Oct 1 2009;125(7):1542–50. doi:10.1002/ijc.24507 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/ijc.24507&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19569240&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 51. 51.Xu Y, Yuan C, Peng J, et al. LncRNA MIR205HG expression predicts efficacy of neoadjuvant chemotherapy for patients with locally advanced breast cancer. Genes Dis. Jul 2022;9(4):837–840. doi:10.1016/j.gendis.2021.10.001 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.gendis.2021.10.001&link_type=DOI) 52. 52.He WA, Berardi E, Cardillo VM, et al. NF-κB-mediated Pax7 dysregulation in the muscle microenvironment promotes cancer cachexia. J Clin Invest. Nov 2013;123(11):4821–35. doi:10.1172/jci68523 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1172/JCI68523&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24084740&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000326611900029&link_type=ISI) 53. 53.Zhao X, Lu M, Liu Z, et al. Comprehensive analysis of alfa defensin expression and prognosis in human colorectal cancer. Front Oncol. 2022;12:974654. doi:10.3389/fonc.2022.974654 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fonc.2022.974654&link_type=DOI) 54. 54.Sun NK, Huang SL, Lu HP, Chang TC, Chao CC. Integrative transcriptomics-based identification of cryptic drivers of taxol-resistance genes in ovarian carcinoma cells: Analysis of the androgen receptor. Oncotarget. Sep 29 2015;6(29):27065–82. doi:10.18632/oncotarget.4824 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.18632/oncotarget.4824&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26318424&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 55. 55.Duan W, Kong X, Li J, et al. LncRNA AC010789.1 Promotes Colorectal Cancer Progression by Targeting MicroRNA-432-3p/ZEB1 Axis and the Wnt/β-Catenin Signaling Pathway. Front Cell Dev Biol. 2020;8:565355. doi:10.3389/fcell.2020.565355 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fcell.2020.565355&link_type=DOI) 56. 56.Liu Y, Xu G, Li L. LncRNA GATA3-;AS1-miR-30b-5p-Tex10 axis modulates tumorigenesis in pancreatic cancer. Oncol Rep. May 2021;45(5)doi:10.3892/or.2021.8010 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3892/or.2021.8010&link_type=DOI) 57. 57.Chen K, Wang Q, Liu X, Wang F, Yang Y, Tian X. Hypoxic pancreatic cancer derived exosomal miR-30b-5p promotes tumor angiogenesis by inhibiting GJA1 expression. Int J Biol Sci. 2022;18(3):1220–1237. doi:10.7150/ijbs.67675 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7150/ijbs.67675&link_type=DOI) 58. 58.Liu S, Luan J, Ding Y. miR-144-3p Targets FosB Proto-oncogene, AP-1 Transcription Factor Subunit (FOSB) to Suppress Proliferation, Migration, and Invasion of PANC-1 Pancreatic Cancer Cells. Oncol Res. Jun 11 2018;26(5):683–690. doi:10.3727/096504017x14982585511252 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3727/096504017X14982585511252&link_type=DOI) 59. 59.Yang J, Cong X, Ren M, et al. Circular RNA hsa\_circRNA\_0007334 is Predicted to Promote MMP7 and COL1A1 Expression by Functioning as a miRNA Sponge in Pancreatic Ductal Adenocarcinoma. J Oncol. 2019;2019:7630894. doi:10.1155/2019/7630894 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1155/2019/7630894&link_type=DOI) 60. 60.Janpipatkul K, Trachu N, Watcharenwong P, et al. Exosomal microRNAs as potential biomarkers for osimertinib resistance of non-small cell lung cancer patients. Cancer Biomark. 2021;31(3):281–294. doi:10.3233/cbm-203075 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3233/cbm-203075&link_type=DOI) 61. 61.Daniel R, Wu Q, Williams V, Clark G, Guruli G, Zehner Z. A Panel of MicroRNAs as Diagnostic Biomarkers for the Identification of Prostate Cancer. Int J Mol Sci. Jun 16 2017;18(6)doi:10.3390/ijms18061281 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3390/ijms18061281&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28621736&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 62. 62.Wang K, Zhong W, Long Z, et al. 5-Methylcytosine RNA Methyltransferases-Related Long Non-coding RNA to Develop and Validate Biochemical Recurrence Signature in Prostate Cancer. Front Mol Biosci. 2021;8:775304. doi:10.3389/fmolb.2021.775304 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.3389/fmolb.2021.775304&link_type=DOI) 63. 63.Li R, Gao X, Sun H, Sun L, Hu X. Expression characteristics of long non-coding RNA in colon adenocarcinoma and its potential value for judging the survival and prognosis of patients: bioinformatics analysis based on The Cancer Genome Atlas database. J Gastrointest Oncol. Jun 2022;13(3):1178–1187. doi:10.21037/jgo-22-384 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.21037/jgo-22-384&link_type=DOI) 64. 64.Zeng X, Wang Y, Liu B, et al. Multi-omics data reveals novel impacts of human papillomavirus integration on the epigenomic and transcriptomic signatures of cervical tumorigenesis. J Med Virol. May 2023;95(5):e28789. doi:10.1002/jmv.28789 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1002/jmv.28789&link_type=DOI) 65. 65.Wang WF, Zhong HJ, Cheng S, et al. A nuclear NKRF interacting long noncoding RNA controls EBV eradication and suppresses tumor progression in natural killer/T-cell lymphoma. Biochim Biophys Acta Mol Basis Dis. Aug 2023;1869(6):166722. doi:10.1016/j.bbadis.2023.166722 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.bbadis.2023.166722&link_type=DOI) 66. 66.Bi X-a, Li L, Xu R, Xing Z. Pathogenic Factors Identification of Brain Imaging and Gene in Late Mild Cognitive Impairment. Interdisciplinary Sciences: Computational Life Sciences. 2021/09/01 2021;13(3):511–520. doi:10.1007/s12539-021-00449-0 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/s12539-021-00449-0&link_type=DOI) 67. 67.Gusev FE, Reshetov DA, Mitchell AC, et al. Chromatin profiling of cortical neurons identifies individual epigenetic signatures in schizophrenia. Transl Psychiatry. Oct 17 2019;9(1):256. doi:10.1038/s41398-019-0596-1 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41398-019-0596-1&link_type=DOI) 68. 68.Yevshin I, Sharipov R, Kolmykov S, Kondrakhin Y, Kolpakov F. GTRD: a database on gene transcription regulation-2019 update. Nucleic Acids Res. Jan 8 2019;47(D1):D100–d105. doi:10.1093/nar/gky1128 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gky1128&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30445619&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 69. 69.The UniProt C. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research. 2023;51(D1):D523–D531. doi:10.1093/nar/gkac1052 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/nar/gkac1052&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=36408920&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F11%2F13%2F2023.11.01.23297724.atom) 70. 70.Shi Y, Yang D, Qin Y. Identifying prognostic lncRNAs based on a ceRNA regulatory network in laryngeal squamous cell carcinoma. BMC Cancer. Jun 15 2021;21(1):705. doi:10.1186/s12885-021-08422-2 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12885-021-08422-2&link_type=DOI) 71. 71.Zhang R, He T, Shi H, et al. Disregulations of PURPL and MiR-338-3p Could Serve As Prognosis Biomarkers for Epithelial Ovarian Cancer. J Cancer. 2021;12(18):5674–5680. doi:10.7150/jca.61327 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7150/jca.61327&link_type=DOI) 72. 72.Zhang R, Guo X, Zhao L, He T, Feng W, Ren S. Abnormal expressions of PURPL, miR-363-3p and ADAM10 predicted poor prognosis for patients with ovarian serous cystadenocarcinoma. J Cancer. 2023;14(15):2908–2918. doi:10.7150/jca.87405 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.7150/jca.87405&link_type=DOI) 73. 73.Cheng Z, Hong J, Tang N, Liu F, Gu S, Feng Z. Long non-coding RNA p53 upregulated regulator of p53 levels (PURPL) promotes the development of gastric cancer. Bioengineered. Jan 2022;13(1):1359–1376. doi:10.1080/21655979.2021.2017588 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/21655979.2021.2017588&link_type=DOI) 74. 74.Han YH, Wang Y, Lee SJ, et al. Identification of Hub Genes and Upstream Regulatory Factors Based on Cell Adhesion in Triple-negative Breast Cancer by Integrated Bioinformatical Analysis. Anticancer Res. Jul 2023;43(7):2951–2964. doi:10.21873/anticanres.16466 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTA6ImFudGljYW5yZXMiO3M6NToicmVzaWQiO3M6OToiNDMvNy8yOTUxIjtzOjQ6ImF0b20iO3M6NTA6Ii9tZWRyeGl2L2Vhcmx5LzIwMjMvMTEvMTMvMjAyMy4xMS4wMS4yMzI5NzcyNC5hdG9tIjt9czo4OiJmcmFnbWVudCI7czowOiIiO30=)