Abstract
Pancreatic cancer (PC) is associated with high mortality overall. Recent literature has focused on investigating long noncoding RNAs (lncRNAs) in several cancers, but studies on their functions in PC are lacking. The purpose of this study was to identify novel lncRNAs and utilize machine learning to techniques to predict metastatic cases of PC using the identified lncRNAs. To identify significantly altered expression of lncRNA in PC, data was collected from The Cancer Genome Atlas (TCGA) and extracted RNA-sequencing (RNA-seq) transcriptomic profiles of pancreatic carcinomas and performed differential gene expression analysis. To assess the contribution of these lncRNAs to metastatic progression, different ML algorithms were used, including logistic regression (LR), support vector machine (SVM), random forest classifier (RFC) and eXtreme Gradient Boosting Classifier (XGBC). To improve the predictive accuracy of these models, hyperparameter tuning was performed, in addition to reducing bias through the synthetic minority oversampling technique. Out of 60,660 gene transcripts shared between 151 PC patients, 38 lncRNAs that were significantly differentially expressed were identified. To further investigate the functions of the novel lncRNAs, gene set enrichment analysis (GSEA) was performed on the population lncRNA panel. GSEA results revealed enrichment of several terms implicated in proliferation. Moreover, using the 4 ML algorithms to predict metastatic progression returned 76% accuracy for both SVM and RFC, explicitly based on the novel lncRNA panel. To the best of my knowledge, this is the first study of its kind to identify this lncRNA panel to differentiate between non-metastatic PC and metastatic PC, with many novel lncRNAs previously unmapped to PC. The ML accuracy score reveals important involvement of the detected RNAs. Based on these findings, I suggest further investigations of this lncRNA panel in vitro and in vivo, as they could be targeted for improved outcomes in PC patients, as well as assist in the diagnosis of metastatic progression based on RNA-seq data of primary pancreatic tumors.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study did not receive any funding.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
All data used in this study was acquired from the cancer genome atlas (TCGA) available from https://portal.gdc.cancer.gov/projects/TCGA-PAAD. The search algorithm for retrieving the data can be provided on request. The filters picked included only data with open access. Data with controlled access was excluded.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
References were updated, as well as incorrect labeling of RNA transcripts was fixed. The hierarchical clustering diagram incorrectly used normalized counts data instead of natural logarithm. Further clarifications were made regarding ML model assessments. Further edits of the text were made to clarify the results more appropriately.
Data Availability
All raw data acquired from TCGA, in addition to all analyses performed on said data and source code utilized to perform the analyses mentioned in the methodology are available at the link https://github.com/hasanalsharoh/PanC.