Abstract
Background Gastric Cancer is one of the most predominant types of cancer in the world, and its genomic links are currently being studied at great depth. In this paper, we work towards using Genome Wide Association Studies (GWAS) data for identifying the Single Nucleotide Polymorphisms (SNPs) which have the strongest correlation with the occurrence of gastric cancer through statistical tests and to leverage them to build a predictive model using machine learning algorithms. Polygenic risk scoring (PRS) is a straightforward predictive model for assigning genetic risk to individual outcomes (cancer or healthy).
Method Genome Wide Association Studies (GWAS) data for Gastric Cancer was subjected to different statistical tests. Chi-square was used for feature selection by determining the degree of association between each probe (SNP) and the target (cancer or control). These results were used to eliminate many probes and proceed with only those that are statistically significant. Naïve Bayes Classifier and Catboost machine learning algorithms were used to build classification models to predict (score) gastric cancer.
Results Naïve Bayes classifier and Catboost classification algorithms were used for modeling. The features were selected by performing Chi-square test on each of the 319283 SNPs in the data. These values were then ordered according to the negative log of the p-value and the top 5, 100 and 1000 features were used as inputs in the classification models. The Naïve Bayes classifier gave an accuracy in the range of 0.60 to 0.76 for different sets of features. The Catboost algorithm proved to be more suited for this application as it gave an accuracy above 0.90 for all subsets of features.
Conclusions This paper aims at creating a highly accurate classification model to predict the occurrence of gastric cancer from GWAS genome data. The Catboost model with an input space of 100 SNPs yielded the best results with an accuracy of 0.93 and can be considered as a polygenic risk scoring model to score new patients for gastric cancer.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
No external funding was received. This research was done as a capstone project.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This data is from the public domain and hosted on the Gene Expression Omnibus Website.
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Footnotes
Saed.sayad{at}rutgers.edu
Data Availability
The data is from the public domain and is available on the Gene Expression Omnibus Website https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE58356