RT Journal Article SR Electronic T1 Gene Sequence to 2D Vector Transformation for Virus Classification JF medRxiv FD Cold Spring Harbor Laboratory Press SP 2024.03.12.24304158 DO 10.1101/2024.03.12.24304158 A1 Ignacio Sanchez-Gendriz A1 Karolayne S. Azevedo A1 Luísa C. de Souza A1 Matheus G. S. Dalmolin A1 Marcelo A. C. Fernandes YR 2024 UL http://medrxiv.org/content/early/2024/03/13/2024.03.12.24304158.abstract AB Background DNA sequences harbor vital information regarding various organisms and viruses. The ability to analyze extensive DNA sequences using methods amenable to conventional computer hardware has proven invaluable, especially in timely response to global pandemics such as COVID-19.Objectives This study introduces a new representation that encodes DNA sequences in unit vector transitions in a 2D space, extracted from the 2019 repository Novel Coronavirus Resource (2019nCoVR). The main objective is to elucidate the potential of this method to facilitate virus classification using minimal hardware resources. It also aims to demonstrate the feasibility of the technique through dimensionality reduction and the application of machine learning models.Methods DNA sequences were transformed into two-nucleotide base transitions (referred to as ‘transitions’). Each transition was represented as a corresponding unit vector in 2D space. This coding scheme allowed DNA sequences to be efficiently represented as dynamic transitions. After applying a moving average and resampling, these transitions underwent dimensionality reduction processes such as Principal Component Analysis (PCA). After subsequent processing and dimensionality reduction, conventional machine learning approaches were applied, obtaining as output a multiple classification among six species of viruses belonging to the coronaviridae family, including SARS-CoV-2.Results and Discussions The implemented method effectively facilitated a careful representation of the sequences, allowing visual differentiation between six types of viruses from the Coronaviridae family through direct plotting. The results obtained by this technique reveal values accuracy, sensitivity, specificity and F1-score equal to or greater than 99%, applied in a stratified cross-validation, used to evaluate the model. The results found produced performance comparable, if not superior, to the computationally intensive methods discussed in the state of the art.Conclusions The proposed coding method appears as a computationally efficient and promising addition to contemporary DNA sequence coding techniques. Its merits lie in its simplicity, visual interpretability and ease of implementation, making it a potential resource in complementing existing strategies in the field.Competing Interest StatementThe authors have declared no competing interest.Funding StatementThis study did not receive any fundingAuthor DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.YesThe details of the IRB/oversight body that provided approval or exemption for the research described are given below:The study uses DNA sequences extracted from the 2019 Novel Coronavirus Resource (2019nCoVR), which is an open repository accessible at https://bigd.big.ac.cn/ncov/.I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.YesI understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).YesI have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.YesAll data produced in the present study are available upon reasonable request to the authors