Abstract
Understanding the COVID-19 severity and why it differs significantly among patients is a thing of concern to the scientific community. The major contribution of this study arises from the use of a voting ensemble host genetic severity predictor (HGSP) model we developed by combining several state-of-the-art machine learning algorithms (decision tree-based models: Random Forest and XGBoost classifiers). These models were trained using a genetic Whole Exome Sequencing (WES) dataset and clinical covariates (age and gender) formulated from a 5-fold stratified cross-validation computational strategy to randomly split the dataset to overcome model instability. Our study validated the HGSP model based on the 18 features (i.e., 16 identified candidate genetic variants and 2 covariates) identified from a prior study. We provided post-hoc model explanations through the ExplainerDashboard - an open-source python library framework, allowing for deeper insight into the prediction results. We applied the Enrichr and OpenTarget genetics bioinformatic interactive tools to associate the genetic variants for plausible biological insights, and domain interpretations such as pathways, ontologies, and disease/drugs. Through an unsupervised clustering of the SHAP feature importance values, we visualized the complex genetic mechanisms. Our findings show that while age and gender mainly influence COVID-19 severity, a specific group of patients experiences severity due to complex genetic interactions.
Competing Interest Statement
The authors have declared no competing interest.
Clinical Protocols
https://github.com/raimondilab/COVID-19-severity-host-genetic-predictor-model-explanation
Funding Statement
Intesa San Paolo for the 2020 charity fund dedicated to the project N B/2020/0119 Identificazione delle basi genetiche determinanti la variabilita clinica della risposta a COVID-19 nella popolazione italiana. The EU project H2020-SC1-FA-DTS-2018-2020, entitled International consortium for integrative genomics prediction (INTERVENE) - Grant Agreement No. 101016775.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The dataset used for this study was part of the GEN-COVID Multicenter Study, https://sites.google.com/dbm.unisi.it/gen-COVID. The Italian multicenter study aimed at identifying the COVID-19 host genetic bases. Specimens were provided by the COVID-19 Biobank of Siena, which is part of the Genetic Biobank of Siena, a member of BBMRI-IT, of Telethon Network of GeneticBiobanks (project no. GTB18001), of EuroBioBank, and RD-Connect. Further information on the cleansed dataset and codes are available on our Githhub group page at: https://github.com/raimondilab/COVID-19-severity-host-genetic-predictor-model-explanation
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Footnotes
Some additional clarifications were made in the abstract and also figure (7 instead of 1 in the last section) numbering was updated.
Data Availability
The cleansed dataset and codes are available on our Githhub group page at: https://github.com/raimondilab/COVID-19-severity-host-genetic-predictor-model-explanation
https://github.com/raimondilab/COVID-19-severity-host-genetic-predictor-model-explanation