Abstract
We will develop a novel approach to drug repurposing, utilising Natural Language Processing (NLP) and Literature Based Discovery (LBD) techniques. This will present a simplified, accessible drug repurposing pipeline using Word2Vec embeddings trained on PubMed abstracts to identify potential new medications to be repurposed. We present this approach in the context of antipsychotics, but it could be repeated for any available medication.
The research is structured in three stages:
Identification of candidate medications using Word2Vec algorithm trained on scientific literature.
Empirical testing of identified candidates using a large hospital dataset to explore protective effects against disease onset.
Validation of findings using a second, independent dataset to assess generalizability.
This method addresses limitations in current machine learning-based drug repurposing approaches, including lack of external validation and limited accessibility. By leveraging Word2Vec’s ability to capture semantic relationships between words, the study aims to uncover hidden connections in medical literature that may lead to novel therapeutic discoveries.
The protocol emphasizes transparency and reproducibility, utilizing publicly available electronic health record (EHR) databases for validation. This approach allows for tangible results even for researchers with limited machine learning expertise, bridging the gap between biomedical and information systems communities.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
No specific funding was obtained for the delivery of this study. Maximin Lange is supported through a studentship by the London Interdisciplinary Social Science Doctoral Training Partnership (LISS DTP). Ben Carter is supported through the NIHR Maudsley Biomedical Research Centre at the South London and Maudsley NHS Foundation Trust in partnership with King's College London
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Research Ethics Committee (REC) and other Regulatory review MIMIC-IV data was collected as part of routine clinical care. It has been deidentified and transformed. It is available to researchers who have completed training in human research and signed a data use agreement. It was approved for research by the institutional review boards of the Massachusetts Institute of Technology and BIDMC, who granted a waiver of informed consent and approved the sharing of the research resource. For the present retrospective study, two authors (ML and MM) signed the PhysioNet Credentialed Health Data Use Agreement 1.5.0 for MIMIC-IV, on May 8 2024 and July 15 2024 respectively. BRATECA was deidentified according to the Health Insurance Portability and Accountability Act (HIPAA) standards using structured data cleansing and date shifting. The data was collected as part of a research project developed with several hospitals in Brazil. All data sharing was approved by each hospital. Ethical approval to use the hospitals' datasets in this research was granted by the Brazilian National Research Ethics Committee under the number 46652521.9.0000.5530. For the present retrospective study, data access was granted on PhysioNet by the authors of the BRATECA dataset on June 7, 2024 for ML. MM applied for access on July 15 2024 but at time of writing of this protocol was not granted access yet. South London and Maudsley NHS Trust have established CRIS in 2008 to allow searching and retrieval of comprehensive, yet de-identified clinical information for research purposes with a permission of secondary data analysis, approved by the Oxfordshire Research Ethics Committee C (reference 08/H0606/71*5). This present retrospective study was approved by the CRIS oversight committee on June 5, 2024, as Project 24-029. ML, MM and EG are approved for data access.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
MIMIC-IV and BRATECA are available after credentialing and training on physionet Anyone can apply to use CRIS provided they meet the governance requirements set out in the CRIS Security Model.