Abstract
The rapidly increasing biomedical knowledge, derived from biological experiments or gained from clinical practice, has become the important treasure in the biomedical research. The emerging knowledge graphs (KGs) provide an efficient and effective way to organize and retrieval the huge and increasing volume of biomedical knowledge. A biomedical KG (BKG) typically stores and represents knowledge by constructing a semantic network describing entities and the relationships between them. Previous efforts have been conducted to construct and curate BKGs by comprehensively integrating various biomedical data resources. Though the resulting BKGs have made a significant progress in this filed in advancing biological and medical research, there remain a big gap to a perfect one that is comprehensive and fine-grained enough. To this end, in the present study, we collected and integrated data from diverse well-curated biomedical knowledge bases and BKGs to curate a more comprehensive one, named the Cornell Biomedical Knowledge Hub (CBKH). To enhance the usage in accelerating biomedical research, we deployed CBKH using the famous graph database, Neo4j. This is a continuing effort and we are adding in more and more contents in CBKH to support the various complex needs in biomedical data analysis. Please contact us if you have better ideas and suggestions.
Introduction
Biomedicine is a discipline with lots of highly specialized knowledge accumulated from biological experiments and clinical practice. These knowledges are usually buried in massive biomedical literature and textbooks. This makes the effective knowledge organization and efficient knowledge retrieval a challenging task. Knowledge graph is a recently emerged concept aiming at achieving this goal. A knowledge graph (KG) stores and represents knowledge by constructing a semantic network describing entities and the relationships between them. The basic elements that comprising a knowledge graph are a set of biomedical entities and a set of different types of semantic relationships among the entities. In biomedicine, the typical entities could be diseases, drugs, and genes, etc., and the relationships could be treats (drug-treats-disease), binds (drug-binds-target protein), interactions (drug-drug interaction), etc. Large scale biomedical KG (BKG) makes efficient knowledge retrieval and inference possible.
Typically, construction and curation of a BKG is done via integrating publicly available biomedical knowledge bases and knowledge extracted from biomedical literature. For example, Hetionet [1], released in 2017, is a well-curated BKG that was constructed by integrating 29 publicly available data resources, such as DrugBank [2], GWAS Catalog [3], DISEASES [4], DisGeNET [5], etc. Similar to Hetionet, Drug Repurposing Knowledge Graph (DRKG) [6] was built by integrating data from six different existing databases, with a specific focus on drug repurposing for COVID-19. It contains 13 types of about 100K entities and 107 types of over 5 million relationships. PreMedKB [7] includes the information of disease, genes, variants, and drugs by integrating existing resources. The Clinical Knowledge Graph (CKG) [8] was constructed by combining relevant existing biomedical databases integration and texts extracted from scientific literature, containing over 16 million nodes and over 220 million relationships. Compared to other BKGs, CKG includes entities representing biological information at a finer granularity, such as metabolite, modified protein, molecule function, transcript, genetic variant, food, clinical variable, etc. In addition, some BKGs were built with a focus on specific diseases or conditions. For example, COVID-KG [9] extracted COVID-19 specific information from biomedical literature and constructed a knowledge graph containing diseases, chemicals, and genes, along with their relationships. KGHC [10] is a knowledge graph focused on hepatocellular carcinoma. It extracted information from literature and contents on the internet, as well as structured triples from SemMedDB [11].
Though significant progress has been achieved by these efforts, they are not perfect or comprehensive enough to incorporate all biomedical knowledge. For example, Parkinson’s disease (PD) is associated genetic mutation like G2019S in LRRK2, but only gene-level information is saved in most existing BKGs, such as Hetionet. PD is associated with brain lesion detected by MRI, but such information is not incorporated in current BKGs. In addition, entities at finer granularity, such as molecules, which have been demonstrated to be important in biomedical research, are not included in most existing BKGs like Hetionet. Therefore, there still is the need for curation of a comprehensive BKG. To this end, in the present study, we collected and integrated data from multiple well-curated biomedical knowledge bases and BKGs to curate a more comprehensive one, named the Cornell Biomedical Knowledge Hub (CBKH). We deployed our CBKH using Neo4j (https://neo4j.com). If you are interested in accessing CBKH, please contact us.
Materials and Methods
Our ultimate goal was to build a biomedical knowledge graph via comprehensively incorporating biomedical knowledge as much as possible. To this end, we collected and integrated 15 publicly available data sources to curate a comprehensive one. Details of the used data resources were listed in Table 1.
Raw data processing and information extraction
Given the data resources, the first step was to pre-process the raw files of them and extract knowledge, including entity information and relationship information, from them. Generally, the data bases release their raw data files in various format, such as comma-separated values (CSV), tab-separated values (TSV), TXT, EXCEL tablet, Hypertext Markup Language (HTML), Resource Description Framework (RDF), and Web Ontology Language (OWL). To this end, for each data base, we parsed the raw files and extracted structured data, i.e., the descriptive files for each type of biomedical entity and the files of each type of relationship. Such procedure varies by data bases or even by files within the same data base.
Term normalization
For normalization of the entity terms, we utilized a greedy strategy. Specifically, we first chose a data base to initialize the vocabulary for each type of entity. Next, we used multiple identifiers as the linkage pool for entity normalization and incorporate and integrate entities from all data bases to enrich the entity vocabulary one by one.
For gene entities, we used HGNC gene repository [12] as the initialization vocabulary of gene entities, as it sets a standard nomenclature for human the genes. The linkage pool for normalization included HGNC ID, HGNC symbol and NCBI ID.
For drug entities, we initialized our vocabulary using DrugBank [2] as it provides the up-to-date list of approved drugs and investigational drugs under clinical trials. The linkage pool for drug entity normalization included MeSH term, MeSH term ID, Unified Medical Language System (UMLS) Concept Unique Identifier (CUI), and the drug name in UMLS.
For disease entities, we used the Disease Ontology [13] for initializing the vocabulary, as it is a structured database of diseases based on etiological classification. The linkage pool for the disease entities normalization included MeSH term, MeSH term ID, UMLS CUI and the disease name in UMLS.
For anatomy entities, we used the Uberon [14] for initializing the vocabulary, as it is a cross-species anatomical ontology based on traditional anatomical classification. The linkage pool for the anatomy entities normalization included MeSH term, MeSH term ID, UMLS CUI and the anatomy name in UMLS.
For molecule entities, we used the ChEMBL [15] for initializing the vocabulary, as it is a manually curated database of molecules with drug properties. The linkage pool for the molecule entities normalization included International Chemical Identifier (InChi).
For symptom entities, we collected the symptom entities from the Hetionet and described them by using the MeSH term and MeSH term ID. We used UMLS CUI as the linkage for symptom entities normalization.
CBKH deployment
To enhance the usability of CBKH in accelerating biomedical research, we deployed it using a graph database, Neo4j (https://neo4j.com), which provides the easy-to-use interface for query and visiting knowledge in the KG. By using the Cypher statement on the Neo4j platform, CBKH can be retrieved efficiently and flexibly.
Results
CBKH integrates data from 15 publicly available biomedical databases. The current version of CBKH (Figure 1 and Table 2) contains a total of 2,231,297 entities of 6 types. Specifically, the CBKH includes 22,963 anatomy entities, 18,503 disease entities, 36,436 drug entities, 87,942 gene entities, 2,065,015 molecule entities and 438 symptom entities. For the relationships in the CBKH (Table 3), there are 91 relation types within 8 kinds of entity pairs, including Anatomy-Gene, Drug-Disease, Drug-Drug, Drug-Gene, Disease-Disease, Disease-Gene, Disease-Symptom and Gene-Gene. In total, CBKH contains 48,678,651 relations. More specifically, there are 3 types of relations between the Anatomy-Gene pair, including such as ‘Express’ and ‘Absent’; 11 relation types between Drug-Disease pair, such as ‘Treat’ and ‘Effect’; 2 relation types between the Drug-Drug pair including ‘Interaction’ and ‘Resemble’; 25 relation types between the Drug-Gene pair, such as ‘Target’, ‘Upregulates’, and ‘Downregulates’; 2 relation types between the Disease-Disease pair including ‘is_a’ and ‘Resemble’; 16 relation types between the Disease-Gene pair, such as ‘Association’; the ‘Presents’ relation type between the Disease-Symptom pair; and 31 relation types between the Gene-Gene pair, such as ‘Covaries’ and ‘Interacts’. Since some resources are generated by text mining methods, they use the form of phrases to express the relations (Text-semantic relation). For example, the relation ’role in disease pathogenesis’ between the Drug-Disease pair and the relation ’enhances expression/production’ between the Drug-Gene pair. The CBKH relations were derived by integrating candidate resources, so some relationships connecting the two entities may have overlap. For example, there are 16,961 ’Target_DrugBank’ relationships and 11,801 ’Binds_Hetionet’ relationships in Drug-Gene. In these two relationships, a total of 4,745 relationships overlaps, which means that both ’Target’ and ’Binds’ relationships exist in these corresponding entities.
Future work
KG quality control
The procedures of constructing and curating a BKG include sophisticated efforts on raw data file extraction and pre-processing, data annotation, as well as terminology normalization, which may result in quality issues. In general, there are two categories of quality issues in KGs: the incorrectness and incompleteness.
Incorrectness
refers to incorrect facts in the KG, e.g., a relation connecting two entities exists in the BKG but inconsistent with real-world evidence. To address this, a common strategy is manual annotation with sampled small subsets. Such procedure is time-and cost-consuming, if one wants to evaluate sufficient triplets to reach the statistic criteria. To address this, for example, Gao et al. [16] proposed an iterative evaluation framework for KG accuracy evaluation. Specifically, inspired by the properties of the annotation cost function observed in practice, the authors developed a cluster sampling strategy with unequal probability theory. Their framework resulted in a 60% shrunk annotation cost and can be easily extended to address evolving KG. In addition, the use of well-designed biomedical vocabularies such as the Unified Medical Language System (UMLS) will improve entity term normalization and hence reduce the risk of errors caused by the ambiguous biomedical entities. Moreover, learning based on KG structure to refine the KG is also a potential way to solve this issue. Early efforts, such as Zhao et al. [17] has been focused on this field.
Incompleteness
mainly refers to the missing of biologically or clinically meaningful triplets in the KG. To address the incompleteness in biomedical KG, we integrated multiple data resources, biomedical data bases, and biomedical KGs to construct and curate a more comprehensive one. However, there is no guarantee the included resources are combined comprehensive enough to cover all biomedical knowledge. In addition, today’s largely available biomedical literature and medical data (e.g., EHRs) are great treasure of biomedical knowledge. In this context, previous studies have been focused on deriving knowledge from biomedical literature [18-21] and EHR data [22, 23], and the derived knowledge could be a good complement for the biomedical KGs. Moreover, the computational methods such as the KG embedding models (e.g., TransE and TransH) and the GNNs (e.g., R-GCN) have been used in KG completion [24], which predict missing relations within a KG according to its structure properties.
Focus on specific diseases on health conditions
Similar to most existing BKGs, like Hetionet and CKG, our CBKH focus on general biomedical knowledge. However, for the sake of precision medicine on some specific human diseases or health conditions, there is the need for very fine-grained knowledge with a specific focus on them. In this context, COVID-KG [9] included biomedical knowledge with a specific focus on COVID-19; KGHC [10] is a knowledge graph constructed focusing on addressing hepatocellular carcinoma. Following this idea, to adapt our KG to addressing problems in specific complex diseases and health conditions like Alzheimer’s disease, Parkinson’s disease, and mental illness, we will focus on collecting fine-grained data, such as genotype-phenotype associations and brain region atrophy-phenotype associations and incorporating them to enrich our BKG, for the specific usage of these diseases.
Keeping KG up to date
Thanks to the advances in the high throughput techniques, biomedical data have been continuously produced. Meanwhile, a rapid increasing amount of biomedical literature are being published. As most existing studies gather knowledge from the experimental data and biomedical literature manually, more human involvement is required. In this context, we would highlight the usage of the computational methods, such as Natural Language Processing (NLP) techniques, which can automatically and efficiently extract knowledge from the raw data files, such as biomedical literature and clinical trial documentations. In the future, we may incorporate such kind of technique to assist us in KG maintenance.
Data Availability
The data is currently available from the corresponding author on reasonable request.
Acknowledgement
The work is supported by NSF 1750326 and 2027970.