Abstract
There are thousands of distinct disease entities and concepts, each of which are known by different and sometimes contradictory names. The lack of a unified system for managing these entities poses a major challenge for both machines and humans that need to harmonize information to better predict causes and treatments for disease. The Mondo Disease Ontology is an open, community-driven ontology that integrates key medical and biomedical terminologies, supporting disease data integration to improve diagnosis, treatment, and translational research. Mondo records the sources of all data and is continually updated, making it suitable for research and clinical applications that require up-to-date disease knowledge.
Introduction
In the past decade, there have been major advances in computational approaches to disease diagnosis and care management. However, the reference data on which these tools depend are not only heterogeneous and disaggregated, but also growing and ever changing. Standard terminologies such as the Human Phenotype Ontology [1], the Online Mendelian Inheritance in Man [2], and Orphanet [3] have helped standardize medical terminology. However, reconciling the very many terminologies used to name diseases and their inherent meaning has continued to be challenging, making knowledge and data integration challenging. It is critical to develop an unambiguous resource for disease name reconciliation such that evidence can be accurately gathered on individuals with these diseases and inform their diagnosis, care, and treatment. This allows related resources such as gene, variant, and infectious agent resources to be interoperable and contribute to the ongoing building of medical knowledge bases. Dozens of terminological disease resources used for research and clinical applications exist [4,5], including for Mendelian diseases, common diseases, rare diseases, cancer, and infectious diseases, with others being more comprehensive and broad [2,3,6–8]. However, scope and classification are just the beginning of the ways these resources differ: additional differences include disease naming conventions, synonym encoding, cross references, and more. As a result, each terminology has different strengths and weaknesses. These resources partially overlap, often significantly [9,10]. The correspondence (mapping) among individual concepts is often accomplished through text-matching; for example, the terms ‘muscular pseudohypertrophy-hypothyroidism syndrome’ and ‘B-cell immunodeficiency-limb anomaly-urogenital malformation syndrome’ both have the exact synonym ‘Hoffman syndrome’, but they are entirely different diseases. Human declared mappings are represented as “cross-references”, but the relationship between the two terms can be non-exact, incorrect, out of date, or otherwise not clearly defined. For example, the concept DOID:8923 ‘skin melanoma’ cross references both OMIM:608035 ‘melanoma, cutaneous malignant, susceptibility to, 4’ and OMIM:612263 ‘melanoma, cutaneous malignant, susceptibility to, 7’ [11], which are two different types of susceptibility rather than a neoplasm. Further, some disciplines of medicine are not well covered by terminologies, for example pharmacogenetics. Therefore, the resulting integration across disease resources is often incomplete, inconsistent, and unreliable for diagnostics or research.
The figure of 7,500 rare diseases [12] is often quoted; however, in our systematic analysis across resources, we identified over 10,500 unique rare diseases [13]. Much of this heterogeneity results from a lack of consensus, both philosophically and practically, about how to classify diseases. Should diseases be classified based upon the anatomical structures they affect? Based on the doctor that first described them, such as ‘Batten disease’? Or based upon their pathogenic mechanism (e.g. infectious, deficiency; hereditary, physiological)? What if two variants in the same gene give rise to different suites of phenotypic features; are those the same disease? The ClinGen “Lumping and splitting” group (https://clinicalgenome.org/working-groups/lumping-and-splitting/) has undertaken the development of curation rules to help inform such decisions, and Orphanet has set standard procedures as well[14] but the community still lacks a comprehensive, multiple classification of diseases that takes into account many other features such as treatment, onset, environmental factors, to name a few. Furthermore, standard clinical enterprise terminologies such as SNOMED-CT or ICD-11 are not released often enough to keep up to date with constantly-changing disease knowledge, and don’t include rarer disease codes. Combined with slow code adoption and miscoding, it continues to be very challenging to identify patients with a given disease in Electronic Health Record systems. Furthermore, numerous clinical and research systems, such as laboratory variant pipelines and repositories such as ClinVar, require up-to-date disease information. At the time the report is written or the data submitted, the disease entities should have identifiers that can be reconciled across sources or over time as knowledge changes. Fundamentally, a mechanism is needed to computationally harmonize disease classifications in order to best take advantage of our collective disease knowledge and heterogeneous data assets. This requires a modern, granular, and interoperable approach to support improved coding that can take into account the community-developed, dynamic knowledge about diseases. Here we introduce the Mondo resource, which provides a sustainable and fully-provenanced approach to integrating disease concepts from numerous sources across disease categories with the goal of better supporting precision medicine, diagnostics, and mechanistic disease research [15].
Terminology systems (or coding systems) often come in the form of taxonomies (simple classifications) or ontologies (conceptual domain models). The utilization of an ontology for biomedical knowledge representation enables data integration and navigation of large amounts of heterogeneous data. Additionally, an ontology encodes hierarchical and other relationships and definitions, which supports modern computational methods. Mondo includes multiple parentage, meaning concepts can be classified in multiple ways, which allows for more sophisticated querying and analytics (Figure 1). For example, adult Refsum disease is a type of ‘neurometabolic disease’ and ‘phytanoyl-CoA hydroxylase deficiency’. The new computationally friendly ICD11 incorporates a pragmatic mechanism for post-coordinating terms and concepts to accommodate the granular detail of complex clinical contents.[16,17]
Mondo currently harmonizes knowledge from 17 disease resources, collectively representing approximately 90,000 source concepts, and merges them into 22,157 distinct disease concepts. These resources were selected based on their scope, strengths, and usage (https://mondo.monarchinitiative.org/pages/sources/).
The Human Genome Nomenclature Committee (HGNC) standardizes human gene naming, but there is no comparable global standard for reconciling the heterogeneity of disease naming systems nor their semantic interoperability. Disease names vary — not only by language and region — but also over time due to social norms and improved understanding of underlying pathogenic mechanisms; moreover, different stakeholders that speak the same language may still prefer different names for different reasons. As a consequence, it is vital to have reliable disease identifiers which durably refer to the same concept over time — accommodating both changes and preferences. Mondo functions as a broker for disease nomenclature; the disease names and respective identifiers are a handle (i.e. stable reference), whereby synonyms, related knowledge and definitions can evolve over time with full provenance. Mondo supports multiple synonyms and synonym types as well as annotating which labels are preferred by which groups. In Mondo, synonyms are classified as exact, broad, narrow, and related [18]. Mondo aims to accommodate all community requests and prioritizes community and medical expert recommendations for naming. More details about disease naming in Mondo is available here (https://mondo.monarchinitiative.org/pages/disease-naming/).
Articulating similarities between concepts across a set of ontologies or terminologies is challenging and often unreliable due to the prevailing use of purely automated approaches (such as text matching). Such methods lack context and can match concepts incorrectly; moreover a lack of declared rules, provenance, and versioning for these mappings makes them difficult to use for computational purposes. Mondo contains precise semantic mappings between source ontologies and terminologies, such as between OMIM, ICD10-CM, Orphanet, the National Cancer Institute Thesaurus (NCIt), and many others [19]. A computational strategy that predicts equivalency based on a variety of features - such as labels, synonyms, cross-references (including existing semantics such as those provisioned by Orphanet), graph structure, and priors that indicate classification features specific to each source - was first applied to generate a set of mappings between concepts [20]. The output of this computational equivalency assessment was reviewed by dedicated curation and technical teams and by the Mondo user community. Introduction of new concepts and subsequent refinements to the hierarchy and mappings are then carried out as needed.
Mondo leverages a wealth of expert knowledge and authoritative terminologies to create a resource that is optimized for computational use in diagnostic, clinical, and research applications. Released on a monthly cycle, Mondo is iteratively developed to meet the evolving needs and input of a diverse, global community of contributors. There are currently more than 100 clinical and domain expert contributors from over 25 institutions that help evolve the resource, including ClinGen [21], OMIM [22], GARD [23], Orphanet [24], and others. Mondo also has a rich community of users that have implemented Mondo in a variety of settings, including incorporation into standards, such as in the Global Alliance for Genomics and Health (GA4GH) [25] and ISO standards such as Phenopackets [26], in the HL7 Terminology Authority [27], use in tools and data management programs, as well as in databases such as ClinGen [21], MedGen [28], Gabriella Miller Kids First [29], Pharos [30], and many others. A full list of users (https://mondo.monarchinitiative.org/pages/users/) and contributors (https://mondo.monarchinitiative.org/pages/contributors/) are available.
Mondo is more than just a source of robust and reproducible mappings between disease terminologies; Mondo also includes n-of-1, rare diseases, environmentally-influenced diseases, and complex genetic diseases that may not be documented in other sources can be added to Mondo and partitioned out for different uses. By integrating knowledge fully provenanced from the many existing and ever-evolving disease resources, acquired through years of work by researchers, clinicians, terminologists, and scientists from around the world, Mondo aims to make unified, comprehensive disease knowledge readily accessible to the scientific community and grow its value through logical connections across resources.
Data Availability
All data produced are available online at https://github.com/monarch-initiative/mondo.
Acknowledgements
Mondo is generously supported by the NIH National Human Genome Research Institute Phenomics First Resource, NIH-NHGRI # 1 RM1 HG010860–01, a Center of Excellence in Genomic Science; and a NIH Office of the Director Grant #5R24OD011883 for the Monarch Initiative. Thank you to Damien Goutte-Gattat for assistance in mining GitHub for our list of contributors.