Abstract
One of the common concerns in clinical research is improving the infrastructure to facilitate the reuse of clinical data and deal with interoperability issues. FAIR (Findable, Accessible, Interoperable and Reusable) Data Principles enables reuse of data by providing us with descriptive metadata, explaining what the data represents and where the data can be found. In addition to aiding scholars, FAIR guidelines also help in enhancing the machine-readability of data, making it easier for machine algorithms to find and utilize the data. Hence, the feasibility of accurate interpretation of data is higher and this helps in obtaining maximum results from research work. FAIR-ification is done by embedding knowledge on data. This could be achieved by annotating the data using terminologies and concepts from Web Ontology Language (OWL). By attaching a terminological value, we add semantics to a specific data element, increasing the interoperability and reuse. However, this FAIR-ification of data can be a complicated and a time-consuming process. Our main objective is to disentangle the process of making data FAIR by using both domain and technical expertise. We apply this process in a workflow which involves FAIR-ification of four independent public HNSCC datasets from The Cancer Imaging Archive (TCIA). This approach converts the data from the four datasets into Linked Data using RDF triples, and finally annotates these datasets using standardized terminologies. By annotating them, we link all the four datasets together using their semantics and thus a single query would get the intended information from all the datasets.
Introduction
The world is at the realm of secondary use of real-world data. With the rapid growth of healthcare data available across the world [1], there is an enormous potential for the “secondary use or reuse” of this data as this leads to a more time-effective and cost-effective research. Collected during clinical trials or as healthcare records, the secondary usage of this data is generally for purposes other than its original use [2]. The potential use cases involve using this data for enhancing the delivery of health care for individuals and broaden the existing knowledge about diseases and their respective treatments, by increasing research and analysis of such data [3]. Reuse of clinical data has been prevalent for decades and has proven effective in detecting early stages of a disease, predicting patient’s length of stay or patient’s overall survival [4]. Despite the massive amount of clinical data generated every year, the secondary usage of these data is very limited.
Moreover, when patients with rare diseases are considered, they are often at a disadvantage due to the rarity of the disease. Rare diseases bring about lesser data with very limited clinical research and management [5]. According to [6], it was found that the five-year relative survival rate of patients with rare cancer was 48.5%, whereas for patients with other common cancers was 63.4%. In addition to this, the volume of publicly “donated data” (e.g., in TCIA or TCGA) is very small relative to total amount of clinical data generated in CRFs. A “big dataset” in rare cancers consists of data from a few hundred patients, which is a separate scale from social media, twitter text or natural photographs numbering in the many millions. Hence, data quality and semantics play a larger role in healthcare. Moreover, this small amount of data contains personal/idiosyncratic coding and schema, and data being structured already assumes a very high degree of implicit knowledge. Therefore, how do we use and share this data, in a privacy-preserving way, without doctors/investigators losing control of their own data? How do we make this a community effort among different domains of expert knowledge? (i.e., Clinical versus Machine Learning versus Data Science).
FAIR (Findable, Accessible, Interoperable, and Reusable) Data Principles were designed to support reusability of scholarly data. FAIR provides an infrastructure, which enables maximum usefulness of research data, by providing descriptive metadata about the data. This helps in structuring the information in the data in a way that it is interpretable by machine algorithms, without direct human supervision [7]. It is to be noted that FAIR does not necessarily imply Open-ness of data, with the key difference between both FAIR and Open data being the term “access”. According to FAIR, data is accessible only for a defined set of people, for example a group of researchers and clinicians working together [8]. Whereas according to Open Data Handbook, “Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike” [9]. This also applies for the opposite case, implying that Open Data is not always FAIR.
FAIR data deals with the challenge of enabling syntactic and semantic interoperability for comprehensive and reusable processing of big data. Syntactic interoperability refers to a standardized communication and data exchange between two or more systems (this could be done with different interfaces and programming languages). Semantic interoperability is, when the exchanged data is understood by the systems involved, thus making it a more challenging yet desirable concept [10]. One of the main requirements to efficiently extract valuable information from large amounts of multi-centered clinical data is supporting both types of interoperability [11].
This is achieved using the concept of Linked Data. Linked Data is a set of guidelines for creating machine-readable and interlinked data on the Web. This collection of Linked Data is what makes up the Semantic Web [12]. For constructing Linked Data, data is often represented in standard Linked Data format called the RDF (Resource Description Framework). The RDF is a standard for defining relationships between data objects and interconnecting them. Every resource is identified by a URI (Universal Resource Identifier), which makes the data ‘linkable’. RDF data is represented in triples with a Subject, a Predicate and an Object [13].
RDF data can be stored in an RDF graph database (also called as semantic graph database or triple store), which acts as a query endpoint to get access to the data efficiently. Accessing data from these endpoints requires technologies like SPARQL (SPARQL Protocol and RDF Query Language), which is a W3C-standardized semantic query language to extract valuable information from a set of triples [14].
Methods
Description of the datasets
The data used in this submission is from the following data collections available on The Cancer Imaging Archive (TCIA).
RADIOMICS-HN1 from Maastro in Maastricht. This data collection has clinical data and Computed Tomography (CT) from 137 patients with head and neck squamous cell carcinoma (HNSCC), treated with radiotherapy [16].
HNSCC from MD Anderson Cancer Center in Houston, which contains a cohort of clinical meta-data and contrast-enhanced computed tomography (CECT) scans for 495 oropharyngeal cancer (OPC) patients [17].
OPC Radiomics from Princess Margaret Cancer Centre in Toronto. The collection includes data from 606 non-metastatic p16-positive oropharyngeal cancer (OPC) patients, treated with radiotherapy or chemo-radiotherapy between 2005 and 2010 [18].
HEAD-NECK-PET-CT from Montreal, which includes FDG-PET/CT and radiotherapy planning CT imaging data of 298 patients with head-and-neck cancer (H&N). Pre-treatment FDG-PET/CT scans for the patients were taken between April 2006 and November 2014 [19].
FAIR data dashboard (demo module)
Step 1
The first step in the demo module is converting the data in the four datasets into RDF triples. This data conversion is done using our in-house developed software called the Triplifier [15]. Triplifier accepts a relational database as an input and gives out two files. One is the OWL file, which has the structure (a data-specific ontology). The other is the Output file with the actual instance information, which is the data from the dataset converted into RDF triples (as shown in fig 3).
In order to obtain the triples, the user can clone the code repository linked to this submission: https://gitlab.com/UM-CDS/projects/flyover_project/-/tree/MedRxiv_branch. In the command line interface, the user can navigate to the project folder and using the command ‘docker-compose up –d’ will build four docker images – Postgres, pgAdmin, Triplifier and GraphDB. The docker files execute the following process:
It builds a Postgres and a pgAdmin container.
SQL insert queries immediately fire into Postgres and this creates four databases (can be checked in pgAdmin), one for each of the four datasets.
Four parallel Triplifier projects fire up, one for each of the four clinical “projects”.
Triplifier projects exit code upon successful completion.
GraphDB (our local graph database) now has four RDF repositories, each with two graphs:
– <http://data.local/> with the RDF triples.
– <http://ontology.local/> with a data specific ontology.
Step 2
Upon the successfully completion of Triplifier, all the four graphDB repositories should consist of the RDF triples pertaining to each clinical project. Next, the second docker compose file can be built by navigating to the annotations folder inside the project https://gitlab.com/UM-CDS/projects/flyover_project/-/tree/MedRxiv_branch and the command ‘docker-compose up –d’ would build the following two images:
An annotations script that adds a new graph <http://annotation.local/> to all the four repositories in graphDB. This graph has triples with new semantic mappings for the flat RDF triples from the Triplifier.
Finally, the user can visualize the data from all the four annotated datasets (now linked together via their semantics) in a dashboard running at port 8050 at their local system. This dashboard uses SPARQL queries to retrieve the triples from graphDB.
Construction of Annotations
The RDF triples from the Triplifier (fig 4) have a flat schema and does not hold any semantics to it. This is where ontologies come into play. Ontologies help in providing syntactic and semantic usability to the data. Ontologies can be viewed as a pre-defined data dictionary, which also contain information about how the entities in a field are related to each other. For our data, we use universal ontologies such as Radiation Oncology Ontology (ROO) and National Cancer Institute Thesaurus (NCIT). This process requires manual work and collaboration from the clinicians, language experts and the researchers as it is crucial to understand the clinical data (which could initially be in any language). The annotations include new meaningful predicates and class equivalencies from pre-defined ontologies which add semantics to the flat RDF triples from the Triplifier. This process is repeated for all the four datasets (refer fig 5-7).
SPARQL endpoint
The annotations with the newly updated semantics and ontological tags for the RDF triples, are uploaded to the SPARQL endpoint GraphDB in their respective repositories (which already has the Output file and the OWL file). Now that meaningful annotations are added for all the four datasets and they are linked to their semantics, one common federated SPARQL query should be able to retrieve required results from all the datasets.
Interactive Dashboard
As a potential use case of this project, we build an interactive visualization dashboard (as shown in fig 8) locally for the demo module, using the newly created Linked Data from all the four datasets, that lets the user choose to view the statistics of all the datasets (together or individually). This dashboard is created entirely using the results from federated SPARQL queries, which is then parsed through a Python program providing the clinicians with the output in the form of interactive graphs and charts.
However, it is to be noted that this dashboard is only for DEMO purpose. It has been created locally by building a centralized data using the four repositories and it is not a privacy preserving distributed dashboard.
Example SPARQL Queries
The first example we have considered is a researcher/user trying to get the ID values, gender, AJCC stage and the tumour location of patients from the FAIRified data. The following query would be able to get the mentioned values from the datasets after they have been semantically mapped with their ontologies. The user can copy and run the query in the SPARQL endpoint (GraphDB) running at their local host.
The second example we have considered is a researcher/user trying to get the values of T stage and M stage for individual patients from the annotated data. The following query would be able to get the mentioned values from the datasets after they have been semantically mapped with their ontologies.
Discussion and Conclusion
By designing the above process to create FAIR data from the datasets, we successfully achieve both syntactic interoperability and semantic interoperability in two manageable pieces. One by creating the standard RDF triples using the Triplifier and the other by creating an annotations graph with semantics for the triples.
The task requires cooperation from the user’s side for answering a few questions about the data and for gaining insights from the metadata for the FAIR-ification, but this does not demand any high degree of programming knowledge and tooling/technical skills from them, thus enabling an easy availability of FAIR data at the local stations.
Data Availability
Data is already publicly available.
https://wiki.cancerimagingarchive.net/display/Public/HNSCC
https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=33948764
https://wiki.cancerimagingarchive.net/display/Public/Head-Neck-Radiomics-HN1
https://wiki.cancerimagingarchive.net/display/Public/Head-Neck-PET-CT
Footnotes
Updated the Funding Information.