Abstract
Background Fast Healthcare Interoperability Resources (FHIR) is a server specification and data model that allows for EHR systems to represent clinical metadata using a consistent API. There is a critical mass of EHR and clinical trial data stored in FHIR based systems. Research analysts can take advantage of existing FHIR tooling for de-identification, pseudonymization, and anonymization. More recently the BiodataCatalyst consortium has proposed the Portable Format for Bioinformatics (PFB) which is a carrier format for describing raw data and the data model in which it is structured, based on an efficient binary format (AVRO). PFB allows an entire cohort of metadata to be loaded into a research data system. Here, we describe an open source utility that will scan FHIR based systems and create PFB based archives.
Results pfb_fhir scans data from FHIR based clinical data systems and converts the data into a self contained PFB file. This utility identifies types, customizations (extensions), and element connections. It then converts all of these components into a graph model compatible for storage in the PFB specification. The structure of the original FHIR system is faithfully reproduced using the PFB schema description system. All records from the system are downloaded, converted and stored as vertices in a graph described by the PFB file. This system has been tested against a number of different FHIR installations, including ones hosted by dbGAP, The Kids First Data Resource and AnVIL.
Conclusions pfb_fhir helps to unlock the potential of EHR and clinical trial data. pfb_fhir allows researchers to easily scan and store FHIR resources and create self contained PFB archives, called FHIR in PFB. These archive files can easily be moved to new data systems, allowing the clinical data to be connected to more complex genomic analysis and data science platforms. The FHIR in PFB archives generated by pfb_fhir have been loaded into data platforms including the Broad’s Terra system, Gen3 based data system, custom graph query engines and Jupyter notebooks. This flexibility will enable genomics investigators to do more integrated genotype to phenotype association analysis using whichever tools suit their line of research.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
1U24HG010263-01
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data Availability
All data produced are available online at https://github.com/bmeg/iceberg-schema-tools
Abbreviations
- ANSI
- American National Standards Institute
- API
- Application Programming Interface
- AVRO
- A data serialization framework developed within Apache’s Hadoop project
- DRS
- GA4GH’s Data Repository Service
- EHR
- Electronic Health Record
- FHIR
- Fast Healthcare Interoperability Resources
- GA4GH
- Global Alliance for Genomic Health
- HL7
- Health Level Seven
- NCPI
- NIH Cloud Platform Interoperability
- NIH
- National Institutes of Health
- PFB
- Portable Format for Bioinformatics
- URI
- Uniform Resource Identifier
- UUID
- Universally Unique Identifier
- dbGaP
- Database of Genotypes and Phenotypes