A unified data infrastructure to support large-scale rare disease research

Lennart F. Johansson; Steve Laurie; Dylan Spalding; Spencer Gibson; David Ruvolo; Coline Thomas; Davide Piscia; Fernanda de Andrade; Gerieke Been; Marieke Bijlsma; Han Brunner; Sandi Cimerman; Farid Yavari Dizjikan; Kornelia Ellwanger; Marcos Fernandez; Mallory Freeberg; Gert-Jan van de Geijn; Roan Kanninga; Vatsalya Maddi; Mehdi Mehtarizadeh; Pieter Neerincx; Stephan Ossowski; Ana Rath; Dieuwke Roelofs-Prins; Marloes Stok-Benjamins; K. Joeri van der Velde; Colin Veal; Gerben van der Vries; Marc Wadsley; Gregory Warren; Birte Zurek; Thomas Keane; Holm Graessner; Solve-RD consortium; Sergi Beltran; Morris A. Swertz; Anthony J. Brookes

doi:10.1101/2023.12.20.23299950

Abstract

The Solve-RD project brings together clinicians, scientists, and patient representatives from 51 institutes spanning 15 countries to collaborate on genetically diagnosing (“solving”) rare diseases (RDs). The project aims to significantly increase the diagnostic success rate by co-analysing data from thousands of RD cases, including phenotypes, pedigrees, exome/genome sequencing and multi-omics data. Here we report on the data infrastructure devised and created to support this co-analysis. This infrastructure enables users to store, find, connect, and analyse data and metadata in a collaborative manner. Pseudonymised phenotypic and raw experimental data are submitted to the RD-Connect Genome-Phenome Analysis Platform and processed through standardised pipelines. Resulting files and novel produced omics data are sent to the European Genome-phenome Archive, which adds unique file identifiers and provides long-term storage and controlled access services. MOLGENIS “RD3” and Café Variome “Discovery Nexus” connect data and metadata and offer discovery services, and secure cloud-based “Sandboxes” support multi-party data analysis. This proven infrastructure design provides a blueprint for other projects that need to analyse large amounts of heterogeneous data.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

The Solve-RD project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 779257. The RD‐Connect Genome‐Phenome Analysis Platform, received funding from EU projects RD‐Connect, Solve-RD and EJP-RD (Grant Numbers FP7 305444, H2020 779257, H2020 825575), Instituto de Salud Carlos III (Grant Numbers PT13/0001/0044, PT17/0009/0019; Instituto Nacional de Bioinformatica, INB) and ELIXIR Implementation Studies. The UMCG VRE and RD3 received funding from the EU projects Solve-RD, EJP-RD and CINECA Project (H2020 779257, H2020 825575, H2020 825775, respectively) and NWO VIDI grant number 917.164.455.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Ethics committee of the Eberhard Karl University of Tubingen gave ethical approval for this work

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Data will be deposited at EGA. Accession numbers to be provided. Pseudonymised phenotypic information for all individuals and their genetic variants are accessible through the RD-Connect GPAP (https://platform.rd-connect.eu/) upon validated registration. All raw and processed data files will be made available at the EGA (Solve-RD study EGAS00001003851) upon approval by data access committee. The Ethics committee of the Eberhard Karl University of Tubingen gave ethical approval for this work.

https://platform.rd-connect.eu/

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.