ABSTRACT
Background A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average “diagnostic odyssey” lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting.
Methods Predictors were provided a dataset of phenotype terms and variant calls from GS of 175 RGP individuals (65 families), including 35 solved training set families, with causal variants specified, and 30 test set families (14 solved, 16 unsolved). The challenge tasked teams with identifying the causal variants in as many test set families as possible. Ranked variant predictions were submitted with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on rank position of true positive causal variants and maximum F-measure, based on precision and recall of causal variants across EPCR thresholds.
Results Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performing teams recalled the causal variants in up to 13 of 14 solved families by prioritizing high quality variant calls that were rare, predicted deleterious, segregating correctly, and consistent with reported phenotype. In unsolved families, newly discovered diagnostic variants were returned to two families following confirmatory RNA sequencing, and two prioritized novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant, in an unsolved proband with phenotype overlap with asparagine synthetase deficiency.
Conclusions By objective assessment of variant predictions, we provide insights into current state-of-the-art algorithms and platforms for genome sequencing analysis for rare disease diagnosis and explore areas for future optimization. Identification of diagnostic variants in unsolved families promotes synergy between researchers with clinical and computational expertise as a means of advancing the field of clinical genome interpretation.
Competing Interest Statement
Authors S.Z., I.L., E.R., P.M., and R.B., own shares of enGenome srl. Authors F.D.P. and G.N. are employees of enGenome srl. Authors T.J., R.S., S.G.V., N.S., A.R., U.S., N.T., are employees of TCS Ltd. Authors P.J.C., C.K., K.N., and P.S. are employees of Invitae Ltd. H.L.R. receives support from Illumina and Microsoft for rare disease gene discovery and diagnosis. A.O’D-L. is a member of the scientific advisory board for Congenica Inc and the Simons Foundation SPARK for Autism study and co-chairs the clinical advisory board for CAGI. S.E.B receives support at UC Berkeley from a research agreement from TCS. All other authors report no competing interests.
Funding Statement
S.L.S. is supported by a fellowship from the Manton Center for Orphan Disease Research at Boston Children’s Hospital. G.L. was supported by Fonds de recherche en sante du Quebec. V.S.G. was supported by the Mass General Brigham Training Program in Precision and Genomic Medicine (NHGRI T32 HG10464). Data and diagnoses were provided by Broad Institute of MIT and Harvard Center for Mendelian Genomics with funding to H.L.R. and A.O’D-L., by the National Human Genome Research Institute (NHGRI) grants UM1HG008900, U01HG011755, and R01HG009141 and by the Chan Zuckerberg Initiative through an advised fund of the Silicon Valley Community Foundation grant 2020-224274. This study was also supported by the NHGRI CAGI grant U24 HG007346 (to S.E.B. and P.R.), National Institute of Child Health and Human Development grant 1R01HD103805‐01, and National Institute of General Medical Sciences R35GM124952, along with funding from King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) grants URF/1/4355-01-01, URF/1/4675-01-01, FCC/1/1976-34-01.
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
The Rare Genomes Project study is approved by the Mass General Brigham Institutional Review Board (IRB) protocol 2016P001422. Written informed consent for the publication of clinical details was obtained from the participants or legal guardians.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
LIST OF ABBREVIATIONS
- ACMG/AMP
- American College of Medical Genetics and Genomics and the Association for Molecular Pathology
- AD
- autosomal dominant
- AFR
- African/African American
- AMR
- Admixed American
- AR
- autosomal recessive
- ASJ
- Ashkenazi Jewish
- CAGI
- Critical Assessment of Genome Interpretation
- CSF
- cerebrospinal fluid
- DM
- disease mutation
- EPCR
- estimated probability of causal relationship
- F-max
- maximum F-measure
- HPO
- Human Phenotype Ontology
- IGV
- Integrative Genome Viewer
- indel
- small insertion/deletion
- LP
- likely pathogenic
- NFE
- Non-Finnish European
- P
- pathogenic
- PHS
- Pitt-Hopkins syndrome
- RGP
- Rare Genomes Project
- SAS
- South Asian
- SE
- standard error
- SNV
- single nucleotide variant
- SV
- structural variant
- VCF
- variant call file
- VEP
- Variant Effect Predictor
- VUS
- variant of uncertain significance
- XLR
- X-linked recessive