Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation

Jonas A. Gustafson; Sophia B. Gibson; Nikhita Damaraju; Miranda PG Zalusky; Kendra Hoekzema; David Twesigomwe; Lei Yang; Anthony A. Snead; Phillip A. Richmond; Wouter De Coster; Nathan D. Olson; Andrea Guarracino; Qiuhui Li; Angela L. Miller; Joy Goffena; Zachery Anderson; Sophie HR Storz; Sydney A. Ward; Maisha Sinha; Claudia Gonzaga-Jauregui; Wayne E. Clarke; Anna O. Basile; André Corvelo; Catherine Reeves; Adrienne Helland; Rajeeva Lochan Musunuri; Mahler Revsine; Karynne E. Patterson; Cate R. Paschal; Christina Zakarian; Sara Goodwin; Tanner D. Jensen; Esther Robb; The 1000 Genomes ONT Sequencing Consortium; University of Washington Center for Rare Disease Research (UW-CRDR); Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium; W. Richard McCombie; Fritz J. Sedlazeck; Justin M. Zook; Stephen B. Montgomery; Erik Garrison; Mikhail Kolmogorov; Michael C. Schatz; Richard N. McLaughlin; Harriet Dashnow; Michael C. Zody; Matt Loose; Miten Jain; Evan E. Eichler; Danny E. Miller

doi:10.1101/2024.03.05.24303792

ABSTRACT

Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.

Competing Interest Statement

WDC, ML, FS, and DEM have received research support and/or consumables from ONT. WDC, JG, FS, and DEM have received travel funding to speak on behalf of ONT. DEM is on a scientific advisory board at ONT. FS has received research support from Illumina, Genetech, and PacBio. SBM is an advisor to BioMarin, MyOme, and Tenaya Therapeutics. EEE is a scientific advisory board (SAB) member of Variant Bio, Inc. DEM holds stock options in MyOme.

Funding Statement

SBG is supported by NIH grant 5T32HG000035-29; WDC is a recipient of a postdoctoral fellowship from FWO [12ASR24N]; EG and AG are supported by NIH grants R01HG013017 and U01DA057530 and NSF grant 2118744; SG is supported by NIH grant 5R50CA243890; TDJ is supported by NIH grant T32HG000044; MK is supported by Intramural NIH funding; SBM, TDJ, and ER is supported by NIH Grant U01HG011762; MCS is supported by NIH grants U24HG010263, R03CA272952, and U01CA253481 and the Lustgarten Foundation grant 90101412; FJS is supported by NIH grants 1U01HG011758-01, 1UG3NS132105-01, and U01AG058589; AAS is supported by an NSF postdoctoral research fellowship in biology [NSF 22-623]; RNM and LY are supported by NIH grants 5R35GM142733-03 and 5R21AI174130-02; EEE is supported by NIH grant HG010169 and EEE is an investigator of the Howard Hughes Medical Institute; DEM is supported by the NIH Directors Early Independence Award DP5OD033357. The GREGoR Consortium is funded by the National Human Genome Research Institute of the National Institutes of Health, through the following grants: U01HG011758, U01HG011755, U01HG011745, U01HG011762, U01HG011744, and U24HG011746.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study uses only publicly available cell lines from the 1000 Genomes Project available at Coriell and data available at public sources such as at https://www.internationalgenome.org/data/.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

Data for all samples sequenced as part of the 1000 Genomes Project ONT Sequencing Consortium are publicly available at https://s3.amazonaws.com/1000g-ont/index.html. Data from the 100 samples reported here, as well as summary analysis data, are available at https://s3.amazonaws.com/1000g-ont/index.html?prefix=FIRST_100_FREEZE/. Data and code related to pangenome analyses are available at https://github.com/AndreaGuarracino/1000G-ONT-F100-PGGB.

https://s3.amazonaws.com/1000g-ont/index.html

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.