Constructing germline research cohorts from the discarded reads of clinical tumor sequences

Alexander Gusev; Stefan Groha; Kodi Taraszka; Yevgeniy R. Semenov; Noah Zaitlen

doi:10.1101/2021.04.09.21255197

ABSTRACT

Background Hundreds of thousands of cancer patients have had targeted (panel) tumor sequencing to identify clinically meaningful mutations. In addition to improving patient outcomes, this activity has led to significant discoveries in basic and translational domains. However, the targeted nature of clinical tumor sequencing has a limited scope, especially for germline genetics. In this work, we assess the utility of discarded, off-target reads from tumor-only panel sequencing for recovery of genome-wide germline genotypes through imputation.

Methods We develop a framework for inference of germline variants from tumor panel sequencing, including imputation, quality control, inference of genetic ancestry, germline polygenic risk scores, and HLA alleles. We benchmark our framework on 833 individuals with tumor sequencing and matched germline SNP array data. We then apply our approach to a prospectively collected panel sequencing cohort of 25,889 tumors.

Results We demonstrate high to moderate accuracy of each inferred feature relative to direct germline SNP array genotyping: individual common variants were imputed with a mean accuracy (correlation) of 0.86; genetic ancestry was inferred with a correlation of >0.98; polygenic risk scores were inferred with a correlation of >0.90; and individual HLA alleles were inferred with correlation of >0.89. We demonstrate a minimal influence on accuracy of somatic copy number alterations and other tumor features. We showcase the feasibility and utility of our framework by analyzing 25,889 tumors and identifying relationships between genetic ancestry, polygenic risk, and tumor characteristics that could not be studied with conventional data.

Conclusions We conclude that targeted tumor sequencing can be leveraged to build rich germline research cohorts from existing data, and make our analysis pipeline publicly available to facilitate this effort.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

N.Z. and K.T. were supported by NIH grants K25HL121295, U01HG009080, R01HG006399, R01CA227237, R01ES029929, R01HG011345, the DoD grant W81XWH-16-2-0018, and the Chan Zuckerberg Science Initiative. A.G. and S.G. were supported by R01CA227237, R01CA244569, and the Doris Duke Charitable Foundation. A.G. was supported by the Louis B. Mayer Foundation and the Claudia Adams Barr Foundation.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

PROFILE samples were selected and sequenced from patients who were consented under institutional review board (IRB) approved protocol 11-104 and 17-000 from the Dana-Farber/Partners Cancer Care Office for the Protection of Research Subjects. Written informed consent was obtained from participants prior to inclusion in this study. Secondary analyses of previously collected data were performed with approval from the Dana-Farber IRB (DFCI IRB protocol 19-033 and 19-025; waiver of HIPAA authorization approved for both protocols).

All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.

Yes

Data Availability

The raw sequencing data are not publicly available because the research participant consent, privacy policy, and terms of service do not include authorization to share identifiable data. The full analysis workflow is available at: https://github.com/gusevlab/panel-imp A containerized version of the imputation pipeline is available at: https://hub.docker.com/r/stefangroha/stitch_gcs

https://github.com/gusevlab/panel-imp

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.