Abstract
We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout


Similar content being viewed by others
References
International HapMap Consortium. et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 46, 818–825 (2014).
Huang, J. et al. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel. Nat. Commun. 6, 8111 (2015).
Sidore, C. et al. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat. Genet. 47, 1272–1281 (2015).
Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007).
Howie, B.N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).
Li, Y., Willer, C.J., Ding, J., Scheet, P. & Abecasis, G.R. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010).
Howie, B., Fuchsberger, C., Stephens, M., Marchini, J. & Abecasis, G.R. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nat. Genet. 44, 955–959 (2012).
Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).
Fuchsberger, C., Abecasis, G.R. & Hinds, D.A. minimac2: faster genotype imputation. Bioinformatics 31, 782–784 (2015).
O'Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016).
Ferrucci, L. et al. Subsystems contributing to the decline in ability to walk: bridging the gap between epidemiology and geriatric practice in the InCHIANTI study. J. Am. Geriatr. Soc. 48, 1618–1625 (2000).
Melzer, D. et al. A genome-wide association study identifies protein quantitative trait loci (pQTLs). PLoS Genet. 4, e1000072 (2008).
Wood, A.R. et al. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant–phenotype associations undetected by HapMap based imputation. PLoS One 8, e64343 (2013).
Bathurst, I.C., Travis, J., George, P.M. & Carrell, R.W. Structural and functional characterization of the abnormal Z α1-antitrypsin isolated from human liver. FEBS Lett. 177, 179–183 (1984).
Ferrarotti, I. et al. Serum levels and genotype distribution of α1-antitrypsin in the general population. Thorax http://dx.doi.org/10.1136/thoraxjnl-2011-201321 (2012).
Sharp, K., Kretzschmar, W., Delaneau, O. & Marchini, J. Phasing for medical sequencing using rare variants and large haplotype reference panels. Bioinformatics 32, 1974–1980 (2016).
CONVERGE Consortium. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591 (2015).
Gurdasani, D. et al. The African Genome Variation Project shapes medical genetics in Africa. Nature 517, 327–332 (2015).
Rosenberg, N.A. et al. Genetic structure of human populations. Science 298, 2381–2385 (2002).
Wang, Y., Lu, J., Yu, J., Gibbs, R.A. & Yu, F. An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res. 23, 833–842 (2013).
Völzke, H. et al. Cohort profile: the study of health in Pomerania. Int. J. Epidemiol. 40, 294–307 (2011).
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nat. Rev. Genet. 11, 499–511 (2010).
Acknowledgements
We are grateful to all participants of all the studies that have contributed data to the HRC. J.M. acknowledges support from the ERC (grant 617306). W.K. acknowledges support from the Wellcome Trust (grant WT097307). S. McCarthy and R.D. acknowledge support from Wellcome Trust grant WT090851. A full list of acknowledgments for the cohorts is given in the Supplementary Note.
Author information
Authors and Affiliations
Consortia
Contributions
The HRC was initially conceived by discussions between J.M., G.A., R.D., M.I.M. and M.B. Analysis and methods development were carried out by S. McCarthy, S.D., W.K., O.D., A.R.W., P.D. and H.M.K. Supervision of the research was provided by J.M., G.A. and R.D. The Michigan Imputation Server was developed by C.F., L. Forer S.S. and G.A. The Sanger Imputation Service was developed by P.D., S. McCarthy and R.D. The Oxford Statistics Phasing Server was developed by W.K., K. Sharp and J.M. All other authors contributed data sets to the project or provided advice.
Corresponding authors
Ethics declarations
Competing interests
The author declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 The effect of sites filtering on Ts/Tv ratio per sample
The top figure shows the per-sample transition-transversion ratio (Ts/Tv) for chromosome 20 after running the GLPhase genotype calling method on the full MAC5 site list. In the bottom figure, GLPhase was run after the site filtering described in the text.
Supplementary Figure 2 Data summaries before and after site filtering
Figure a shows the number of sites in the unfiltered and filtered MAC5 site lists (chromosome 20) stratified by non-reference allele frequency. The allele frequency here is calculated from the genotypes made after running the GLPhase genotype calling method on the full MAC5 site list. Figure b shows the corresponding transition-transversion ratio (Ts/Tv) of these sites.
Supplementary Figure 3 Performance of imputation using different reference panels
The x-axis shows the non-reference allele frequency of the SNP being imputed on a log scale. The y-axis shows imputation accuracy measured by aggregate r2 when imputing SNP genotypes into 10 CEU samples. These results are based on using genotypes from sites on Illumina Core Exome SNP array.
Supplementary Figure 4 Performance of imputation using different reference panel.
The x-axis shows the non-reference allele frequency of the SNP being imputed on a log scale. The y-axis shows imputation accuracy measured by aggregate r2 when imputing SNP genotypes into 10 CEU samples. These results are based on using genotypes from sites on Illumina OMNI 5M SNP array.
Supplementary Figure 5 Site stratification by calling and filtering status across cohorts.
On the x-axis we show the number of studies a variant was called in (out of 20) and on the y-axis we show the number of times it was filtered out by the cohort-specific internal QC pipelines. The color shows the percentage of variants in each such cell (red means more than 10% of variants lie in that cell while blue means less than 0.1%). The number to the top right of each cell denotes the Ts/Tv ratio for all sites in that cell. Cells higher in the plot have been filtered out relatively often and usually represent poor variants, as is also seen from the low Ts/Tv ratio. All variants above the red line were filtered out (which excludes all cells which had been filtered independently by more than 4 studies or have Ts/Tv ratio less than 1.7)
Supplementary Figure 6 Comparison of methods for genotype calling as sample size increases
The figure shows a log-log plot of run time vs sample size for four different methods of genotype calling from GL data. For each sample size 5 random 1024 site chunks from chromosome 20 were used. Each dot represents the run time of a single dataset. Lines are drawn between successive means of run times for each value of sample size
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–6, Supplementary Tables 1–8 and Supplementary Note. (PDF 1898 kb)
Rights and permissions
About this article
Cite this article
the Haplotype Reference Consortium. A reference panel of 64,976 haplotypes for genotype imputation. Nat Genet 48, 1279–1283 (2016). https://doi.org/10.1038/ng.3643
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3643
This article is cited by
-
Defining type 2 diabetes polygenic risk scores through colocalization and network-based clustering of metabolic trait genetic associations
Genome Medicine (2024)
-
Polygenic risk score-based phenome-wide association study of head and neck cancer across two large biobanks
BMC Medicine (2024)
-
Elucidating pathways to pediatric obesity: a study evaluating obesity polygenic risk scores related to appetitive traits in children
International Journal of Obesity (2024)
-
Proxy-analysis of the genetics of cognitive decline in Parkinson’s disease through polygenic scores
npj Parkinson's Disease (2024)
-
Life course plasma metabolomic signatures of genetic liability to Alzheimer’s disease
Scientific Reports (2024)