The expected polygenic risk score (ePRS) framework: an equitable metric for quantifying polygenetic risk via modeling of ancestral makeup

Yu-Jyun Huang; Nuzulul Kurniansyah; Matthew O Goodman; Brian W Spitzer; Jiongming Wang; Adrienne Stilp; Cecelia Laurie; Paul S de Vries; Han Chen; Yuan-I Min; Mario Sims; Gina M Peloso; Xiuqing Guo; Joshua C Bis; Jennifer A Brody; Laura M Raffield; Jennifer A Smith; Wei Zhao; Jerome I Rotter; Stephen S Rich; Susan Redline; Myriam Fornage; Robert Kaplan; Nora Franceschini; Daniel Levy; Alanna C Morrison; Eric Boerwinkle; Nicholas L Smith; Charles Kooperberg; Bruce M Psaty; Sebastian Zöllner; the Trans-Omics in Precision Medicine Consortium; Tamar Sofer

doi:10.1101/2024.03.05.24303738

Abstract

Polygenic risk scores (PRSs) depend on genetic ancestry due to differences in allele frequencies between ancestral populations. This leads to implementation challenges in diverse populations. We propose a framework to calibrate PRS based on ancestral makeup. We define a metric called “expected PRS” (ePRS), the expected value of a PRS based on one’s global or local admixture patterns. We further define the “residual PRS” (rPRS), measuring the deviation of the PRS from the ePRS. Simulation studies confirm that it suffices to adjust for ePRS to obtain nearly unbiased estimates of the PRS-outcome association without further adjusting for PCs. Using the TOPMed dataset, the estimated effect size of the rPRS adjusting for the ePRS is similar to the estimated effect of the PRS adjusting for genetic PCs. The ePRS framework can protect from population stratification in association analysis and provide an equitable strategy to quantify genetic risk across diverse populations.

Introduction

Polygenic risk scores (PRS) combine information from multiple genetic variants, summarizing disease risk due to genetics into a single score. They are typically constructed based on summary statistics from genome-wide association studies (GWAS). PRSs are now being incorporated into the clinic (1,2). Potential applications of PRS in healthcare settings include disease risk screening, risk prediction, and identification of target populations that may benefit from early interventions; all are the ultimate goals of precision and preventive medicine (3–5). Using PRS in multi-ancestry populations has received increasing attention due to the recognition that genomics research has been “failing on diversity” (6), with most genetic studies being carried out in populations of European ancestry. Consequently, PRS performance is lower in populations that are not of European ancestry or that have some European ancestry admixed with other continental ancestries (7). Aside from efforts to increase the diversity of GWAS participants, recent PRS research has proposed to improve upon PRS in individuals of diverse ancestries by combining information from summary statistics from multiple GWASs, each from a different genetic ancestry, while borrowing information across (8,9). However, there is still a gap in PRS performance between individuals of European ancestry and of other diverse populations (10,11).

To alleviate population stratification bias in association studies, adjusting for principal components (PCs) constructed using genome-wide genetic data is a standard procedure. Recently, the “ancestry-adjusted PRS” method was utilized to analyze diverse populations (4,12,13), where PCs are regressed out of the PRS via a linear model, and then the PRS-outcome association is estimated with the residuals from the first stage regression model. Khera et al. (4) showed that the ancestry-adjusted PRS had similar distribution across diverse populations, which indicated the removal of ancestry-related genetic effects. However, different studies may adjust for different number of PCs, and the meaning of PCs are inequivalent across different datasets. Furthermore, PCs can only capture global genetic structure and not local ancestry pattern. Hence, it still needs to be determined whether the adjustment of PCs in PRS-outcome association analyses in admixed population is sufficient to account for unknown ancestry-related confounding due to the complex admixture pattern (14–16).

Motivated by the challenges of using PCs in the translation of PRS for clinical use in diverse populations, we aim to propose a new method in which the ancestry-related factors that cause the differences in the distribution of PRS between populations can be captured by a single variable, and the interpretation is identical across datasets. The intuitive difference in PRS distribution between populations is that they have different baseline genetic characteristics, i.e., the mean of the PRS distribution, which is caused by the difference in ancestry composition and ancestry-specific allele frequency. Therefore, if the baseline genetic value of each individual can be computed, the corresponding PRS can be calibrated to obtain a distribution that is homogeneous across populations. Because the calibrated PRS is now independent of genetic ancestry, utilizing a single threshold to classify individuals into high- or low-risk groups across diverse populations is feasible, which achieves the goal of equitably quantifying genetic risk.

Based on the above rationale, this work introduces a novel framework that uses individual’s genetic ancestry composition in PRS construction. We propose a metric called expected PRS (ePRS), in which the expected value of a PRS is calculated based on either their global ancestry proportions or their local ancestry pattern. Therefore, the ePRS can be interpreted as the baseline trait-specific genetic parameter based on an individual’s ancestral composition.

Coupled with ePRS, we introduce another two metrics: the residual PRS (rPRS), which measures the deviation of the PRS from its expected mean according to admixture patterns (i.e., from the ePRS), and the quantile PRS (qPRS), defining the distribution-based percentile of the PRS accounting for both the ePRS and its variability. Therefore, the rPRS and qPRS can distinguish genetic risk between individuals regardless of their genetic ancestry patterns. We performed simulation studies to examine whether our proposed method can provide a nearly unbiased estimation of the PRS-outcome effect by adjusting for ePRS in the association model. Other standard adjustment procedures are considered as comparators. We also applied our method to analyze data from the Trans-Omics for Precision Medicine (TOPMed) study and compare the PRS-outcome effect estimation with the standard PC-adjusted approach.

Results

PRS-outcome association model

The underlying PRS-outcome association model is provided by:

In equation (1), Y is the trait of interest. The PRS of individual i,, is calculated by the aggregation of SNP allele effect sizes ω_j, j = 1, 2, …, p and the individual allele counts g_ij, j = 1, 2,. ., p. The parameter of interest is β₁, representing the effect size of the PRS-outcome association; X is a vector of covariates and δ is the vector of corresponding effect sizes; and U_G is an unknown ancestry-related genetic factor, which may confound the PRS-outcome relationship (population stratification bias), and γ is its effect. The function g() is known as the link function, relating the outcome distribution to the covariates.

Overview of the ePRS framework

We define the ePRS as the expectation of individual’s polygenic risk score ePRS_i=E (PRS_i). The ePRS provides the baseline genetic characteristic based on one’s ancestry composition. We define the residual PRS (rPRS) as the difference between PRS and ePRS, i.e., rPRS_i= PRS_i− ePRS_i. The rPRS can be used to compare genetic risk between individuals with different genetic ancestry composition. The working association model in the ePRS framework is illustrated in equation (2)

One can see that both the rPRS and the ePRS are used as covariates in the model. The estimated effect of the rPRS is equivalent to the estimated effect of the PRS in a standard association model. Importantly, the ePRS is used to control for stratification bias, instead of the standard adjustment for genetic PCs. This is illustrated in Figure 1a, where one can see that the ePRS blocks the “back door path” from the unknown genetic factors (U_G from equation (1)) that are related to ancestry by having ancestry-related differences in allele frequencies.

Figure 1: Schematic overview of the ePRS framework

Panel a: Directed acyclic graph demonstrating that the ePRS blocks a pathway between the PRS and unknown confounding genetic measures U_G. Panel b: The rPRS is computed by subtracting the value of the ePRS from that of the PRS. The ePRS is determined according to an individual’s ancestral makeup. Panel c: The distribution of BMI and its PRS, ePRS, and rPRS in the TOPMed dataset. The distributions stratified based on harmonized self-reported race/ethnicity. Abbreviations: BMI: body mass index; PRS: polygenic risk scores; ePRS: expected polygenic risk scores; rPRS: residual polygenic risk scores; TOPMed: Trans-Omics for Precision Medicine.

To focus on the phenomena of interest (population stratification bias), in this work we assume that the SNP effect size ω_j, are fixed and known values, and this work does not discuss uncertainty in their estimation nor potential differences in effects across genetic ancestries. To model the allele count of each variant and each individual, we assume that g_ij follows a mixture distribution based on individual i’s ancestry makeup: with K being the number of genetic ancestries is the ancestry-specific allele frequencies of allele j in ancestry. is the Binomial distribution Bin represents the proportion of the entire genome of individual i inherited from ancestral population, k i.e., the global ancestry proportion. We assume that π_ik are known. Note that if π_ik = 1 for a specific k ∈ {1, … K}, equation (3) reduces to a standard allele count model without admixture. We can use global ancestry proportions to calculate the “global” ePRS (gePRS) as: With rearrangement of terms in equation (4), for individual i the global ePRS can be expressed as a weighted combination of ancestry-specific ePRSs, with weights being their global ancestry proportions. If the p variants are independent, the variance of the PRS conditional on the global ancestry proportions is:

If the ancestry makeup is available in each locus (local ancestry), we can also derive local ePRS (lePRS). Assuming that each chromosomal copy’s local ancestry is known and fixed, we now use the binomial distribution with count of 1 for chromosomal copy m=1,2, i.e.. Exploiting the same strategy of deriving global ePRS, the local ePRS can be written as

And the variance of PRS based on local ancestry is provided in equation (7):

More details about the ePRS framework and derivations can be found in the Methods section.

The mean and variance of a PRS (based on either global or local ancestry patterns) can next be used to construct a third metric: the quantile PRS (qPRS). We assume that given the mean and variance, the distribution of the PRS for individual i follows the Normal distribution with mean E(PRS_i) and variance Var(PRS_i), i.e.,

The qPRS then can be computed as where the notation Φ is the cumulative Normal distribution function. Thus, the qPRS is a person-specific percentile of PRS value conditional on an individual’s ancestral makeup. The schematic illustration of the ePRS framework as well as example distributions of PRS, ePRS and rPRS for body mass index (BMI) in the TOPMed datasets are shown in Figure 1.

Results from simulation studies

Overview of the simulation settings

We conducted extensive simulation studies to examine the estimation performance of PRS-outcome association (β₁ in equation (1)) when using the ePRS framework in comparison with other possible association models. We simulated a few ancestry makeup patterns and used them to generate a PRS and an unobserved confounder U_G. Figure 2 provides an overview of the procedure for generating global and local ancestries as well as the PRS, the unknown genetic confounder U_G, and the outcome. To guide the definition of the simulated PRS, we used summary statistics from a GWAS of systolic blood pressure (SBP) of the UK Biobank and ICBP consortium (17). We used the top 100 SNPs after clumping using PLINK and extracted their estimated effect sizes from the GWAS as the true variant weights. The ancestry-specific frequencies of these SNPs were taken to be the estimated ancestry-specific frequencies from the global ancestry-specific allele frequency estimation in admixed populations (GAFA) procedure applied over the TOPMed dataset (18). Two types of PRSs were considered in our simulations: homogenous weighting PRS, where the variant weights are set to be the same across ancestries, and heterogenous weighting PRS, where some of the weights are ancestry-specific (see Methods). We simulated the unknown confounder U_G in a few forms (Supplementary Table 1), and it was generally sampled to be ancestry-related using a similar strategy to the PRS. Further, a few forms of U_G were a weighted sum of alleles, just like a PRS. Thus, similar to the simulation of the main PRS, we used summary statistics from GWAS to guide the selection of SNPs to inform of ancestry-specific allele frequencies and weights. SNPs were sampled from either the UKBB-ICBP SBP GWAS (conf-PRS1) or from the SBP GWAS of the Million Veteran Program (MVP), in which case, top SNPs that were highly common (MAF > 0.4) in African ancestry and relatively rare (MAF < 0.1) in European ancestry were used (conf-PRS2). We also used a single variant or a combination of two variants as genetic confounders, again enriching for African frequency. The purpose of simulating alleles that are common in one particular ancestry was to create a strong genetic ancestry confounding effect. For comparison, we also set the “unobserved” confounder to be a known (i.e., observed) genetic principal component (conf-pc*), essentially another weighted sum of SNPs. Notably, in all simulations other than when using a conf-pc* as a confounder (for benchmarking purposes), the association model is always different than the data generating model. The PRS value considered in each association model (compare with data generating model) is the homogenous weighting PRS, which forms a model misspecification scenario when the underlying model is generated by the heterogeneous weighting PRS. We summarized the settings and the rationale for creating different simulation scenarios in Supplementary Note 1 and Supplementary Table 1. Comprehensive details about the simulation studies are provided in the Methods section.

Figure 2: Visual illustration of the data generating mechanism in simulations

To generate data in simulations we used the 5 steps illustrated above. In step 1 we sampled global proportions of ancestry for each individual assuming 3-way admixture. Step 2 sampled the local ancestry at each SNP, on each chromosomal copy, according to the proportions of ancestry. Step 3 used the local ancestry at each SNP position and copy to sample an allele from a Bernoulli distribution (with the probability was set as the ancestry-specific allele frequency). Step 4 summed alleles with weights to obtain a PRS and genetic confounder (U_G). Step 5 simulated the outcome from a regression model that includes an intercept, the PRS, the unobserved confounder, and normally distributed errors. Abbreviations: PRS: polygenic risk score; SNP: single nucleotide polymorphism.

Simulation results: estimation performance of the PRS-outcome association in the ePRS framework

Selected results from simulation studies are shown in Figures 3 to 5, demonstrating that: (a) the estimated effect size of rPRS when adjusting for either global or local ePRS in the model is unbiased for β₁ (Figure 3); (b) if the unknown genetic confounders share similar local ancestry intervals with the variants that make up the PRS of interest (conf-PRS1 setting), adjusted lePRS in the association model results in more efficient estimation than gePRS (Figures 3 and 4); (c) setting strong genetic ancestry confounding (conf-PRS2) causes biased estimates and dramatically increases the MSE of the estimated β₁ in models that do not adjust for ePRSs (none and conf-pc*) but has nearly no increase in bias when applying the ePRS framework (Figures 3 and 4); (d) in the setting where the underlying simulated model is homogenous weighting PRS with conf-pc* confounding, the ePRS framework performed comparably to the benchmark model (conf-pc*) and outperformed other competing approaches (Figures 3 and 4); (e) compared with the approach that adjusts for PCs in the model, in which the PCs were computed based on the genetic data that generated the PRS of interest, the ePRS framework has smaller variance of β₁ estimates (Figure 3) and smaller MSE (Figure 4); (f) estimation using the ePRS framework is more robust to PRS model misspecification (heterogeneous weighting PRS setting, Figures 3 and 4); (g) the efficiency in the estimation of β₁ decreases (i.e., increased MSE) as the effect size γ of the genetic confounder, but the increase in MSE is less severe when using the ePRS framework (Figure 5). Full results from these simulations are provided in Supplementary Note 1 and Supplementary Figures 1 to 3.

Figure 3: PRS effect size estimates in simulation studies

The figure provides box plots of the distribution of estimated effect sizes of the PRS across simulation settings. Results on the top row correspond to data generating model with homogenous weighting PRS, and results in the bottom row correspond to heterogeneous weighting PRS. Columns correspond to four forms of confounding factors defined in the data generating model: conf-PRS1, conf-PRS2, conf-pc*, and no confounding (from the left to the right). Each panel provides boxplots visualizing distributions of obtained in 7 association analysis models, with the y-axis representing values. The true value, β₁ = 1.5, is highlighted as a red dotted line. Association analyses that estimate (standard) PRS effect while adjusting to covariates include PC10, PC20 (adjusting for top 10 and 20 genetic PCs, respectively), none (no covariate adjustment), conf-pc* (adjustment for a known confounder, for benchmarking), and gaProp (adjustment for global proportions of ancestry). Association analyses applying the ePRS framework include gePRS and lePRS (estimation of rRPS effect adjusting for ePRS, based on global and local models, respectively). Distributions are provided from 1000 simulation repetitions.

Abbreviations: conf: confounding; ePRS: expected PRS; gePRS: global ancestry expected PRS; lePRS: local ancestry expected PRS; PC: principal component; PRS: polygenic risk score; gaProp: global ancestry proportion.

Figure 4: Mean squared error of PRS effect size estimates in simulation studies

The figure provides bar plots visualizing the MSE of estimated effect sizes of the PRS across simulation settings. Results on the top row correspond to data generating model with homogenous weighting PRS, and results in the bottom row correspond to heterogeneous weighting PRS. Columns correspond to four forms of confounding factors defined in the data generating model: conf-PRS1, conf-PRS2, conf-pc*, and no confounding (from the left to the right). Each panel provides bars with heights corresponding to the MSE of estimated in 7 association analysis models. Association analyses that estimate (standard) PRS effect while adjusting to covariates include PC10, PC20 (adjusting for top 10 and 20 genetic PCs, respectively), none (no covariate adjustment), conf-pc* (adjustment for a known confounder, for benchmarking), and gaProp (adjustment for global proportions of ancestry). Association analyses applying the ePRS framework include gePRS and lePRS (estimation of rRPS effect adjusting for ePRS, based on global and local models, respectively). MSEs were computed over 1000 simulation repetitions. Intervals around the estimated MSE correspond to the MSE +/-one estimated standard error. Abbreviations: MSE: mean square error; conf: confounding; ePRS: expected PRS; gePRS: global ancestry expected PRS; lePRS: local ancestry expected PRS; PC: principal component; PRS: polygenic risk score; gaProp: global ancestry proportion.

Figure 5: Estimation performance of the PRS effect size across increasing strength of the unknown genetic ancestry-related confounding factor

The figure provides the estimated MSE of the estimated PRS effect size across simulation settings and analysis methods. Results on the top row correspond to data generating model with homogenous weighting PRS, and results in the bottom row correspond to heterogeneous weighting PRS. Columns correspond to four forms of genetic ancestry-related confounding factors defined in the data generating model: conf-PRS1, conf-PRS2, conf-pc*, and no confounding (from the left to the right). Each panel provides MSE (y-axis) obtained across 1000 simulation repetitions with association analyses using 7 combinations of PRS and adjustment approaches, and across 3 simulated effect sizes (γ) of the genetic ancestry-related confounder. In this simulation, we fixed β_’ = 1. in the data generating model across all settings. Association analyses that estimate (standard) PRS effect while adjusting to covariates include PC10, PC20 (adjusting for top 10 and 20 genetic PCs, respectively), none (no covariate adjustment), pc* (adjustment for a known confounder, for benchmarking), and gaProp (adjustment for global proportions of ancestry). Association analyses applying the ePRS framework include gePRS and lePRS (estimation of rRPS effect adjusting for ePRS, based on global and local models, respectively). MSEs were computed over 1000 simulation repetitions. Intervals around the estimated MSE correspond to the MSE +/-one estimated standard error.

Abbreviations: MSE: mean square error; conf: confounding; ePRS: expected PRS; gePRS: global ancestry expected PRS; lePRS: local ancestry expected PRS; MSA: mean squared error; PC: principal component; PRS: polygenic risk score; gaProp: global ancestry proportion.

Supplementary Note 2 provides comprehensive results from secondary simulations. Specifically, we conducted additional simulations: (a) sensitivity analyses to examine the performance of the ePRS approach under the assumption that random error exits in the global or local ancestry inference (Supplementary Figure 4); (b) simulation studies demonstrating that the estimated effect of the ePRS is related to the strength of the unknown genetic confounding effect (Supplementary Figures 5 and 6); and finally (c) simulations demonstrating the use of quantile PRS (qPRS) in binary trait risk classification analyses (Supplementary Figure 7). Each of the global and local qPRS approaches may outperform the other in some settings.

PRS association analyses in the Trans-Omics for Precision Medicine (TOPMed) dataset

The TOPMed dataset provides aggregated whole genome sequencing (WGS) data from individuals from multiple parent studies representing genetically (and environmentally) diverse populations. We applied the ePRS framework to estimate the associations of a few PRSs with their outcomes in the TOPMed dataset. We considered five continuous phenotypes: body mass index (BMI), diastolic blood pressure (DBP), systolic blood pressure (SBP), high-density lipoprotein (HDL), low-density lipoprotein (LDL), and two binary outcomes: venous thromboembolism (VTE) and obstructive sleep apnea (OSA). Across traits, up to 49,626 individuals were included in a given analysis, with sample sizes and parent studies of participants varying across traits. Supplementary Table 2 characterizes TOPMed participants in the continuous trait analyses, Supplementary Table 3 provides summary statistics of these continuous traits in the sample. Supplementary Tables 4 and 5 characterize TOPMed participants in the VTE and OSA analyses, respectively. Supplementary Table 6 provides the number of individuals participating in each of the analyses (combined and broken down by harmonized self-reported race/ethnicity).

Characteristics of PRS, ePRS, and rPRS across TOPMed individuals and traits

Figure 6 illustrates the impact of genetic ancestry on BMI and LDL PRSs. Panel a shows that the distribution of conventional PRS differ across self-reported race/ethnicity groups, where African American individuals have higher BMI PRS values and lower LDL PRS values, compared to other groups. This is driven by their ancestral makeup, as demonstrated by the ePRS distributions. In contrast, the rPRS distributions are similar across self-reported race/ethnicity groups and are centered around zero. Panel b provides another view of the ePRS and PRS relationship. For example, it demonstrates that the highly admixed groups, such as African American and Hispanic/Latino individuals, have large variation in ePRS values, while genomes of most Asian and European individuals have little admixture. Panel c demonstrates that while African American individuals are enriched in high BMI and low LDL PRS values, this pattern disappears when using the rPRS and the qPRS. Supplementary Figures 8-14 provide similar visualizations for other traits (SBP, DBP, HDL, OSA, and VTE).

Figure 6: Patterns of BMI and LDL ePRS-related measures in TOPMed participants

Panel a: BMI and LDL PRS, global ePRS, and global rPRS distributions in TOPMed participants. Panel b: The relationship between BMI and LDL PRSs and global ePRSs. Each point represents an individual’s global ePRS (y-axis) and PRS (x-axis) values. Panel c: Scatterplots visualizing, on the x-axis, percentiles of PRS, global rPRS, and global qPRS, against mean values of the corresponding phenotypes (y-axis), averaged across individuals with the corresponding PRS, rPRS, or qPRS percentile. For instance, for each phenotype and PRS measure, individuals were binned into 100 strata defined by PRS measure percentiles. The color of a given point corresponds to the proportion of African American individuals among individuals in the relevant percentile stratum. Darker color reflects a higher proportion of African American individuals. Global ePRSs were constructed based on each individual’s global ancestry proportion and the ancestry-specific allele frequency estimated using GAFA. Abbreviations: TOPMed: Trans-Omics for Precision Medicine; BMI: body mass index; LDL: low-density lipoprotein; PRS: polygenic risk score; ePRS: expected PRS; rPRS: residual PRS; qPRS: quantile PRS.

Estimation of PRS-outcome associations

We apply the ePRS framework to estimate PRS-outcome associations and compare the results to those from association models using conventional PRS and adjusting for genetic PCs or the estimated global ancestry proportions. When using the ePRS framework, we estimated the association of the rPRS while adjusting for the corresponding ePRS (global or local), computed with ancestry-specific allele frequencies calculated by GAFA.

Figure 7 shows the estimated effect sizes of the PRSs and of the rPRSs with their 95% confidence intervals. Results are provided from multi-population analysis, and stratified by self-reported race/ethnicity groups. The results from the multi-population analyses show that all four models had similar estimates, matching what we observed in simulation studies. In some cases, local ePRS analysis resulted in slightly higher estimated effect sizes (this is more apparent for DBP). The estimates obtained via multi-population analyses are close to those from the European American population, as it usually dominates other populations in sample size contribution. The estimates corresponding to the Asian American group have large standard errors due to small sample sizes. Generally, effect estimates often differ across race/ethnic groups. For example, the estimated effect sizes in the Asian American group are substantially lower than other groups for LDL and HDL traits. The estimated effect size for the African American group is higher than others for OSA (BMI adjusted) and HDL; however, the estimated effect sizes are lower than other three populations for most of the other traits. All association analysis results, including the estimated effect sizes, standard error, sample size used for each trait, are summarized in Supplementary Data 1.

Figure 7: The estimated PRS-outcome effect sizes for TOPMed studies traits.

Estimated PRS-outcome effect sizes (y-axis) and their corresponding 95% confidence intervals in the TOPMed dataset. Conventional PRS model estimated the PRS effect size by either adjusting for genetic PCs or for global ancestry proportions. For the ePRS model, we show the estimated rPRS effect size either adjusting for global or local ePRS. Estimated association are provided, for each trait, for the combined dataset (Multi), and stratified by self-reported race/ethnicity. For continuous traits (BMI, SBP, DBP, HDL, and LDL), effect sizes are in the original trait scale (kg/m², mmHg, mmol/L). OSA and VTE are binary traits, and their estimated effect sizes are in the log odds ratio scale. For OSA, we provide results for two OSAs: OSA_bmi_adj and OSA_bmi_unadj, based on GWAS that did and did not adjust for BMI, respectively. Due to the limited sample size, we did not perform an Asian-specific analysis of VTE. More details of the analytic approaches for all the analyses can be found in the Methods section. Abbreviations: BMI: body mass index; SBP: systolic blood pressure; DBP: diastolic blood pressure; HDL: high-density lipoprotein; LDL: low-density lipoprotein; OSA: obstructive sleep apnea; VTE: venous thromboembolism; Multi: Multi-ethnic; EA: European American; AA: African American; HA: Hispanic/Latino; AsA: Asian American; PRS: polygenic risk score; PC: principal component; gaProp: global ancestry proportion; gePRS: global ancestry expected PRS; lePRS: local ancestry expected PRS.

In secondary analysis we developed genome-wide PRSs using LDPred2 (19), and constructed PRS, ePRS, and rPRS for each trait. The estimated PRS-outcome associations across all traits are similar between the ePRS framework and conventional association models (adjusting for genetic PCs or global ancestry proportions), with the same conclusion as the primary analyses. Details of secondary data analysis are summarized in Supplementary Note 4, Supplementary Figure 15, and Supplementary Data 2.

Discussion

We proposed an individual-level metric, the ePRS, calculated according to one’s ancestral makeup, to provide an equitable way to quantify genetic risk across diverse populations. By “equitable” we mean that an individual of a specific genetic ancestry will not be automatically annotated has “high” or “low” risk due to their ancestry, and that the same PRS would be useful independently of population descriptors such as genetic ancestry, or race and ethnicity (which are noisy correlates of genetic ancestry and have no biological meaning). By its definition, the ePRS can be considered as the baseline genetic characteristic of an individual according to their specific ancestry composition. The ePRS forms a basis for standardizing PRS across diverse and admixed individuals. We showed that adjustment for the ePRS accounts for population stratification in PRS-outcome association studies: we conducted several simulation studies and confirmed that it is sufficient to adjust for ePRS to obtain nearly unbiased estimates of the PRS-outcome associations, while adjustment for PCs (with or without the ePRS) does not improve parameter estimates. The homogeneous distribution of rPRS across different populations further demonstrates the usefulness of calibrating a PRS by the ePRS. We applied this framework to study PRS associations with 7 phenotypes in the TOPMed dataset. Across phenotypes, the estimated effect size of the rPRS adjusting for the ePRS is similar to the estimated effect of the PRS adjusting for genetic PCs or global ancestry proportion.

PRS distributions are affected by allele frequencies. As frequencies vary between groups defined by genetic ancestry patterns, PRS distributions may differ between groups, limiting the interpretation of an observed PRS value of an individual who is compared to other individuals with different genetic makeup. One of the novelties of our ePRS framework is the conceptualization of personalized PRS distribution based on one’s genetic ancestry makeup.

Thus, observed PRS values are contextualized based on an individual’s ancestry composition, which determines the potential attained PRS values. Consequently, “high” or “low” PRS values are determined in relation to this potential. Individuals can be classified into at-risk groups regardless of ancestral makeup, preventing a situation where all individuals with high proportion of one genetic ancestry all appear to be at a high (or low) risk compared to individuals with high proportion of a different genetic ancestry.

Our proposed method has a few notable strengths. First, ePRS accounts for population stratification in a similar fashion to genetic PCs, however, because the ePRS is constructed such that it is specific to the PRS of interest, it has more intuitive interpretation. Further, the ePRS represents the same quantity across datasets, in contrast to PCs which are typically constructed independently in each dataset (though some approaches have been proposed for unifying PCs across datasets (e.g., (20)). While these authors also showed that PCs are used for prediction models when aligned across datasets, this alignment procedure requires joint quality control across multiple datasets, which is difficult, and critically, the ePRS framework is not inconsistent with this approach, because PCs can be incorporated to prediction models that include the ePRS. Second, the ePRS framework obviate the potential use of race and ethnicity in clinical use of PRS. While race/ethnic classification can be seen as proxies of genetic ancestry (e.g., because self-reported White individuals usually have large proportions of European genetic ancestries, etc.), two individuals self-identifying with the same race/ethnicity group may have different mixes of genetic ancestry proportions (21,22). Therefore, the distributions of PRS may arbitrarily differ across groups, potentially leading to wrong interpretations when stratifying individuals into risk groups. In contrast, genetic ancestry is defined or estimated by admixture pattern, which is a more accurate characteristic than the ambiguous definition of self-reported race ethnicity (23). How to explicitly use race/ethnicity self-reported information in biomedical or genetics research still debated, given that it is often unclear what such variables measure (22). For that reason, we expect our ePRS framework to be widely adopted and refined in order to limit potential use of self-reported race/ethnicity in clinical use of PRS while still providing accurate, well defined, measures of genetic risk.

A major consideration and limitation of the proposed ePRS framework is the computational burden and estimation accuracy of the global/local ancestry as well as the ancestry-specific allele frequency. We conducted simulation studies to examine the sensitivity of the ePRS to accuracy of both global and local ancestry inference, and showed the robustness of both global and local ePRS. Still, the performance of the ePRS framework must be influenced by the accuracy of estimated global/local ancestry pattern. In our simulation studies, although local ePRS utilizes more information than global ePRS, and, if accurate, should theoretically perform better than global ePRS, there were settings in which global ePRS outperformed the local ePRSs. Many algorithms have been proposed to conduct local ancestry inference in the past few years, including LAMP (24) and RFMix (25)(26). Among these approaches, RFMix, the approach used to estimate global/local ancestry patterns in the TOPMed dataset, showed high accuracy in estimating ancestry (27,28). We expect a more accurate and efficient algorithm to be proposed to conduct local ancestry inference and, ultimately, incorporate it into the ePRS framework to increase the estimation performance.

The ePRS is proposed to calibrate the PRS to obtain the rPRS and qPRS, which do not depend on ancestry. Another approach that have been used to “remove” the effect of ancestry on PRS is the “ancestry-adjusted PRS” method, where PCs are regressed out of the PRS via a linear model, and then the PRS-outcome association is estimated with the residuals from this regression used in lieu of the PRS (4,12,13). Khera et al. (4) showed that the ancestry-adjusted PRS had similar distribution across diverse populations, which is the same as what we found in the distribution of rPRS. However, the conventional problems of utilizing PCs, including the number of PCs to use and their generalization across datasets, are still unsolved. Moreover, local ancestry patterns specific to the PRS are not captured by standard PCs. Even though the homogenous distribution of ancestry-adjusted PRS highlights the success of adjusting PCs to account for population stratification, we think that a PRS-specific adjustment is useful. Yet, the ePRS is currently limited, compared to the PCs-adjustment method, in that it is harder to use given the computational complexity and the choices that have to be made regarding ancestry inference. Global ePRS is very easy to compute, it only requires the computation of ancestry-specific ePRS, followed by weighting according to individual’s ancestry proportion. Still, this requires choosing the level of ancestry to use (e.g., continental ancestry? more refined ancestry levels?), and inference of that ancestry, while PC-adjustment does not. Notably, PC adjustment does not account for dependency of PRS variance on admixture patterns, as it only removes mean effects. While the rPRS does not account for the variance of the PRS distribution according to genetic makeup, the qPRS does. However, currently we did not account for linkage disequilibrium between variants in the computation of the variance, and therefore only used PRSs that rely on a limited number of highly significant, independent, trait loci. It is an important extension of the ePRS framework to account for LD when computing individual PRS distribution. This will allow for computing qPRS for a PRS that is constructed based on hundreds of thousands or millions of variants. The challenge is mostly computational.

In addition to clarifying the meaning of rPRS when using the ePRS, we demonstrated in our simulation studies that the estimated effect size of the ePRS is also meaningful, to some extent (Supplementary Note 2). We showed that the effect size of the ePRS can reflect the strength of unknown genetic factors based on ancestry composition. In contrast, the estimated effect sizes of PCs usually are not interpreted.

In this work, we use an individual-specific PRS distribution, assuming the variant effect sizes are fixed, admixture patterns are known, and the source of randomness is the variant frequencies. Other publications studied uncertainty in PRS quantifications by incorporating uncertainty in estimating variant weights. For example, Ding et al. (29) used a Bayesian framework to account for the potential distribution of variant weights. An exciting potential extension of this work is incorporating genetic ancestry information to simultaneously update the ancestry-specific effect size and the admixed pattern of genetics composition. It is a promising future direction for providing comprehensive PRS analysis results.

In summary, the proposed ePRS approach provides a strategy to differentiate individual genetic risk from differences in PRS distributions due to ancestral makeup. The ePRS improves over PCs in its interpretation and transferability across datasets while protecting from population stratification bias in association analysis.

Methods

Computing expected PRS (ePRS)

The polygenic risk score (PRS) of an individual i, i = 1, 2, …, n is calculated by the weighted sum of alleles, with weights being the alleles’ estimated effect sizes. This can be written as . We use the clumping and thresholding method, e.g., as implemented in the PRSice software (30), in order to have a set of independent genetic variants, to enable efficient computation of the PRS variance, conditional on ancestry, as later described. Here we treat the weighting parameter ω_j as a fixed and known value. We define the expected PRS (ePRS) for an individual i as where the E(.) is the expectation, over the random sample of alleles. By the definition of PRS and the calculation of expectation, we can write the ePRS in the following form,

In equation (11), we assume that g_{i j} is a random variable and the estimated effect size ω_j is fixed value. When focusing on a homogenous population, one can assume the genetic variant follows a binomial distribution g_{i j} ∼Bin(2, f_j), where f_j is the allele frequency of the allele g_j. Therefore, the expected value of the variant j can be computed as E(g_{i j})= 2 ×f_i and the resulting ePRS is for individual i. In addition, by assuming that the p variants are independent, we can apply the Binomial model to compute the variance of the PRS of individuals i as

We define the residual PRS (rPRS) for individual i as the difference between PRS_i and the corresponding ePRS_i, which is

The rPRS is an index indicating the additive deviation of the PRS value from the underlying ePRS value. In a homogenous population, each individual has the same ePRS value, while they likely have different rPRS values.

Consider the “mixed population” case, in which we assume the genome for each individual is inherited from one ancestry. Assume that in the studied population there are K ancestries a_i,∈ {a₁,a₂, …, a_k}, and the allele frequency for each genetic variant j in ancestry k is . The ePRS for each individual conditioned on their ancestry can be computed as

By equation (14), the ePRS for two individuals from the same ancestry is identical. Extending the idea of modeling mixed populations to admixed populations, in which the genome consists of mosaic segments inherited from different genetic ancestries, the Binomial distribution needs to be revised. In the next section, we illustrate how to calculate ePRS for admixed populations according to global and local ancestry patterns.

Construction of ePRS using global ancestry proportions and local ancestries

Global ancestry ePRS

We use global ancestry proportion to calculate global ancestry ePRS (gePRS) and later extend the idea to local ancestries ePRS (lePRS) using local ancestry inference. First, we assume that for each person, the proportions of their entire genomes inherited from each ancestry is known, which are the global ancestry proportion (example is shown in Figure 1 b). We further assume that the various ancestries are uniformly distributed across the genome. Let π_ik represent the proportion of the entire genome inherited from ancestry for the participant i, where k = 1, 2, … K. To model the genetic pattern of an admixed population, we assume that a genetic variant j follows a mixture model where is the Binomial distribution Bin with an ancestry-specific allele frequency. . Using the mixture model (15), we can now compute the gePRS for each individual as

From equation (16), the gePRS can be interpreted as a weighted combination of ancestry-specific ePRSs, with weights being the global ancestry proportion. We can extend equation (12) to calculate the variance of a PRS based on equation (15). Assuming that the p variants are independent, the variance of the PRS conditional on global ancestry proportions is:

Local ePRS

Next, we show how to use local ancestries to construct ePRS, which we call lePRS. Suppose we know the ancestry of each locus and each chromosomal copy. Alleles for each variant are counted by the sum of two chromosomal copies, g_ij = g_ij1 g_ij2. Assume, where m= 1, 2 denotes the two copies. The lePRS can be computed as:

Distinct from gePRS, we assume that each copy’s local ancestry is known and fixed. Hence, by conditioning on a_ijm. =a_k, the last term of equation (18) can be simplified as Pr (a_ijm. =a_k) = 1 and Pr (a_ijm. =a_k ;t ≠k) = 0. Therefore, the lePRS can be expressed as followed

We can also compute the variance of the PRS based on local ancestry information but add additional assumptions. Here, assuming that the p variants are independent, and that the two chromosomal copies are also independent, the variance of the PRS can be derived as:

The technical considerations of computing ePRS and the related metrics are summarized in Supplementary Note 3.

Quantile PRS

Based on the ePRS and the PRS variance, we can construct a third metric: the quantile PRS (qPRS). We assume that given the gePRS or lePRS and the corresponding variance of PRS, the distribution of the PRS for individual i follows the Normal distribution with mean E(PRS_i) and variance Var(PRS_i), i.e.,

The qPRS is computed as the percentile of the PRS value conditional on an individual’s ancestral makeup, which can be written as

where Φ is the cumulative Normal distribution function.

Simulations

Generating global ancestry proportions, local ancestries, and allele counts

Without loss of generality, we assume that the genome of each individual in the simulation study is inherited from three ancestries {a₁,a₂, a₂}. The model could be readily extended to a higher number of ancestries. Three ancestries may represent Hispanic/Latino admixed individuals in the U.S., where, for example, a₁ represents European, a₂ represents African, and a₃ represents American ancestry. We assume that a₁ has the highest proportion across the entire genome, and generated the global ancestry proportions by sampling the three proportions of ancestry sequentially from uniform distributions, as follows: where π_ak,k = 1,2,3 denotes the global ancestry proportion across the genome from ancestries a_k and Unif (a,b) is the uniform distribution with range (a,b) After generating π_i for each observation, we next generate local ancestry a_ijm.. Assuming that each SNP is independent, and the two chromosomal copies are also independent, we generate the local ancestry for each variant and each chromosome separately based on Thus, we assume that a_ijm. follows multinomial distribution, where α_ijm. ∼ multinomial . For each simulated variant, allele counts are then generated with the local ancestry guiding its ancestry-specific allele frequency.

Generating PRS

To select the parameters for the simulations, i.e., ancestry-specific allele frequencies and SNP effect sizes, we used summary statistics from a GWAS of systolic blood pressure (SBP), from UK biobank and the ICBP consortium (17). We preprocessed the summary statistics using PLINK with the standard parameters setting (genome-wide significance threshold (5□10⁻⁸), clumping parameter R²=0.1, and the distance was set to 1000 kb), and took 100 SNPs and their estimated effect sizes. Their ancestry-specific frequencies were taken to be the estimated frequencies from the GAFA procedure (18) applied over the TOPMed dataset (31). For each person and each SNP, we first sampled the local ancestry as described above, followed by sampling of allele count using the ancestry-specific allele frequency corresponding to that SNP and that local ancestry. We finally calculated the PRS as the weighted sum of SNP alleles with weights being the GWAS-estimated effect sizes. The flow chart to generate simulation data is illustrated in Figure 2.

We considered two PRS models. In the primary model, the PRS was assumed to be homogenous across genetic ancestries, i.e., the SNP effects are assumed the same regardless of ancestry, which is the homogeneous weighting PRS. The second model allowed for some heterogeneity by assuming that the SNP effects are sometimes different by ancestry. When simulating heterogeneous weighting PRS settings, we selected the top 10 most frequent SNPs in each a₂ and a₃ ancestries and set different effect sizes for these selected SNPs. Specifically, we set the effect sizes as 1.5 for the selected SNPs in a₂ (African) ancestry and 2 for a₃ (Native American) ancestry.

Generating the genetic ancestry confounder U_G

We generated an ancestry-dependent genetic variable potentially confounding the PRS-outcome association. Details are provided in Supplementary Table 1. Briefly, we considered a few types of unobserved confounder, namely conf-PRS1, conf-PRS2, with computation generally following the same procedure used to generate PRS (i.e., sampling of local ancestry followed by sampling of alleles). Each confounder type was used in separate simulations. For benchmarking, we also used in simulations a variable representing a genetic PC (conf-pc*), which is observed, as a confounder.

For the first genetic confounder (conf-PRS1), we assume the generated genetic variants shared the same local ancestry interval as the variants that computed the main PRS of interest. Based on the local ancestry interval, we then used ancestry-specific allele frequencies of 100 randomly selected variants from the UKBB-ICBP SBP GWAS to generate genetic data, which is the same procedure for generating observed PRS. The only difference is that the weighting parameters of computing conf-PRS1 are generated from a standard normal distribution.

A second type of genetic confounder (conf-PRS2) is generated based on a selection of SNPs from the MVP SBP GWAS. These SNP were selected to be enriched in frequency (MAF > 0.4) in the African ancestry population while having MAF<0.1 in the European ancestral population, as estimated by GAFA. Notably, the distribution of Conf-PRS2 is quite different from that of the observed PRS, leading to strong unknown genetic confounding effect in the simulation settings.

The third continuous confounder, conf-pc*, intends to mimic a scenario where the true confounding is based on only one genetic principal component score. To guide the construction of this variable we randomly select 100 SNPs from MVP GWAS and generate the weighting parameters from standard Normal distribution. To emphasize, conf-pc* is not calculated via a principal component analysis procedure.

Generating outcomes via a data generating model

We generate a continuous outcome from a linear model summing the effect of the PRS of interest and the genetic confounder through the following model:

Where the genetic confounder is standardized in each simulated dataset so that effect sizes are comparable across simulation settings. We set the PRS effect β₁ to 1.5 and varied the effect of the genetic confounder to be γ = 0.5, 1, 1.5. The intercept was set to β₀ = 1 across all simulations. Finally, the errors were sampled from a Normal distribution with ε_i & N(0,1). We sampled N = 10,000 observations in each simulation repetition and repeated each simulation setting 1000 times.

Estimation of PRS effect in simulations via a working association model

We compare the estimation performance of β₁ via seven working association model, M1-M7. In all models we evaluate the estimated as an estimator of β₁, i.e., as Also, all association models use homogeneous weighting PRS.

In M1, we estimate PRS association without adjusting for other covariates (we named it as “none” in our results and figures). The proposed ePRS-based approaches are provided in M2 and M3, which adjust for global and local ePRS with the corresponding rPRS. M4-M7 models again estimate the effect of the PRS, and further adjust for different ancestry-related measures. M4 uses conf-pc*, which is a benchmark when the true confounding factor is simulated as confpc*. M5 adjusts for the global ancestry proportion (“gaProp” in our results and Figures). M6 and M7 use 10 and 20 PCs, respectively, where the PCs were calculated based on the genetic dataset used for computing the PRS.

The TOPMed dataset

The TOPMed project aggregates individual-level data from multiple parent studies. Supplementary Note 5 describes TOPMed parent studies contributing to the analyses.

Whole-genome sequencing

We used genetic data from whole genome sequencing via the TOPMed program (32) freeze 8 released. Information about genome sequencing, allele calling, and quality control in TOPMed is publicly available in https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-methods-freeze-8. The TOPMed Data Coordinating Center constructed a kinship matrix estimating recent genetic relatedness, the corresponding sparse kinship matrix, where values were set to zero when the genetic relationship was estimated to be more distant than 4^thdegree relatedness, as well as providing genetic principal components (PCs), using the PC-Relate algorithm (33). In the TOPMed data analysis, we adjusted for 11 genetic PCs in the standard PRS model (except the VTE analysis, in which we adjusted seven genetic PCs according to the previously published paper (34)).

Genetic ancestry inference in TOPMed

Ancestry inference was performed by the TOPMed Informatics Research Center (IRC). First, local ancestry was inferred using RFMix (25), with default parameter settings except the following option: --node-size=5. Then, global ancestry was computed as for each participant as a weighted average of the ancestries in inferred local ancestry intervals. The reference panel used was the Human Genome Diversity Panel (HGDP) downloaded from the Stanford HGDP website http://hagsc.org/hgdp/files.html. Genomic coordinates were lifted over from genome build 37 to build 38. The 53 HGDP populations were merged into 7 super-populations: Sub-Saharan Africa, Central and South Asia, East Asia, Europe, Native America, Oceania, Middle East. Local ancestry inference was performed in two versions. First, for samples available in TOPMed freeze 6, RFMix V1 was used, and local ancestry was inferred for the autosomes only. Later, for samples participating only in freeze 8 (but not in freeze 6), and for the X-chromosome, local ancestry inference was performed using RFMix V2. Global ancestry proportions of an individual were defined as the proportions of the inferred ancestries (accounting for interval lengths) of each of the ancestries, genome-wide. Because the levels of Oceania ancestry were very low in the sample, we did not use it, and instead rescaled, for each person, the other ancestries so that they some to 1. So, if based on the RFMix global ancestry proportions we had is the estimated proportion of an ancestry for an individual, we summed the non-oceania proportions for each individual to get . The scaled proportions were set to etc.

Computation of ancestry-specific allele frequencies

Ancestry-specific allele frequencies were computed for all common variants as we previously described (31), using the GAFA algorithm (18) that uses global genetic ancestries to deconvolute ancestry-specific allele frequencies. We also computed ancestry-specific allele frequencies using the LAFA algorithm, i.e. using local genetic ancestry patterns at each variant location, however the results are similar, so we chose to move forward with the GAFA-estimated frequencies.

Harmonization of self-reported race/ethnicity in TOPMed

Most of the participating studies recruited individuals from pre-defined race/ethnicity categories, which were not always the same. For example, CARDIA uses the descriptor “Black” while JHS uses “African American”. Here, we used the race/ethnicity harmonized categorized prepared by the Data Coordinating Center of TOPMed as part of a phenotype harmonization effort (35), and as descriptors we use White (non-Hispanic White or European American), Black (non-Hispanic Black or African American), Asian (Asian or Chinese American), and Hispanic/Latino. We did not use genetic ancestry to categorize race/ethnic categories, as they are not biological and reflect sociopolitical and demographic patterns, though the distribution of genetic ancestry patterns generally differs by groups defined by race and ethnicity

Phenotype harmonization

We used 5 continuous phenotypes, SBP, DBP, HDL, LDL, and BMI, all harmonized by the TOPMed DCC (35). To account for medication use, SBP and DBP values were increased in antihypertensive medication users by 15 and 10mmHg, respectively. For LDL, the value was adjusted via dividing by 0.7 if the individual took lipid-lowering medication.

Two binary variables used were VTE and OSA. VTE was also harmonized by the TOPMed DCC, as reported in the Seyerle et al. 2023 (34). We harmonized OSA as follows. COPDGene control participants (i.e., individuals without COPD) self-reported doctor-diagnosed OSA. Other participating studies used home sleep apnea testing device to measure the Apnea Hypopnea Index (AHI; measured in ARIC, CARDIA, CHS, and FHS as part of the sleep heart health study (36), and in MESA (37) and JHS (38)), or the Respiratory Event Index (REI; measured in HCHS/SOL (39)). For these cases we categorized OSA based on AHI/REI ≥15 (i.e., moderat or more severe OSA). Lower AHI/REI values were categorized as no OSA.

Selecting SNPs and effect sizes for PRS for the studied traits

We used summary statistics from published GWAS to develop PRS (and ePRS) for each considered trait. These are reported in Supplementary Table 7, which includes information about accessing these summary statistics, the sample sizes of each GWAS population, and the number of variants used to compute PRSs. Briefly, for most traits we used summary statistics from MVP GWAS (40–43), which are multi-population GWAS. For BMI, we used the GWAS summary statistics from the Genetic Investigation of Anthropometric Traits (GIANT) consortium meta-analyzed with a UK Biobank (UKBB) BMI GWAS (44). The GIANT + UKBB study is mostly based on European ancestry. For each trait and its PRS, in primary analysis we focused on independent SNPs to allow for calculating the variance of PRS. Thus, we performed clumping using plink (45). The p-value thresholding parameter is set as the genome-wide significance threshold (5□10⁻⁸). The clumping parameter was set to R²=0.1, and the distance was set to 1000 kb. In secondary analysis we also developed PRSs and corresponding ePRSs based on the same GWAS summary statistics using LDPred2 (19). These are secondary analyses because we currently cannot compute the PRS variance when using highly correlated SNPs in the PRS. Information about these PRSs and the association analyses is provided in Supplementary Note 4.

PRS-outcome association analyses

PRS associations with continuous traits were estimated via linear mixed model, and with binary traits via logistic mixed models. All models used sparse kinship matrix to model a random effect to account for relatedness. Standard PRS model were adjusted for (a) 11 genetic PCs or (b) global ancestry proportions. EPRS models estimated the global or local rPRS association while adjusting for the corresponding ePRS. Other covariates were often trait-specific: BMI PRS associations were adjusted for sex, age, and squared age term, while SBP, DBP, HDL, and LDL associations were further adjusted for BMI. Logistic mixed model analysis of VTE used a similar approach to that in the published VTE GWAS (34): we adjusted sex, age, and lnratio (log of the ratio between the size of the case stratum and the matched control; to address the differences of case-control matched ratio). Because the model did not converge, we did not include “sample set” as a covariate. OSA analysis adjusted for sex, age, BMI, and the square of the BMI term as fixed effect covariates. All analyses were conducted via the R package GENESIS (version 2.32.0; (46)) with the fitNullModel function.

Assessment of qPRS as an equitable metric for genetic disease risk

We assessed the use of qPRSs for ranking genetic liability quantified by PRS, instead of using the conventional PRSs which are highly affected by ancestral makeup. To this end, we visualized patterns of trait values and qPRS percentiles. For each trait, we stratified individuals to 100 bins defined by values of PRS, rPRS, and qPRS (each separately), and computed the mean value of the trait or proportion of the binary outcome, and visualized these as scatterplots. To demonstrate the impact of ancestral makeup on the PRS metric ranking, we colored each point in the scatterplot by the proportion of African American individuals in the strata (or bin). We chose this population group because the sample is sufficiently large.

Data Availability

TOPMed freeze 8 WGS data and harmonized BP and lipid phenotypes are available by application to dbGaP according to the study specific accessions: Amish: phs000956, ARIC: phs001211, CARDIA: phs001612, CFS: phs000954, CHS: phs001368, COPDGene: phs000951, FHS: phs000974, GENOA: phs001345, HCHS/SOL: phs001395, HVH: phs000993, JHS: phs000964, Mayo VTE: phs001402, MESA: phs001211, WHI: phs001237. Summary statistics from MVP GWAS are available from dbGaP by application to study accession phs001672. Summary statistics from GIANT + UKBB GWAS were downloaded from https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files. Data needed to construct the reported PRSs in this study include variants, alleles, and weights for each of the PRS are deposited on fighsare, and will be deposited on the PGS catalog. A dataset with ancestry-specific allele frequencies computed using GAFA on the TOPMed dataset for Europe, Africa, Middle East, East Asia, South Asia, and America ancestries for HapMap3 variants, which are recommended for use by the LDPred2 software, are available on the figshare repository: https://figshare.com/articles/dataset/ePRS_project_summary_statistics_and_ancestry-specific_allele_frequency/25336294

https://figshare.com/articles/dataset/ePRS_project_summary_statistics_and_ancestry-specific_allele_frequency/25336294

https://github.com/Gene-Huang/Expected_PRS

Data availability

TOPMed freeze 8 WGS data and harmonized BP and lipid phenotypes are available by application to dbGaP according to the study specific accessions: Amish: “phs000956”, ARIC: “phs001211”, CARDIA: “phs001612”, CFS: “phs000954”, CHS: “phs001368”, COPDGene: “phs000951”, FHS: “phs000974”, GENOA: “phs001345”, HCHS/SOL: “phs001395”, HVH: “phs000993”, JHS: “phs000964”, Mayo VTE: “phs001402”, MESA: “phs001211”, WHI: “phs001237”. Summary statistics from MVP GWAS are available from dbGaP by application to study accession “phs001672”. Summary statistics from GIANT + UKBB GWAS were downloaded from https://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files. Data needed to construct the reported PRSs in this study include variants, alleles, and weights for each of the PRS are deposited on fighsare, and will be deposited on the PGS catalog. A dataset with ancestry-specific allele frequencies computed using GAFA on the TOPMed dataset for Europe, Africa, Middle East, East Asia, South Asia, and America ancestries for HapMap3 variants, which are recommended for use by the LDPred2 software, are available on the figshare repository: https://figshare.com/articles/dataset/ePRS_project_summary_statistics_and_ancestry-specific_allele_frequency/25336294

Code availability

R codes used for simulation studies and for constructing ePRSs are available on the GitHub repository: https://github.com/Gene-Huang/Expected_PRS

Author contributions

Y-JH and TS conceptualized the method. Y-JH, NK, MOG, and TS brainstormed the method development and application. Y-JH performed simulation studies. Y-JH, NK, and BWS performed data analysis related to PRS development, phenotype harmonization, and association analyses. JW and SZ performed local and global ancestry inference in TOPMed. AS and CL performed phenotype harmonization on behalf of TOPMed Data Coordinating Center. Y-JH and TS drafted the manuscript. PSdV, HC, Y-IM, MS, GMP, LL, XG, JCB, JAB, LMR, JAS, WZ, JIR, SSR, SR, MF, RK, NF, DL, ACM, EB, NLS, CK, and BMP contributed to design and data curation in TOPMed studies they represent. NK, MOG, BWS, JW, AS, CL, PSdV, HC, Y-IM, MS, GMP, LL, XG, JCB, JAB, LMR, JAS, WZ, JIR, SSR, SR, MF, RK, NF, DL, ACM, EB, NLS, CK, BMP, and SZ critically reviewed the manuscript.

Competing interests

BP serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson i Johnson. LMR and SSR are consultants to the TOPMed Administrative Coordinating Center (through Westat®).

Acknowledgements

This work is supported by National Heart, Lung, and Blood Institute (NHLBI) grant R01HL161012 and National Aging Institute grant R01AG080598 grant to TS, and National Human Genome Research Institute (NHGRI) grant R56HG013163. SZ is supported by NHGRI grant R01HG011031. NLS is supported by NHLBI grants R01HL139553 and R01HL154385. GMP is support by R01HL142711 from NHLBI. This analysis was approved by the Beth Israel Deaconess Medical Center Committee on Clinical Investigations, protocol #2023P000279, and by the Mass General Brigham IRB, protocol #2021P001928.

References

1.↵
Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020 Sep;15(9):2759–72.
OpenUrl PubMed
2.↵
Ma Y, Zhou X. Genetic prediction of complex traits with polygenic scores: a statistical review. Trends Genet. 2021 Nov;37(11):995–1011.
OpenUrl CrossRef
3.↵
Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018 Sep;50(9):1219–24.
OpenUrl CrossRef PubMed
4.↵
Khera AV, Chaffin M, Zekavat SM, Collins RL, Roselli C, Natarajan P, et al. Whole-Genome Sequencing to Characterize Monogenic and Polygenic Contributions in Patients Hospitalized With Early-Onset Myocardial Infarction. Circulation. 2019 Mar 26;139(13):1593–602.
OpenUrl CrossRef PubMed
5.↵
Konuma T, Okada Y. Statistical genetics and polygenic risk score for precision medicine. Inflamm Regen. 2021 Jun 17;41(1):18.
OpenUrl
6.↵
Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016 Oct 13;538(7624):161–4.
OpenUrl CrossRef PubMed
7.↵
Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun. 2019 Jul 25;10(1):3328.
OpenUrl PubMed
8.↵
Ruan Y, Lin Y-F, Feng Y-CA, Chen C-Y, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022 May 5;54(5):573–80.
OpenUrl
9.↵
Zhao Z, Fritsche LG, Smith JA, Mukherjee B, Lee S. The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet. 2022 Nov 3;109(11):1998–2008.
OpenUrl
10.↵
Wang Y, Tsuo K, Kanai M, Neale BM, Martin AR. Challenges and opportunities for developing more generalizable polygenic risk scores. Annu Rev Biomed Data Sci. 2022 Aug 10;5:293–320.
OpenUrl
11.↵
Majara L, Kalungi A, Koen N, Tsuo K, Wang Y, Gupta R, et al. Low and differential polygenic score generalizability among African populations due largely to genetic diversity. HGG Adv. 2023 Apr 13;4(2):100184.
OpenUrl
12.↵
Wang M, Menon R, Mishra S, Patel AP, Chaffin M, Tanneeru D, et al. Validation of a Genome-Wide Polygenic Score for Coronary Artery Disease in South Asians. J Am Coll Cardiol. 2020 Aug 11;76(6):703–14.
OpenUrl FREE Full Text
13.↵
Ge T, Irvin MR, Patki A, Srinivasasainagendra V, Lin Y-F, Tiwari HK, et al. Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Genome Med. 2022 Jun 29;14(1):70.
OpenUrl
14.↵
Ma J, Amos CI. Principal components analysis of population admixture. PLoS ONE. 2012 Jul 9;7(7):e40115.
OpenUrl CrossRef PubMed
15.↵
Gaspar HA, Breen G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinformatics. 2019 Mar 7;20(1):116.
OpenUrl CrossRef
16.↵
Elhaik E. Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci Rep. 2022 Aug 29;12(1):14683.
OpenUrl
17.↵
Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet. 2018 Oct;50(10):1412–25.
OpenUrl CrossRef PubMed
18.↵
Granot-Hershkovitz E, Sun Q, Argos M, Zhou H, Lin X, Browning SR, et al. AFA: Ancestry- specific allele frequency estimation in admixed populations: The Hispanic Community Health Study/Study of Latinos. HGG Adv. 2022 Apr 14;3(2):100096.
OpenUrl
19.↵
Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2021 Apr 1;36(22–23):5424–31.
OpenUrl
20.↵
Naret O, Kutalik Z, Hodel F, Xu ZM, Marques-Vidal P, Fellay J. Improving polygenic prediction with genetically inferred ancestry. HGG Adv. 2022 Jul 14;3(3):100109.
OpenUrl
21.↵
Bonham VL, Green ED, Pérez-Stable EJ. Examining how race, ethnicity, and ancestry data are used in biomedical research. JAMA. 2018 Oct 16;320(15):1533–4.
OpenUrl CrossRef PubMed
22.↵
Khan AT, Gogarten SM, McHugh CP, Stilp AM, Sofer T, Bowers ML, et al. Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the NHLBI TOPMed program. Cell Genomics. 2022 Aug 10;2(8).
23.↵
Borrell LN, Elhawary JR, Fuentes-Afflick E, Witonsky J, Bhakta N, Wu AHB, et al. Race and Genetic Ancestry in Medicine - A Time for Reckoning with Racism. N Engl J Med. 2021 Feb 4;384(5):474–80.
OpenUrl CrossRef PubMed
24.↵
Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 2012 May 15;28(10):1359–67.
OpenUrl CrossRef PubMed Web of Science
25.↵
Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013 Aug 8;93(2):278–88.
OpenUrl CrossRef PubMed
26.↵
Salter-Townshend M, Myers S. Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups. Genetics. 2019 Jul;212(3):869–89.
OpenUrl Abstract/FREE Full Text
27.↵
Schubert R, Andaleon A, Wheeler HE. Comparing local ancestry inference models in populations of two- and three-way admixture. PeerJ. 2020 Oct 2;8:e10090.
OpenUrl
28.↵
Uren C, Hoal EG, Möller M. Putting RFMix and ADMIXTURE to the test in a complex admixed population. BMC Genet. 2020 Apr 7;21(1):40.
OpenUrl
29.↵
Ding Y, Hou K, Xu Z, Pimplaskar A, Petter E, Boulier K, et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023 Jun;618(7966):774–81.
OpenUrl
30.↵
Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019 Jul 1;8(7).
31.↵
Kurniansyah N, Goodman MO, Khan AT, Wang J, Feofanova E, Bis JC, et al. Evaluating the use of blood pressure polygenic risk scores across race/ethnic background groups. Nat Commun. 2023 Jun 2;14(1):3202.
OpenUrl
32.↵
Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290–9.
OpenUrl CrossRef PubMed
33.↵
Conomos MP, Reiner AP, Weir BS, Thornton TA. Model-free Estimation of Recent Genetic Relatedness. Am J Hum Genet. 2016 Jan 7;98(1):127–48.
OpenUrl CrossRef PubMed
34.↵
Seyerle AA, Laurie CA, Coombes BJ, Jain D, Conomos MP, Brody J, et al. Whole Genome Analysis of Venous Thromboembolism: the Trans-Omics for Precision Medicine Program. Circ Genom Precis Med. 2023 Apr;16(2):e003532.
OpenUrl
35.↵
Stilp AM, Emery LS, Broome JG, Buth EJ, Khan AT, Laurie CA, et al. A System for Phenotype Harmonization in the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program. Am J Epidemiol. 2021 Apr 16;
36.↵
Quan SF, Howard BV, Iber C, Kiley JP, Nieto FJ, O’Connor GT, et al. The Sleep Heart Health Study: design, rationale, and methods. Sleep. 1997 Dec;20(12):1077–85.
OpenUrl CrossRef PubMed Web of Science
37.↵
Chen X, Wang R, Zee P, Lutsey PL, Javaheri S, Alcántara C, et al. Racial/Ethnic Differences in Sleep Disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA). Sleep. 2015 Jun 1;38(6):877–88.
OpenUrl CrossRef PubMed
38.↵
Johnson DA, Guo N, Rueschman M, Wang R, Wilson JG, Redline S. Prevalence and correlates of obstructive sleep apnea among African Americans: the Jackson Heart Sleep Study. Sleep. 2018 Oct 1;41(10).
39.↵
Redline S, Sotres-Alvarez D, Loredo J, Hall M, Patel SR, Ramos A, et al. Sleep-disordered breathing in Hispanic/Latino individuals of diverse backgrounds. The Hispanic Community Health Study/Study of Latinos. Am J Respir Crit Care Med. 2014 Feb 1;189(3):335–44.
OpenUrl
40.↵
Giri A, Hellwege JN, Keaton JM, Park J, Qiu C, Warren HR, et al. Trans-ethnic association study of blood pressure determinants in over 750,000 individuals. Nat Genet. 2019 Jan;51(1):51–62.
OpenUrl CrossRef PubMed
41.
Klarin D, Damrauer SM, Cho K, Sun YV, Teslovich TM, Honerlaw J, et al. Genetics of blood lipids among ∼300,000 multi-ethnic participants of the Million Veteran Program. Nat Genet. 2018 Nov;50(11):1514–23.
OpenUrl CrossRef PubMed
42.
Klarin D, Busenkell E, Judy R, Lynch J, Levin M, Haessler J, et al. Genome-wide association analysis of venous thromboembolism identifies new risk loci and genetic overlap with arterial vascular disease. Nat Genet. 2019 Nov 1;51(11):1574–9.
OpenUrl
43.↵
Sofer T, Kurniansyah N, Murray M, Ho Y-L, Abner E, Esko T, et al. Genome-wide association study of obstructive sleep apnoea in the Million Veteran Program uncovers genetic heterogeneity by sex. EBioMedicine. 2023 Apr;90:104536.
OpenUrl
44.↵
Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum Mol Genet. 2018 Oct 15;27(20):3641–9.
OpenUrl CrossRef PubMed
45.↵
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015 Feb 25;4:7.
OpenUrl CrossRef PubMed
46.↵
Gogarten SM, Sofer T, Chen H, Yu C, Brody JA, Thornton TA, et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics. 2019 Dec 15;35(24):5346–8.
OpenUrl CrossRef PubMed

View the discussion thread.

Posted March 06, 2024.

Download PDF

Supplementary Material

Data/Code

Citation Tools

Subject Area

Genetic and Genomic Medicine

Subject Areas

All Articles

Addiction Medicine (380)
Allergy and Immunology (695)
Anesthesia (186)
Cardiovascular Medicine (2809)
Dentistry and Oral Medicine (324)
Dermatology (241)
Emergency Medicine (424)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (999)
Epidemiology (12499)
Forensic Medicine (10)
Gastroenterology (796)
Genetic and Genomic Medicine (4364)
Geriatric Medicine (398)
Health Economics (711)
Health Informatics (2814)
Health Policy (1042)
Health Systems and Quality Improvement (1038)
Hematology (372)
HIV/AIDS (888)
Infectious Diseases (except HIV/AIDS) (13922)
Intensive Care and Critical Care Medicine (825)
Medical Education (411)
Medical Ethics (113)
Nephrology (460)
Neurology (4132)
Nursing (219)
Nutrition (613)
Obstetrics and Gynecology (773)
Occupational and Environmental Health (720)
Oncology (2178)
Ophthalmology (616)
Orthopedics (253)
Otolaryngology (316)
Pain Medicine (260)
Palliative Medicine (80)
Pathology (482)
Pediatrics (1166)
Pharmacology and Therapeutics (486)
Primary Care Research (480)
Psychiatry and Clinical Psychology (3619)
Public and Global Health (6721)
Radiology and Imaging (1475)
Rehabilitation Medicine and Physical Therapy (858)
Respiratory Medicine (893)
Rheumatology (427)
Sexual and Reproductive Health (431)
Sports Medicine (359)
Surgery (468)
Toxicology (57)
Transplantation (197)
Urology (173)

[1] 1.↵
Choi SW, Mak TS-H, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nat Protoc. 2020 Sep;15(9):2759–72.
OpenUrl PubMed

[2] 2.↵
Ma Y, Zhou X. Genetic prediction of complex traits with polygenic scores: a statistical review. Trends Genet. 2021 Nov;37(11):995–1011.
OpenUrl CrossRef

[3] 3.↵
Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018 Sep;50(9):1219–24.
OpenUrl CrossRef PubMed

[4] 4.↵
Khera AV, Chaffin M, Zekavat SM, Collins RL, Roselli C, Natarajan P, et al. Whole-Genome Sequencing to Characterize Monogenic and Polygenic Contributions in Patients Hospitalized With Early-Onset Myocardial Infarction. Circulation. 2019 Mar 26;139(13):1593–602.
OpenUrl CrossRef PubMed

[5] 5.↵
Konuma T, Okada Y. Statistical genetics and polygenic risk score for precision medicine. Inflamm Regen. 2021 Jun 17;41(1):18.
OpenUrl

[6] 6.↵
Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016 Oct 13;538(7624):161–4.
OpenUrl CrossRef PubMed

[7] 7.↵
Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun. 2019 Jul 25;10(1):3328.
OpenUrl PubMed

[8] 8.↵
Ruan Y, Lin Y-F, Feng Y-CA, Chen C-Y, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022 May 5;54(5):573–80.
OpenUrl

[9] 9.↵
Zhao Z, Fritsche LG, Smith JA, Mukherjee B, Lee S. The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet. 2022 Nov 3;109(11):1998–2008.
OpenUrl

[10] 10.↵
Wang Y, Tsuo K, Kanai M, Neale BM, Martin AR. Challenges and opportunities for developing more generalizable polygenic risk scores. Annu Rev Biomed Data Sci. 2022 Aug 10;5:293–320.
OpenUrl

[11] 11.↵
Majara L, Kalungi A, Koen N, Tsuo K, Wang Y, Gupta R, et al. Low and differential polygenic score generalizability among African populations due largely to genetic diversity. HGG Adv. 2023 Apr 13;4(2):100184.
OpenUrl

[12] 12.↵
Wang M, Menon R, Mishra S, Patel AP, Chaffin M, Tanneeru D, et al. Validation of a Genome-Wide Polygenic Score for Coronary Artery Disease in South Asians. J Am Coll Cardiol. 2020 Aug 11;76(6):703–14.
OpenUrl FREE Full Text

[13] 13.↵
Ge T, Irvin MR, Patki A, Srinivasasainagendra V, Lin Y-F, Tiwari HK, et al. Development and validation of a trans-ancestry polygenic risk score for type 2 diabetes in diverse populations. Genome Med. 2022 Jun 29;14(1):70.
OpenUrl

[14] 14.↵
Ma J, Amos CI. Principal components analysis of population admixture. PLoS ONE. 2012 Jul 9;7(7):e40115.
OpenUrl CrossRef PubMed

[15] 15.↵
Gaspar HA, Breen G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinformatics. 2019 Mar 7;20(1):116.
OpenUrl CrossRef

[16] 16.↵
Elhaik E. Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated. Sci Rep. 2022 Aug 29;12(1):14683.
OpenUrl

[17] 17.↵
Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet. 2018 Oct;50(10):1412–25.
OpenUrl CrossRef PubMed

[18] 18.↵
Granot-Hershkovitz E, Sun Q, Argos M, Zhou H, Lin X, Browning SR, et al. AFA: Ancestry- specific allele frequency estimation in admixed populations: The Hispanic Community Health Study/Study of Latinos. HGG Adv. 2022 Apr 14;3(2):100096.
OpenUrl

[19] 19.↵
Privé F, Arbel J, Vilhjálmsson BJ. LDpred2: better, faster, stronger. Bioinformatics. 2021 Apr 1;36(22–23):5424–31.
OpenUrl

[20] 20.↵
Naret O, Kutalik Z, Hodel F, Xu ZM, Marques-Vidal P, Fellay J. Improving polygenic prediction with genetically inferred ancestry. HGG Adv. 2022 Jul 14;3(3):100109.
OpenUrl

[21] 21.↵
Bonham VL, Green ED, Pérez-Stable EJ. Examining how race, ethnicity, and ancestry data are used in biomedical research. JAMA. 2018 Oct 16;320(15):1533–4.
OpenUrl CrossRef PubMed

[22] 22.↵
Khan AT, Gogarten SM, McHugh CP, Stilp AM, Sofer T, Bowers ML, et al. Recommendations on the use and reporting of race, ethnicity, and ancestry in genetic research: Experiences from the NHLBI TOPMed program. Cell Genomics. 2022 Aug 10;2(8).

[23] 23.↵
Borrell LN, Elhawary JR, Fuentes-Afflick E, Witonsky J, Bhakta N, Wu AHB, et al. Race and Genetic Ancestry in Medicine - A Time for Reckoning with Racism. N Engl J Med. 2021 Feb 4;384(5):474–80.
OpenUrl CrossRef PubMed

[24] 24.↵
Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 2012 May 15;28(10):1359–67.
OpenUrl CrossRef PubMed Web of Science

[25] 25.↵
Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. Am J Hum Genet. 2013 Aug 8;93(2):278–88.
OpenUrl CrossRef PubMed

[26] 26.↵
Salter-Townshend M, Myers S. Fine-Scale Inference of Ancestry Segments Without Prior Knowledge of Admixing Groups. Genetics. 2019 Jul;212(3):869–89.
OpenUrl Abstract/FREE Full Text

[27] 27.↵
Schubert R, Andaleon A, Wheeler HE. Comparing local ancestry inference models in populations of two- and three-way admixture. PeerJ. 2020 Oct 2;8:e10090.
OpenUrl

[28] 28.↵
Uren C, Hoal EG, Möller M. Putting RFMix and ADMIXTURE to the test in a complex admixed population. BMC Genet. 2020 Apr 7;21(1):40.
OpenUrl

[29] 29.↵
Ding Y, Hou K, Xu Z, Pimplaskar A, Petter E, Boulier K, et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023 Jun;618(7966):774–81.
OpenUrl

[30] 30.↵
Choi SW, O’Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience. 2019 Jul 1;8(7).

[31] 31.↵
Kurniansyah N, Goodman MO, Khan AT, Wang J, Feofanova E, Bis JC, et al. Evaluating the use of blood pressure polygenic risk scores across race/ethnic background groups. Nat Commun. 2023 Jun 2;14(1):3202.
OpenUrl

[32] 32.↵
Taliun D, Harris DN, Kessler MD, Carlson J, Szpiech ZA, Torres R, et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature. 2021;590(7845):290–9.
OpenUrl CrossRef PubMed

[33] 33.↵
Conomos MP, Reiner AP, Weir BS, Thornton TA. Model-free Estimation of Recent Genetic Relatedness. Am J Hum Genet. 2016 Jan 7;98(1):127–48.
OpenUrl CrossRef PubMed

[34] 34.↵
Seyerle AA, Laurie CA, Coombes BJ, Jain D, Conomos MP, Brody J, et al. Whole Genome Analysis of Venous Thromboembolism: the Trans-Omics for Precision Medicine Program. Circ Genom Precis Med. 2023 Apr;16(2):e003532.
OpenUrl

[35] 35.↵
Stilp AM, Emery LS, Broome JG, Buth EJ, Khan AT, Laurie CA, et al. A System for Phenotype Harmonization in the NHLBI Trans-Omics for Precision Medicine (TOPMed) Program. Am J Epidemiol. 2021 Apr 16;

[36] 36.↵
Quan SF, Howard BV, Iber C, Kiley JP, Nieto FJ, O’Connor GT, et al. The Sleep Heart Health Study: design, rationale, and methods. Sleep. 1997 Dec;20(12):1077–85.
OpenUrl CrossRef PubMed Web of Science

[37] 37.↵
Chen X, Wang R, Zee P, Lutsey PL, Javaheri S, Alcántara C, et al. Racial/Ethnic Differences in Sleep Disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA). Sleep. 2015 Jun 1;38(6):877–88.
OpenUrl CrossRef PubMed

[38] 38.↵
Johnson DA, Guo N, Rueschman M, Wang R, Wilson JG, Redline S. Prevalence and correlates of obstructive sleep apnea among African Americans: the Jackson Heart Sleep Study. Sleep. 2018 Oct 1;41(10).

[39] 39.↵
Redline S, Sotres-Alvarez D, Loredo J, Hall M, Patel SR, Ramos A, et al. Sleep-disordered breathing in Hispanic/Latino individuals of diverse backgrounds. The Hispanic Community Health Study/Study of Latinos. Am J Respir Crit Care Med. 2014 Feb 1;189(3):335–44.
OpenUrl

[40] 40.↵
Giri A, Hellwege JN, Keaton JM, Park J, Qiu C, Warren HR, et al. Trans-ethnic association study of blood pressure determinants in over 750,000 individuals. Nat Genet. 2019 Jan;51(1):51–62.
OpenUrl CrossRef PubMed

[41] 41.
Klarin D, Damrauer SM, Cho K, Sun YV, Teslovich TM, Honerlaw J, et al. Genetics of blood lipids among ∼300,000 multi-ethnic participants of the Million Veteran Program. Nat Genet. 2018 Nov;50(11):1514–23.
OpenUrl CrossRef PubMed

[42] 42.
Klarin D, Busenkell E, Judy R, Lynch J, Levin M, Haessler J, et al. Genome-wide association analysis of venous thromboembolism identifies new risk loci and genetic overlap with arterial vascular disease. Nat Genet. 2019 Nov 1;51(11):1574–9.
OpenUrl

[43] 43.↵
Sofer T, Kurniansyah N, Murray M, Ho Y-L, Abner E, Esko T, et al. Genome-wide association study of obstructive sleep apnoea in the Million Veteran Program uncovers genetic heterogeneity by sex. EBioMedicine. 2023 Apr;90:104536.
OpenUrl

[44] 44.↵
Yengo L, Sidorenko J, Kemper KE, Zheng Z, Wood AR, Weedon MN, et al. Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry. Hum Mol Genet. 2018 Oct 15;27(20):3641–9.
OpenUrl CrossRef PubMed

[45] 45.↵
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015 Feb 25;4:7.
OpenUrl CrossRef PubMed

[46] 46.↵
Gogarten SM, Sofer T, Chen H, Yu C, Brody JA, Thornton TA, et al. Genetic association testing using the GENESIS R/Bioconductor package. Bioinformatics. 2019 Dec 15;35(24):5346–8.
OpenUrl CrossRef PubMed