Abstract
Polycystic ovary syndrome (PCOS) and its underlying features remain poorly understood. In this genetic and proteomic study, we expand the number of genetic loci from 19 to 29, and identify 31 associated plasma proteins. Many risk-increasing loci were associated with later age at menopause, underscoring the reproductive longevity related to a larger functional ovarian reserve. Hormonal regulation in the aetiology of this condition, through metabolic and reproductive features, was emphasised. The proteomic analysis highlighted perturbations of metabolically-related biology that are typical in women with PCOS. A PCOS polygenic risk score was associated with adverse cardio-metabolic outcomes, with differing contributions of testosterone and BMI in women and men. Finally, while oligo- and anovulatory infertility are characteristic features of PCOS, we observed no impact of PCOS susceptibility on childlessness. We suggest that PCOS susceptibility confers balanced pleiotropic influences on fertility in women, and life-long adverse metabolic consequences in both sexes.
Introduction
Polycystic ovary syndrome (PCOS) is the most common reproductive endocrinopathy in women1, with varied impacts across the lifespan. The diagnostic criteria include two of the following three features: hyperandrogenism, oligo-anovulation, and polycystic ovarian morphology (PCOM)2. PCOS is the most common cause of anovulatory infertility and is frequently associated with insulin resistance, which confers an increased risk of adverse metabolic outcomes such as type 2 diabetes (T2D)3,4. Previous large-scale genetic studies demonstrated that PCOS is a complex polygenic disorder that encompasses interactions between brain, metabolic and gonadal function5. High BMI and raised fasting insulin levels were identified as causal risk factors for PCOS, supporting the link to metabolic disease. Association at the FSHB genetic locus has highlighted the pituitary as a driver for PCOS5,6. Later age at menopause risk was also identified as causal, suggesting links to DNA damage and repair pathways in PCOS aetiology7. However, the small number of susceptibility loci identified thus far has limited exploration of the hereditary component8. There are no adequately powered prospective studies of women with PCOS beyond their reproductive years. Therefore, their long-term health outcomes remain unknown. In addition, our understanding of the genetic risk factors for PCOS on other health outcomes in women and men is not complete.
To address these limitations, we conducted a GWAS meta-analysis, which doubled the number of women with PCOS compared to previous GWAS6 (Supplementary Table 1), including data from 21,570 cases and 523,971 controls from 13 studies. We assessed the identified signals for association with a range of phenotypes encompassing three mechanisms thought to be related to PCOS: metabolic pathways, hormonal regulation including through the hypothalamic-pituitary-gonadal axis, and the oocyte/follicle complement. We also conducted a complementary proteomic-based analysis to further identify the biology related to this condition.
In addition, previous studies used genetic instruments to explore the causal links from a range of phenotypes to PCOS, but few have considered the downstream impacts of PCOS9. Many of these are likely to be a feature of “common soil” effects, a situation in which several conditions stem from the same source; in this case the adverse metabolic or hormonal background. However, there may be conditions in which PCOS has a specific, additional, adverse effect. In particular, while previous work has highlighted that variants associated with male-pattern balding increase the risk of PCOS5, whether PCOS risk variants impact disease risk in men has only been addressed to a limited extent10. Here the impact of PCOS on cardio-metabolic disease, disorders of reproductive organs (both cancer and non-cancer), and mental health are examined in both men and women.
Results
Genome-wide discovery for signals for PCOS
We identified 29 independent loci associated with PCOS (P<5×10-8) in the all-ancestries meta-analysis, of which 13 had not previously been reported5,6,11–13 (Figure 1, Supplementary Table 2). Novel GWAS signals for PCOS include a variant at FTO (rs8047587), confirming earlier findings from a candidate gene study14. The link to FTO strengthens the reported effect of increasing BMI on risk of PCOS6,15. Other novel PCOS signals have relevance to reproductive hormone pathways, such as AMH (rs732310), INHBB (rs6712151), and SHBG (rs1641518). Alongside the known European signal at FSHB (rs11031005) we report a signal at FSHR (rs13004711), replicating the association seen in Han Chinese for the first time in a majority European-ancestry GWAS11. Three of the 29 loci (INHBB, NEIL2, DENND1A) had evidence of secondary signals (within 500kb of the lead signal) (Supplementary Table 3).
We also performed a BMI-adjusted model in a subset of the cohorts (Supplementary Table 4; Supplementary Figure 1). In this adjusted analysis only the association at the FTO locus was substantially attenuated (P=0.019 after adjustment for BMI) (Supplementary Figure 2). To explore whether our GWAS findings were affected by differences in the diagnostic criteria used to assign a PCOS diagnosis, we stratified studies based on the method of case identification. There was no difference in the effects of the 29 PCOS signals by case definition (Supplementary Figure 3). Furthermore, effect sizes for each variant were comparable across individual studies (Supplementary Figure 4).
We assessed the transferability of 10 PCOS signals previously reported in East Asian women11 (Supplementary Table 5). Of these signals all but three (variants near C9orf3, INSR and SUMO1P1) had statistically significant associations in our cohort (P<0.005). The SUMO1P1 minor allele was more common and had a relatively smaller effect size (OR=1.03) with borderline evidence for association (P=0.0058) in our data. The variant at C9orf3 had a frequency in our data of approximately half that in the Chinese data, whereas the variant in INSR, coding for the insulin receptor, was more common in our data. Although rare deleterious variants in INSR cause a severe PCOS phenocopy16, previous studies have not demonstrated a genome-wide PCOS association at the INSR locus in women of European ancestry17. The relative frequency of the INSR variant suggests that there might be a gene-environment interaction that explains the different findings between these two studies or that the Han Chinese variant tags a haplotype that is not present in Europeans.
Fine-mapping and genes of interest
Fine-mapping of the identified PCOS loci was performed using a LD reference based on European ancestry women in the MyCode/DiscovEHR and, separately, using the BioVU study (Supplementary Tables 6 and 7). At seven of the loci, both analyses resolved to the same sentinel variant. At LLPH, rs117568227 was the only variant in the 95% credible set (PP>0.99) in both studies, and at YAP1, rs7925543 was the only variant (PP=0.64 in MyCode/DiscovEHR and 0.79 in BioVU). Another 10 PCOS loci resolved to 20 or fewer variants across the two analyses.
We used two complementary approaches to link potential PCOS risk genes, which we call “consensus genes”. First, we performed a literature review of all genes within 500kb of the 29 signals to identify genes with a reported link to one of four processes - i) reproductive function, ii) steroid metabolism and sex-hormone levels, iii) pathways related to metabolic syndrome, and iv) DNA damage repair, selected due to reported links between PCOS and age at menopause for which DNA damage repair is the dominant process7 (Figure 2). Second, we used the GWAS-to-Gene bioinformatic approach that leverages data on eQTLs, pQTLs, predicted deleterious variants, and variant based scoring methods to rank genes based on their causal likelihood (Supplementary Table 8)18. Findings from these two approaches were then harmonised for each PCOS signal (Supplementary Table 9).
In sixteen cases both approaches prioritised the same gene. In others, there was a strong rationale for prioritizing the literature-based gene instead of the bioinformatics-identified gene, such as SHBG over ATP1B2 at rs1641518, particularly because the variant was also genome-wide significantly associated with circulating SHBG levels (Figure 2). The rs732310 variant was assigned to the AMH gene, based on the known functions of AMH in inhibiting recruitment of ovarian follicles from the primordial follicle pool, inhibiting FSH-sensitivity of growing follicles and regulating GnRH-dependent LH pulsatility19,20. Although rs732310 shows no association with AMH levels (Figure 2), the GWAS data used for the AMH analysis was derived from normo-ovulatory women21, and may not reflect variations in AMH levels in women with PCOS, in whom the expression pattern of AMH differs from normo-ovulatory women22.
Relationship of the identified loci with other phenotypes
Several of the 29 PCOS variants had previously been associated in GWASs for age at menopause (14 variants), age at menarche (six), female testosterone levels (seven), BMI (eight), and male-pattern baldness (two) (see Supplementary Table 10 for a complete list). We annotated our signals for the 29 PCOS loci with publicly available GWAS results, focusing in particular on the relationships between these variants and a range of metabolic, reproductive and hormonal phenotypes, given the suspected relationships to PCOS. Notably, all 29 PCOS signals showed at least nominal significance with one or more metabolic, reproductive and/or hormonal trait(s), providing evidence to support the multifactorial aetiology and comorbidities (Figure 2, Supplementary Tables 11and 12).
In validation of the substantial overlap between signals for PCOS and age at menopause (ANM), eight signals showed evidence of colocalization (PP≥0.75). PCOS signals at FSHB, DENND1A, TOX3, RAD50, and MAF/MAFTRR were associated with ANM (Supplementary Table 13); conversely the reported ANM signals at FSHB, DENND1A, CASC22, BMP4, PPARG, and MAF/MAFTRR were associated with PCOS (Supplementary Table 14). At all eight colocalised signals, the PCOS risk-increasing allele conferred later ANM. Of these loci, the FSHB signal has been well described to impact other reproductive phenotypes including age at menarche23, and dizygous twinning24. Other shared loci are also related to breast cancer pathways - TOX3, CASC22 and RAD50. Interestingly, the shared PPARG locus suggests an effect of metabolic pathways independent of those related to BMI.
A number of the PCOS-associated loci also showed strong effects on hormone levels, in particular on SHBG levels (including rs1641518 near SHBG). Approximately 30% (8/29) of the PCOS risk increasing alleles had nominal association with lower SHBG levels. There were 2 loci in which PCOS risk alleles were associated with lower SHBG and higher total testosterone, suggesting a relationship driven by higher androgens, whereas 3 other loci, including SHBG and FTO had lower SHBG and lower total testosterone, suggesting an SHBG mediated effect (Figure 2). Three of the signals had evidence of colocalization with SHBG levels including the CYP3 complex, FTO and FSHB (Supplementary Table 15); notably this did not include the signal at SHBG25. The signal at FTO is likely due to the established links between an increase in BMI and decrease in SHBG levels26. The CYP3 complex metabolizes estradiol and testosterone27, with mouse knockouts showing substantially increased free testosterone levels28 and thus the CYP3 risk allele may result in decreased SHBG. The FSHB locus is associated with an increased LH, which stimulates androstenedione, and therefore testosterone29, which would lower SHBG.
There was additional evidence for the role of PCOS-associated loci impacting regulation of gonadotropins and the functional ovarian reserve. The PCOS risk-increasing allele at the FSHB locus was associated with lower levels of FSH and INHBB was nominally associated with lower FSH. Although the signal at FSHR was not associated with FSH levels in 2,913 Europeans, a variant found in Han Chinese women, which shows some evidence of LD (R2=0.3 in Europeans) with the variant reported here, was associated with lower FSH levels 30,31 (Figure 2, Supplementary Table 11). FSHR has also been recently linked to rates of twinning and FSH levels using gene-based approaches24. At twelve loci, PCOS risk alleles were also nominally associated with higher AMH levels (P<0.05), with 6 additional loci nominally significant. Nine of these twelve loci overlap with the loci related to age at natural menopause and are all associated with higher AMH levels and later age at menopause. AMH is measured clinically to indicate functional ovarian reserve, and its concentrations are strongly related to the age at which menopause occurs. Loci that are associated with both elevated AMH levels and later age at menopause imply their involvement in the establishment and preservation of the functional ovarian reserve as a fundamental element of PCOS.
Protein-based analysis
A total of 31 plasma proteins levels were phenotypically associated with ‘ovarian dysfunction’ in women defined as International Classification of Disease (ICD) 10 category E28 which includes PCOS (all P<3.4×10-5, Figure 3, Supplementary Table 16). These included recognized metabolic disease-associated proteins such as PCSK9, LDLR, FURIN, FABP1 and FABP4. Other associated proteins metabolise hydroxysteroids, retinol and lipids, including ALDH1A1 and ADH4, which may regulate the metabolic response to a high fat diet32,33. Other proteins are potential contributors to diabetes and metabolic disease, GGT134, or their complications, SORD35. Finally, there were enzymes important for biosynthesis of progesterone or testosterone, GSTA1 and GSTA336, and fertilisation and implantation, PAEP37. We used STRING to perform a combined pathway-based analysis using proteins drawn from either the protein-based approach or the GWAS associated loci (Supplementary Figure 5, Supplementary Tables 17 and 18)38. In general, the two discovery methods highlighted different biology, with only one pathway, benzaldehyde dehydrogenase activity, driven by both proteomic and GWAS findings (which included FANCC, a known DNA damage repair gene). The results did include evidence of enrichment or pathways representing androgen binding, and ovarian follicle development neuregulin receptor activity, the PCSK9-LDLR complex consistent with the lipid abnormalities of PCOS39,40.
In order to build causal pathways, we identified plasma proteins whose levels were associated with each of our PCOS GWAS signal variants, resulting in a total of 299 proteins (P<3.4×10-5, Supplementary Table 19). Of note, the PCOS signals at ERBB3, ERBB4 and ZBTB16 were associated with plasma levels of their encoded proteins, providing further support for these consensus genes. The PCOS signal at SHBG was associated with levels of TNSF12 and TNSF13, encoded by genes in the same region, and related to apoptosis and regulation of steroidogenesis 41. One PCOS signal at RAD50/IRF1 accounted for 199 protein associations. This signal lies in a gene rich region on chromosome 2 that contains a large number of immune-related genes - IRF1, IL4, IL5 and IL13 and likely has a widespread impact on the plasma proteome. Another PCOS signal, overlapping the established obesity signal at NEIL2/GATA4, was associated with 29 plasma proteins. Other associated proteins included leptin, which is higher in obesity; PPY, a regulator of food intake; and several fatty acid binding proteins, which regulate fatty acid uptake in adipose cells.
Having identified these 299 proteins, we then assessed if any showed evidence of association with a diagnosis of ICD E28 (ovarian dysfunction), based on a threshold of P<1.7×10-4 (0.05/299). This two-stage approach resulted in nine variant-protein pairs. Three proteins were associated with the PCOS signal at MAF (CDHR2, IGFBP2, CPM), one with ERBB3 (NCAM2) and five with FTO (ADM, FABP1, FABP4, LEP and SSC4D) (Supplementary Table 20). The FTO-locus related proteins are likely to be wholly explained the adiposity effect at this locus; this either as a result of the casual BMI association, or via a common soil effect. This is further highlighted by the fact that, of these proteins, FABP4 and LEP have recently been shown to be linked to aging specifically in adipose tissue42, which would suggest that changes in the levels are due to a primarily BMI effect. To understand the direction between individual plasma proteins and PCOS, we compared the variance in each trait explained by its respective PCOS signal (variance in BMI was also compared for those proteins associated with FTO) (Figure 4). The approach is based on the assumption that if a variant was associated with greater explained variance in protein levels compared to PCOS, it was unlikely that protein was affected as a result of PCOS. All five FTO associated proteins, NCAM2 associated with ERBB3, and IGFBP2 associated with MAF, appeared to be upstream (i.e. more likely to be determinants, or common soil) of PCOS. The two other proteins associated with MAF, CDHR2 and CPM, may have their levels altered as a consequence of PCOS.
Inferring causal impacts of PCOS on other co-morbidities
The impact of PCOS on fertility, metabolic disease and mental health is well known, but few studies have used a genetic approach to uncover additional comorbidities43,44. Previous genetic causal modelling, using Mendelian Randomisation (MR) approaches, have shown that aspects of metabolic syndrome traits are risk factors for PCOS6. However, those MR studies did not determine whether PCOS had an effect on broader health status. Therefore, we calculated a polygenic risk score (PRS) for PCOS comprising ∼1.1 million genetic variants to explore the likely causal effect of PCOS on a number of other outcomes.
To identify phenotypes that may share genetic influences with PCOS, we performed polygenic risk score (PRS) based analyses in the UK Biobank study (which was independent of the PCOS GWAS discovery dataset). We used the PRS-CS software to calculate a polygenic risk score for PCOS, which is a Bayesian regression framework that weights the effect size of each variant by the strength of its association (p-value) in the GWAS meta-analysis45. To facilitate interpretation, the PRS was standardised; effect sizes (beta or OR) in Table 1 are reported per 1 SD increase in the PCOS PRS. To validate the score, we confirmed that the PCOS PRS was strongly associated with PCOS in women in the UK Biobank (P=9×10-27) and that the odds of PCOS increased across increasing quintiles of PRS (Supplementary Figure 6). As expected, there was also a strong association between a higher PCOS PRS and increased BMI in both women and men (Table 1).
Since increased BMI is a risk factor for both PCOS and many of the expected metabolic comorbidities, we assessed the BMI-independent causal relationship between PCOS and metabolic outcomes in two additional analyses: 1) we included measured BMI as a covariate and 2) we generated and tested a second BMI-adjusted PRS for PCOS. The association of the BMI-adjusted PRS with baseline BMI in the UK Biobank was substantially attenuated in women and men (Table 1 and Supplementary Table 21). The PRS analyses were replicated using an additional polygenic risk score tool LDpred2, which provided consistent results (Supplementary Table 21).
Both in women and in men, a higher PCOS PRS was associated with increased risk of coronary artery disease (CAD), T2D and obesity (Table 1). It was also associated with a number of cardiometabolic risk factors: higher waist-to-hip ratio adjusted for BMI (WHRadjBMI), higher HbA1c, higher triglycerides and lower HDL cholesterol. Regarding hormonal/reproductive risk factors, higher PRS was associated with lower SHBG levels and higher free androgen index (FAI) in both sexes (Table 1). The sex-stratified BMI-adjusted model analyses provided support for all these observations.
We then tested for the presence of sex differences in the effects of the PRS on i) cardiometabolic outcomes where different effects of sex were observed for BMI, WHRadjBMI, CAD, HbA1c and HDL cholesterol; and ii) hormonal/reproductive outcomes where differential sex-related effects were observed for SHBG, total testosterone and FAI (Table 1). There has been increasing interest in the links between reproductive diseases and mental health46,47; we found only a nominal association between the PCOS PRS and depression in women after BMI adjustment (Table 1).
We used Mendelian Randomisation to provide additional evidence to determine whether highlighted associations were causal (Supplementary Tables 22 and 23). While MR might provide more robust causal inference there are a number of caveats: the available outcome data is often in men and women combined, and with only 29 variants this analysis is likely very under-powered. Most of the associations with non-reproductive phenotypes were not significant; including when BMI was the outcome, likely demonstrating that the method is underpowered. Interestingly, the MR analysis showed associations with two non-cardiometabolic outcomes - depression and asthma, with no attenuation when controlling for BMI.
Pleiotropic effects of PCOS on reproductive outcome
Consistent with the overlap between signals for PCOS and ANM, PCOS susceptibility was associated with later ANM in both PRS (Table 1) and MR analyses (Supplementary Table 23). Conversely, we also observed an apparent effect of susceptibility to later ANM on higher PCOS risk (Supplementary Table 23), indicating a bidirectional relationship. Given the importance of DNA repair as a mechanism that regulates ANM, we tested to determine whether the same pathways also contributed to PCOS. We performed MR analyses for PCOS with reported variants for ANM stratified by their likely functions on DNA repair and genes unrelated to DNA repair7. Both strata of ANM variants indicated an effect of later ANM on higher risk of PCOS (Supplementary Table 23). However, the MR estimated effect of ANM was larger using non-DNA repair ANM variants than when using DNA repair variants (P-value for the difference between estimates = 1.5×10-6, Supplementary Figure 7), which suggests a role for sex hormone-related pathways common to PCOS and ANM. The role of hormone levels as common causal factor is highlighted by the shared signal at FSHB associated with lower FSH levels, higher PCOS risk and later age at menopause.
The DNA repair-related ANM variants that are associated likely act via various DNA-repair pathways, which have differing effects on PCOS. For example, ANM variants at BRCA1 and BRCA2, genes involved in homologous recombination repair of double strand DNA breaks, are robustly associated with an earlier ANM but not with PCOS (P=0.94 and P=0.11 respectively). On the other hand, CHEK2, known to maintain DNA integrity through checkpoint control48, is associated with a later ANM, greater risk of PCOS and higher serum AMH levels in PCOS.
The PCOS PRS showed unexpected nominal associations with lower risk of childlessness, although these were not confirmed in the MR analysis. We therefore examined the impact of PCOS susceptibility on eight infertility phenotypes in a hospital-based cohort, the Copenhagen Hospital Biobank (CHB)49 (multiple testing threshold P<0.006; = P<0.05/8; Supplementary Table 24). A higher PCOS PRS was associated with increased risk of infertility in women (OR=1.03, P=0.02 after age adjustment; OR=1.04, P=0.00011 after age and BMI adjustment). The stronger association after adjustment for BMI suggests that in addition to affecting fertility via BMI, PCOS also affects fertility through other mechanisms, such as hyperandrogenism, that are independent of BMI.
Interestingly, the PCOS PRS was associated with an increased number of oocytes aspirated during IVF treatment (Beta=0.025, P=1×10-4 after age adjustment; beta=0.027, P=2×10-5 after age and BMI adjustment). In separate data from 892 women, two of the 29 loci, at ZBTB16 and SHBG, were associated with larger ovarian volume; which is both a marker of ovarian reserve linked to fertility, and a symptomatic presentation of PCOS. These findings suggest a greater available oocyte pool in PCOS.
There was also support for the hypothesis that some genetic PCOS susceptibility might exhibit a balancing pleiotropy effect on reproductive success. We assessed the links from PCOS to age at first birth50, age at last birth50, childlessness51 and number of children51 using publicly available GWAS datasets (Supplementary Table 23). There were no apparent associations with childlessness or number of children. The latter result was confirmed in the CHB data, in which there was no association between the PCOS PRS and the completed family size (P=0.07) or rates of pregnancy (P=0.25; Supplementary Table 24). However, there was a nominally significant association with later age at last birth when data were controlled for age and BMI (P=0.01). We further assessed the association between PCOS and age at last live birth in the UK Biobank in 1,003 women with PCOS and 205,849 controls. PCOS was associated with later age at live birth (Beta=0.46 years, P=0.04 after age adjustment, Beta=0.81 years, P=0.0003 after age and BMI adjustment). These age at last birth results could be explained by a longer reproductive window or by a shifting of the window, and the lack of association with childlessness would indicate that some compensatory mechanisms exist.
Discussion
This study expands the number of PCOS genome-wide significant loci from 19 to 29. The new PCOS locus at FTO, well-established for obesity, highlights the link between PCOS, metabolic syndrome and obesity. Other signals at SHBG, FSHR (associated for the first time in a European GWAS) and the CYP3 complex highlight hormonal regulation in the aetiology of PCOS. Alongside these results, we also present proteins that are associated with ovarian dysfunction. The protein associations confirm some top candidate genes at GWAS loci (ERBB3, ERBB4 and ZBTB16).
The identified genetic loci underscore the sex hormonal origins of PCOS. Annotation of these signals to consensus genes has identified signals at FSHB52 and FSHR11, highlighting the role of pituitary gonadotrophs (LH and FSH) in ovarian follicle stimulation. Other consensus genes, SHBG, INHBB, AMH and TEX41 - this last robustly associated with AMH levels – also point to hormones related to ovarian folliculogenesis, and feedback on the hypothalamic-pituitary-adrenal axis. Inhibin B is secreted by granulosa cells of the small to large antral follicles and inhibits FSH release, ensuring the growth of one dominant follicle53,54. In support, the INHBB variant was nominally, inversely associated with FSH levels. Circulating AMH levels reflect the number of small antral follicles and AMH reduces FSH sensitivity of growing follicles20. Furthermore, AMH inhibits aromatase activity at the level of a growing follicle, and increases LH-dependent gonadotropin-releasing hormone (GnRH) pulsatility at the hypothalamus55. Finally, SHBG is a binding protein for androgens, thereby regulating free and bioavailable androgen levels56. In men there is a corresponding increase in free androgen index which explains the connection between PCOS variants and male pattern balding6. In summary, PCOS risk is affected by a number of classical and well-established sex hormonal pathways.
A significant proportion of the identified PCOS loci overlap with those associated with age at menopause. In all cases, PCOS risk alleles confer later age at menopause. There are two possible mechanisms which could explain, perhaps in tandem, the links between these two phenotypes. First, several overlapping variants are related to genes linked to DNA repair mechanisms such as MSH6, CHEK2 and RAD50. The partitioned Mendelian Randomisation analysis suggested that both DNA damage repair and non-DNA damage repair (thought to be predominantly hormonal pathways) were causal for PCOS, but that the latter had a stronger influence. While age at menopause is thought to be impacted by a range of pathways linked to DNA repair, those signals shared with PCOS might be relate to more specific mechanisms, such as CHEK2, where the effect is to have DNA-damaged oocytes persist for longer 7. Thus, there may be follicles with DNA-damaged oocytes that remain in the ovary because the DNA checkpoint removal mechanism failed. In PCOS, this may reduce oocyte atresia, leading to continuous AMH expression and thereby stronger inhibition of primordial follicle recruitment (associated with later ANM) and reduced FSH-sensitivity (contributing to the polycystic ovary morphology seen in PCOS)19,20. In addition, PCOS was not associated with BRCA1 or BRCA2 variants, which appear to influence earlier age at menopause based on less functional DNA-repair mechanisms and potential loss of damaged oocytes7.
Secondly, changes in hormonal levels may increase follicle numbers as demonstrated for increased androgen levels57 and activin, the product of two INHB subunits58. It is also possible that hormone levels reduce depletion of the primordial follicle pool causing a later end to the reproductive window, as has been demonstrated for the variants causing lower FSH levels59. Moreover, the variant in FSHB was also linked to less follicle selection across the reproductive lifespan potentially leading to a greater follicle pool consistent with PCOS. In addition, increased serum AMH levels, seen in PCOS patients, reduce the rate of primordial follicle recruitment and may thereby slow follicle pool depletion, leading to later menopause19,20. In addition, observational studies suggest that women with PCOS or PCOS symptoms have children as often as asymptomatic women and have reproductive success when followed over a sufficiently long term60–63. In both our PCOS PRS and epidemiological analysis of women with PCOS in the UK Biobank, there was a suggestion of an impact of a later age of last birth in both the PRS genetic analyses and epidemiological analysis and this effect was also seen in the CHB data. This is consistent with a previous finding that a longer or shifted window of reproduction was required to achieve the same cumulative family size in women with PCOS63. The main cause of infertility in PCOS is irregular ovulation, and therefore, the relative infertility at younger ages may be balanced by improved ovulations at later ages64,65.
In summary, the new loci contain a number of risk genes expected to increase the follicle complement in PCOS66,67. This finding supports the Rotterdam diagnostic criteria for PCOS, which highlight polycystic ovarian morphology (number of growing small antral follicles), hyperandrogenism (hormone regulation) and related irregular ovulation and menses as primary etiologic features of PCOS2.
The score-based analyses stressed the link to metabolic diseases, with a number of strong associations between PCOS and clinical endpoints, mirroring observed associations68, and previous studies10,43. Unsurprisingly, a higher PCOS PRS was associated with higher BMI and increased risk of obesity in both women and men; thus, undoubtedly, much of the effect on cardio-metabolic diseases seen in the BMI unadjusted models is via the “common soil” impact of BMI. However, many of the associations remained significant in women (though substantially attenuated), and not in men after controlling for BMI. These female-specific effects, that were not BMI-related, suggest a shared causal factor between PCOS and metabolic disease. A plausible mechanism explaining this is the higher androgen levels in PCOS, which have been shown to be a risk factor for CAD25,69,70. Although testosterone and other androgens decrease to the same level as in controls after menopause, the continued lower SHBG in women with PCOS after menopause and the lasting impact of androgens during reproductive age on physiology may result in long term increased CAD risk71,72.
In addition to this, the proteins associated with reproductive dysfunction stress the links between reproductive phenotypes and the metabolic syndrome. Associations were seen with classical adiposity and metabolic proteins including leptin and furin. Other associated proteins are vital to cholesterol metabolism such as PCSK9 and the LDL receptor, which are important for both cardiovascular risk and steroidogenesis. There were also proteins that contribute to the metabolic response to a high fat diet. It is important to consider that most of the women in whom the protein-based analysis was done were assessed after the end of their reproductive window. The data again implicate the lower SHBG and higher free androgen levels in PCOS after menopause71, and potentially the sustained effects of hyperandrogenism even after the reproductive years. Thus, these results and the score-based analyses together suggest that there is an ongoing, adverse pattern of cardiometabolic health in women with a genetic risk for PCOS.
Conclusions
Here we identify genetic regions and proteins associated with PCOS. In general, the genomic loci appear to primarily implicate hormonal pathways as the causal factors for PCOS; while the proteins stress the common causal factors between PCOS and metabolic disease, particularly pathways related to increased BMI. Our findings highlight important links between PCOS and type 2 diabetes and coronary artery disease, via mechanisms that are related to and also independent of adiposity. We also expand our understanding of the factors affecting the ovarian follicle complement on the condition, including both hormonal influences and specific DNA repair mechanisms, and their role in PCOS. We also demonstrate some evidence of balanced pleiotropy conferred by PCOS genetic susceptibility that maintains the high prevalence of PCOS in the population.
Methods
Data Collection and Quality Control
Summary results of genome-wide association analysis (GWAS) using a case-control setting were provided by the studies contributing to the meta-analysis. At the study level, the analyses were adjusted for age, principal components and body mass index (BMI, only for BMI-adjusted analyses). Central quality control (QC) was performed by two independent analysts using the EasyQC pipeline73. Variant exclusion filters used included: (1) Minor allele frequency (MAF) <1%, (2) imputation quality (R2) <0.3 or info <0.4 for MACH and IMPUTE2, respectively73.
Meta-analysis
A fixed-effect, inverse-weighted-variance meta-analysis approach was used with the collected summary statistics from the individual studies. Either GWAMA74 or METAL75 were employed as meta-analysis tools. We performed meta-analyses for all ancestries combined and only European ancestry combined. These meta-analyses were carried out using two models; age-adjusted, and age and BMI-adjusted, given the association between obesity and PCOS5. Variants present in at least three strata were reported and used in further analyses.
These meta-analysis results were then combined with the previously published genome-wide meta-analysis summary statistics6 to increase the statistical power and discover further associations with PCOS status. We called this analysis the 2-strata meta-analysis. As previous research had found no substantial heterogeneity in variant discovery as a function of different diagnostic criteria6, studies with any method of case ascertainment were combined. Variants present in all strata were reported and used in the follow-up analyses. Identified variants were annotated and investigated further with regards to their biological function using FUMA76. Forest plots for comparing the effect sizes across the strata in the meta-analysis were made using the ggplot2 package in R.
Furthermore, we compared the effect sizes across different phenotype definitions used; PCOS definitions based on i) electronic health records (EHRs), ii) clinical diagnosis and iii) self-reports were included in this comparison. Additional meta-analysis was performed to statistically test for heterogeneity across these three PCOS definitions. In addition to visually inspecting forest plots for the meta-analysis, Cochrane’s Q P-value and I2 were used for assessing heterogeneity.
The summary statistics from the age-adjusted meta-analyses were further combined with the previously published summary statistics from 23andMe, Inc. in order to increase the statistical power6. We called this analysis the 3-strata meta-analysis. The resulting sample size was 545,541 (21,570 cases and 523,971 controls).
We assessed the lead GWAS variants (p<5x10-8) by examining their relationship with 20 related metabolic, hormonal and reproductive phenotypes with available GWAS results data. Except for LH and FSH all other traits were publicly available (Supplementary Table 10). The heatmap was drawn using the “pheatmap” library in R 3.6.1.
Fine-mapping
To identify a credible set of variants containing the most likely causal variant underlying our association signals, we conducted fine-mapping using the shotgun stochastic search method as performed in FINEMAP77. We used summary statistics from our two-stage summary GWAS meta-analysis results without the data from 23andMe, Inc., and considered all variants within 1 MB +/- from our tag variants. We used two different contributing studies as LD references to perform fine-mapping. First, we used unrelated (up to 2nd degree), European ancestry women from the MyCode Community Health Initiative Study (DiscovEHR) (N=47,061) with genetic data imputed to the 1000 Genomes Phase III global reference panel. European ancestry was inferred using genetic data as described elsewhere78. Second, we used an unrelated dataset of European ancestry females (N=36,890) in the EHR-linked biobank at Vanderbilt University Medical Center (BioVU). Genetic data were imputed to the Haplotype Reference Consortium and European ancestry was defined by principal components79. We assumed a single causal variant for all loci, and for four loci with evidence of a secondary signal, we also performed fine-mapping assuming two causal variants.
Functional Mapping and Annotation of GWAS
Functional mapping and annotation of GWAS was performed with FUMA, and further annotation of the association results with PhenoScanner (date accessed 25 March 2022)80,81. FUMA analyses were performed using the summary statistics for: i) the top 29 PCOS-associated variants in the 3-strata meta-analysis; and ii) the genome-wide 2-strata meta-analysis. Unless specified otherwise, the default settings were used in the FUMA analyses for both SNP2GENE and GENE2FUNC76.
Proteomic Analysis
Proteomic analysis using logistic regression for the association of normalised protein levels with disease was run in the ∼22,000 women with data from the Olink panel of plasma proteins, aged 56.5±8.1 yrs82. Here the outcome was the first occurrence data, and to maximise sample size we used a diagnosis of any of ICD10 category E28, the supra-group that includes PCOS. Proteins were considered significantly associated if they passed a Bonferroni corrected p-value threshold of 3.4×10-5. The generation of the protein data is described elsewhere83.
Separately, we performed a protein Phewas for each of the variants that we identified in the GWAS meta-analysis using the total sample of ∼44,000 (men and women) with Olink data to identify proteins linked to our PCOS signals. Again, we used a Bonferroni corrected p-value threshold of 3.4×10-5. Once this panel of proteins had been identified we then considered if these were also associated with the diagnosis of ICD10 E28. Finally, once we had established our variant-protein pairs, we attempted to establish the position in the causal pathways by considering the relative R2, with those variants that had a higher R2 with PCOS suggesting that the protein was downstream of PCOS, and vice-versa.
GWAS-to-Gene pipeline
The GWAS-to-Genes pipeline leverages a range of annotation methods to highlight likely causal genes at each of the identified signals, as described elsewhere18. Briefly, tissue enrichment for GWAS associations was performed using LD score regression to identify key tissues for annotations with tissue specific datasets. Then a gene score is generated from the following panel of annotations: a) The closest gene to the signal with these scored 1.5 points. b) eQTL colocalisation from both SMR-HEIDI and coloc were scored 1.5 points, or 1 if only from one of these. An additional point was given to genes with eQTLs at secondary signals. c) pQTL colocalisation scored the same as for eQTLs. d) Coding variants, with variants of deleterious or damaging predicted consequence in LD with GWAS PCOS signals were scored 1.0 point, or only 0.5 points if the coding variants were predicted to be benign or tolerated. e) Genes targeted by enhancers which overlapped with or were correlated with GWAS PCOS signals were scored 1.0 point. f) PoPs prioritised genes at each locus were scored 1.5 points84.
Gene-set Enrichment Analysis
To perform gene-set enrichment analysis that leveraged information across both the proteomics and the implicated genes we used GProfiler selecting either the consensus gene or the associated protein from the proteomics analysis. Clustering of the pathways was done using an index of dissimilarity based on the shared genes across the enriched intersections of each pathway18.
Polygenic Risk Score Analyses in the UK Biobank
We used the PRS-CS software to calculate a polygenic risk score (PRS) for PCOS, which is a Bayesian regression framework that applies continuous shrinkage parameters to estimate posterior effect sizes45. This work was performed in the UK Biobank, a population-based cohort of ∼500,000 individuals in the United Kingdom85, which was independent of the discovery GWAS sample. The tuning or global shrinkage parameter phi=1×10-4 that optimised the association of the PRS for PCOS in the UK Biobank as previously reported was used10. Using this method, our PCOS polygenic risk score included 1,119,009 genetic variants. In the same UK Biobank sample, we replicated these analyses using another PRS tool, LDpred-2, which employs Bayesian shrinkage model86.
To identify women with PCOS in the UK Biobank study, data from self-report, primary-care clinical events, and/or ICD 9 and ICD 10, as previously reported10 was used. We binned women with and without a diagnosis of PCOS by their quintile of PRS and used logistic regression to determine the odds of PCOS for each quintile using the lowest quintile as a reference. Women with a PCOS PRS in the highest quintile had an increased odds of PCOS (OR 2.41, 95% CI 1.96-2.98; P=2×10-16). Thus, our PCOS PRS is able to represent the genetic risk for PCOS in women in the UK Biobank.
Ascertainment of cardiometabolic and androgenic phenotypes have been previously reported10. All other phenotypes, including measures of fertility and longevity, asthma, and mental health disorders were based on a composite of self-reported measures, diagnosis codes from hospitalisation records, and age at diagnosis (Supplementary Information). We used linear and logistic regression to analyse the associations between continuous and dichotomous phenotypes and the PCOS PRS, respectively. We adjusted all analyses for age, age squared, genotyping array, the UK Biobank assessment centre, and the first ten genetic principal components; for asthma and psychological outcomes, we additionally controlled for the Townsend Deprivation Index, and for asthma, we further controlled for smoking status. Adjustment for BMI was performed in two ways, first a measured BMI was included in the model as a covariate; second we constructed a score based on the GWAS meta-analysis where the genetic associations were adjusted for BMI.
Polygenic Risk Score Analysis in Copenhagen Hospital Biobank based on the Danish Registries
Polygenic Risk Scores (PRS) for PCOS were calculated using LDPred287. These genome-wide scores were calculated using the meta-analysis excluding data from 23andMe Inc. Autosomal genotype data from 138,669 individuals in the Copenhagen Hospital Biobank (CHB) were filtered to only include variants present in a set of 1,054,330 reference variants recommended by LDPred2 developers. Missing genotype information was imputed to be the reference allele for the affected locus. GWAS summary statistics were pre-processed with MungeSumStats.
The completed family size was determined by counting the number of live births from the Medical Birth Registry (MBR)88. This study was initiated in 1973, and data is considered complete. Only women born in the years 1957-1973 were included in this analysis, thus the youngest would be 45 years old, and 61 years old when data collection ended (31st December 2018). Data were treated as count data, and we tested to determine whether there was equi-, under-, or over-dispersion using the AER R package (v1.2.10). We found significant underdispersion (dispersion=0.60, p<2.2×10-16). Consequently, data were analysed using a Conway-Maxwell Poisson distribution.
From the MBR, we also identified the age at first birth and last birth (expressed in days). Data were analysed using a linear regression, and model fit was inspected from residuals. There were no signs of deviation from a Gaussian error.
The Danish IVF registry was initiated in 1994, and contains all treatments and procedures related to medically assisted reproduction. Reporting is mandatory for both private and public clinics. Furthermore, there is information on any treatment related to the procedure, and treatment duration. Female infertility was defined using the 628 (ICD8) and N97 (ICD10; excluding N97.4) in the National Patient Registry (public hospitals only89) and “female cause” in the IVF Register. Male infertility was defined using the 606 (ICD8) and N46 (ICD10) in the National Patient Registry and “male cause” (excluding male infertility due to sterilization) in the IVF Register. For number of oocytes, we extracted all aspirations performed in the period from 18th January 1994 - 31st December 2018. The mandatory reported data were changed in 2005, and thus, we analysed the two time periods (1994-2005 and 2006-2018) separately and meta-analysed. We additionally extracted information on treatment (Klomifen, HMG-FSH, GNRH-A, Oestrogen, Progesterone, HCG) and the number of treatment days prior to the aspiration. Both time periods were over dispersed and were analysed using a negative binomial distribution. To take into account multiple aspirations for a single woman, we included an individual random term.
Lastly, we investigated the number of cycles before a woman got pregnant or ceased treatment. Each woman was only included until the first pregnancy. All transfer or insemination attempts were summarised, and analysed using a Conway-Maxwell Poisson distribution, as data were under dispersed. Furthermore, a term for zero-inflated was also tested as only 81% of the population achieved a pregnancy. A model that included a zero-inflation term was found to fit significantly better (p<2.2×10-16, likelihood-ratio test). This was also substantiated by lower AIC and BIC scores.
All models were fit using glmmTMB, and no rate models, except the number of cycles until pregnant, had indications of zero-inflation.
GWAS Catalog Accessions for calculated bioavailable testosterone, total testosterone and SHBG: GCST90012102, GCST90012106 and GCST90012112.
Mendelian Randomisation Analysis
Mendelian Randomisation analysis was performed using two sample inverse weighted methods (IVW)90. In addition, the intercept from the MR-EGGER91 was calculated to provide a test of directional pleiotropy and the I2 metric to assess general heterogeneity of the variants. Data for the outcomes was taken from the most recent genome-wide study for each of the outcomes (thus in most cases this data were not sex-specific). To correct for any impact of the role of BMI on analyses multivariate IVW92 was implemented. The betas for these variants to BMI was taken from the most recent GIANT consortium study which combined GWAS meta-analysis data with that from UK Biobank93. For the analysis of association between menopause variants and PCOS split by evidence for a DNA damage effect, the variants were classified based on the proximity to a known DNA damage repair gene as per Ruth et al.7.
Data Availability
Data produced will either be available online or upon reasonable request to the authors after peer-reviewed publication of the manuscript.
Acknowledgements
This work was conducted using UK Biobank, application numbers 9905, and 31823. We would like to thank the research participants and employees of 23andMe Inc. for making this work possible. Tugce Karaderi is supported by the Novo Nordisk Foundation Data Science Investigator grant (NNF20OC0062294). Further Acknowledgements can be found in the Supplement.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵