Abstract
Context Polycystic ovary syndrome (PCOS) is one of the leading causes of infertility, yet current diagnostic criteria are ineffective at identifying patients whose symptoms reside outside strict diagnostic criteria. As a result, PCOS is under diagnosed and its etiology is poorly understood.
Objective We aim to characterize the phenotypic spectrum of PCOS clinical features within and across racial and ethnic groups.
Methods We developed a strictly defined PCOS algorithm (PCOSregex-strict) using International Classification of Diseases, 9th and 10th edition (ICD9/10) and regular expressions mined from clinical notes in electronic health records (EHRs) data. We then systematically relaxed the inclusion criteria to evaluate the change in epidemiological and genetic associations resulting in three subsequent algorithms (PCOScoded-broad, PCOScoded-strict,PCOSregex-broad). We evaluated the performance of each phenotyping approach and characterized prominent clinical features observed in racially and ethnically diverse PCOS patients.
Results The best performing algorithm was our PCOScoded-strict algorithm with a positive predictive value (PPV) of 98%. Individuals classified as cases by this algorithm had significantly higher body mass index (BMI), insulin levels, free testosterone values, and genetic risk scores for PCOS, compared to controls. Median BMI was higher in African American women with PCOS compared to White and Hispanic women with PCOS.
Conclusions PCOS symptoms are observed across a severity spectrum that parallels genetic burden. Racial and ethnic group differences exist in PCOS symptomology and metabolic health across different phenotyping strategies.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
KVA and JC are supported by NIH training grant 5T32GM007628-42 and 5T32DK007061. LKD is supported by U54MD010722. This research was done in part using the resources from the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville, TN. Datasets for this project were obtained using the Synthetic Derivative at Vanderbilt University Medical Center which is supported by multiple grant sources that are institutional, private, and federal. This includes the NIH funded Shared Instrumentation Grant S10RR025141 and CTSA grants UL1TR002243, UL1TR000445, and UL1RR024975.
Author Declarations
All relevant ethical guidelines have been followed; any necessary IRB and/or ethics committee approvals have been obtained and details of the IRB/oversight body are included in the manuscript.
Yes
All necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Due to data sharing restrictions related to privacy concerns in the EHR, the datasets generated from our hospital population will not be publicly available, however, all criteria for automated phenotyping is available in supplementary materials.