Abstract
Endometriosis is a condition characterized by implants of endometrial tissues into extrauterine sites, mostly within the pelvic peritoneum. The prevalence of endometriosis is under-diagnosed, and estimated to account for 5–10% of all women of reproductive age. The goal of this study is to develop a model for endometriosis based on the UK-biobank (UKBB). We partitioned the data into those diagnosed with endometriosis (5,924; ICD-10: N80) and a control group (142,576). We included over 1000 variables from UKBB covering personal information about female health, lifestyle, self-reported data, genetic variants, and medical history prior to endometriosis diagnosis. We applied machine learning algorithms to train an endometriosis prediction model. The optimal prediction was achieved with the gradient boosting algorithms of CatBoost for the data-combined model, with an area under the ROC curve (roc-AUC) of 0.78. We discovered that, prior to being diagnosed with endometriosis, women had significantly more ICD-10 diagnoses than the average unaffected woman. Informative features, ranked by SHAP values included irritable bowel syndrome (IBS) and the length of the menstrual cycle. We conclude that the rich population-based retrospective data from the UKBB is valuable for developing predictive models despite the limitations of missing data and noisy medical input. The informative features of the model may improve clinical utility for endometriosis diagnosis.
Competing Interest Statement
The authors have declared no competing interest.
Funding Statement
This study was supported by the ISF grant number: 2753/20 (to M.L.). The Louise and Alan Edwards Foundation, Clinical Research Fellowship Grant 2021 (to T.S.)
Author Declarations
I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines and uploaded the relevant EQUATOR Network research reporting checklist(s) and other pertinent material as supplementary files, if applicable.
Yes
Data Availability
Model based on the UK-Biobank
Abbreviations
- (AI)
- Artificial intelligence
- (AUC)
- Area under the ROC Curve
- (DL)
- Deep Learning
- (EHR)
- Electronic Health Records
- (OT)
- OpenTargets
- (ROC)
- Receiver Operating Characteristic Curve
- (IBS)
- Irritable bowel syndrome
- (UKBB)
- UK-Biobank
- (PRS)
- Polygenic risk score
- (T2D)
- Type 2 diabetes
- (BMI)
- Body mass index