Abstract
Objective The majority of breech fetuses are delivered by Cesarean birth as few physicians are trained in vaginal breech birth. An external cephalic version (ECV) can prevent Cesarean delivery and the associated morbidity in these patients. Current guidelines recommend all patients with breech presentation be offered an ECV attempt. Not all attempts are successful, and an attempt does carry some risks so shared decision-making is necessary. To aid in patient counseling, over a dozen prediction models to predict ECV success have been proposed in the last few years. However, very few models have been externally validated, and thus none have been adopted into clinical practice. This study aims to use data from a United States hospital to provide further data on ECV prediction models.
Study Design This study retrospectively gathered data from Carle Foundation Hospital and used it to test six models previously proposed to predict ECV success. These models were Dahl 2021, Bilgory 2023, López Pérez 2020, Kok 2011, Burgos 2010, and Tasnim 2012 (GNK-PIMS score).
Results 125 patients undergoing 132 ECV attempts were included. 69 attempts were successful (52.2%). Dahl 2021 had the greatest predictive value (AUC 0.779), while Tasnim 2012 performed the worst (AUC 0.626). The remaining models had similar predictive values as each other (AUC 0.68-0.71). Bootstrapping confirmed that all models except Tasnim 2012 had confidence intervals not including 0.5. The bootstrapped 95% AUC confidence interval for Dahl 2021 was 0.71-0.84. In terms of calibration, Dahl 2021 was well calibrated with predicted probabilities matching observed probabilities. Bilgory 2023 and López Pérez were poorly calibrated.
Conclusion Multiple prediction tools have now been externally validated for ECV success. Dahl 2021 is the most promising prediction tool.
Key Points
Prediction models can be powerful tools for patient counseling
The odds of ECV success can estimated based on patient factors and clinical findings
Of the 6 tested models, only Dahl 2021 appears to have good predictive value and calibration
Introduction
Breech presentation complicates approximately 3-4% of term pregnancies, and transverse lie complicates an additional 0.1%-0.2%1–3. The three main options for malpresentation presentation include vaginal breech birth (VBB), Cesarean delivery, or external cephalic version (ECV) followed by cephalic vaginal delivery 1. Vaginal breech births are extremely rare in the United States and only offered in a limited number of centers. External cephalic version is a physical maneuver where physicians press on the mother’s abdomen in the third trimester to try to rotate to a cephalic (“head down”) position. If successful, the ECV allows for a cephalic vaginal delivery and foregoes the need for a Cesarean delivery or VBB. As such, ECVs are cost effective and noninvasive procedures that decrease Cesarean rates and avoid the associated morbidity for patients2. Current American and British guidelines recommend that patients who are near term with breech presentation be offered an ECV if there are no contraindications2,4,5.
Wide variation in ECV success rates have been reported, although most estimates are usually around 50-60%2,5–7. Furthermore, not every patient has the same likelihood of procedure success. Some risk factors for ECV failure include low amniotic fluid volume, nulliparity, obesity, descent of the breech into the pelvis, and later gestational age8–11. Regardless of whether the procedure is successful, the procedure can rarely result in complications, the most severe include placental abruption (0.18%), umbilical cord prolapse (0.18%), fetomaternal hemorrhage (0.9%), rupture of membranes (0.22%), and fetal death (0.19%) 7. As such, it is recommended that an ECV only be performed in an institution where an emergency Cesarean delivery can be performed and that shared decision making be employed to determine whether to proceed with an ECV vs a scheduled Cesarean birth.. Given the overall risks of the procedure, there has been a significant move in the field to publish models which predict an individual patient’s success, to aide patients and practitioners in deciding for or against this intervention.
Many statistical models have been published in the last decade which predict the success of an ECV depending on a multitude of variables such as: parity, placental location, amniotic fluid volume, fetal sex, and descent of the presenting part12. We recently published a systematic review of the 25 models that have been published so far, 14 of which in the last 5 years. The vast majority of models had no external validation and thus were not being used in clinical practice. Because of various statistical biases, it is common for a prediction model to be overfit and appear to perform well in the original study, but be poorly predictive in reality. Thus, it is critical that rigorous external validation be performed before these models are used clinically as acting on incorrect predictions is worse than have no prediction at all. Therefore, in our systematic review, we determined that rather than new models to be developed, there is a need for validation of the existing models to determine their suitability for clinical use13. This study aims to fill that gap in the current literature and uses data from a hospital in the United States to externally validate six different existing prediction models for ECV success.
Methods
Study design
We performed a retrospective cohort study at Carle Foundation Hospital in Urbana, Illinois, USA. Approval from the institutional review board (IRB) was obtained prior to data collection and the study was deemed exempt. Inclusion criteria were pregnant individuals in whom one or more ECVs were attempted or performed. ECVs in patients with a prior Cesarean section (TOLAC) and in non-breech (e.g., transverse and oblique lie) presentations were included. Twin pregnancies were excluded. Patients were excluded if the ECV or the delivery did not happen at our institution or if the patient did not have any formal ultrasounds available in our EMR.
Data was retrieved from maternal medical records including demographic characteristics, pregnancy information, ultrasound information, the success of the ECV attempt, and delivery information. All charts were manually reviewed by the authors to confirm ECV success and final method of delivery. ECV success was defined as cephalic presentation when the physician left the room after the procedure. If the fetus returned to non-cephalic presentation after, then a second attempt was considered a separate ECV altogether. Fetal presentation was obtained from the patient’s most recent formal ultrasound before the ECV; if the patient’s last ultrasound indicated cephalic presentation, then presentation information was obtained from physician notes at the time of the ECV. REDCap was used to store the data until it was deidentified.
Demographic data was collected and tabulated. Pre-gravid BMI (BMI prior to pregnancy or at first prenatal visit) was able to be collected for almost all patients, but BMI before ECV and BMI at delivery was only available for a subset of patients. Many of the models included BMI, but only some specified which BMI was used. The appropriate BMI was used when possible; otherwise, the last available BMI was used. Some patients had other missing data (most commonly for fetal presentation); thus, these patients were only used for evaluating prediction models that did not require that variable.
All models except Dahl 202114 used amniotic fluid volume; four models used amniotic fluid index (AFI) while Burgos 201015,16 used qualitative terms (e.g. “scarce”, “normal”, or “abundant”). AFI was collected from patient records. For calculation of the Burgos 2010 model, qualitative criteria were converted to oligohydramnios (AFI<5), normal, and polyhydramnios (AFI≥24) 17,18. Five patients did not have an AFI reported, but did have a single deepest pocket (SDP); this was approximated to an AFI value by multiplying by 3.
Some models required estimated fetal weight (EFW). The last growth ultrasound was used to obtain the EFW percentile. Then, using the World Health Organization Fetal Growth Charts, the EFW at the time of ECV was calculated using the EFW percentile, gestational age at ECV, and the fetal sex19,20 (see https://github.com/jcarvalho45/whoFetalGrowth for code and dataset).
Statistical analysis
Six prediction models were chosen from the prior systematic review: Dahl 202114, Bilgory 202321,22, López Pérez 202012, Kok 201123, Burgos 201015,16, and Tasnim 2012 (GNK-PIMS score)24. These models were chosen because they did not require any additional measurements (e.g., forebag size, cervical exam before ECV) beyond what is routinely already performed at our institution. The score or predicted probability was calculated for each model. The receiver operating characteristic (ROC) curves were calculated for each model by determining the sensitivity and specificity values at each threshold. The area under the curve (AUC) was calculated and used as a predictor of discriminative power, and this was compared between the models.
Because the ROC does not naturally have a method to determine confidence intervals, bootstrapping statistical techniques were used to simulate 1000 new datasets from our original dataset by sampling with replacement. This was done for each model and the simulations were used to create confidence intervals for the AUC as well as the ROC curves themselves.
Finally, calibration plots were calculated to give a measure of how well the calculated probabilities compared to the observed outcomes25. This was performed by calculating the odds of success at each decile of predicted probability (e.g., in cases with a predicted probability of 70-79%, what percent of the time was the ECV actually successful?). Quality of calibration was assessed visually.
Calculations were performed with Python version 3.10 using numpy, pandas, and sklearn (scikit-learn) libraries and in Jupyter notebooks hosted by Google colab. Figures were created using plotly 5.15. The code is hosted on GitHub and will be published online after publication of this study. The underlying data was deidentified and cleaned for publication with this manuscript for reproduction or for future analyses.
Results
Population demographics
We ultimately obtained data from 125 patients who underwent a total of 132 ECV attempts, of which 69 were successful (52.2%) as seen in Table 1 and Table 2. 53 of the patients ultimately had a vaginal delivery including 50 of those with successful ECVs and 3 of those with failed ECVs who spontaneously verted afterwards. No patients had vaginal breech births. 107 patients were breech, 15 patients were transverse, and 3 were oblique before the first ECV attempt. Our sample had a largely white population, which is consistent with the demographics of our geographical region. No patients had oligohydramnios and six patients had polyhydramnios (AFI ≥24), of which five had mild polyhydramnios (AFI<30) and five had successful ECVs (all except the patient with moderate polyhydramnios). 1 patient did not have any recorded BMI values. 116 patients did not have a recorded EFW percentile. Only 40 patients had information of breech type (e.g. frank, complete, incomplete). The underlying data from this study is given in Supplement 1.
Available models – discrimination ability
The following 6 models were studied: Dahl 202114, Bilgory 202321,22, López Pérez 202012, Kok 201123, Burgos 201015,16, and Tasnim 2012 (GNK-PIMS score)24. Table 3 outlines the varying characteristics of these models including the AUC. In our population, all models had at least some predictive value (AUC >0.5), with Dahl 2021 having the greatest AUC at 0.78 (Figure 1). Bilgory 202321,22, López Pérez 202012, Kok 201123, and Burgos 201015,16 all had similar predictive values with AUC 0.68-0.71. Tasnim 201224 had the lowest predictive value in our study with an AUC of 0.63. When bootstrapping was performed to simulate confidence intervals for the AUC, Tasnim 201224 was not statistically significant with 0.5 being inside the confidence interval (Figure 2). However, the confidence interval was wide for Tasnim 2012 because the majority of our data did not have fetal presentation (e.g., frank vs complete vs incomplete breech) recorded and thus was not included in the analysis of Tasnim 2012. The confidence intervals for the AUC overlapped between the majority of the models so confirmation of which model had the best discriminative ability was not able to be done. A bootstrapped ROC curve is shown for Dahl 2021 in Figure 3 and a table of bootstrapped AUC value for all the models is shown in Supplement 2.
Available models – calibration
Dahl 202114 was relatively well calibrated in our population (Figure 4). Bilgory 202321, 2212 had poor calibration without a specific tendency to over or under calibrate. López Pérez 2020 was consistently under-calibrated. That is, it would consistently underestimate the probability of success in our population by a relatively fixed amount. The remaining three models return score values instead of predicted probabilities. Thus, calibration curves cannot be properly assessed. A probability of success by score value is shown for Kok 2011, Burgos 2010, and Tasnim 2012 in Figure 5. The probability of success poorly correlated with score for Kok 2011 and Tasnim 2012, but correlated well for Burgos 2010.
Discussion
The prediction of ECV success has the potential to be an incredibly powerful tool in counseling patients as they weigh the risks and benefits of the procedure. Prediction models have the opportunity to give specific, personalized information to patients. However, they have the ability to cause harm if they are inaccurate or poorly calibrated. In reality, confidently providing a patient an inaccurate odds of success is worse than providing no prediction at all.
Unfortunately, there has not been a clinical consensus on the optimal tool despite dozens of models having been proposed. Our prior systematic review identified lack of external validation studies as a major gap in the literature, prohibiting clinical use of published models.
In this article, we set out to validate the 6 models that could be used in our population as they used variables which are already collected as part of routine care at our institution. These models are Dahl 202114, Bilgory 202321,22, López Pérez 202012, Kok 201123, Burgos 201015,16, and Tasnim 2012 (GNK-PIMS score)24. Our data demonstrates that all of these models have at least some predictive value, with the exception of Tasnim 2012 (which was not statistically significant by bootstrapping, likely because of the small number of patients with sufficient recorded data to calculate this model). Of all the models, Dahl 2021 had the greatest AUC, although the bootstrapped confidence interval still overlapped with the other models. Of the three models that output a specific probability, only Dahl 2021 appeared overall well calibrated. Based off the available evidence from this study, Dahl 2021 appears to overall be the best model for clinical use.
This is the first external validation study of Dahl 2021. While it is generally expected that external validation studies will show worse predictive values than the original study because of some degree of unavoidable statistical bias, the opposite was true here. Dahl 2021’s AUC in our study was 0.779 (confidence interval of 0.712-0.844), which is higher than the original study, which showed an AUC of 0.667 (confidence interval of 0.634-0.701). Overall, the results are very promising for Dahl 2021.
There are some limitations with our study. Due to the retrospective nature of our study, we only evaluated models that used variables routinely measured in our institution. Our institution does not routinely measure uterine tone26,27, cervical exam/fetal station28–30 and forebag size31, which are components of other models. One of these models may in fact have greater predictive value than the ones we evaluated in our study. On the other hand, these models would also be less practical to implement in clinical practice because they would require measuring additional variables. Additionally, these variables would not be known until close to ECV; thus, it would be hard to counsel patients in advance using one of these models. Since many models required cervical exam or fetal station, we suggest future research study be a prospective trial where cervical exam is performed.
Declarations
Ethics approval and consent to participate
No new unpublished data from humans or other animals was included in this study.
Consent for publication
Not applicable.
Funding
The authors declare that no funds, grants, or other support were received specifically for the preparation of this manuscript. The authors have no conflicts of interest to declare.
Data Availability
All data produced in the present work are contained in the manuscript and supplement.
Supplement
Supplement 1 Deidentified patient data to reproduce the analyses. For yes/no questions, 0: no, 1: yes. For baby gender, 1: female, 2: male. For ethnicity, 11: Hispanic, 12: not Hispanic, 13: unknown/patient refused. For language, 1: English, 2: French, 3: Spanish, 15: Portuguese. For patient race, 1: White, 2: Black, 3: none of the above/all other races, 4: Hispanic, 5: Asian, 6: American Indian or Alaskan Native, 7: Pacific Islander or Native Hawaiian, 8: Patient refused. EFW at ECV and at delivery were calculated using the percentile from the most recent ultrasound as described in the methods. ECV success for the first, second, third, and fourth ECV attempts is given by the variables ecv_successful_1, ecv_successful_2, etc, respectively. Maternal age at delivery (age_delivery) is given in years. Gestational ages are given in days. EFWs are given in grams. Patient weights and total weight gain (twg) is given in pounds, while BMI is given in kg/m2
Acknowledgments
We thank artificial intelligence engineer Benjamin M. Pikus for his help in deciding how to evaluate prediction model performance.
Footnotes
Other author contacts: PKP Email: peggy.palsgaard{at}wmed.edu, PS Email: pshankarappa{at}lifespan.org, VJ Email: valerie.jennings{at}carle.com