Data Preparation of the nuMoM2b Dataset

Anton Goretsky; Anastasia Dmitrienko; Irene Tang; Nicolae Lari; Owen Kunhardt; Raiyan Rashid Khan; Cassandra Marcussen; Adam Catto; Daniel Mallia; Alisa Leshchenko; Adam (Yun Chao) Lin; Anita Raja; Ansaf Salleb-Aouissi; Itsik Pe’er; Ronald Wapner; Cynthia Gyamfi-Bannerman

doi:10.1101/2021.08.24.21262142

Abstract

In 2010, the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) started the Nulliparous Pregnancy Outcomes Study: Monitoring Mothers-to-be (nuMoM2b), a prospective cohort study of a racially/ethnically/geographically diverse population of nulliparous women with singleton gestation. The nuMoM2b is a very large dataset, consisting of data for 10,038 patients with over 4,600 features per patient, spread out over 80 files. In this report, we share our experience preparing and working with this dataset. We present our data preprocessing of the nuMoM2b dataset to get a deeper understanding of the data, extract the most relevant features, make the fewest assumptions when filling in unknown values, and reducing the dimensionality of the data. We hope this report is useful to researchers interested in building machine learning and statistical models from the nuMoM2b dataset.

1 Introduction to nuMoM2b and Data Processing

The primary goal of the nuMoM2b study [1] was to determine the maternal characteristics, both clinical and genetic factors, physiological response to pregnancy and environmental factors that could be used to derive models that accurately predict adverse pregnancy outcomes (APOs). Our team has extensive prior experience working on medical data for preterm birth (PTB) prediction [2, 4, 5].

As originally organized, the dataset is not immediately conducive to analysis to those unfamiliar with the medical background, nor is it conducive to quick placement into machine learning models. The ratio of instances to features would result in an inevitable model overfitting. Various medical categories exist within an individual file, data relevant to features exist throughout many different files, and dependencies and redundancies exist across the whole dataset. As such, nuMoM2b required extensive review and processing of features, their dependencies and relations in order to reduce the complexity of the dataset, and shape it in a form amenable to a variety of machine learning (ML) algorithms and exploratory data analysis (EDA).

The intended audience for this document is researchers interested in building machine learning and statistical models from the nuMoM2b dataset. Our aim is to share our experience preparing and working with this dataset. The intent is not to share the processed dataset nor the scripts that are specific to research aims.

Our specific research goal is to build machine learning models for the prediction and prevention of preterm birth in nulliparous women using the nuMoM2b dataset. This project is funded by the NIH/NLM (Project # 1R01LM013327-01). The data review and processing work described in this document was conducted through the direct collaboration of the Computer Science departments at Hunter College and Columbia University, along with maternal and fetal medicine experts at the Columbia University Medical Center. The following are the goals of this collaborative effort:

Significant reduction of the feature space for a better management of the data and a a reduction of the risk for overfitting.
A reformatting of the dataset, to allow for easy configuration of the data, conducive to exploration and machine learning modeling.
Extensive literature review of existing risk factors related to data categories and causal pathways of PTB.
Exploratory data analysis of both the reduced and the unmodified feature space.

As a result of the effort, the dataset was reduced to 364 features at the most general level of complexity, and 465 features at a higher level of detail. Extensive filtering and imputation rules were created to accomplish this goal, along with a system for both human readability and easy script interpretation. An extensive literature review was performed documenting odds ratios. EDA was performed on the dataset, comparing calculated odds ratios to the literature review, and discovering and correcting data inconsistencies. Finally, we summarize our thoughts on data preprocessing and nuMoM2b in Section 7.

We summarize and visualize the PTB statistics in the dataset in Figure 1 and 2. See [3] for more details about the dataset.

Figure 1:

Preliminary statistics in the nuMoM2b data

Figure 2:

Distribution of gestational age and weight in the nuMoM2b data

2 Preprocessing

As part of the preprocessing stage, the information present in the provided codebooks on basic variable statistics, distribution, and type was transferred over via a script to a workable spreadsheet. Then to support our analysis, the following steps were taken:

“Data type” was corrected to a more accurate label. Many variables were labeled “num” for numeric even though they actually represented categorical data. For example, Country of Birth (V1AF08) was labeled numeric. However, country cannot be treated as a numeric feature in modeling. Numeric, according to the codebook, simply meant the values used in labeling were numerals. This problem was repeated across the entire dataset. As such, each variable was looked at manually and categorized properly for typical modeling and analysis. Many variables labeled “num” were changed to “categorical”.
A “Missingness” metric was calculated using the variable statistics provided in the codebook to help with data analysis.
A new “Temporal” label was added in order to label variables that appear at multiple unique time points throughout the study. nuMoM2b data was recorded at several distinct time points, and some variables are updated over time / questions are asked more than once. Temporal had values of True or False.
A new “Temporal Detail” label was added in order to label at which time point a variable is relevant. Not all data collected are relevant only to the moment of collection in the study. For example, many demographic related questions were asked at Visit 2. However, those are static features and significant at all points in the study. This concept will be explained further under the section titled “Timeline”.

Accuracy of action timing, such as treatment administration like medications, and the creation of these “temporal” labels are important for this team’s goals of sequential treatment decision making modeling.

3 Organization into Groups

After the preprocessing stage, further organization was desired to better understand and manage the dataset. In order to better organize the data, Filtering Groups were created. The goal of these groups was to break down the dataset into more understandable portions for those unfamiliar with the detailed medical content, and for simpler processing of data in bulk. The following were the filtering groups created for the nuMoM2b dataset.

Treatment - Variables relating to intervention in cases of potential or immediate PTB risk, such as progesterone administration, steroids, and last minute medical administrations at delivery.
Psychological - Variables relating to the psychological state of the patient, through multiple scales.
Physiological - Variables relating to instantaneous measurements of physiology, such as temperature, symptoms of flu-like illnesses, blood pressure, etc.
Medical History - Variables relating to long term medical history and conditions, but also various tests performed on both the mother and the fetus or newborn found in the study.
Demographics - Variables relating to the patient’s demographic factors, such as race, income, education, etc.
Ultrasound - Variables that were recorded from the research and clinical ultrasounds the patient went through and marked as such in the dataset.
Outcomes - A metadata file containing various variables that are useful mainly as classifier labels, or are features that were collected post-delivery about the mother or newborn.
Activity - Variables relating to the physical activity of the patient.
Toxicology - Variables relating to the medications taken just before and during pregnancy, and / or their relations to particular reasons / conditions.
Family History - Variables relating to family medical conditions and history.
Food Frequency Analysis - Variables relating to food, diet and vitamins, in the three months prior to pregnancy.
Sleep Substudy - Select variables from the two sleep substudies included in nuMoM2b.

4 Processing

Now that the dataset was more simply organized, the process of feature reduction began. Filtering groups were divided among teams, who worked in consultation with the Columbia University Medical Center OB/GYN collaborators, to determine which features to keep as is, which features to summarize into scales, scores, or other aggregate forms, and which features to remove for redundancy or other reasons. As a result of this effort, a system was created to organize the efforts in a form both readable by people, and interpretable by scripts, to allow for easy updates. This system consists of the creation of Filtering and Imputation Rules, and Layering. Filtering rules serve to keep, drop, or summarize features, while imputation rules serve to impute missing data as required by many modeling algorithms. Layering served to organize the data into different levels of abstraction from the most general to the most specific level of information.

4.1 Layers

Layers were decided upon given the high feature complexity of nuMoM2b, even after much feature reduction. As many questions are structured around a format of a general question followed by several sub-questions, it was reasoned that the general question should in most cases be representative of the data points that follow. For example, V1AD06 asks, Have you had any ‘flu-like illnesses’, ‘really bad colds’, fever, a rash, or any muscle or joint aches since you became pregnant? This question is then followed by questions in regard to which symptoms are actually present in this ‘flu-like illness’. V1AD06 covers all, and is such a more general question, and thus would be selected into a more general layer, while symptom specific questions would be reserved for the detail-oriented layers, or dropped. If dropped, they may be brought back if desired, or if some significance is found in the most general feature. Internally, we decided upon 3 layers.

Layer 0 would consist of known risk factors for PTB, along with variables shown to have high odds ratios in our EDA.
Layer 1 would be the most general layer, consisting of L0 and all general questions that cover as much information as possible.
Layer 2 would bring back detail that may have been lost, or not included given the generalization and simplicity of L0 and L1.

We will not go into variable-level detail in each layer, but we believe this concept can serve as an organizational method for large complex datasets.

4.2 Filtering and Imputation Rules

As part of the data cleaning process, data first passes through a general filtering script, with the rules shown in Table 1. It then passes through an imputation script, with the rules shown in Table 2. Throughout the processing of this dataset, we strived to hold to a set of generally applicable rules for imputation. Below is a sample of the imputation rules used for nuMoM2b.

View this table:

Table 1:

List of Filter Rules and their Descriptions

View this table:

Table 2:

List of Imputation Rules and their Descriptions

If a numeric-like feature is applicable to a vast majority of patients, and the missingness was relatively low, we attempted to impute with a value such as MEAN or MODE. If multiple measurements were made on the same information, the mean took into account all of them (excluding those marked as incomplete or inaccurate).
If a feature serves as a general precursor to a list of follow-up questions, such as “Have you had any flu-like illnesses, ‘really bad cold’, fever, a rash…” followed by questions regarding symptoms, if data is present in the follow-up but the is missing for the general question, we impute the general question to whatever value represents True. Otherwise we often impute to an unknown or not applicable determiner such as 999, especially if assumption is misleading in the understanding of treatment. Often, these imputation rules looked at follow-up questions that were not included in the current layer or were excluded from modeling, due to the general question covering the topic. For the follow-up questions themselves, imputation may have been left at unknown or imputed to a value depending on the missingness and applicability.
Negative one (−1) was often used to represent unknown or inapplicable for numeric features.
When there is parent feature and there are several related child features that are too detailed, DROP the child features.

5 Timeline

This timeline is meant to serve as a high level representation of the way we processed and organized data for internal analysis and modeling. This timeline does not cover all variables and is thus not an exhaustive representation of nuMoM2b. It serves rather as a quick glance into the data currently in use, accurate at time of publication, but subject to change. Each internal substudy may use a different collection of variables for their goals.

This timeline shows the existence of six unique time points that nuMoM2b represents. 5 of those time points are directly sequential, namely Before Pregnancy, Visit 1, Visit 2, Visit 3, Visit 4 (Delivery Visit), and Post Delivery. There also exists a Constant time point, which groups data that may have been collected at different points in the study, but applies at all times to the patient, such as race and pregnancy history. In an ideal full-term pregnancy, the patient would pass through all these time points, and have data recorded under the Constant point. However, if a birth was preterm, still, or a patient missed a visit, their timeline could skip some of either Visit 1, 2 or 3, and move straight to Visit 4, which represents data collected at delivery. This is represented by the dashed arrow in the timeline. Variables in the timeline are organized by the Filtering Groups described earlier, and are abstractly summarized and simplified in the tables below. On the bottom right we see two boxed sections. These represent time points that occurred in between time points in this timeline. Antepartum evaluations may have occurred between visits, and enrollment screening occurred before visit 1. To reiterate, this timeline is not representative of all data available or used, but should rather serve as a quick guide.

View this table:

Table 3:

Amount of patients available at each visit. Present numbers are patients who filled out the main maternal interview form. Patients designated as other / missing may still have information available at that visit, especially under Chart Abstractions. Delivery outcomes are available for almost every patient except those withdrawn. Withdrawn patient counts were approximated using the interval at form A05 recording (official withdrawal). Other / Missing represents those not having taken the main maternal interview form, (V#A)

Figure 3:

nuMoM2b Processed Data Timeline

6 Filtering Group Breakdown

The following is a breakdown of the filtering groups our team decided upon at the time of writing. These groups are subject to change in name and organization, and simply serve to understand and organize the data in a simpler and more manageable fashion. Each group below contains a description of itself, the total number of features marked belonging to said group, the number of features used in layer 1 and layer 2, the files which comprise the group, and a general description of the features that were dropped from out modeling. Following each description, shown is a table of the features both compiled and used as-is. Features used as-is are at the top of the table. Compiled features – those constructed from other variables in the data – follow, and are surrounded by horizontal lines and bolded. Those below a compiled variable are used to construct it, using the rule listed next to the compiled feature. The column NAME represents the original or compiled variable name. FILE represents the file in the data from which the original feature comes from. RULE represents which filtering rule was used to compile a group of features. TEMPORAL represents at which point in the timeline this feature is relevant. TEMPORAL ranges from -1 to 5, where 1 to 4 represent Visits 1 to 4 (delivery), -1 represents “applies at all time”, 0 represents before pregnancy, and 5 represents post-pregnancy. IMPUTE represents which rule was used for imputation. MISSING represents the missingness value for each original feature. DESCRIPTION is a shortened description from the original data set of each variable.

6.1 Family History

The family history filtering compiled all of the questions regarding diagnoses of family members related to diabetes, blood clotting disorders, pregnancy complications, heart disease, and hypertension. The 9 pregnancy-related conditions were grouped into 3 categories:

Spontaneous

Early or preterm rupture of the membranes
Spontaneous preterm delivery (less than 37 weeks)

Indicated

Delivery of a child more than 3 weeks before the expected due date
Preeclampsia, eclampsia, toxemia or pregnancy-induced hypertension

Fetal Conditions

Delivery of a child weighing less than 5 lb 8 oz (or 2500 grams)
Stillbirth
Delivery of an infant with a birth defect
Other pregnancy complication

Medical History Filtering

6.9 Outcomes

The outcomes filtering concerns post-pregnancy analysis of the newborn (all kinds of delivery), and variables that are highly predictive of preterm birth, such as major fetal conditions. Some of these variables may be used as class labels, such as pOUTCOME or GAwksEND, and others may provide privileged information for certain models. The vast majority of this data is highly prone to class leakage and should not be used in modeling, but is used for data analysis. Provided in the table are a sample of features that are useful in data analysis or as class labels.

Total # Features: 1165

Relevant Files: CBA, CBB, CBC, CMA, CMC, CPA, pregnancy outcomes, S02, U02, U2A, U3A, V4A

View this table:

6.10 Treatment

The treatment filtering contains features that pertain to drug administration directly related to delivery or the prevention of preterm birth. This includes administration of steroids for fetal lung maturation, tocolytics, progesterone, and various classes of drugs used specifically for delivery. These features are highly predictive of PTB, and are thus only used for analysis or sequential treatment decision making. Shown is a sample of these features.

Total # Features: 149

Layer 1 # Features: N/A (27 Available)

Layer 2 # Features: N/A (34 Available)

Relevant Files: CMA

Dropped Features

For this group, there was much metadata that was used to place the administration within a particular time point, and then later dropped. Medication codes were used instead of names. Details on which particular drug within a class was used were dropped in L1, as the intended and possible effect were known and similar. Anticonvulsant details were returned in L2 as they can have dramatic differences.

View this table:

6.11 Food Frequency Analysis

The food frequency analysis file takes into account the food and nutrients consumed by patients in the 3 months prior to conception. We were most interested in the calculated nutrient intake as opposed to less interpretable data like grams of food consumed or food pyramid group quantities. In addition to nutrients, we also attempted to capture the overall energy intake by using glycemic load and calorie intake as proxies.

Total Features: 737

Layer 1 Features: 38

Layer 2 Features: 38

Relevant Files: food frequency analysis (ancillary)

Dropped Features

For this group, we dropped the features that contained redundant information about food intake that was already captured by the vitamin amounts. For instance, quantities of food items such as “glasses of milk” would be reflected in the overall vitamin and calorie consumption.

View this table:

Table 10:

Food Frequency Filtering

6.12 Sleep Substudy

The sleep substudy filtering includes sleep quantity and quality by calculating the average hours slept per night as well as sleep apnea diagnoses.

Total Features: 6

Layer 1 Features: 4

Layer 2 Features: 6

Relevant Files: V1L, V3L

Dropped Features

Due to sleep being a separate substudy, only a few features were selected to be included. The rest were dropped as of writing.

View this table:

Table 11:

Sleep Substudy Filtering

6.13 Genetic data

The genotyped cohort comprises 9,757 nulliparous women from the nuMoM2b study who had adequate samples and agreed to be genotyped. DNA extractions from whole blood, which had been frozen at −80°, were carried out on a Qiasymphony instrument at the Center for Bioinformatics and Genomics (Indiana University). Genotyping was done at the Van Andel Institute (Grand Rapids, MI, USA) using the Infinium Multi-Ethnic Global D2 BeadChip (Illumina, Miami, USA). We imposed standard filters for quality control of loci at this stage (cluster separation < 0.3, AA R Mean < 0.2, AB R Mean < 0.2, BB R Mean < 0.2, 10% GC < 0.3) using GenomeStudio v2.4 (Illumina). Genotype calls (in .GCT format) for the 1,748,280 loci that passed initial quality control were made with Beeline autoconvert (Ilumina).

7 Lessons Learned and Discussion

The main goals of the data preprocessing of the nuMoM2b dataset were to get a deeper understanding of the data, extract the most relevant features, and reduce the dimensionality of the data. Additionally, we wanted to make the fewest assumptions when filling in unknown values, and to understand the intricacies behind the dependencies present in the data.

The original ratio of features to patients of around 4,600 to 10,000 is too high to create reliable models because of the high possibility of over-fitting in a high dimensional space. Therefore, it was crucial for the interdisciplinary team to work together to leverage medical expertise, computing and statistics skills in order to gain a good grasp of the wealth of information in the nuMoM2b dataset. The data preprocessing team worked to make sure that any medical assumptions, such as grouping related conditions together or deciding to drop select details, were approved by the medical experts.

It was equally crucial to understand the protocol that doctors follow in the administration of interventions. Debates about definitions, classification of conditions, and treatments exist and were taken into consideration. A notable example is the debate around the usage of progesterone as an intervention for preterm birth. Another is simply the definitions of spontaneous and indicated preterm birth and how the events that fall under those categories may have changed over time.

With regards to the data itself, the most significant challenges were understanding the dependencies between different features and their medical relevance to preterm birth. The data includes both patient interviews and abstractions from the patients’ charts. Sometimes the questions overlapped, other times they were parallel, or had features that combined information from both. Each source has a different level of relevance in building the models.

Standardization of data formatting was another obstacle. Much of the original labeling and information provided in the codebooks was not sufficient to directly begin the modeling process. For instance, labels such as variable type had to be manually added. It was also important to determine when specific data were relevant and when they were collected. For example, for data that were not collected exactly within the strict time designations for Visits 1 through 4, the associated dates were computed relative to the estimated date for the start of pregnancy. Manual inspection was also required when coded values of responses did not match the information shown in the data collection forms. Codings such as Don’t Know, or Refused response did not exist in the data even though they were mentioned in the codebooks.

Throughout the preprocessing, we aimed to make the filtering and imputation steps as systematic as possible, while organizing the data into medically homogeneous groups. The discussion here is merely a glimpse of the intended goals and actions taken during the preprocessing. As our team delves into the PTB-related research goals, we anticipate that more preprocessing will be required on specific data.

8 Institutional review board statement and funding sources

Human subjects approval for this study, titled “SCH: Prediction of Preterm Birth in Nulliparous Women”, was obtained following review by Columbia University Human Subjects Institutional Review Board under number IRB-AAAR9413, and the City University of New York CUNY HRPP/IRB review number 2019-0855. Human subjects training requirements were completed by all authors of this study.

This work is supported by NIH/NLM (www.nlm.nih.gov) grant R01LM013327.

References

[1].↵
http://www.nichd.nih.gov/research/supported/Pages/nuMoM2b.aspx.
[2].↵
Clinical Informatics Group. Data pre-processing for the preterm prediction study MFMU dataset. http://www.cs.columbia.edu/~ansaf/cing/CCLS-13-04.pdf, 2013.
Google Scholar
[3].↵
David Haas et al. A description of the methods of the nulliparous pregnancy outcomes study: monitoring mothers-to-be (numom2b). American Journal of Obstetrics and Gynecology, 212(4):539.e1–539.e24, 2015.
OpenUrl Google Scholar
[4].↵
Ilia Vovsha, Ashwath Rajan, Ansaf Salleb-Aouissi, Anita Raja, Axinia Radeva, Hatim Diab, Ashish Tomar, and Ronald Wapner. Predicting preterm birth is not elusive: Machine learning paves the way to individual wellness, 2014.
Google Scholar
[5].↵
1. Finale Doshi-Velez,
2. Jim Fackler,
3. David Kale,
4. Byron Wallace, and
5. Jenna Wiens
Ilia Vovsha, Ansaf Salleb-Aouissi, Anita Raja, Thomas Koch, Alex Rybchuk, Axinia Radeva, Ashwath Rajan, Yiwen Huang, Hatim Diab, Ashish Tomar, and Ronald Wapner. Using kernel methods and model selection for prediction of preterm birth. In Finale Doshi-Velez, Jim Fackler, David Kale, Byron Wallace, and Jenna Wiens, editors, Proceedings of the 1st Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 55–72, Children’s Hospital LA, Los Angeles, CA, USA, 18–19 Aug 2016. PMLR.
Google Scholar

Posted August 26, 2021.

Download PDF

Author Declarations

Data/Code

Citation Tools

Get QR code

Tweet Widget

Subject Area

Obstetrics and Gynecology

Reviews and Context

Comment

TRIP Peer Reviews

Community Reviews

Automated Services

Blogs/Media

Author Videos

Subject Areas

All Articles

Addiction Medicine (413)
Allergy and Immunology (727)
Anesthesia (215)
Cardiovascular Medicine (3121)
Dentistry and Oral Medicine (351)
Dermatology (264)
Emergency Medicine (464)
Endocrinology (including Diabetes Mellitus and Metabolic Disease) (1105)
Epidemiology (13070)
Forensic Medicine (14)
Gastroenterology (867)
Genetic and Genomic Medicine (4885)
Geriatric Medicine (449)
Health Economics (753)
Health Informatics (3084)
Health Policy (1109)
Health Systems and Quality Improvement (1141)
Hematology (410)
HIV/AIDS (969)
Infectious Diseases (except HIV/AIDS) (14367)
Intensive Care and Critical Care Medicine (888)
Medical Education (456)
Medical Ethics (121)
Nephrology (502)
Neurology (4659)
Nursing (248)
Nutrition (691)
Obstetrics and Gynecology (848)
Occupational and Environmental Health (767)
Oncology (2404)
Ophthalmology (683)
Orthopedics (270)
Otolaryngology (333)
Pain Medicine (309)
Palliative Medicine (88)
Pathology (518)
Pediatrics (1249)
Pharmacology and Therapeutics (524)
Primary Care Research (525)
Psychiatry and Clinical Psychology (3997)
Public and Global Health (7218)
Radiology and Imaging (1612)
Rehabilitation Medicine and Physical Therapy (960)
Respiratory Medicine (945)
Rheumatology (462)
Sexual and Reproductive Health (478)
Sports Medicine (403)
Surgery (516)
Toxicology (65)
Transplantation (222)
Urology (192)

Comments

medRxiv aims to provide a venue for anyone to comment on a medRxiv preprint. Comments are moderated for offensive or irrelevant content (this can take ~24 h). Please avoid duplicate submissions and read our Comment Policy before commenting. The content of a comment is not endorsed by medRxiv.

medRxiv aims to inform readers about online discussion of this preprint occurring elsewhere. The content at the links below is not endorsed by either medRxiv or the preprint's authors.

Community reviews for this article:

There are no community reviews for this paper.

Automated Evaluations

Certain services provide automated analysis of preprints. Analyses invited by the authors are displayed at the top of this tab. Those done independently of authors are shown underneath . None of these analyses is endorsed by medRxiv.

Automated Evaluations:

There are no automated evaluations for this paper.

[1] [1].↵
http://www.nichd.nih.gov/research/supported/Pages/nuMoM2b.aspx.

[2] [2].↵
Clinical Informatics Group. Data pre-processing for the preterm prediction study MFMU dataset. http://www.cs.columbia.edu/~ansaf/cing/CCLS-13-04.pdf, 2013.
Google Scholar

[3] [3].↵
David Haas et al. A description of the methods of the nulliparous pregnancy outcomes study: monitoring mothers-to-be (numom2b). American Journal of Obstetrics and Gynecology, 212(4):539.e1–539.e24, 2015.
OpenUrl Google Scholar

[4] [4].↵
Ilia Vovsha, Ashwath Rajan, Ansaf Salleb-Aouissi, Anita Raja, Axinia Radeva, Hatim Diab, Ashish Tomar, and Ronald Wapner. Predicting preterm birth is not elusive: Machine learning paves the way to individual wellness, 2014.
Google Scholar

[5] [5].↵
Finale Doshi-Velez,
Jim Fackler,
David Kale,
Byron Wallace, and
Jenna Wiens
Ilia Vovsha, Ansaf Salleb-Aouissi, Anita Raja, Thomas Koch, Alex Rybchuk, Axinia Radeva, Ashwath Rajan, Yiwen Huang, Hatim Diab, Ashish Tomar, and Ronald Wapner. Using kernel methods and model selection for prediction of preterm birth. In Finale Doshi-Velez, Jim Fackler, David Kale, Byron Wallace, and Jenna Wiens, editors, Proceedings of the 1st Machine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 55–72, Children’s Hospital LA, Los Angeles, CA, USA, 18–19 Aug 2016. PMLR.
Google Scholar

[6] Finale Doshi-Velez,

[7] Jim Fackler,

[8] David Kale,

[9] Byron Wallace, and

[10] Jenna Wiens

Data Preparation of the nuMoM2b Dataset

Abstract

1 Introduction to nuMoM2b and Data Processing

2 Preprocessing

3 Organization into Groups