A Predictive Model to Identify Complicated Clostridiodes difficile Infection
============================================================================

* Jeffrey A. Berinstein
* Calen A. Steiner
* Samara Rifkin
* D. Alexander Perry
* Dejan Micic
* Daniel Shirley
* Peter D.R. Higgins
* Vincent B. Young
* Allen Lee
* Krishna Rao

## Abstract

**Background** *Clostridioides difficile* infection (CDI) is a leading cause of healthcare-associated infections and may result in organ dysfunction, colectomy, and death. We recently showed that published risk scores to predict severe complications from CDI demonstrate poor performance upon external validation. We hypothesized that building and validating a model using geographically and temporally distinct cohorts would more accurately identify patients at risk for complicated CDI.

**Methods** We conducted a multi-center retrospective cohort study of adult subjects diagnosed with CDI in the US. After randomly partitioning the data into training/validation set, we developed and compared three machine learning algorithms (Lasso regression, random forest, stacked ensemble models) with 10-fold cross-validation that used structured EHR data collected within 48 hours of CDI diagnosis to predict disease-related complications from CDI (intensive care unit admission, colectomy, or death attributable to CDI within 30 days of diagnosis). Model performance was assessed using area under the receiver operating curve (AUC).

**Results** A total of 3,762 patients with CDI were included of which 218 (5.8%) had complications. Lasso regression, random forest, and stacked ensemble models all performed well with AUC ranging between 0.89-0.9. Variables of importance were similar across models, including albumin, bicarbonate, change in creatinine, systolic blood pressure, non-CDI-related ICU admission, and concomitant non-CDI antibiotics. Sensitivity analyses indicated that model performance was robust even when varying derivation cohort inclusion and CDI testing approach.

**Conclusion** Using a large heterogeneous population of patients, we have developed and validated a prediction model based on structured EHR data that accurately estimates risk for complications from CDI.

**Key Points** Machine learning models using structured electronic health records can be leveraged to accurately predict risk of severe complications related to *Clostridiodes difficile* infection, including intensive care unit admission, colectomy, and/or death.

Keywords
*   Clostridiodes difficile
*   machine learning
*   predictive model
*   severe disease

## Introduction

*Clostridiodes* (*Clostridium) difficile* infection (CDI) is the leading cause of healthcare-associated infection in United States (U.S.) hospitals, accounting for nearly half a million infections per year [1,2]. Furthermore, up to 8% of patients with CDI may develop complicated disease, which is associated with worse outcomes, including organ dysfunction, severe sepsis, colectomy, and death [3]. In addition to the substantial patient morbidity and mortality, CDI has been attributed to $4.8 billion in acute healthcare costs in the United States, with even more costs associated with non-acute-care settings [4].

While multiple effective medical therapies have been developed, such as vancomycin, fidaxomicin, monoclonal antibodies, and fecal microbiota transplant [5], it remains unclear which patients are highest risk for a complicated disease course and therefore should receive more aggressive upfront therapy to improve outcomes. Current risk scores developed to identify patients at high risk for complicated CDI, have limited generalizability due to being developed in small, single-center cohorts or having not undergone rigorous external or prospective validation [5–14]. Recently, our group demonstrated that published CDI severity scoring systems performed poorly when tested on a large, multicenter cohort within the US [15]. Thus, an accurate prediction model that can be easily applied early after CDI diagnosis is needed to identify patients at risk for complicated CDI and to facilitate effective therapy that minimizes the risk for adverse CDI outcomes. In this study, we aimed to determine whether using structured electronic health record data from several geographically distinct centers in the US would provide a more generalizable predictive model for complicated CDI.

## Methods

### Patient Cohorts

We conducted a multi-center retrospective longitudinal cohort study at four geographically and temporally distinct cohorts, including the University of Michigan (2010-2012, 2015-2016), University of Wisconsin (2014-2015), and University of Chicago (2013-2015), as previously described by Perry *et al*. [15]. Adult subjects ≥18 years who were diagnosed with CDI were included in our analysis. CDI was diagnosed by presence of diarrhea (≥3 unformed stools in a 24-hour period) and positive real-time PCR for the *tcdB* gene (Simplexa *C. difficile* Universal Direct, Diasorin Molecular LLC, Cypress, CA) at University of Wisconsin and University of Chicago. Meanwhile, at both University of Michigan cohorts, CDI was diagnosed by presence of diarrhea (≥3 unformed stools in a 24-hour period) and a positive stool test for toxigenic *C. difficile* (positive testing for both the glutamate dehydrogenase [GDH] antigen and TcdA/TcdB by enzyme immunoassay [C. Diff Quik Chek Complete, Alere, Waltham, MA], or real-time polymerase chain reaction for the *tcdB* gene performed when GDH/toxin results were discordant). Patients with positive laboratory testing for CDI were identified using electronic health record (EHR).

### Primary Outcome

We defined CDI as complicated if it led to any of three adverse outcomes within 30 days of CDI diagnosis: admission to intensive care unit (ICU), colectomy, or death attributable to CDI as determined by the study team physicians [13,17]. Clinical and demographic variables including comorbidities, medications, vitals, laboratory results, and study results (such as radiographic imaging) were collected for each patient’s admission through automated query of the EHR. The institutional review boards at the University of Michigan, University of Chicago, and University of Wisconsin gave ethical approval for this work.

### Predictor Variables

A total of 32 predictor variables were evaluated, including demographic, biometric, biochemical, and co-morbid conditions (see Supplementary Table 1). Variable inclusion was based on literature review of clinically relevant factors. Non-CDI-related ICU admission and non-CDI concurrent antibiotics were included as predictor variables but only if ICU admission and/or antibiotic use were unrelated to CDI.

### Data Pre-Processing

We used R version 4.0.2 (R Foundation for Statistical Computing, Vienna, Austria) for cleaning and wrangling data. Data were randomly split with 75% of the data used for model training and the remaining 25% data held out for testing and validation. The data were stratified by the proportion of severe CDI events so that the distribution of the outcome was maintained in both the training and test set. Missing data were imputed using random forest-based imputation strategies (i.e., using the R package missForest) [18], which has previously been shown to have the lowest imputation error for both continuous and categorical variables [19]. Numeric variables were centered and scaled while categorical variables were recoded into dummy variables.

### Model Development and Testing

We developed three separate machine learning classification models using the *Tidymodels* framework [20], including L1-regularized logistic regression (least absolute shrinkage and selection operator or Lasso) using the R package *glmnet* [21], random forest with the R package *ranger* [22], and extreme gradient boosted trees with the R package *XGBoost* [23]. The best performing algorithms were combined in an ensemble model using stacking with the R package *stacks* [24]. Machine learning algorithms were first applied to training data to parameterize and fit the model. Ten-fold cross-validation was utilized to tune model hyperparameters. To evaluate the prediction accuracy of machine learning models, area under the receiver operating characteristic (ROC) curves (AUC) were calculated for each model using independent test data.

### Variable Importance

Variable importance was determined by permutation-feature importance analysis using the R package *vip* [25]. Permutation-feature importance measures the increase in prediction error when the variable is permuted and the relationship between variable and outcome is broken, thus the drop in the model score is indicative of how much the model depends on the feature [26]. To further improve interpretability of our machine learning models, we also performed locally interpretable model-agnostic explanations using the R package *breakDown*, which decomposes model predictions into parts that can be attributed to particular variables [27].

### Sensitivity Analyses

To determine whether our models were generalizable across sites and time, we performed a sensitivity analysis where we trained the machine learning algorithms on three cohorts and validated model performance on the fourth cohort. We repeated this process three times so that models were validated on each individual cohort.

Furthermore, as the definition of CDI varied by individual centers, we performed an additional sensitivity analysis to determine whether diagnosis of CDI by PCR only versus two-step mechanism influenced our model predictions. Sites that diagnosed CDI by PCR only (i.e. University of Wisconsin and University of Chicago) were analyzed separately from sites that utilized a two-step diagnostic approach (i.e. University of Michigan 2010 and 2016). We randomly split the data, training the model on 75% of the data and tested and validated our models on the remaining held-out 25% of the data. Model performance was evaluated as described above.

## Results

### Patient Characteristics

A total of 3,762 patients testing positive for *C. difficile* were collected from four cohorts at three sites from 01/01/2010 to 12/31/2015 (**Table 1)**. Overall, the mean age was 56.49 ± 19.82 years, 53.5% female, 68.6% Caucasian, 25.7% Black (with the highest proportion in the University of Chicago cohort), and 5.7% reported another race or not specified. A total of 218 patients (5.8%) met the primary endpoint, including 65 (4.5%) at the University of Chicago cohort; 90 (7.9%) at the University of Michigan 2010-2012 cohort; 28 subjects (4.4%) at the University of Michigan 2010-2012 cohort; and 35 (6.5%) at the University of Wisconsin.

View this table:
[Table 1:](http://medrxiv.org/content/early/2022/05/22/2022.05.18.22275113/T1)

Table 1: Patient Characteristics

### Model Training and Performance

Data were randomly split into a training set of 2,763 cases and an independent validation set of 921 cases. Splitting of the data set was stratified by the proportion of severe CDI to preserve the distribution of complicated CDI in both the training and validation set. We found that lasso regression, random forest, and stacked ensemble models all performed well with AUC scores ranging from 0.88-0.89 (**Figure 1**) when tested on an independent test dataset. However, XGBoost models showed poor performance (AUC 0.50) and were not carried forward in subsequent analyses (data not shown). The random forest model performed marginally better than the lasso regression and stacked ensemble model, but only with a magnitude of 0.01, which is unlikely to be clinically significant. Model calibration plots for the three models are illustrated in **Supplemental Figure 1**.

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/05/22/2022.05.18.22275113/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2022/05/22/2022.05.18.22275113/F1)

Figure 1. 
Receiver operator characteristic (ROC) curves for Lasso, Random Forest, and Stacked Ensemble models. After randomly splitting the data into training/validation sets, Lasso regression (*red line*), Random Forest (*blue line*), and Stacked Ensemble (*green line*) models were trained. All models demonstrated excellent performance when tested on an independent validation set (Lasso regression: AUC = 0.88 [95% CI 0.84-0.93]; Random Forest: AUC = 0.89 [95% CI 0.84-0.94]; Stacked Ensemble: AUC = 0.88 [95% CI 0.83-0.94]).

#### Sensitivity Analyses

To test the generalizability of our multi-cohort approach as well as to assess for temporal trends and variations in management practices and participant composition among the various cohorts, we performed a sensitivity analysis by deriving models using only three of the four cohorts in the dataset and holding out the fourth cohort as the independent test set. We found that even models trained on only three cohorts were able to predict severe CDI in the fourth cohort left out of the training set with high accuracy. Model performance was robust in this sensitivity analysis (AUC ranging from 0.84-0.92) although there was a drop in performance when data from the University of Chicago was held out and used as the test set (AUC 0.75-0.76) **(Figure 2**).

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/05/22/2022.05.18.22275113/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2022/05/22/2022.05.18.22275113/F2)

Figure 2. 
Models are robust despite geographical, demographic, and temporal variability. To determine whether models were generalizable across centers and across time, machine learning algorithms were derived and trained using data from three cohorts and then validated model performance on the fourth cohort (labeled as Test in the figure). This process was repeated three times so that models were validated on each individual cohort. The Lasso regression (*red line*), Random Forest (*blue line*), and Stacked ensemble (*green line*) models demonstrated good performance when tested on (**A**) the University of Michigan 2010-2012 cohort (Lasso regression: AUC = 0.84 [95% CI 0.79-0.89); Random Forest: AUC = 0.87 [95% CI 0.82-0.91]; Stacked Ensemble: AUC = 0.86 [95% CI 0.82-0.91]); (**B**) University of Wisconsin cohort (Lasso regression: AUC = 0.92 [95% CI 0.86-0.98]; Random Forest: AUC = 0.92 [95% CI 0.86-0.98]; Stacked Ensemble: AUC = 0.92 [95% CI 0.86-0.98]); and (**D**) University of Michigan 2015-2016 cohort (Lasso regression: AUC = 0.91 [95% CI 0.87-0.96]; Random Forest: AUC = 0.89 [95% CI 0.82-0.95]; Stacked Ensemble: AUC = 0.90 [95% CI 0.84-0.96]). Performance of the models dropped but still showed adequate performance when tested on (**C**) the University of Chicago cohort (Lasso regression: AUC = 0.76 [95% CI 0.70-0.82]; Random Forest: AUC = 0.75 [95% CI 0.67-0.80]; Stacked Ensemble: AUC = 0.75 [95% CI 0.68-0.82]).

In addition, to determine whether the method for CDI diagnosis may have impacted model performance, we performed separate sensitivity analyses using data from sites that used PCR alone (i.e. University of Chicago and University of Wisconsin) vs. two-step testing (i.e. University of Michigan 2010 and 2016) to train and validate models. For sites that used two-step testing, our models retained excellent performance (AUC 0.89 – 0.91), while model performance was lower when using sites that used PCR testing alone as the hold-out-test set (AUC 0.79 – 0.84) (**Supplemental Figure 2**).

### Predictive Features for Complicated CDI

We performed a permutation-based variable importance analysis to determine the variable importance of each predictor of complicated CDI (**Figure 3**). Interestingly, the variables of importance were similar across models, but varied in their relative contribution to each model. The top predictors shared across all models included albumin, bicarbonate, change in creatinine, systolic blood pressure, non-CDI-related ICU admission within 30 days, and concomitant non-CDI antibiotics within 30 days of CDI diagnosis. Model performance was similar despite differences in relative contributions by each variable.

![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/05/22/2022.05.18.22275113/F3.medium.gif)

[Figure 3.](http://medrxiv.org/content/early/2022/05/22/2022.05.18.22275113/F3)

Figure 3. 
Variable Importance Plot for Lasso Regression and Random Forest. Twenty of the most important variables by permutation-based variable importance analysis are illustrated for (**A**) Lasso Regression and (**B**) Random Forest. Variables positively associated with complicated *Clostridiodes difficile* infection (CDI) are shown in *blue* while variables negatively associated with complicated CDI are shown in *orange*. Please note, determining the directionality of association for variables using permutation-based variable importance analysis with nonlinear based machine learning techniques, such as random forest, is not possible.

To enhance the interpretability of our machine learning models, we also performed breakdown plots to determine how each variable contributes to a final prediction. For patients with low predicted probability of complicated CDI (**Figure 4A-B**), absence of factors, such as non-CDI-related ICU admission, concurrent non-CDI antibiotics, low CO2 levels, low systolic blood pressure, and peak WBC count were associated with decreased risk for complicated CDI. In contrast, for patients with high predicted probability for complicated CDI (**Figure 4C-D**), non-CDI-related ICU admission, high WBC count, low systolic blood pressure, low albumin level, and concurrent non-CDI antibiotics were associated with increased risk for complicated CDI.

![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/05/22/2022.05.18.22275113/F4.medium.gif)

[Figure 4.](http://medrxiv.org/content/early/2022/05/22/2022.05.18.22275113/F4)

Figure 4. 
Breakdown plot for patients with low and high predicted probability for complicated *Clostridiodes difficile* infection (CDI). Breakdown plot is shown for (**A**) Lasso and (**B**) random forest model in the same patient (patient #1) with low predicted probability for CDI. Similarly, breakdown plot is depicted for (**C**) Lasso and (**D**) random forest model in a patient with high predicted probability for complicated CDI (patient #277). Intercept represents the mean model-specific predicted probability for complicated CDI while each subsequent variable increases or decreases predicted probability and results in the overall predicted probability (labeled prediction).

## Discussion

In this multi-site cohort study, machine learning models based on structured electronic health record data accurately predicted disease-related complications from CDI. All three machine learning methods, including lasso regression, random forest, and a stacked ensemble method demonstrated excellent performance in predicting severe complications from CDI (i.e., AUC of 88-89%). Importantly, we intentionally developed models without site-specific variables, and our results suggest that this model is generalizable across centers and time, which is critical when considering the heterogeneity in patient population and practice patterns across the United States. Furthermore, our results were generally agnostic to specific algorithms and performed equally well when using both linear and nonlinear machine learning approaches. Finally, although there was some model-specific variability, the predictors most important in discriminating severe complications attributable to CDI were similar between models.

While several CDI severity and complication scoring systems have been developed previously, they generally were developed from single centers and were not externally validated [5–14]. Our group recently published the largest external validation of published CDI severity scoring systems and found that all models yielded AUC scores below 70% suggesting that current models are not generalizable and cannot reliably predict severe complications from CDI [15].

There are multiple notable strengths to our analysis which make our approach generalizable. First, our models were developed using data from three geographically distinct sites, which were composed of a heterogenous population. When we validated our model on each site as an independent hold-out data set, our models continue to demonstrate good performance. Secondly, our models were robust despite differences in temporal trends and practice patterns across sites. For example, CDI was diagnosed by positive PCR test alone at the University of Chicago and the University of Wisconsin, while a two-step algorithm was required for positive CDI diagnosis at the University of Michigan. Furthermore, the University of Michigan started to use Vancomycin as first-line treatment for CDI starting in 2013 while the other centers did not switch from metronidazole to vancomycin as first-line treatment until mid-2016. Despite these differences in practice patterns and temporal trends, our sensitivity analyses demonstrated that our model retained good performance when deriving models from the three earlier cohorts and validating it on the latter University of Michigan cohort as well as deriving and validating models on the two sites that used PCR testing alone for CDI diagnosis compared to sites that used a two-step algorithm.

While our model showed strong performance overall, we did see a noticeable drop in performance on sensitivity analyses when the model was validated on the cohort from the University of Chicago. We speculate that this may be related to the diversity of the patients comprising the University of Chicago cohort that was not captured when this site was not included in deriving the model. Black individuals represented a larger proportion of the cohort at the University of Chicago compared with other sites, while Black race was associated with a higher likelihood of severe CDI outcomes. Thus, important information related to racial diversity may have been lost when model derivation was performed using only the University of Michigan and University of Wisconsin cohorts which were comprised of a small proportion of Black patients. We believe this also strengthens the rationale for building and validating our primary model using four geographically and temporally heterogeneous cohorts, which was critical in developing a more generalizable model for predicting complicated CDI.

We achieved similar classification performance when using lasso regression, random forest or combining these algorithms using a stacked ensemble method. While random forest models showed a numerically higher performance compared to the other models, the incremental benefit was likely negligible. Machine learning algorithms, such as random forest and stacked ensemble methods, do not produce coefficients and are inherently less interpretable compared with lasso regression [26]. Thus, transforming model results into a risk score is more complex. Given the ease of interpretation as well as reduced computational costs associated with regression-based approaches, our lasso regression model may be favored for future applications.

In general, variables that carried the greatest importance were consistent across models and across sites. Most variables identified as important predictors of complicated CDI by our model are also consistent with our clinical and biological understanding of CDI pathogenesis and progression, such as peak creatinine, peak white blood cells, low albumin levels, low blood pressure, low hemoglobin, advanced age, concurrent antibiotic use for treating non-CDI infections, low bicarbonate, and ICU admissions. Importantly, all variables were collected within 48 hours of CDI diagnosis, which increases the clinical utility of our model, by allowing for early calculation of patients at risk for developing complicated CDI. Leukocytosis (white blood cell count > 15,000 cells/mL) and acute rise in serum creatinine > 1.5 mg/dL are well established markers for severe CDI and were initially incorporated into guidelines by Society for Healthcare Epidemiology of America and the Infectious Disease Society of America in 2018 to determine which patients at risk for severe disease and should be offered more aggressive upfront therapy (i.e., vancomycin over metronidazole) [28]. In addition, older age [29,30], hypoalbuminemia [31], low hemoglobin [32], concurrent antibiotic use [31], and ICU admission [14,31] have also been previously identified as predictors of poor outcomes from CDI. Although ICU admission was part of our composite endpoint, we were careful to include only patients who were admitted to the ICU prior to CDI diagnosis as our predictor variable.

Our study has several notable strengths. To our knowledge, this is the largest study combining data from four distinct cohorts composed of both temporally and geographically distinct patients to create a predictive model of complicated CDI. This significantly improved the generalizability of the model. Previous models predicting severe or complicated CDI were mostly derived from smaller single centers. By using rigorous and well-validated predictive modeling techniques and by comparing several different predictive modeling approaches, we were able to develop a highly accurate model for predicting complicated CDI using readily available structured EHR data. In addition, the use of permutation-based variable importance analysis allowed us to identify the importance of each predictor in our models. By improving the interpretability of our model, we anticipate this will enhance clinical utility, provider buy-in, and uptake of use when implemented.

These results should be interpreted in the context of several limitations. First, our analysis is retrospective. Prospective studies are necessary for understanding how such models perform in real time. Second, our models were derived from four cohorts from major academic medical centers in the midwestern United States. Model performance will have to be evaluated outside of this setting to confirm generalizability as patient populations, clinical protocols, and risk factors may vary across institutions. Third, given that our model merely identified associations, additional prospective investigation is required to establish the direction of the true underlying relationships (e.g., through RCTs). Moreover, despite the removal of variables that could serve as proxies for our composite outcome, some model features may not be true risk factors but rather markers for the beginning of complicated CDI itself. In the future, this may be elucidated by tracking the evolution of a patient’s risk over multiple days of their hospitalization. Fourth, we cannot exclude the possibility that patients may experience the outcome at another hospital, and thus we may have potentially underestimated the extent of complications from CDI. In addition, although CDI testing was recommended only for symptomatic patients during our study period and this was further validated by chart review, some positive CDI tests might reflect asymptomatic carriers. Lastly, our model employed a large number of variables which may affect clinician-perceived useability; however, this would be mitigated by embedding a decision tool which utilizing our prediction models directly within the EHR.

In summary, this multi-center cohort study comprised of a large heterogeneous population demonstrates that machine learning algorithms based on structured EHR data can accurately estimate patients’ risk for developing complications from CDI. Our approach leverages variables that can be readily extracted from the EHR early in the course of a patient’s CDI. This approach has many potential applications for guiding clinician decision making in the management of CDI. Future studies may determine whether prospective deployment of this model may aid clinicians to tailor patient therapy in real-time and allow for early use of more aggressive therapies to minimize risk of complications.

## Supporting information

Supplemental Figure 1 [[supplements/275113_file08.pdf]](pending:yes)

Supplemental Figure 2 [[supplements/275113_file09.pdf]](pending:yes)

Supplemental Table 1 [[supplements/275113_file10.docx]](pending:yes)

## Data Availability

All data produced in the present study are available upon reasonable request to the authors

## Funding

This study was supported by the National Institutes of Health grants HS027431 (to KR) and DK124567 (to AAL).

## Data Availability

All data produced in the present study are available upon reasonable request to the authors

## Conflict of Interest

The authors have declared that no conflict of interest exists.

## Figure Legends

**Supplemental Figure 1**. Calibration plot for Lasso, Random Forest, and Stacked Ensemble Models. The calibration plots fitted with a locally estimated scatterplot smoothing (loess) curve for the Lasso regression (*red line*), Random Forest (*blue line*), and Stacked Ensemble (*green line*) models show overall good agreement between observed and predicted outcomes. However, the Lasso regression and Random Forest models tend to overestimate risk for complicated CDI at low probabilities while the Stacked Ensemble model underestimates the risk for complicated CDI at low probabilities.

**Supplemental Figure 2**. Models retain good performance despite differences in definitions for *Clostridiodes difficile* infection (CDI) across centers. As the definition for CDI employed at the University of Chicago and the University of Wisconsin differed from that used at the University of Michigan, we performed a sensitivity analysis to see if our model was robust to these differences. (**A**) Using data only from the University of Chicago and University of Wisconsin cohorts which were randomly split 75/25 into training/validation sets, machine learning algorithms were trained and then validated on the independent test set. Lasso models (*red line*), random forest (*blue line*), and stacked ensemble (*green line*) showed good performance when validated on the independent test set. (**B**) This process was repeated but using data only from the University of Michigan 2010 and 2016 cohorts. From University of Chicago and University of Wisconsin and then validated model performance on both University of Michigan cohorts (labeled as Michigan 2010 and Michigan 2016).

## Footnotes

*   **Disclosures:** PDRH received consulting fees from AbbVie, Amgen, Genentech, JBR Pharma and Lycera. JAB received consulting fees from Buhlmann Diagnostic Corp. KR is supported in part from an investigator-initiated grant from Merck & Co, Inc.; he has consulted for Bio-K+ International, Inc., Roche Molecular Systems, Inc., Seres Therapeutics, Inc. and Summit Therapeutics, Inc. All other authors report no disclosures.

## Abbreviations

AUC
:   area under the receiver operator characteristic curve
CDI
:   *Clostridiodes difficile* infection
EHR
:   electronic health records
ICU
:   intensive care unit

*   Received May 18, 2022.
*   Revision received May 18, 2022.
*   Accepted May 22, 2022.


*   © 2022, Posted by Cold Spring Harbor Laboratory

The copyright holder for this pre-print is the author. All rights reserved. The material may not be redistributed, re-used or adapted without the author's permission.

## References

1.  1.Hall AJ, Curns AT, McDonald LC, Parashar UD, Lopman BA. The roles of Clostridium difficile and norovirus among gastroenteritis-associated deaths in the United States, 1999-2007. Clin Infect Dis 2012; 55:216–223.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/cis386&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22491338&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

2.  2.Lessa FC, Mu Y, Bamberg WM, et al. Burden of Clostridium difficile Infection in the United States. New England Journal of Medicine 2015; 372:825–834.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMoa1408913&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25714160&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

3.  3.Cheng Y-W, Fischer M. Treatment of Severe and Fulminnant Clostridioides difficile Infection. Curr Treat Options Gastroenterol 2019; 17:524–533.
    
    
4.  4.Dubberke ER, Olsen MA. Burden of Clostridium difficile on the healthcare system. Clin Infect Dis 2012; 55 Suppl 2:S88–92.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/cis335&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22752870&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

5.  5.Bagdasarian N, Rao K, Malani PN. Diagnosis and treatment of Clostridium difficile in adults: a systematic review. JAMA 2015; 313:398–408.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2014.17103&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25626036&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

6.  6.Belmares J, Gerding DN, Parada JP, Miskevics S, Weaver F, Johnson S. Outcome of metronidazole therapy for Clostridium difficile disease and correlation with a scoring system. J Infect 2007; 55:495–501.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jinf.2007.09.015&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17983659&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000251509600004&link_type=ISI) 

7.  7.Bhangu S, Bhangu A, Nightingale P, Michael A. Mortality and risk stratification in patients with Clostridium difficile-associated diarrhoea. Colorectal Dis 2010; 12:241–246.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1463-1318.2009.01832.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19508548&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000274670400015&link_type=ISI) 

8.  8.Butt E, Foster JAH, Keedwell E, et al. Derivation and validation of a simple, accurate and robust prediction rule for risk of mortality in patients with Clostridium difficile infection. BMC Infect Dis 2013; 13:316.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/1471-2334-13-316&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23849267&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

9.  9.Bloomfield MG, Carmichael AJ, Gkrania-Klotsas E. Mortality in Clostridium difficile infection: a prospective analysis of risk predictors. Eur J Gastroenterol Hepatol 2013; 25:700–705.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=23442414&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

10. 10.Drew RJ, Boyle B. RUWA scoring system: a novel predictive tool for the identification of patients at high risk for complications from Clostridium difficile infection. J Hosp Infect 2009; 71:93–94; author reply 94-95.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jhin.2008.09.020&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19041159&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000262551700014&link_type=ISI) 

11. 11.Gujja D, Friedenberg FK. Predictors of serious complications due to Clostridium difficile infection. Aliment Pharmacol Ther 2009; 29:635–642.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1365-2036.2008.03914.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19077106&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

12. 12.Hensgens MPM, Dekkers OM, Goorhuis A, LeCessie S, Kuijper EJ. Predicting a complicated course of Clostridium difficile infection at the bedside. Clinical Microbiology and Infection 2014; 20:0301–0308.
    
    
13. 13.Na X, Martin AJ, Sethi S, et al. A Multi-Center Prospective Derivation and Validation of a Clinical Prediction Tool for Severe Clostridium difficile Infection. PLoS One 2015; 10:e0123405.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0123405&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25906284&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

14. 14.Toro DH, Amaral-Mojica KM, Rocha-Rodriguez R, Gutierrez-Nuñez J. An Innovative Severity Score Index for Clostridium difficile Infection: A Prospective Study. Infectious Diseases in Clinical Practice 2011; 19:336–339.
    
    
15. 15.Perry DA, Shirley D, Micic D, et al. External Validation and Comparison of Clostridioides difficile Severity Scoring Systems. Clinical Infectious Diseases 2021; :ciab737.
    
    
16. 16.Li BY, Oh J, Young VB, Rao K, Wiens J. Using Machine Learning and the Electronic Health Record to Predict Complicated Clostridium difficile Infection. Open Forum Infect Dis 2019; 6:ofz186.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ofid/ofz186&link_type=DOI) 

17. 17.Rao K, Micic D, Natarajan M, et al. Clostridium difficile Ribotype 027: Relationship to Age, Detectability of Toxins A or B in Stool With Rapid Testing, Severe Infection, and Mortality. Clin Infect Dis 2015; 61:233–241.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/civ254&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25828993&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

18. 18.Stekhoven DJ, Buhlmann P. MissForest--non-parametric missing value imputation for mixed-type data. Bioinformatics 2012; 28:112–118.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/bioinformatics/btr597&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22039212&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000298383800017&link_type=ISI) 

19. 19.Waljee AK, Mukherjee A, Singal AG, et al. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 2013; 3:e002847.
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYm1qb3BlbiI7czo1OiJyZXNpZCI7czoxMToiMy84L2UwMDI4NDciO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMi8wNS8yMi8yMDIyLjA1LjE4LjIyMjc1MTEzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

20. 20.Kuhn M, Wickham H. Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. 2020; Available at: [https://www.tidymodels.org](https://www.tidymodels.org).
    
    
21. 21.Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 2010; 33:1–22.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1467-9868.2005.00503.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=20808728&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000275203200001&link_type=ISI) 

22. 22.Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 2017; 77:1–17.
    
    
23. 23.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016; :785–94.
    
    
24. 24.Couch S, Kuhn M. stacks: Tidy Model Stacking. 2022. Available at: [https://stacks.tidymodels.org/](https://stacks.tidymodels.org/).
    
    
25. 25.Greenwell B M, Boehmke B C. Variable Importance Plots—An Introduction to the vip Package. The R Journal 2020; 12:343.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.32614/RJ-2020-013&link_type=DOI) 

26. 26.Molnar C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed. 2022. Available at: christophm.github.io/interpretable-ml-book/.
    
    
27. 27.Staniak M, Biecek P. Explanations of Model Predictions with live and breakDown Packages. The R Journal 2018; 10:395–409.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.32614/RJ-32018-32017&link_type=DOI) 

28. 28.McDonald LC, Gerding DN, Johnson S, et al. Clinical Practice Guidelines for Clostridium difficile Infection in Adults and Children: 2017 Update by the Infectious Diseases Society of America (IDSA) and Society for Healthcare Epidemiology of America (SHEA). Clin Infect Dis 2018; 66:e1–e48.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/cid/cix1085&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29462280&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

29. 29.Chiang H-Y, Huang H-C, Chung C-W, et al. Risk prediction for 30-day mortality among patients with Clostridium difficile infections: a retrospective cohort study. Antimicrobial Resistance & Infection Control 2019; 8:175.
    
    
30. 30.Ananthakrishnan AN, Guzman-Perez R, Gainer V, Murphy S. Predictors of severe outcomes associated with Clostridium difficile infection in patients with Inflammatory Bowel Disease. Aliment Pharmacol Ther 2012; 35:789–795.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/j.1365-2036.2012.05022.x&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22360370&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F05%2F22%2F2022.05.18.22275113.atom) 

31. 31.Tay HL, Chow A, Ng TM, Lye DC. Risk factors and treatment outcomes of severe Clostridioides difficile infection in Singapore. Sci Rep 2019; 9:13440.
    
    
32. 32.Lee E, Song K-H, Bae JY, et al. Risk factors for poor outcome in community-onset Clostridium difficile infection. Antimicrobial Resistance & Infection Control 2018; 7:75.