Comparison of machine-learning and logistic regression models to predict 30-day unplanned readmission: a development and validation study
=========================================================================================================================================

* Masao Iwagami
* Ryota Inokuchi
* Eiryo Kawakami
* Tomohide Yamada
* Atsushi Goto
* Toshiki Kuno
* Yohei Hashimoto
* Nobuaki Michihata
* Tadahiro Goto
* Tomohiro Shinozaki
* Yu Sun
* Yuta Taniguchi
* Jun Komiyama
* Kazuaki Uda
* Toshikazu Abe
* Nanako Tamiya

## Abstract

We compared the predictive performance of gradient-boosted decision tree (GBDT), random forest (RF), deep neural network (DNN), and logistic regression (LR) with the least absolute shrinkage and selection operator (LASSO) for 30-day unplanned readmission, according to the number of predictor variables and presence/absence of blood-test results. We used electronic health records of patients discharged alive from 38 hospitals in 2015–2017 for derivation (n=339,513) and in 2018 for validation (n=118,074), including basic characteristics (age, sex, admission diagnosis category, number of hospitalizations in the past year, discharge location), diagnosis, surgery, procedure, and drug codes, and blood-test results. We created six patterns of datasets having different numbers of binary variables (that ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. For the dataset with the smallest number of variables (102), the c-statistic was highest for GBDT (0.740), followed by RF (0.734), LR-LASSO (0.720), and DNN (0.664). For the dataset with the largest number of variables (1543), the c-statistic was highest for GBDT (0.764), followed by LR-LASSO (0.755), RF (0.751), and DNN (0.720). We found that GBDT generally outperformed LR-LASSO, but the difference became smaller when the number of variables was increased and blood-test results were used.

Keywords
*   risk prediction
*   machine learning
*   artificial intelligence
*   readmission
*   electronic health records

## Introduction

Unplanned hospital readmission is a common issue in public health because it imposes burdens on medical budgets, healthcare staff, and patients [1]. Some unplanned hospital readmission is preventable [2]. Therefore, it is important to identify patients with a high risk of unplanned readmission for intervention. For achieving this, a clinical prediction model that can accurately estimate the probabilities of clinical outcomes for individual patients is essential [3].

Clinical prediction models have been developed for 30- or 28-day unplanned hospital readmission—mostly logistic regression (LR) models based on a small number of clinical variables [4, 5]. However, the prediction ability of these models is often suboptimal, with a c-statistic or area under the receiver operating characteristic curve (AUROC) of <0.70. It was discussed that the prediction ability should be improved by increasing the number/variety of clinical variables and developing more accurate statistical or mathematical models with modern computing techniques [4, 5].

Recently, machine-learning models, such as the gradient-boosted decision tree (GBDT), random forest (RF), and deep neural network (DNN), have been applied to claims data and electronic health records (EHRs) to build prediction models for readmission [6-8]. However, according to a recent systematic review, there was no statistically significant difference in the predictive performance between “traditional regression models” (defined in the paper as LR or Cox regression models, regardless of feature selection techniques) and machine-learning models, with average c-statistic values of 0.71 for 26 studies and 0.74 for 15 studies, respectively, corresponding to a difference in c-statistic of 0.03 (95% confidence interval [CI]: –0.01 to 0.07) [8]. As its major limitation, however, only few studies included in the systematic review involved direct comparisons of “traditional regression models” and commonly used machine-learning models within the same study [9-13]. Moreover, the studies included in the systematic review tended to use a small number of predictor variables. The predictive performance of machine-learning models may depend on the number of predictor variables, as well as the inclusion of continuous variables potentially having nonlinear relationships with a study outcome, such as blood-test results. Some machine-learning models may outperform an LR model if the number of predictor variables is increased and blood-test results are used for prediction.

Thus, using the EHRs obtained from 38 Japanese hospitals, we aimed to develop, validate, and compare GBDT, RF, DNN, and LR with the least absolute shrinkage and selection operator (LASSO) for predicting 30-day unplanned readmission, by intentionally creating six patterns of datasets having different numbers of predictor variables (which ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. Notably, among several feature-selection techniques for LR, such as stepwise variable selection and the LASSO, we decided to use the LASSO because it is generally less likely to cause overfitting than other techniques [14] and also because LR-LASSO outperformed LR with the stepwise variable selection in a study predicting 30-day all-cause non-elective readmission [11]. Our study hypothesis was that some machine-learning models (GBDT, RF, or DNN) outperform LR-LASSO more prominently if the number of predictor variables is increased and blood-test results are used. The findings of this study could help hospitals apply clinical prediction models to their EHRs to identify patients at high risk for readmission and intervene efficiently.

## Results

In the Medical Data Vision (MDV) database, we identified 635,509 discharges of 410,941 patients, with at least one blood test during hospitalization, who were admitted after January 1, 2015 and discharged before December 31, 2018, from 38 hospitals (**Figure 1**). After the exclusion of admissions associated with childbirth, discharges dead, and transfers to other hospitals and cases of missing data, there were 457,587 discharges eligible for analysis, including 339,513 discharges (mean age 62.0 ± 24.6 years, 54.3% men) from January 1, 2015 to December 31, 2017 in the derivation dataset and 118,074 discharges (mean age 63.4 ± 24.1 years, 54.1% men) from January 1 to December 31, 2018 in the validation dataset.

![Figure 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/05/11/2023.05.06.23289569/F1.medium.gif)

[Figure 1.](http://medrxiv.org/content/early/2023/05/11/2023.05.06.23289569/F1)

Figure 1. Flowchart
DPC, Diagnosis Procedure Combination; MDV, Medical Data Vision.

The incidence of 30-day unplanned readmission was 6.8% (23,108/339,513) and 6.4% (7,507/118,074) for the derivation and validation datasets, respectively. **Table 1** presents the basic characteristics of the patients overall and by outcome status. Patients with 30-day unplanned readmission tended to be older men and were more likely to have been hospitalized in the past year, admitted for diseases of the respiratory or digestive system, and discharged to nursing homes than those without 30-day unplanned readmission.

View this table:
[Table 1.](http://medrxiv.org/content/early/2023/05/11/2023.05.06.23289569/T1)

Table 1. 
Baseline characteristics of patients

Among the 2102 International Classification Disease 10th revision (ICD-10) codes, 914 surgery codes, 122 procedure codes, and 468 Anatomical Therapeutic Chemical (ATC) codes, we excluded variables that <10 patients had, in both the derivation and validation datasets, to avoid the “perfect separation” (or “complete separation”) problem [15]. Consequently, 823 ICD-10 codes (**Supplementary Table S1**), 259 surgery codes (**Supplementary Table S2**), 65 procedure codes (**Supplementary Table S3**), and 381 ATC codes (**Supplementary Table S4**) were regarded as candidate binary predictor variables for the following analyses. The distribution of the 10 types of blood-test results, as the last measurement during hospitalization, is presented in **Supplementary Table S5**.

Then, we intentionally created six pattens of datasets with different numbers of the candidate binary predictor variables (that ≥5% or ≥1% of patients or ≥10 patients had, in both the derivation and validation datasets) with and without blood-test results, in addition to the basic characteristics presented in **Table 1**. The number of candidate predictor variables was 102 for patten 1 (including binary variables that ≥5% had, without blood-test results), 112 for pattern 2 (including binary variables that ≥5% had, with blood-test results), 296 for patten 3 (including binary variables that ≥1% had, without blood-test results), 306 for pattern 4 (including binary variables that ≥1% had, with blood-test results), 1533 for pattern 5 (including binary variables that ≥10 patients had, without blood-test results), and 1543 for pattern 6 (including binary variables that ≥10 patients had, with blood-test results).

For each pattern, using the derivation dataset, each model was developed with optimized hyperparameters (**Supplementary Table S6)**. By applying these models to the validation dataset, the discrimination ability of each model—indicated by the c-statistic or AUROC—was evaluated, as shown in **Figure 2** and **Supplementary Table S7**. The GBDT outperformed LR-LASSO for all patterns (Delong’s test P-value <0.001). The discrimination ability of DNN was the worst for all patterns. In more detail, for pattern 1 (including binary variables that ≥5% had, without blood-test results), which had the smallest number of variables, the point estimate in c-statistic was the highest for GBDT (0.740), followed by RF (0.734), LR-LASSO (0.720), and DNN (0.664). For pattern 6 (including binary variables that ≥10 patients had, with blood-test results), which had the largest number of variables, the point estimate in c-statistic was highest for GBDT (0.764), followed by LR-LASSO (0.755), RF (0.751), and DNN (0.720), suggesting that the difference between GBDT and LR-LASSO was small.

![Figure 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/05/11/2023.05.06.23289569/F2.medium.gif)

[Figure 2.](http://medrxiv.org/content/early/2023/05/11/2023.05.06.23289569/F2)

Figure 2. C-statistic of each model for six patterns of datasets having different numbers of the candidate predictor variables with and without blood-test results
AUROC, area under the receiver operating characteristic curve; CI, confidence interval; GBDT, gradient-boosted decision tree; RF, random forest; DNN, deep neural network; LR-LASSO, logistic regression with the least absolute shrinkage and selection operator.

Pattern 1: Basic characteristics and binary predictor variables that ≥5% of patients had, without blood-test results (102 variables)

Pattern 2: Basic characteristics and binary predictor variables that ≥5% of patients had, without blood-test results (112 variables)

Pattern 3: Basic characteristics and binary predictor variables that ≥1% of patients had, without blood-test results (296 variables)

Pattern 4: Basic characteristics and binary predictor variables that ≥1% of patients had, with blood-test results (306 variables)

Pattern 5: Basic characteristics and binary predictor variables that ≥10 patients had, without blood-test results (1533 variables)

Pattern 6: Basic characteristics and binary predictor variables that ≥10 patients had, with blood-test results (1543 variables)

For pattern 6 (including binary variables that ≥10 patients had, with blood-test results), the ROC curves are shown in **Figure 3**, and the calibration plots are shown in **Figure 4**. As graphically shown, the calibration of every model was similarly good for patients at low risk for readmission. However, GBDT and LR-LASSO tended to overestimate, whereas RF and DNN tended to underestimate the probability of readmission for patients at higher risk for readmission.

![Figure 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/05/11/2023.05.06.23289569/F3.medium.gif)

[Figure 3.](http://medrxiv.org/content/early/2023/05/11/2023.05.06.23289569/F3)

Figure 3. Receiver operating characteristic curve of each model for the dataset with the largest number of variables including blood-test results (1543 variables)
AUROC, area under the receiver operating characteristic curve; CI, confidence interval; GBDT, gradient-boosted decision tree; RF, random forest; DNN, deep neural network; LR-LASSO, logistic regression with the least absolute shrinkage and selection operator.

![Figure 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/05/11/2023.05.06.23289569/F4.medium.gif)

[Figure 4.](http://medrxiv.org/content/early/2023/05/11/2023.05.06.23289569/F4)

Figure 4. Calibration plot of each model for the dataset with the largest number of variables including blood-test results (1543 variables)
GBDT, gradient-boosted decision tree; RF, random forest; DNN, deep neural network; LR-LASSO, logistic regression with the least absolute shrinkage and selection operator.

For pattern 6 (including binary variables that ≥10 patients had, with blood-test results), **Supplementary Table S8** presents all regression coefficients in LR-LASSO, whereas **Figure 5** shows the variable importance of the top 10 important predictors for GBDT, suggesting that age, blood-test results, and number of hospitalizations in the past year were important predictors.

![Figure 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/05/11/2023.05.06.23289569/F5.medium.gif)

[Figure 5.](http://medrxiv.org/content/early/2023/05/11/2023.05.06.23289569/F5)

Figure 5. Variable importance of the top 10 important predictors in gradient-boosted decision tree for the dataset with the largest number of variables including blood-test results (1543 variables)
ICD-10, International Classification Disease 10th revision; BUN, blood urea nitrogen; WBC, white blood cell.

## Discussion

Using the EHRs of 38 Japanese hospitals, we systematically compared the prediction performance for 30-day unplanned readmission among commonly used machine-learning models (GBDT, RF, and DNN) and an LR-LASSO model, by intentionally creating six patterns of datasets having different numbers of predictor variables (that ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. The discrimination ability mostly improved if the number of predictor variables was increased or blood-test results were added. For the latter patterns of datasets, the c-statistics of the models except for DNN were approximately 0.75 or higher, suggesting that the performance of these models was better than or comparable to that of previously reported models [4, 5, 8]. For each pattern of dataset, the c-statistic of GBDT was higher than that of LR-LASSO. However, against our hypothesis (that some machine-learning models outperform LR-LASSO more prominently if the number of predictor variables is increased and blood-test results are used), the difference in c-statistics between GBDT and LR-LASSO became rather smaller in the last pattern of dataset, which had the largest number of binary variables and blood-test results.

To our knowledge, this was the first systematic comparison of the predictive performance of commonly used machine-learning models and an LR model for unplanned readmission, using datasets characterized by different numbers of predictor variables and the presence/absence of blood-test results. A recent systematic review [8] included several studies comparing the performance of different prediction models for readmission with the same dataset; however, the target population was narrow, e.g., patients with lupus [12] and heart failure [13]. Our study is partly similar to that of Jamei et al. [9], who used the EHRs of >300,000 patients in Northern California, USA to compare the predictive performance of RF with 100 features; NN with 100, 500, and 1667 features; and LR with 100, 500, and 1667 features. With 100 features, RF and NN outperformed LR, with AUROCs of 0.77, 0.76, and 0.72, respectively. With 1667 features, NN performed far better than LR, with AUROCs of 0.78 and 0.66, respectively. We speculate that the reduction in the predictive performance of LR according to an increase in the number of features was probably due to overfitting, as it appears that the authors did not conduct any regularization or penalization via means such as LASSO, which we used in the present study. In contrast to their study, in the present study, the predictive performance of DNN was worse than that of the other models. This is probably because our dataset, which included mostly binary variables and only a small number of continuous variables (such as age and 10 blood-test results), was simpler than the dataset of Jamei et al., which had a larger number of continuous variables, such as vital signs and 26 blood-test results [9].

To date, comparisons of machine-learning models and “traditional regression models” (defined as LR or Cox regression models, regardless of feature selection techniques [8]) have been performed for various outcomes. Christodoulou et al. conducted a systematic review of 71 studies on clinical prediction models, regardless of outcomes, and concluded that there was no evidence of machine-learning methods outperforming LR [16]. However, the systematic review was limited in that it included only studies with <100 predictor variables; the authors identified seven studies with >100 predictor variables, all of which were excluded owing to “high risk of bias” in their classification [16]. Therefore, it remained unclear whether machine-learning methods can outperform LR if the dataset includes a larger number of predictor variables [16]. Our study answers this question: we intentionally created six patterns of datasets having different numbers of predictor variables (102, 112, 296, 306, 1533, or 1543) with and without blood-test results, and in the dataset with the largest number of variables, the difference in c-statistics between GBDT and LR was rather small. Therefore, the number of predictor variables does not seem to affect the results and conclusions very much when comparing the predictive performance of the machine-learning and LR models. Instead, the types or characteristics of predictors may matter. Notably, in the present study, the vast majority of the predictor variables were binary, and the number of continuous variables was small. An increased number of continuous variables, particularly those having nonlinear relationships with a study outcome, and the use of more complex information (such as images and sounds) may be needed to demonstrate the superiority of machine-learning methods over “traditional regression models” in clinical risk prediction. Further studies are needed to clarify this aspect.

Regarding clinical implications, according to the results of the present study, for hospitals applying clinical prediction models to their EHRs to identify patients at high risk for readmission, the GBDT may be recommended. However, the benefit of GBDT over LR-LASSO was small in the present study, suggesting that “traditional regression models” such as an LR model may also be acceptable, considering that they are more user-friendly and easier to interpret. In GBDT, the variable importance was high for age, blood-test results, number of hospitalizations in the previous year. Therefore, clinicians should be conscious of these variables when their patients are to be discharged. Patients at high risk for readmission may need to have their discharge postponed or receive intensive post-discharge follow-up as an outpatient [2].

This study had several limitations. First, we used data from a relatively large number of hospitals, but it is unclear whether these hospitals are representative of Japan. Although we believe that the selection of these hospitals is unlikely to be associated with the performance of the machine-learning models, our findings should be validated using data from different hospitals in Japan as well as hospitals in other countries. Second, we were unable to differentiate between patients admitted to different hospitals, because the hospital identifier was omitted to preserve privacy and data safety in the MDV database. Therefore, we could not account for any clustering effect by hospitals in our analysis. Third, the incidence of 30-day unplanned readmission was meaningfully lower in 2018 in the validation dataset (6.4%) than that in 2015-2017 in the derivation dataset (6.8%), possibly suggesting a decreasing trend in hospital readmission in Japan. This trend might have harmed the applicability of the developed models to the validation dataset, resulting in underestimation of the predictive ability of the models. Fourth, although the MDV database contained important information on recorded diagnoses and treatments during hospitalization, as well as blood-test results, it lacks other clinical variables that may be useful for the prediction of hospital readmission, such as vital signs and socioeconomic features [8]. Finally, we used a hospital-based database and could only predict readmission to the hospital that the patient was initially admitted to, rather than readmission to any hospital. In emergencies, patients may be sent to another hospital for readmission. Thus, a study utilizing a population-based database should be conducted to assess the performance of machine-learning methods in predicting readmission to any hospital.

## Conclusions

Using the EHRs of 38 hospitals in the MDV database, we compared the predictive performance of machine-learning and LR models for six patterns of datasets having different numbers of predictor variables with and without blood-test results. For every pattern, GBDT outperformed LR-LASSO. However, the difference in c-statistic between GBDT and LR-LASSO was small, especially in the dataset with the largest number of variables (1543 variables including blood-test results). Thus, instead of the number of predictor variables, future studies should focus on the types or characteristics of predictors in EHRs for demonstrating the superior performance of machine-learning methods to “traditional regression models”.

## Methods

### Design and setting

We conducted a retrospective cohort study using a hospital-based database in Japan.

### Data source

Diagnosis Procedure Combination (DPC) data in Japan are administrative claims data and discharge summaries, which are collected at the top of the DPC system, a case-mix patient classification, and a lump-sum payment system for inpatients in acute care hospitals [17]. The MDV database, which was built by Medical Data Vision Co., Ltd. (Tokyo, Japan), consisted of >350 acute care hospitals—approximately 20% of the DPC hospitals in Japan. It consisted of hospitals that used the business support system of MDV Co., Ltd. and provided consent for secondary data use for research purposes. The age distribution of inpatients in the MDV database is similar to that of all DPC hospitals, whereas the hospital volume (i.e., number of beds) tends to be larger [18]. During the study period between January 1, 2015 and January 31, 2019, laboratory test values were available from 38 hospitals that agreed to provide the data. Thus, in this study, we used DPC data (administrative claims data and discharge summaries) and blood-test results from 38 hospitals. We did not conduct any sample size calculation, but planned to use all the available data.

The DPC data included basic patient characteristics such as age and sex; recorded diagnoses based on ICD-10 codes [19], including the admission diagnosis (indicating the reason for hospital admission), main diagnosis, most resource-consuming diagnosis, second-most resource-consuming diagnosis, comorbidities present on admission (up to four diagnoses), and complications arising after admission (up to four diagnoses), recorded by the responsible physician at the time of discharge; Japanese original surgery codes [20], if a patient received surgery during hospitalization; Japanese original procedure codes [21], such as mechanical ventilation and dialysis; and original Japanese prescription codes (including both oral and intravenous drugs), which are linked to the ATC codes [22]. In addition, the admission status (planned or unplanned) and discharge status (dead, discharged to nursing facility or home, transferred to other hospitals) are available.

The study was approved by the Ethics Committee of the University of Tsukuba (approval no. 1414) in accordance with the Declaration of Helsinki. Because the claims data were anonymized before the researchers received them, individual participants’ consent was waived according to the ethical guidelines for medical and health research involving human subjects [23]. To ensure privacy and data safety, the hospital identifier was omitted by MDV Co., Ltd. in the data that we received. Thus, we could not identify the names and characteristics of the 38 hospitals and did not know which of the 38 hospitals the patients were admitted to.

### Study participants

We identified the discharges of patients who were admitted to the 38 hospitals, with at least one blood-test during hospitalization, from January 1, 2015 to December 31, 2018. We excluded: i) cases with the admission diagnosis category suggesting childbirth (ICD-10 O00-Q99), ii) discharges dead, iii) transfers to other hospitals, and iv) cases with missing values of discharge location, or one or more of 10 basic blood-tests (white cell count, hemoglobin, platelet count, sodium, potassium, chloride, creatinine, blood urea nitrogen, aspartate transaminase, and alanine transaminase). We excluded cases with missing values of one or more of 10 basic blood-tests because (i) missingness is not likely to be at random, and (ii) even if missingness is at random, strategies to impute missing data in different machine-learning models are different or not established [24], whereas the objective of the present study was to compare the predictive performance of different models for the same complete dataset. If the same patient was hospitalized multiple times during the study period, we considered each admission as an independent admission, while we used the number of hospitalizations in the past year as a predictor variable to reduce dependence between them.

The data were split into derivation data for patients discharged from January 1, 2015 to December 31, 2017, which were used to develop the models, and validation data for patients discharged from January 1 to December 31, 2018, which were used to assess the discrimination ability and calibration of the models.

### Outcome

The outcome of interest was 30-day unplanned readmission to the hospital from which the patient was initially discharged.

### Predictor variables and datasets

The predictor variables used in this study consisted of basic characteristics, recorded diagnoses, inpatient treatments, and blood-test results that were the last measurement before discharge. The basic characteristics included age (as a continuous variable), sex, admission diagnosis category (according to the ICD-10 codes, from A to T), number of hospitalizations in the past year, and discharge location (home or nursing home). For recorded diagnoses, information on ICD-10 codes recorded anywhere in the DPC code position (i.e., admission, main, most or second-most resource-consuming, comorbidities present on admission, or complications arising after admission) was transformed into the presence or absence of each 3-digit ICD-10 code from A00 to Z00. In total, 2102 types of 3-digit ICD-10 codes in the WHO ICD-10 Version 2010 were used in this study [19]. Similarly, for inpatient treatments, information on the surgery, procedure, and prescribed drugs recorded in the DPC system was transformed into the presence or absence of 914 surgery codes from K000 to K939 [20], 122 procedure codes from J000 to J129 [21], and 468 ATC codes from A01A0 to V07A0 [22].

Among the 2102 ICD-10 codes, 914 surgery codes, 122 procedure codes, and 468 ATC codes, we excluded variables that <10 patients had, in both the derivation and validation datasets, to avoid the “perfect separation” (or “complete separation”) problem [15]. Then, using these candidate binary predictor variables, we created six patterns of datasets having different numbers of the variables (that ≥5% or ≥1% of patients or ≥10 patients, in both the derivation and validation datasets) with and without blood-test results, in addition to the basic characteristics (i.e., age, sex, admission diagnosis category, number of hospitalizations in the past year, and discharge location).

The continuous variables, including age, the number of hospitalizations in the past year, and blood test results, were standardized to a mean of 0 and a standard deviation of 1. The categorical variables, including the number of hospitalizations in the past year and admission diagnosis category (16 values from A to T), were transformed into a combination of binary variables using dummy values.

### Machine-learning and LR models

Details regarding the supervised machine-learning models are presented elsewhere [15, 25]. In brief, RF comprises ensembles of decision trees constructed from bootstrapped training samples, for which random samples corresponding to specific numbers of predictors are selected to initiate tree induction. The GBDT is another ensemble method, in which new tree models for predicting the errors and residuals of previous models are constructed in sequence. These new models are combined, and a gradient descent algorithm is used to minimize the loss function. A DNN comprises multiple processing layers and model outcomes via intermediate hidden units, each comprising a linear combination of predictors that are transformed into nonlinear functions.

There are several feature-selection techniques for LR, such as stepwise variable selection and the LASSO [3]. In this study, we used the LASSO because it is generally less likely to cause overfitting than other techniques [14]. Additionally, in a previous study, LR with the LASSO outperformed stepwise variable selection in predicting 30-day all-cause non-elective readmission [11]. We did not include any interaction terms between the features in our LR-LASSO.

### Statistical analysis

First, we described the distribution or proportions of all the predictor variables according to outcome status in the derivation and validation data. Next, for each pattern of datasets from 1 to 6, we used derivation data to establish the machine-learning and LR models. The hyperparameters for each model were determined and optimized via 10-fold cross-validation within the derivation data (i.e., training data) [26], using automated machine learning in the *h2o* package of R. Then, using the validation data (i.e., test data), we evaluated the performance of each model with the c-statistic or AUROC, as well as calibration plots. We conducted Delong’s tests to compare the c-statistics of machine-learning models and LR-LASSO as a reference.

By focusing on GBDT, which showed the best performance among the studied models, we identified the 10 variables with the highest importance to examine the contribution of each predictor to the model with the best discriminatory abilities [27].

Data cleaning was conducted using STATA version 16 (StataCorp LLC, Texas, USA), and a statistical analysis was performed using R version 4.1.2 (R Foundation for Statistical Computing, Vienna, Austria) with the *h2o* and *rms* packages.

We followed the checklist of the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) Statement [28].

## Supporting information

Supplementary Table S1 [[supplements/289569_file08.docx]](pending:yes)

Supplementary Table S2 [[supplements/289569_file09.docx]](pending:yes)

Supplementary Table S3 [[supplements/289569_file10.docx]](pending:yes)

Supplementary Table S4 [[supplements/289569_file11.docx]](pending:yes)

Supplementary Table S5 [[supplements/289569_file12.docx]](pending:yes)

Supplementary Table S6 [[supplements/289569_file13.docx]](pending:yes)

Supplementary Table S7 [[supplements/289569_file14.docx]](pending:yes)

Supplementary Table S8 [[supplements/289569_file15.docx]](pending:yes)

TRIPOD checklist [[supplements/289569_file16.docx]](pending:yes)

## Data Availability

We obtained data from Medical Data Vision Co., Ltd. (MDV) and are not allowed to share these data with other parties. Researchers who meet the criteria for access can acquire de-identified participant data from MDV (https://en.mdv.co.jp).

## Author contributions

M.I. planned the study, obtained funding and data, and wrote the main manuscript text. M.I. and R.I. analyzed the data and prepared all the tables and figures. N.M., Y.T., J.K., and K.U. assisted with data cleaning. E.K., T.Y., A.G., T.K., Y.H., T.G., T.S., Y.S., and T.A. contributed substantially to the analysis using machine-learning models and interpretation of the results. N.T. supervised the study as the laboratory head. All authors reviewed the manuscript and provided substantial feedback.

## Data sharing

We obtained data from Medical Data Vision Co., Ltd. (MDV) and are not allowed to share these data with other parties. Researchers who meet the criteria for access can acquire de-identified participant data from MDV ([https://en.mdv.co.jp](https://en.mdv.co.jp)).

## Declaration of interests

We have no conflicts of interest to declare.

## List of supplementary materials

Supplementary Table S1. Details and frequency of diagnosis codes (International Classification of Diseases, 10th revision)

Supplementary Table S2. Details and frequency of surgery codes (Japanese original codes)

Supplementary Table S3. Details and frequency of procedure codes (Japanese original codes)

Supplementary Table S4. Details and frequency of drug codes (Anatomical Therapeutic Chemical codes)

Supplementary Table S5. Details and distribution of blood-test results

Supplementary Table S6. Hyperparameters in each model for the dataset with the largest number of variables including blood-test results (1543 variables)

Supplementary Table S7. Discrimination ability (c-statistic) of each model for validation

Supplementary Table S8. Regression coefficients in logistic regression with the least absolute shrinkage and selection operator for the dataset with the largest number of variables including blood-test results (1543 variables)

## Acknowledgements

We express our gratitude to the personnel of Medical Data Vision Co., Ltd.—particularly Masaki Nakamura, Shogo Atsuzawa, and Masayoshi Suzuki—for their contributions to the preparation of the data. We thank Rina Yamauchi at the School of Public Health, University of Tokyo Graduate School of Medicine for her assistance in creating Supplementary Tables from S1 to S4. We also thank Editage ([www.editage.com](http://www.editage.com)) for the English language editing.

## Footnotes

*   **Full names and contact information of authors**: Masao Iwagami, iwagami-tky{at}umin.ac.jp, Ryota Inokuchi, inokuchi.ryota.ge{at}u.tsukuba.ac.jp, Eiryo Kawakami, eiryo.kawakami{at}chiba-u.jp, Tomohide Yamada, bqx07367{at}yahoo.co.jp, Atsushi Goto, agoto{at}yokohama-cu.ac.jp, Toshiki Kuno, kuno-toshiki{at}hotmail.co.jp, Yohei Hashimoto, yohashimoto1223{at}gmail.com, Nobuaki Michihata, michihata{at}m.u-tokyo.ac.jp, Tadahiro Goto, tag695{at}mail.harvard.edu, Tomohiro Shinozaki, shinozaki{at}rs.tus.ac.jp, Yu Sun, sunyu{at}md.tsukuba.ac.jp, Yuta Taniguchi, taniguchi.yuta.ma{at}alumni.tsukuba.ac.jp, Jun Komiyama, jun.komi33{at}gmail.com, Kazuaki Uda, uda.kazuaki.gn{at}u.tsukuba.ac.jp, Toshikazu Abe, abetoshikazu{at}me.com, Nanako Tamiya, ntamiya{at}md.tsukuba.ac.jp

*   **Funding:** This study was supported by a Japan Society for the Promotion of Science (JSPS) KAKENHI Grant (No. 19K19430) from the Japanese Ministry of Education, Culture, Sports, Science, and Technology. The funder had no role in study design, data collection, data analysis, data interpretation, or writing.

*   Received May 6, 2023.
*   Revision received May 6, 2023.
*   Accepted May 11, 2023.


*   © 2023, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NonCommercial-NoDerivs 4.0 International), CC BY-NC-ND 4.0, as described at [http://creativecommons.org/licenses/by-nc-nd/4.0/](http://creativecommons.org/licenses/by-nc-nd/4.0/)

## References

1.  1.Jencks, S. F., Williams, M. V. & Coleman, E. A. Rehospitalizations among patients in the Medicare fee-for-service program. N. Engl. J. Med. 360, 1418–1428 (2009).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1056/NEJMsa0803563&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=19339721&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000264751800007&link_type=ISI) 

2.  2.Leppin, A. L. et al. Preventing 30-day hospital readmissions: a systematic review and meta-analysis of randomized trials. JAMA Intern. Med. 174, 1095–1107 (2014).
    
    

3.  3.Iwagami, M. & Matsui, H. Introduction to Clinical Prediction Models. Ann. Clin. Epidemiol. 4, 72–80 (2022).
    
    

4.  4.Kansagara, D. et al. Risk prediction models for hospital readmission: a systematic review. JAMA 306, 1688–1698 (2011).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.2011.1515&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22009101&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000295963200028&link_type=ISI) 

5.  5.Zhou, H., Della, P. R., Roberts, P., Goh, L. & Dhaliwal, S. S. Utility of models to predict 28-day or 30-day unplanned hospital readmissions: an updated systematic review. BMJ Open 6, e011060 (2016).
    
    [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6NzoiYm1qb3BlbiI7czo1OiJyZXNpZCI7czoxMToiNi82L2UwMTEwNjAiO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wNS8xMS8yMDIzLjA1LjA2LjIzMjg5NTY5LmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 

6.  6.Artetxe, A., Beristain, A. & Graña, M. Predictive models for hospital readmission risk: A systematic review of methods. Comput. Methods. Programs. Biomed. 164, 49–64 (2018).
    
    

7.  7.Huang, Y., Talwar, A., Chatterjee, S. & Aparasu, R. R. Application of machine learning in predicting hospital readmissions: a scoping review of the literature. BMC Med. Res. Methodol. 21, 96 (2021).
    
    

8.  8.Mahmoudi, E. et al. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: systematic review. BMJ 369, m958 (2020).
    
    

9.  9.Jamei, M., Nisnevich, A., Wetchler, E., Sudat, S. & Liu, E. Predicting all-cause risk of 30-day hospital readmission using artificial neural networks. PLoS One 12, e0181173 (2017).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1371/journal.pone.0181173&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28708848&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

10. 10.Wang, H. et al. Predicting Hospital Readmission via Cost-Sensitive Deep Learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 15, 1968–1978 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TCBB.2018.2827029&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29993930&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

11. 11.Tong, L., Erdmann, C., Daldalian, M., Li, J. & Esposito, T. Comparison of predictive modeling approaches for 30-day all-cause non-elective readmission risk. BMC Med. Res. Methodol. 16, 26 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12874-016-0128-0&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=26920363&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

12. 12.Reddy, B. K. & Delen, D. Predicting hospital readmission for lupus patients: An RNN-LSTM-based deep-learning methodology. Comput. Biol. Med. 101, 199–209 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.compbiomed.2018.08.029&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30195164&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

13. 13.Golas, S. B. et al. A machine learning model to predict the risk of 30-day readmissions in patients with heart failure: a retrospective analysis of electronic medical records data. BMC Med. Inform. Decis. Mak. 18, 44 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/s12911-018-0620-z&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=29929496&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

14. 14.Pavlou, M. et al. How to develop a more accurate risk prediction model when there are few events. BMJ 351, h3868 (2015).
    
    

15. 15.Luo, W. et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J. Med. Internet. Res. 18, e323 (2016).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2196/jmir.5870&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

16. 16.Christodoulou, E. et al. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 110, 12–22 (2019).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jclinepi.2019.02.004&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30763612&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

17. 17.Hayashida, K., Murakami, G., Matsuda, S. & Fushimi, K. History and Profile of Diagnosis Procedure Combination (DPC): Development of a Real Data Collection System for Acute Inpatient Care in Japan. J. Epidemiol. 31, 1–11 (2021).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.2188/jea.JE20200288&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

18. 18.Taniguchi, Y. et al. Comparison of patient characteristics and in-hospital mortality between patients with COVID-19 in 2020 and those with influenza in 2017-2020: a multicenter, retrospective cohort study in Japan. Lancet Reg. Health West. Pac. 20, 100365 (2022).
    
    

19. 19.World Health Organization. ICD-10 Version:2010. [https://icd.who.int/browse10/2010/en#/](https://icd.who.int/browse10/2010/en#/) (Accessed March 31, 2023).
    
    

20. 20.©mplat, Inc. Shirobon Net. Chapter 2 Special Medical Fee Part 10 Surgery (Japanese only). [https://shirobon.net/medicalfee/latest/ika/r04\_ika/r04i\_ch2/r04i2\_pa10/](https://shirobon.net/medicalfee/latest/ika/r04_ika/r04i_ch2/r04i2_pa10/) (Accessed March 31, 2023).
    
    

21. 21.©mplat, Inc. Shirobon Net. Chapter 2 Special medical fees Part 9 Procedure (Japanese only). [https://shirobon.net/medicalfee/latest/ika/r04\_ika/r04i\_ch2/r04i2\_pa9/](https://shirobon.net/medicalfee/latest/ika/r04_ika/r04i_ch2/r04i2_pa9/) (Accessed March 31, 2023).
    
    

22. 22.European Pharmaceutical Market Research Association (EPHMRA). Anatomical Classification. [https://www.ephmra.org/anatomical-classification](https://www.ephmra.org/anatomical-classification) (Accessed March 31, 2023).
    
    

23. 23.Ministry of Education, Culture, Sports, Science and Technology, Ministry of Health, Labour and Welfare. Ethical guidelines for medical and health research involving human subjects. [https://www.lifescience.mext.go.jp/files/pdf/n2181\_01.pdf](https://www.lifescience.mext.go.jp/files/pdf/n2181_01.pdf) (Accessed March 31, 2023).
    
    

24. 24.Nijman, S. et al. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J. Clin. Epidemiol. 142, 218–229 (2022).
    
    

25. 25.Ono, T. & Goto, T. Introduction to supervised machine learning in clinical epidemiology. Ann. Clin. Epidemiol. 4, 63–71 (2022).
    
    

26. 26.Steyerberg, E. W. Validation in prediction research: the waste by data splitting. J. Clin. Epidemiol. 103, 131–133 (2018).
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.jclinepi.2018.07.010&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30063954&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom) 

27. 27.Gromping, U. Variable Importance Assessment in Regression: Linear Regression Versus Random Forest. The American Statistician 63: 308–19 (2012).
    
    

28. 28.Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. Transparent reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): the TRIPOD statement. J. Clin. Epidemiol. 68, 134–143 (2015).
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=25579640&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F05%2F11%2F2023.05.06.23289569.atom)