Abstract
Objective Patient information can be retrieved more efficiently in electronic medical record (EMR) systems by using machine learning models that predict which information a physician will seek in a clinical context. However, information-seeking behavior varies across EMR users. To explicitly account for this variability, we derived hierarchical models and compared their performance to non-hierarchical models in identifying relevant patient information in intensive care unit (ICU) cases.
Materials and Methods Critical care physicians reviewed ICU patient cases and selected data items relevant for presenting at morning rounds. Using patient EMR data as predictors, we derived hierarchical logistic regression (HLR) and standard logistic regression (LR) models to predict their relevance.
Results In 73 pairs of HLR and LR models, the HLR models achieved an area under the ROC curve of 0.81, 95% CI [0.80, 0.82], which was statistically significantly higher than that of LR models (0.75, 95% CI [0.74-0.76]). Further, the HLR models achieved statistically significantly lower expected calibration error (0.07, 95% CI [0.06-0.08]) than LR models (0.16, 95% CI [0.14-0.17]).
Discussion The physician reviewers demonstrated variability in selecting relevant data. Our results show that HLR models perform significantly better than LR models with respect to both discrimination and calibration. This is likely due to explicitly modeling physician-related variability.
Conclusion Hierarchical models can yield better performance when there is physician-related variability as in the case of identifying relevant information in the EMR.
1 Introduction
A key source of frustration with electronic medical record (EMR) systems stems from the inability to retrieve relevant patient information efficiently [1, 2, 3, 4]. Current EMR systems do not possess sophisticated search capability nor do they prioritize patient information relative to the clinical task at hand [5, 6]. The inability to identify relevant patient information can lead to poor care and medical errors [7, 8, 9]. Further, in complex clinical environments, such as the intensive care unit (ICU), large quantities of data per patient accumulate rapidly [10], which can exacerbate information retrieval challenges. EMR systems that prioritize the display of relevant patient information are therefore needed to minimize the time and effort that physicians spend in identifying relevant information.
Various solutions have been proposed for effective prioritization and display of patient information in EMR systems [11, 12, 13, 14], most of which are based on rules that have been developed to customize and organize the display of patient information. In contrast to rule-based approaches, we developed and evaluated a data-driven approach called the learning EMR (LEMR) system in a prior study [15, 16]. The LEMR system tracks physician information-seeking behavior and uses it to learn machine learning models that predict which information is relevant in a given clinical context. Those predictions are used to highlight the relevant data in the EMR system to draw a physician’s attention.
However, information-seeking behavior has been shown to vary across individual physicians as well as across EMR system user types such as physicians, nurses, and pharmacists [1, 5]. In this study, we use hierarchical models to explicitly model this variability because such models have been shown to be useful when the data are collected from subjects with different behaviors [17]. In particular, we compare the performance of hierarchical logistic regression models and standard logistic regression models in predicting relevant patient information in a LEMR system.
The remainder of this paper is organized as follows. In the Background section, we review the LEMR system, briefly describe hierarchical models, and describe prior work on physician-related variability. In the Methods section, we describe the data collection and preparation, the experimental details, and the evaluation measures. We present the results of the experiments in the Results section, and close with Discussion and Conclusion sections.
2 Background
In this section, we provide brief descriptions of the LEMR system, hierarchical models, and past studies that have examined physician-related variability.
2.1 The LEMR system
The LEMR system uses a data-driven approach to prioritize patient information that is relevant in the context of a clinical task [15, 16]. The system uses machine learning to automatically identify and highlight relevant patient information for a specified task, for example, the task of summarizing a patient’s clinical status at morning rounds in the ICU. In ICU morning rounds, the clinical team reviews pertinent information and the status of each patient; for each patient, one team member reviews information in the EMR system and orally presents a summary of the patient’s clinical status to the team. Reviewing and identifying relevant patient information, called pre-rounding, is time-consuming and laborious. The goal of the LEMR system is to use machine learning to automatically identify and highlight the relevant information required for a given clinical task such as pre-rounding. The predictive models of the LEMR system are derived using the information-seeking behavior of physicians when they search for relevant information in the EMR in the context of the clinical task. In particular, eleven critical care physicians reviewed the EMRs of ICU patients and marked the information that was relevant to pre-rounding, and predictive models were developed from this data.
2.2 Hierarchical models
Hierarchical models, also known as multilevel models, are useful in modeling hierarchically structured data because they can capture variability at different levels of the hierarchy [17]. For example, consider predicting the mortality rate in a hospital with several units, such as critical care, general medical care, and emergency care. The data has a two-level hierarchical structure with the hospital at the first level and the units at the second level of the hierarchy. The overall mortality rate at the hospital level is obtained by combining the unit-level mortality rates in some fashion. A hierarchical model explicitly estimates the variability of the mortality rates across the units and uses those estimates to derive the hospital level mortality rate, which can result in a better estimate of the overall mortality rate compared to using non-hierarchical models.
In a similar fashion, the information-seeking data used to develop the LEMR models has a two-level hierarchical structure, where the top level corresponds to data that denote the entire population of physician reviewers and the bottom level corresponds to data that denote individual physicians. For specific patient information such as serum creatinine, its relevance is expected to differ across physician reviewers. A hierarchical model of the LEMR data explicitly captures this variability that is likely to be useful in deriving more accurate predictive models.
2.3 Physician-related variability
Physician-related variability in healthcare outcomes has been of interest for decades, going back to the 1970s with studies reporting the effects of geographic location on clinical outcomes such as mortality and length of stay [18]. In particular, variation in individual physician characteristics and practice styles has been recognized as a source of variability in clinical outcomes after adjusting for the health status of patients and the quality of healthcare services [5, 19, 20, 21, 22, 23, 24, 25]. For example, variability in cesarean section rates has been attributed to physician practice style after controlling for patient characteristics and risk factors, status of the medical facility, and physician years of experience [25]. A study concluded that variability across individual physicians may impact the quality of preference-sensitive critical care delivery [20]. A recent study analyzed physician search patterns in the EMR and uncovered considerable variation in information-seeking behavior [5]. In general, hierarchical modeling has been applied in various clinical settings to account for physician-related variability where the data has a hierarchical structure and can be grouped by a variety of factors such as country, state, or hospital site [26, 27, 28, 29, 30, 31, 32].
3 Method
In this section, we first describe the dataset and the data preparation steps. Then we describe the experimental methods including the development and evaluation of predictive models.
3.1 Dataset
One-hundred seventy-eight ICU patient cases with a diagnosis of either acute kidney failure (AKF; ICD-9 584.9 or 584.5; 93 cases) or acute respiratory failure (ARF; ICD-9 518.81; 85 cases) were selected randomly from patients who were admitted between June 2010 and May 2012 to an ICU at the University of Pittsburgh Medical Center. Eleven critical care medicine physicians reviewed the patient cases in the LEMR system and for each patient indicated which patient information was relevant to the task of pre-rounding in the ICU.
The dataset consists of two sets of variables including the predictor variables (or predictors) and target variables (or targets) that we now describe in detail. Predictor variables include demographics, admitting diagnosis, vital signs, ventilator settings, input and output measurements, laboratory test results, and medication administration data. A few variables such as demographics and admitting diagnosis are static, that is, their values do not change during the ICU stay, while the remaining variables, which constitute the majority of the predictors, are temporal and have multiple values during the ICU stay. For example, age (in years) is a static predictor variable while blood urea nitrogen (BUN) is a temporal predictor variable as it is usually measured multiple times during an ICU stay.
Target variables include any data in the EMR, such as vital signs, ventilator settings, input and output measurements, laboratory test results, and medication administration data that a physician may annotate as relevant for the task of pre-rounding. A target variable can take either relevant or not relevant values. As an example, for a patient with AKF, BUN = relevant denotes that BUN was measured for the patient and was sought, found, and annotated by a physician as relevant. If BUN was measured for the patient but was not sought by a physician, then the target is denoted as BUN = not relevant. A target variable may be missing too; for example, when BUN is not measured for the npatient, it would not be available for a physician to seek and find. We developed a predictive model for each target variable such as BUN that predicts whether it is relevant in a particular patient. To develop a BUN model, we used all predictor variables described in the previous paragraph and used only data in which the BUN target was not missing.
The difference between predictor and target variables is in the values they take; i.e., a target variable takes values of either relevant or not relevant, whereas a predictor variable’s values are the measured values that are recorded in the EMR. For example, when BUN is a predictor, it takes numeric values in milligrams per deciliter (mg/dL) unit1, whereas as a target variable, it takes a value of either relevant or not relevant. Consequently, a model for predicting whether or not BUN is relevant may contain numeric values for BUN as a predictor variable.
3.2 Data preparation
We transformed the dataset into a representation that is amenable to the application of machine learning methods. In particular, for each temporal predictor variable we generated between 4 to 36 features (feature expansion in Figure 1).
The number of features for a temporal predictor was based on (1) the data domain of the predictor variable (e.g., medication administration or laboratory result) and (2) the type of the predictor variable (e.g., nominal or continuous). For example, for each medication variable we generated four features including an indicator of whether the drug is currently prescribed, the time elapsed between first administration and the current time, the time elapsed between the most recent administration and the current time, and the dose at the most recent administration. For each laboratory test result, vital sign, and ventilator setting, we generated up to 36 features including an indicator of whether the event or measurement ever occurred, the value of the most recent measurement, the highest value, the lowest value, the slope between the two most recent values, and 30 other features. More details on the feature expansion are given in [33].
The dataset consisted of 178 patient cases and 1,864 raw predictor variables. Feature expansion resulted in a total of 30,770 features. Since the dimensionality of the data was high, we reduced the number of features (feature reduction in Figure 1) by removing those features where the values were missing in every patient case, had the same value for every case (i.e., had zero variance), or the values were duplicates of another variable. Feature reduction resulted in a total of 6,935 features.
We selected as target variables 73 EMR data items that had been annotated as relevant (positive) in 9 or more patient cases. Table A1 in Appendix C contains the list of target variables along with the number of cases in which each target variable was relevant, as well as the number of cases where the target variable was available for selection (i.e., the value was not missing).
3.3 Experimental methods
3.3.1 Predictive models
An HLR model is a generalization of a standard logistic regression model in which the data is clustered into groups and the model intercept and coefficients can vary by group [17]. Figure 2 shows the structure of a 2-level HLR model in which the LEMR data is clustered into groups of patient cases reviewed by each physician. Parameters at the lower level represent the physician-level models for the 11 physician reviewers, and parameters at the upper level represent the model for the entire population of physician reviewers (i.e., population-level model). For a more detailed description of HLR models see Appendix A.
We developed HLR predictive models for each of the selected 73 targets. Each predictive model of a target variable is formulated as a binary classification problem where the model learns to identify cases in which the target variable is relevant. To investigate the utility of HLR over non-hierarchical models, we used LR as baseline models in which the physician identifier was included as an indicator variable. We implemented the HLR models using the brms package [34] in R, which uses No-U-Turn Sampler (NUTS) (as an extension of the Hamiltonian Monte Carlo algorithm) to estimate the posterior distribution of model parameters. In our experiments, we set the NUTS sampler to use 4 Markov chains; each chain included 400 iterations of sampling where the first 200 were used to calibrate the sampler. A total of 4 × 200 = 800 posterior samples for each HLR model parameter were obtained. LR models were implemented using the glmnet package in R [35].
3.3.2 Cross validation
Each model was trained and evaluated independently in a stratified 10-fold cross validation setting. At each iteration of the cross validation, the patient cases were randomly split into a training set (9 folds) and a test set (1 fold), while preserving the original distribution of the target variable. Hyperparameter tuning and data preprocessing such as imputing missing values and feature selection were performed during cross validation. More details are described in Appendix B.
3.3.3 Performance measures
We measured the predictive performance of each model with the area under the receiver operating characteristic (ROC) curve (AUROC), area under the precision-recall curve (AUPRC), and expected calibration error (ECE) [36]. AUROC is a measure of model discrimination and varies from 0.5 and 1, where 0.5 denotes an uninformative model and 1 represents perfect discrimination. AUPRC summarizes the precision-recall curve where precision (or positive predictive value) and recall (or sensitivity) values at different thresholds are plotted as a curve. The AUPRC varies from 0 to 1 is and is commonly used in binary classification problems when the data is imbalanced (i.e., when cases with one label are more prevalent than cases with the other label).
ECE is a measure of model calibration. In a perfectly calibrated model, outcomes with predicted probability correspond to a fraction of positive cases in the data. ECE is derived from the probability calibration curve [37] where the sorted predicted probabilities are partitioned into bins; in each bin i, calibration error is defined as the absolute difference between the mean of predicted probabilities (pi) and the fraction of positive outcomes (oi). ECE is the weighted average of the calibration errors over all bins: where wi denotes the fraction of cases that fall into bin i. Lower ECE denotes a better calibrated model.
4 Results
We report the variability across the physician reviewers and then report the results of the predictive performance of LR and HLR models from three perspectives: overall, per-target, and per-physician. Table 1 summarizes the physician characteristics and the number of patients that each physician reviewed within the two diagnostic groups, AKF and ARF.
4.1 Variability in information-seeking behavior
We define a descriptive statistic called average relevance proportion (ARP) to measure the information-seeking behavior of each physician reviewer. An ARP value for a physician is defined as the average proportion of EMR data items that the physician sought as relevant. We calculated the ARP values over the 73 EMR data items that were used as target variables. Figure 3 shows the physician ARP values separately for each of the diagnostic groups. Each circle denotes the ARP value for the corresponding physician on the x-axis and each error bar represents a 95% confidence interval (CI) for an ARP value. In the ARF diagnosis group, the ARP CIs for physicians 1, 7, and 8 do not overlap with those of the other physicians, which indicates a potential variability in information-seeking behavior between these physicians and the rest. Similar variability is observed in the AKF group, where the ARP CIs of physicians 1, 3, 7, and 8 differ from those of the other physicians.
4.2 Overall performance
The overall performance of each model family (LR and HLR) was calculated by concatenating the predictions for all 73 target variables into a single vector and using that vector to compute the performance metrics. Table 2 reports the AUROC, AUPRC, and ECE for the LR and HLR models across all 73 target variables. For AUROC values, the 95% CI and p-value were calculated using Delong’s method [38, 39]. The 95% CI for AUPRC values was derived using the logit intervals method [40] and the p-value was calculated using the Wald z-test. For ECE values, we set k = 100 in Equation 1 and obtained a vector of 100 calibration errors to compute 95% CIs and a t-test p-value. Figure 4a shows the overall ROC and calibration curves for LR and HLR models. Note that for the calibration curves, we set the number of bins to k = 10 for better visibility.
4.3 Per-target performance
For per-target performance, we computed the predictive performance for each target variable, which resulted in vectors of AUROC, AUPRC, and ECE values each with a length of 73, for each model family (LR and HLR). Distributions of per-target performance measures are shown as histograms in Figure 4b for each model family. Histograms of the two model families are overlaid for better comparison. Additional details are provided in Table A1 in Appendix C where AUROC, AUPRC, and ECE values are reported for each target variable.
4.4 Per-physician performance
For per-physician performance, we computed the predictive performance for each physician, which resulted in 11 AUROC, AUPRC, and ECE values for each model family (LR and HLR). Figure 4c presents the per-physician bar plots of the performance measure values; the bars for HLR and LR models are displayed side by side for better comparison. Per-physician calibration curves are presented in Figure A1 in Appendix C.
5 Discussion
Our results show that HLR models perform better than LR models when predicting which information a physician will seek in a future patient case. Moreover, the ECE results show that HLR models are generally better calibrated than LR models. In general, the more calibrated the probabilities are that are output by a predictive model, the higher the expected utility of the decisions that will be made using that model; in the case of the LEMR system, those decisions involve which information is worthwhile to highlight in the EMR of a given patient.
Although most physician reviewers had similar years of ICU experience, we observed a considerable degree of variability in information-seeking behavior across physicians in terms of ARP values. Because the study patients were selected to have a similar level of complexity, patient cases are unlikely to be the source of this variability. Controlling for physicians’ years of experience in LR models was not as effective in improving predictive performance as estimating individual physician variability using the HLR models. This shows the advantage of HLR models over standard models in the presence of unexplained variability.
The per-physician performance measures in Figure 4c show that HLR models learn physician-specific models that perform better in terms of both discrimination and calibration. Although HLR models fit a separate model for each physician, the inherent regularization in these models prevents overfitting. In particular, as population and physician-specific parameters are estimated at the same time, a pooling effect occurs that prevents a physician-specific model from overfitting when the sample size is small.
Furthermore, HLR models allow for a detailed investigation at the physician level because each physician model has its own set of parameters. Figure 5 demonstrates a few instances of the detailed information that can be obtained from an HLR model. Each panel in Figure 5 represents the distributions of a model parameter in an HLR model for each physician and for all physicians as a whole. Investigating the physician-specific parameters can lead to a better understanding of factors that influence a physician’s information-seeking behavior.
6 Limitations
One limitation of this study was the relatively modest amount of annotated data. Having experts review and annotate data is an expensive and time-consuming task in many domains, especially in medicine. It takes many hours for a physician to review and annotate a small number of patient cases in the EMR, which makes it challenging to collect large amounts of annotated data in the LEMR system. Due to this limitation, the number of positive samples for most target variables was modest. As a result, we derived models for only 73 target variables out of 865 available target variables. Nevertheless, this restriction can be addressed by using scalable data collection methods. For example, a scalable solution based on eye-tracking technology has been proposed to automatically identify information that physicians seek in the EMR [41].
Despite the advantage of HLR models in terms of performance, they have two major drawbacks. First, HLR models by default assign equal weights to physicians with different level of experience and as a result, these models can be biased toward less experienced physicians if they have reviewed the majority of cases. One solution is to move the bias toward the experienced physicians by increasing the proportion of cases reviewed by them; however, estimating the optimal proportion requires further studies with more data. The second drawback of HLR models is the added complexity due to the additional per-level parameters. This complexity creates new challenges in parameter estimation and interpretation. Compared to LR models, training HLR models requires more computing power and there are more hyperparameters to tune, including the choice of prior distributions.
7 Conclusion
Displaying large quantities of patient information in EMR systems with little prioritization can adversely influence the decision-making process of physicians and compromise the safety of patients. A data-driven solution was recently proposed as a learning EMR (LEMR) system that uses machine learning to identify and prioritize relevant data in the EMR for physicians. The current study improves the performance of LR models by using HLR models.
We trained 2-level HLR models that simultaneously learn physician-specific models at one level and a population model at another level. We evaluated the discrimination and calibration performance of HLR models in identifying relevant data items in the EMR. Our results show that HLR models perform significantly better than LR models. Moreover, we demonstrated that HLR models provide details about the physician-specific models that can be used to investigate physicians’ information-seeking behaviors in the EMR system.
8 Funding
The research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under award number R01 LM012095, and a Provost Fellowship in Intelligent Systems at the University of Pittsburgh (awarded to M.T.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Data Availability
The dataset is not publicly available.
10 Conflict of interest statement
None declared.
Appendix
A Hierarchical logistic regression (HLR) models
Hierarchical logistic regression (HLR) models are a generalization of standard logistic regression (LR) models in which distinct logistic regression models are fit at different levels of hierarchically structured data [17]. In the context of the LEMR system, we define an HLR model as follows (a boldface character represents a matrix or vector of parameters or values). Let D = {X, y} be a LEMR data set where X denotes a set of N patient cases in the EMR, each with K predictor variables including demographics, medication administrations, laboratory test results, and vital signs. Let y represent an EMR data item for N patient cases with binary values of relevant and not relevant. The data is reviewed by J physicians, each reviewing nj cases such that . We formulate an HLR model as follows: where β0 and β are population-level intercept and coefficients, and φ0 and φ denote physician-level intercept and coefficients. Z corresponds to the physician-level design matrix, which is a sparse expansion of X with N rows and J K columns, splitting X into J segments. Below is a simple example of X and Z matrices with N = 5 patients reviewed by J = 2 physicians (colored in blue and red) and K = 3 predictor variables:
In Equation 3, βk and φk denote the coefficients of the kth predictor variable (k = 0 denotes the intercept) for the population and physician j’s models, and are assumed a priori to be centered at 0 with standard deviations σβ and σφ, respectively. We define the prior for σφ as a Cauchy distribution with scale parameter τφ, and restrict it to positive values (hence the half Cauchy). σβ and τφ are the model hyperparameters and are tuned in the model training phase. Markov chain Monte Carlo (MCMC) methods are used to estimate the parameters’ posterior distributions. MCMC methods draw samples sequentially from the posterior distribution and improve the draws at each step to better approximate the distribution.
B Cross validation details
Each model was trained and evaluated independently in a stratified 10-fold cross validation setting. The following preprocessing steps were applied to the training and test sets at each iteration:
Imputation. Missing values in predictor variables were imputed by the median or mode for continuous and discrete variables, respectively. The imputed value was derived from the training set and applied to both the training and test sets.
Feature selection. We used supervised univariate feature selection to reduce the number of features before deriving models. In particular, we used analysis of variance (ANOVA) and Fisher’s exact tests (significance level = 0.01) for continuous and binary predictor variables, respectively. We limited the predictors to a maximum of 100 variables. The feature selection was performed on the training set and then applied to both the training and test sets.
Feature standardization. Continuous predictor variables were rescaled to be centered at zero and have a unit standard deviation. We calculated the mean and standard deviation statistics from the training set and used them to standardize both training and test sets.
Each model was trained on the training set and evaluated on the test set. Model hyperparameters were tuned in an inner stratified 3-fold cross validation of the training set. In particular, for HLR models, we selected the best values of the hyperparameters (i.e., σβ and τφ in Equations 3 and 4 above) from the set of values { 0.01, 0.1, 1, 5, 10}. For LR models, LASSO regularization was used and the optimal regularization parameter (λ) was chosen from an automatically generated sequence of 100 values as described in [35].
9 Acknowledgments
This research was supported in part by the University of Pittsburgh Center for Research Computing (CRC) through the resources provided. We specifically acknowledge the assistance of the research faculty consultants at CRC. The study was approved by the University of Pittsburgh IRB under protocol PRO14020588.
Footnotes
↵1 Note that BUN may have been measured multiple times for a patient case and therefore, take several numeric values. We summarize these values as a fixed-length vector as described in the Data preparation section.