Confidence-based laboratory test reduction recommendation algorithm =================================================================== * Tongtong Huang * Linda T. Li * Elmer V. Bernstam * Xiaoqian Jiang ## Abstract Unnecessary laboratory tests present health risks and increase healthcare costs. We propose a new deep learning model to identify unnecessary hemoglobin (Hgb) tests for patients admitted to the hospital. Machine learning models might generate less reliable results due to noisy inputs containing low-quality information. We estimate prediction confidence to measure reliability of predicted results. Using a “select and predict” design philosophy, we aim to maximize prediction performance by selectively considering samples with high prediction confidence for recommendations. We use a conservative definition of unnecessary laboratory tests, which we define as stable and below the lower normal bound (LBNR). Our model accommodates irregularly sampled observational data to make full use of variable correlations (i.e., with other laboratory test values) and temporal dependencies (i.e., previous observations) in order to select candidates for training and prediction. Using data collected from a teaching hospital in Houston, our model achieves Hgb prediction performance with a normality AUC at 95.89% and a Hgb stability AUC at 95.94%, while recommending a reduction of 9.91% of Hgb tests that were deemed unnecessary. ## 1. Introduction Laboratory over-utilization is common, especially in the United States[1,2]. Unnecessary blood draws waste resources and may harm patients by contributing to iatrogenic anemia[1]. Ideally, we want to minimize laboratory utilization while obtaining necessary information[1,3]. Recently, researchers have used machine learning to identify unnecessary laboratory tests. Most approaches leverage time series prediction methods to take advantage of previous patient information, such as autoregressive models[4], mixed-effect models[5], and traditional recurrent neural networks (RNN)[6,7]. Previous studies that identify unnecessary laboratory tests fall into two categories: *information gain*, and observability learning. *Information gain* methods measure whether certain laboratory tests yield informative values. These approaches require an exact definition of unnecessary laboratory tests. Roy et al.[4] identify laboratory tests in the normal range as low yield laboratory tests, but they ignore events where the laboratory value changes from the normal to abnormal range; however, transitions such as these are likely to be clinically relevant events. Cismondi et al.[8] dichotomize laboratory tests into “information gain” or “no information gain” categories based on both normality and dropping levels. Aikens et al.[9] consider laboratory stability using both absolute value and standard deviation changes. Their models recommend eliminating laboratory tests that have little information gain. However, these algorithms lack confidence estimates to measure the reliability of predictions. Metrics that summarize the confidence of a prediction are important adjuncts for clinicians using the algorithm. *Observability learning* approaches estimate the need for a lab to be checked in actual practice. A missing observation means that the laboratory test was not checked by the physician. Yu et al.[10,11] used a two-layer long-short term memory (LSTM) network with multiple fully-connected layers to estimate the observability of the next laboratory test. The min-max loss function increases prediction accuracy as the likelihood of observability (i.e., the necessity of conducting the next laboratory test) increases. Their first approach[10] has a narrow definition of the ground truth, which aims to predict the change rate for laboratory values. The second approach[11] extends the previous model by using multitask learning mechanisms to include predictions for abnormality (i.e., laboratory values beyond the normal range) and transition (i.e., laboratory values change from the normal to the abnormal range or vice versa). However, such recommendation algorithms are highly dependent on actual physician practices, which might be not optimal for laboratory test reduction algorithms. Relying on past practices might not be applicable to different or complicated clinical situations. Observability learning models, which work to predict the need for a test, are likely to recommend eliminating laboratory tests with a low prediction confidence. We believe that a better strategy may be to focus on eliminating laboratory tests that can be confidently predicted. Our work is designed to reduce unnecessary hemoglobin (Hgb) tests based on the following assumptions. Sequential Hgb levels within the normal range imply that the patient is stable[12,13]. We further define that the Hgb test is “stable” when its immediate value does not change from normal to abnormal. We explore approaches that enable our algorithm to make safe and confident predictions with features embedded in the algorithm. First, we introduce an *outcome-level* safety assurance, which uses a conservative (‘safe’) definition of unnecessary Hgb laboratory tests. We define ‘safe’ Hgb tests as ones that are predicted to be stable and remain normal. If the predicted normality is consistent with predicted stability, we will recommend eliminating the next Hgb test. Second, we will estimate the confidence of predicting Hgb test values as a *sample-level* safety assurance. We estimated the confidence for predicting Hgb at a specific time based on all available time-series data by introducing a selection predictor in the neural network architecture[14]. In training, this means the model will ignore some samples, whose inclusion would decrease the performance of the model. In testing, this means the selection predictor will not make recommendations if the selection confidence is below a certain threshold. Our main contribution is to integrate the confidence-based selection process for candidate samples into the training and testing phase, rather than estimating prediction confidence in a post-hoc manner. ## 2. Results ### 2.1 Dataset Description Our data included 75,335 distinct inpatient encounters from a large urban hospital system in the southern United States. Because the model requires learning previous observations of laboratory data, we excluded 8,528 encounters with only one Hgb test. We also eliminated 4,328 encounters with systolic blood pressures less than 90 mmHg to focus on patients who were hemodynamically stable, leaving 62,479 unique encounters. This included 804 pediatric encounters (age < 18). The demographics are summarized in **S4 Table**. We divided the entire cohort into 80% training set and 20% test set. To estimate future Hgb values, we also included 11 other common laboratory tests: * Electrolytes: Na (sodium), K (potassium), Cl (chloride), HCO3 (serum bicarbonate), Ca (total calcium), Mg (magnesium), PO4 (phosphate) * Renal function: BUN (Blood Urea Nitrogen), Cr (creatinine) * Complete Blood Count (CBC): Plt (platelet count), WBC (white blood count) We included five relevant vital signs (i.e., peripheral pulse rate, diastolic blood pressure, systolic blood pressure, respiratory rate, and SpO2 percent), individual patient demographics (i.e., gender, race, and age), hours from the last observation, Hgb value changes, and missing value indicators. ### 2.2 Prediction Tasks Considering patient data at {0,1,2,…, *t*}, where *t* denotes the timestep number, our confidence-based selection model has four predictive tasks: selection probability *p**t*, Hgb normality *y**t*, Hgb stability *z**t*, and Hgb value *v**t*. Specifically, predicting Hgb normality *y**t* and Hgb stability *z**t* are responsible for *outcome-level* safety assurance, and predicting selection probability *p**t* is responsible for *sample-level* safety assurance. Selection probability *p**t* measures the confidence of predicting the next Hgb. Prediction confidence represents predictability, that is, reliability of Hgb predictions estimated by the model. Based on Hgb tests that are confidently predicted, our model can make more accurate predictions of normality and stability. The model selected Hgb candidates with high prediction confidences, whose *p**t* is greater than a selection threshold *τ*. For the Hgb normality *y**t* task, if the predicted value was above the LBNR, the Hgb test was considered normal (i.e., *y**t* = 1). Otherwise, the Hgb test was cosnsidered abnormal (i.e., *y**t* = 0). We assume that Hgb results exceeding the upper bound of the normal range (UBNR) are relatively uncommon (e.g., polycythemia) or irrelevant (e.g., indicate dehydration or chronic hypoxia that are usually monitored using other modalities and were excluded from our analysis). Most clinical cases focus on dropping Hgb (e.g., bleeding), and only 0.37% of Hgb results in our samples were above the UBNR. In our population-driven model, instead of using a uniform LBNR over the entire population, we defined normal Hgb based on age and gender (see **Table 3**). For the Hgb stability *z**t* task, a predicted value was considered stable (i.e., *z**t* = 1) if it does not change from normal to abnormal. For this task, we only considered a decreasing drop from normal to abnormal. Agreement between normality and stability predictions reinforces the confidence of the overall prediction. The Hgb value *v**t* was an auxiliary task to improve primary prediction tasks – Hgb normality and Hgb stability. We expected that Hgb value predictions with confident normality and stability predictions would also closely approximate the actual values. Given an encounter’s input features *X*, we defined that the ground truth of unnecessary Hgb tests satisfy the following condition {*X*| *y**t* = 1} ∩ {*X*| *p**t* = 1} ∩ {*X*|*p**t* > *τ*}. That is, among selected candidates, if the next Hgb value was predicted to be normal and stable, the model would recommend a ‘safe’ reduction. ### 2.3 Training and Evaluation Design In the training stage, we conducted random mask corruption to transform some observations into zeros in order to simulate the impact of recommended lab test reduction (see **Methods**). In the test stage, we either converted omitted laboratory values to zeros (in reduction evaluation) or kept original laboratory values (in no-reduction evaluation). * *Training protocol:* In the training stage, input features *X* were fed into the network model at a fixed length *T*. The prediction at every timestep *t* depends on the observations from all previous timesteps. We trained separate models at the target coverage rate *c* = {0.75,0.8,0.85,0.9, 0.95,1.0}, which reflects the expected proportion of laboratory candidates that are selected for possible reduction. * *Reduction evaluation (practical setting):* In the test stage, we simulated dynamic laboratory reduction during the evaluation process. Starting at the initial timestep *t* = 0, the model was fed initial inputs *x***. For the following timesteps *t* > 0, we conducted stepwise reduction estimation. If the model estimated the next Hgb to be normal and stable, which yielded a recommendation to omit that test, the next Hgb input would be set at a zero value. The reduction evaluation process iterated until the last timestep. * *No-reduction evaluation (idealized setting):* Like the training stage, the second evaluation protocol performed a fixed evaluation process. At every timestep *t*, the model obtained all input features from the previous timesteps. In this process, we always used full observation to make future predictions, and no lab test was ever reduced in the prediction process. Model robustness can be estimated from the gap between *reduction* and *no-reduction* evaluations. ### 2.4 Label Prediction Performance We report the results of several experiments that validate the effectiveness of our confidence-based model in predicting normality and stability labels. **Fig 1** represents the comparison results under reduction evaluation and no-reduction evaluation metrics. Our numerical results are represented in **Table 1** and **Table 2**. ![Fig 1.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/06/21/2022.06.20.22276546/F1.medium.gif) [Fig 1.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/F1) Fig 1. Model performance at multiple selection coverages under reduction and no-reduction evaluation. The selection threshold *τ* is 0.5. The coverage rate refers to the expected proportion of Hgb samples to constrict the model. View this table: [Table 1.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T1) Table 1. Model performance by selection coverages and reduction rate. The model coverage rate refers to the actual proportion of Hgb samples considered by the model; the reduction rate refers to the proportion of Hgb samples recommended for reduction among the entire Hgb test dataset; Prev indicates prevalence, which denotes the proportion of normal/stable Hgbs in selective candidates; AUC indicates area under the ROC curve; Acc indicates accuracy; Prec indicates precision; AUPRC indicates the precision-recall curve. View this table: [Table 2.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T2) Table 2. Model performance under selection coverages under no-reduction evaluation. View this table: [Table 3.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T3) Table 3. Hgb normal range stratification table. LBNR means the lower bound of the normal range. UBNR means the upper bound of the normal range. Our experiments only considered Hgb LBNR to identify normality and stability. We set the default selection threshold *τ* = 0.5. The model selects a subset of Hgb samples with high prediction confidences and use *p**t* > 0.5 to make classifications, guided by the target coverage rate *c* (see **Method**). The selected samples are considered high-confidence candidates. A lower coverage rate means the model considers more high-confidence candidates for training and testing. But there is a tradeoff, as being too strict on prediction confidence would reduce selection size and hurt the model’s generalizability. “Model coverage rate” is the actual proportion of Hgb samples considered by the model. One can see that “model coverage rates” were close to “target coverage rates”, and they had less than 2% of differences in all settings, showing that the model is enforcing the target coverage rate. Because the data distribution was skewed (i.e., approximately 19.5% of labs were normal and only 9.2% transitioned from normal to abnormal), we used AUC and AUPRC to model performance measures. **Fig 1** and **Table 1** show that all the selective models achieved normality AUCs at over 90%, and stability AUCs at over 94%, even in the extreme coverage case at *c* = 1. Note that even though we set *c* = 1, our model did not think all the Hgb predictions are necessarily good candidates for reduction (indeed, it considered 99% of Hgb samples were eligible for reduction recommendation). As a good tradeoff between performance and reduction, we observed that the model reduced 9.91% Hgb tests at a coverage rate of 0.85. Most AUCs and AUPRCs decreased when the coverage rate increased. Note that stability AUCs did not drop at a steady rate because the proportion of stable samples was much larger than unstable ones, resulting in the mean probability bias favoring stable lab tests[15]. Overall, reduction evaluations achieved comparable performances over no-reduction evaluations. The results show that our model is robust in the practical setting when some input laboratory values are missing due to previous reductions. We defined prevalence as the proportion of positive samples (i.e., normal and stable Hgbs) in the selected Hgb candidates. As the coverage rate increased, the normality prevalence increased linearly, and the stability prevalence decreased linearly, suggesting that the higher-confidence model tends to select more samples from the dominating class of labels (i.e., abnormal Hgbs and stable Hgbs) to obtain higher accuracy while recognizing minor labels (i.e., normal Hgbs and unstable Hgbs). ### 2.5 Value Prediction Performance We evaluated the consistency between predicted values and predicted normality. The test was conducted under a reduction evaluation. Ideally, when the model estimates a Hgb sample to be normal, we expect its predicted value should also lie above the LBNR. **Fig 2** (see numerical details in **S6 Table**) shows that the model predicted values that were consistent with normality predictions. Among Hgbs that were predicted to be normal, our model predicted >90% of corresponding values above the LBNR in a tolerable error of 3% at coverage rates <85%, and in a tolerable error of 5% at almost all the coverage rates. These results imply that our multitask learning framework is able to leverage the auxiliary task (value prediction) to support the primary prediction tasks (normality and stability prediction). Another observation is that the coverage rate has a large effect on recognizing normalities of predicted values. When the coverage rate is higher, the accuracy of normality prediction shows more variability at error rates 0-10%, suggesting that the low-confidence model has more value variability. ![Fig 2.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/06/21/2022.06.20.22276546/F2.medium.gif) [Fig 2.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/F2) Fig 2. Consistency between predicted values and predicted normality. The objective was to measure the consistency between predicted values and predicted normalities. The “normality accuracy of predicted values” was defined as the percentage of predicted values with *f**V* (*x**t*) > = *b*, where *b* is the value of the LBNR, on Hgb samples with *f**Y*(*x**t*) = 1. We considered a tolerable boundary to be m% lower than the LBNR for predicted normal Hgbs. ### 2.6 Selection Performance The selection threshold *τ* was set at 0.5 during training. In the test phase, we evaluated the model using the default threshold *τ* = 0.5. In this section, we measured the influence of different selection thresholds *τ* ∈ [0.05,0.95] (with intervals at 0.05) on classification accuracy. The experiment was evaluated under laboratory reductions. We set the target coverage rate to 0.85. Our model accounted for 83.80% of Hgb candidates (**Fig 3**), using the default selection probability *τ* = 0.5. ![Fig 3.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/06/21/2022.06.20.22276546/F3.medium.gif) [Fig 3.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/F3) Fig 3. Model performances using different selection thresholds for reduction evaluation. The range of our selection threshold is *τ* ∈ [0.05,0.95] with intervals at 0.05. As shown in **Fig 3**, although performance curves fluctuate at nearby selection thresholds, the trend remained increasing when the threshold value was increased. With a larger threshold, the model selected fewer normal Hgbs and more stable Hgbs, but obtained higher confidence in classifications. When the threshold *τ* is above 0.5, we achieved normality predictions over 95.8% AUC and 80.0% AUPRC, while stability predictions achieved more than 95.9% AUC and 99.8% AUPRC. ## 3. Discussion We introduced a deep learning model with a selective framework to address the laboratory reduction problem. We conducted a case study on Hgb reduction. Based on the definition that unnecessary laboratory tests are the ones predicted to be normal and stable, we showed that our selective model achieved good predictive performance. Our major contribution is to offer safe recommendations for omitting unnecessary Hgb samples by jointly considering the confidence and prediction accuracy during training and testing. The idea was to select a proportion of Hgb candidates with high prediction confidence in estimating normality and stability. Our model automatically identified an appropriate balance between stability, normality, and prediction confidences, which achieved a performance of 95.89% normality AUC and 95.94% stability AUC with a potential to eliminate 9.91% of Hgb tests. In addition, when future Hgb tests were predicted as normal, our model predicted >90% of the corresponding predicted Hgb values within a tolerable error range of 3% at a 90% selection coverage, demonstrating its robustness. We also made some technical contributions in order to handle irregular time sequences in laboratory reduction by introducing a feed-forward attention function to capture the importance of every timestep. The ablation study results (see **S3 Table**) confirmed that both selective mechanism and attention-based LSTM layers contributed to the improvement of model performances. Not all predictable laboratory tests are unnecessary in complicated clinical situations. To ensure usability, our model discovered unnecessary Hgb tests among predicted high-confidence candidates. The predicted low-confidence Hgb samples are not recommended for elimination, for which our model would acknowledge ‘I don’t know’ and let clinicians decide. Our model also offers a tunable parameter to control the confidence level for sample inclusion (i.e., expected selection coverage). Clinicians can choose a confidence level based on the context to receive recommendations for clinical decision making. Nevertheless, our work still has limitations. First, the selection mechanism is computationally expensive because it needs to accommodate multiple objectives. Second, the optimal expected coverage rate that controls the model’s confidence might vary by different situations, and it is not directly interpretable as confidence intervals that are familiar to clinicians. Third, our study was internally validated for patients admitted to a single health institution. External validation studies are needed. Fourth, our model recommends reduction for an individual laboratory test. In clinical practice, Hgb tests are commonly ordered as a part of a complete blood count panel, which includes platelet count and white blood count. Our model currently does not account for bundled test reduction strategies because the laboratory panel information is missing in the dataset. Finally, the ground truth of unnecessary laboratory tests might not be perfect. Our model relies on standard of laboratory normal ranges, whereas some abnormal results may be predictable and stable (e.g., a clinically stable patient with stable anemia) and thus could be omitted. We will address these limitations in future research. ## 4. Methods The workflow of the confidence-based candidate selection is demonstrated in **Fig 4**. ![Fig 4.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/06/21/2022.06.20.22276546/F4.medium.gif) [Fig 4.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/F4) Fig 4. Block diagram of candidate selection. (1) Before processing the network model, we imputed zeros for missing features. (2) In the training stage, we inserted a random zero mask for existing laboratory values. The network model predicted selection probabilities for individual laboratory tests. Thus, the model ignored some samples, whose inclusions were considered to decrease performance. Each model was trained under one target coverage rate that constrained the actual proportion of selected laboratory tests. Intuitively, the lower coverage rate means that selections are more strict. (3) In the test stage, we chose a model at an acceptable coverage rate. A threshold *τ* was used to determine whether individual laboratory tests were selected. The selected tests were considered high-confidence candidates. The model recommended canceling pending laboratory tests if predicted values satisfied two joint conditions: a. High-confidence; b. Unnecessary (i.e., predicted to be stable and remain normal). ### 4.1 Data Preprocessing For each encounter, we organized laboratory test results into a consecutive sequence in temporal order. The laboratory tests conducted in the same hour were aggregated into a laboratory draw. If the same laboratory test appears more than one time at the encounter’s visit hour, we averaged the result values. All laboratory draws for each encounter start with the draw where the first Hgb result was recorded, and end with the draw where the last Hgb result was recorded. Finally, we capped the total length of laboratory draws to 30 timestamps (to make the parameter space tractable) based on the histogram of encounter visit times (i.e., the number of timestamped laboratory records for patients, see **S2 Fig**). To incorporate patient conditions, vital signs were also integrated with the laboratory draw by the event time. We calculated average values when more than one vital sign was reported during the same hour. To better understand each patient, our model includes patient demographics, involving gender, race, and age information. The normal range varies depending on gender and age (**Table 3**). We encoded gender and race information as categorical variables. We also categorized patient ages into four groups based on their different normal ranges and genders. ### 4.2 Notation **Table 4** explains the symbols used in our model. For each encounter, we had a multivariate time series data of length *T*. We gave symbols to represent important concepts: feature measurements *X*, Hgb normality labels *y*, Hgb stability labels *Z*, Hgb values *V*, and selection probability *P. X* represented a group of input features, including laboratory values, vital signs, Hgb value changes, demographics, time differences, and observation indicators. The observation indicator was denoted as *o*. It was used for handling missing values of the dataset. Hgb normality labels *Y*, Hgb stability labels *z*, and Hgb values *V* served as gold standards to measure our model’s prediction performance. To measure the confidence of predicting the next Hgb test, our model also predicted the selection probability *P*. As observations were not necessarily made at regular intervals, hence, the timestep *t* simply indexed the sequence of observations. View this table: [Table 4.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T4) Table 4. Nomenclature table. When an encounter had observed data (i.e., a Hgb test drawn and resulted) at a timestep *t*, we denoted the observation indicator *o**t* = 1; Otherwise, we denoted *o**t* = 0. The normality *y**t*, stability *z**t*, values *v**t*, and selection probabilities *p**t* represented the outcome realized at timestep *t*. We used a sigmoid function to output predictions for normality *Y*, stability *Z*, and selection probability *P* in a range of [0,1], and restricted predictions for *V* to be positive. To make a cutoff for classifications, we used a threshold for normality, stability, and selection probability. The default thresholds were 0.5. In the test stage, the selection threshold is adjustable, and we set values in the range of *τ* ∈ [0.05.0.95]. ### 4.3 Feature Processing This section introduces our methods to handle missing data and random mask corruption. We incorporated relational positional time embeddings [16] to replace absolute one-dimensional time differences. Since time embeddings have little impact on improving model performance in our experiment, we discuss details of this approach in **S1 Text**. #### 4.3.1 Missing Data Handling Our model considered 12 common laboratory tests as features, but some tests were not conducted in the same time window. The reason is that no patient has all labs drawn at the same time. We treated these unmeasured laboratory tests as missing values, which are denoted as *v**t* = 0. Previous work [17] shows that deep learning models, such as long-short term memory (LSTM) networks, can handle missing data by integrating an additional indicator *o**t* =0 for these missing values *v**t* = 0. Thus, our model used two vectors (*o**t*, *v**t*) instead of only one vector of observed lab test values so that the model knew which observations were missing. This so-called zero imputation strategy can handle missing data implicitly by considering feature correlations. #### 4.3.2 Random Mask Corruption In the training stage, the random mask corruption was designed to simulate the impact of lab test reduction on future predictions. Assuming a lab test is reduced at time *t*, and therefore, its observed value cannot be used for predicting labs at time *t* + 1. Following an earlier work[11], we randomly corrupted 10% of observed inputs (*o**t* = 1), that is, making their value *v**t* =0 at future predictions to simulate the impact of lab test reductions. We still considered the prediction errors of these corrupted Hgb values from time *t* − 1 to *t*, but did not use them to make future predictions. At the testing stage, we did not introduce any corruption and simply changed *v**t* =0 for lab tests that were recommended for reduction with respect to future predictions. ### 4.4 Development of Confidence-based Deep Learning Approach #### 4.4.1 Feed-forward Attention LSTM A time-aware attention mechanism [18] was used to extract essential features from the input sequence over time. First, the LSTM layer generated a sequence of hidden vectors *h*. Second, for each timestep *t*, the learnable function *a* computed the hidden vector *h**t*, then produced an encoded embedding *e**t*. We computed a probability weight *a**t* using a softmax function over the entire time sequence. Finally, the context vector *c* was computed as a weighted sum of the hidden vectors *h**t*. ![Formula][1] ![Formula][2] ![Formula][3] As a result, the attention mechanism enabled the model to distribute weights over the entire time period, which smoothly adjusted the impacts of long-time memories for irregular time sequences. #### 4.4.2 Model Architecture We presented the confidence-based deep learning work in combination with long-short term memory (LSTM) and the selective neural network under multitask learning strategies. The idea is to select a proportion of Hgb candidates with high prediction confidence so that the selected Hgb candidates are more likely to make correct classifications of normality and stability. The architecture of our model is shown in **Fig 5**. The input features were first fed into the LSTM network module. Inspired by a multitask learning architecture[19], we used a common feed-forward LSTM layer to capture the shared information for downstream networks, and assigned two separate attention-based LSTMs. The shared LSTM layer accepted all features (i.e., laboratory values, vital signs, patient demographics, Hgb value changes, time differences, and observation indicators) as inputs, and optimized hidden states at every timestep. Demographic features were copied and placed in every timestamp. The model used two attention-based LSTM layers with shared parameters to capture task information. In the attention-based LSTM layer, input embeddings were augmented by concatenating the hidden features of the shared LSTM layer and original data *X*. One attention-based LSTM layer was followed by the stability predictor. We used Hgb values to measure the stability of future Hgb tests, and this first layer learned a subset of features, including Hgb value changes, patient demographics, time differences, and observation indicators. The other attention-based LSTM layer was followed by the normality, value, and selection predictor. These three prediction tasks require complex features and may influence each other[10]. This second layer learned entire feature vectors, including laboratory values, vital signs, patient demographics, Hgb value changes, time differences, and observation indicators. ![Fig 5.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/06/21/2022.06.20.22276546/F5.medium.gif) [Fig 5.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/F5) Fig 5. Model Architecture Framework. In the LSTM network module, the shared LSTM layer received all input features, and outputs hidden features that contained general information derived from original data. The attention-based LSTM layer augmented input embeddings by concatenating hidden features and duplicated original features. One attention-based layer learned a subset of features for the following stability predictor. The other attention-based layer learned entire feature vectors to obtain complicated information for the following normality, value, and selection predictor. In the selective network module, we had four 2-layer MLP predictors to make task-specific predictions for Hgb stability, Hgb normality, Hgb value, and selection probability in parallel. Stability and normality predictors were treated as primary predictions that focus on selected Hgb samples. The value predictor served as the auxiliary prediction that covers all Hgb samples, including the non-selected ones. The downstream selective network module used multilayer perceptrons (MLPs) with ReLU activation, each followed by a final task-specific prediction. Each MLP predictor consisted of two fully-connected layers. The selection MLP predictor estimated the likelihood of choosing the Hgb test as a high-confidence sample[20]. More details of confidence-based selection will be introduced in the next section. The normality and stability MLP predictors were responsible for primary predictions for selected Hgb candidates. The value MLP predictor made the auxiliary prediction that covered all Hgb samples, including Hgb tests that were not selected by the model. The reason for introducing the auxiliary prediction was to avoid overfitting (i.e., the model chooses non-representative samples to only benefit normality and stability prediction). In addition, Hgb value optimization used the entire dataset to evaluate the loss in order to ensure the generalizability of normality and stability predictors. #### 4.4.3 Loss Function with Reject Optimization We considered the problem of selective prediction in the laboratory reduction network and leveraged the integrated reject mechanism developed in SelectiveNet[14]. It is a deep learning architecture that is optimized for selecting samples that maximally benefit model predictions. Based on our assumption that the dataset contains a proportion of Hgb outliers, such a specialized rejection model only considers a proportion of samples and filters out low-confidence ones. The approach proposes a loss function that enforces the coverage constraint using a variant of the Interior Point Method (IPM)[21]. The selection head outputs a single probabilistic value *p**t* using a sigmoid activation. At a certain timestep *t*, the selective network module achieves LSTM hidden features, then predicts Hgb normality *f**Y*(*x**t*) and stability *f**z*(*x**t*) if and only if the selection probability *p**t* exceeds a user-defined threshold *τ* (defined in **Table 4**); otherwise, the model rejects the prediction tasks of normality and stability for *x**t*. Given the selection loss *L**f* in Equation (4), the performance of the selective algorithm is measured by the likelihood of normality *r*(*f**Y*), the likelihood of stability *r*(*f**z*), and a quadratic penalty function, *ψ*(*a*). ![Formula][4] ![Formula][5] where *c* is the customized target coverage (i.e., expected proportion of samples to be considered eligible for reductions, for which the model will predict), *λ* is a penalty parameter to control the weight of the regularization. The likelihood of normality *r*(*f**Y*) and the likelihood of stability *r*(*f**z*) are determined by the normality loss *L*(*f**Y*(*x**t*), *y**t*) and stability loss *L*(*f**z*(*x**t*), *z**t*), respectively. ![Formula][6] ![Formula][7] where ![Formula][8] In Equation (9) and (10), the normality loss *L*(*f**z*(*x**t*), *z**t*) and stability loss *L*(*f**z*(*x**t*), *z**t*) are both calculated using the binary cross-entropy, where *σ* denotes the sigmoid function that converts predictions into probabilistic values. ![Formula][9] ![Formula][10] Additionally, we handled Hgb value predictions *f**V* (*x**t*) as the auxiliary task using a standard mean squared error (MSE) loss function. Auxiliary predictions were used to connect the selection loss *L**f* by accounting for all samples. Otherwise, normality loss and stability loss only consider optimizing predictions of selected samples, which might cause the overfitting issue. Thus, the overall loss function *L* is a combination of the selection loss *L**f* and the auxiliary loss *L**aux* as follows: ![Formula][11] where *α* is the selection weight which lies in [0,1], and ![Formula][12] ## Data Availability These clinical data contain potentially identifying and sensitive patient information, which cannot be made available due to the privacy protection policy of the hospital. ## Data Availability These clinical data contain potentially identifying and sensitive patient information, which cannot be made available due to the privacy protection policy of the hospital. ## Ethnic Statement Our research study was approved by the Committee for the Protection of Human Subjects (the UTHSC-H IRB) using the protocol HSC-MS-21-0452 [22]. There was a waiver of informed consent. ## Code Availability Custom code used to analyze the model is available upon reasonable request to the corresponding author. ## Financial Disclosure XJ is CPRIT Scholar in Cancer Research (RR180012), and he was supported in part by Christopher Sarofim Family Professorship, UT Stars award, UTHealth startup, the National Institute of Health (NIH) under award number R01AG066749 and U01TR002062. LL is supported by a KL2 Mentored Career Development Award through the University of Texas Health Science Center at Houston. EVB is supported in part by National Center for Advancing Translational Sciences (NCATS) UL1TR000371; the Cancer Prevention and Research Institute of Texas (CPRIT), under award RP170668 and the Reynolds and Reynolds Professorship in Clinical Informatics. ## Author Contributions TH led the writing and conducted experiments. LL, EVB, XJ contributed to the generation of the idea and contributed to the writing of the paper. All authors made multiple rounds of edits and carefully reviewed the manuscript. ## Competing Interests Statement We do not have a competing interest to disclose. ## Supplementary Information ### S1 Text. Supplementary Notes #### Data analysis Statistical analyses were conducted in Python (version 3.7). The network model is processed via PyTorch libraries using CUDA. The network was trained using the Adam optimizer, with a 64 minibatch size, an L2 regularization weight of 0.000001, and a learning rate of 0.0001 over 100 epochs. The network hyperparameters producing the lowest total loss in the test set were chosen as the final network architecture. #### Relative Positional Time Embedding The laboratory test data are represented as irregular time series, which have wide and heterogeneous time gaps between consecutive observations. The number of hour differences from the last laboratory draw contains positional information that indicates the relative relationships between lab test values in consequent blood draws. The time differences were concatenated with other input features (i.e., laboratory values, vital signs, patient demographics, Hgb value changes, and observation indicators) in the timestamp. Previous work [11] handled irregular time-series predictions by assigning the absolute time difference to each timestep. We assigned a unique temporal encoding to cover all possible hour gaps (relative differences by hours), which can be generalized to both short and long time gaps. Following the transformer model[16], we used sinusoidal functions to incorporate relative positions. ![Formula][13] ![Formula][14] where Δ*t* is the relative time differences of laboratory tests, *d* represents the embedding dimension, and *s:N*–> *Rd* is the sinusoidal function that produces the output vector for each dimension *i*. This kind of embedding allows us to represent relative positions, in which *S*(*t* + *k*) is a linear transformation of *S*(*t*) for any fixed offset *k*. #### Ablation Study Evaluation To better understand the network behavior, we performed ablation experiments in which some components of the network are removed. All ablation experiments were conducted on the reduction evaluation protocol. We compared our full model with several variants with reduced structure from the full model. The model without time embeddings refers to a model considering time differences as a one-dimensional vector. The model without attention refers to a model without attention in LSTM layers. The model without confidence selection makes predictions of normality, stability, and values for all Hgb samples. The vanilla two-layer LSTM together with four two-layer MLPs whose network structure refers to Yu et al.[11]. The baseline models considering the selection mechanism (our model, without time embedding, and without attention) are set as the coverage rate at 0.85, and their corresponding model coverage rates are very close. As shown in **S3 Table**, our full model achieves the best normality prediction performances and comparable stability prediction outcomes. Although the relative positional time embedding method works well on transformer models[16], removing time embeddings does not make a significant influence on predicting labels in our case. In our model, attention weights have improved normality AUC by 3.21% and AUPRC by 7.06%, and confidence selections have improved normality AUC by 4.98% and AUPRC by 16.13%. Compared to the vanilla two-layer LSTM network, our model includes a combination of attention weights and confidence selection, which archives performance improvements. ## Supplementary Figure ![S2 Fig.](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2022/06/21/2022.06.20.22276546/F6.medium.gif) [S2 Fig.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/F6) S2 Fig. Histogram of encounter visit times. Encounter visit times refer to the number of laboratory draws for each encounter. The histogram results show that it follows a long tail distribution. Most encounters have no more than 30 laboratory test records. ## Supplementary Tables View this table: [S3 Table.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T5) S3 Table. Comparison with baseline models at an optimal coverage (*c* = o. 85). Bold numbers denote the best performances over the evaluation metric, and underlined numbers denote the second-best performance over the evaluation metric. View this table: [S4 Table.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T6) S4 Table. Patient cohorts summarized. This table concludes input features of 3 patient demographic information (age, gender, and race), 5 vital signs, and 12 common laboratory tests. View this table: [S5 Table.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T7) S5 Table. The number of Hgb coverages and reduction rates under reduction evaluation. This table provides numerical details of hemoglobin test reduction regarding the test dataset. The total number of encounters is 12,525. The total number of Hgb samples in the entire dataset is 57,039 starting from *t* = 0, and 44,586 starting from *t* = 1. The dominator of the coverage rate and the reduction rate is the total number of Hgbs starting from *t* = 1, which ignores initial tests at *t* = 0. View this table: [S6 Table.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T8) S6 Table. Value prediction consistency with normality under reduction evaluations. This table provides numerical details on value prediction to explain **Fig 2** in the article. View this table: [S7 Table.](http://medrxiv.org/content/early/2022/06/21/2022.06.20.22276546/T9) S7 Table. Model performances using different selection thresholds under reduction evaluation. This evaluation table provides numerical details of selection prediction to explain **Fig 3** in the article. * Received June 20, 2022. * Revision received June 20, 2022. * Accepted June 21, 2022. * © 2022, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at [http://creativecommons.org/licenses/by/4.0/](http://creativecommons.org/licenses/by/4.0/) ## REFERENCES 1. 1.Brownlee S, Chalkidou K, Doust J, Elshaug AG, Glasziou P, Heath I, et al. Evidence for overuse of medical services around the world. Lancet. 2017;390: 156–168. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0140-6736(16)32585-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=28077234&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F06%2F21%2F2022.06.20.22276546.atom) 2. 2.Chassin MR, Galvin RW, and the National Roundtable on Health Care Quality, and the National Roundtable on Health Care Quality. The Urgent Need to Improve Health Care Quality: Institute of Medicine National Roundtable on Health Care Quality. JAMA. 1998;280: 1000–1005. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1001/jama.280.11.1000&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=9749483&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F06%2F21%2F2022.06.20.22276546.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000075891400034&link_type=ISI) 3. 3.Statland BE, Winkel P. Utilization review and management of laboratory testing in the ambulatory setting. Med Clin North Am. 1987;71: 719–732. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=3295422&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F06%2F21%2F2022.06.20.22276546.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1987H829800012&link_type=ISI) 4. 4.Roy SK, Hom J, Mackey L, Shah N, Chen JH. Predicting Low Information Laboratory Diagnostic Tests. AMIA Jt Summits Transl Sci Proc. 2018;2017: 217–226. 5. 5.Nasserinejad K, de Kort W, Baart M, Komárek A, van Rosmalen J, Lesaffre E. Predicting hemoglobin levels in whole blood donors using transition models and mixed effects models. BMC Med Res Methodol. 2013;13: 62. 6. 6.Lobo B, Abdel-Rahman E, Brown D, Dunn L, Bowman B. A recurrent neural network approach to predicting hemoglobin trajectories in patients with End-Stage Renal Disease. Artif Intell Med. 2020;104: 101823. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.artmed.2020.101823&link_type=DOI) 7. 7.Pellicer-Valero OJ, Cattinelli I, Neri L, Mari F, Martín-Guerrero JD, Barbieri C. Enhanced prediction of hemoglobin concentration in a very large cohort of hemodialysis patients by means of deep recurrent neural networks. Artif Intell Med. 2020;107: 101898. 8. 8.Cismondi F, Celi LA, Fialho AS, Vieira SM, Reti SR, Sousa JMC, et al. Reducing unnecessary lab testing in the ICU with artificial intelligence. Int J Med Inform. 2013;82: 345. [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F06%2F21%2F2022.06.20.22276546.atom) 9. 9.Aikens RC, Balasubramanian S, Chen JH. A Machine Learning Approach to Predicting the Stability of Inpatient Lab Test Results. AMIA Summits Transl Sci Proc. 2019;2019: 515. 10. 10.Predict or draw blood: An integrated method to reduce lab tests. J Biomed Inform. 2020;104: 103394. 11. 11.A deep learning solution to recommend laboratory reduction strategies in ICU. Int J Med Inform. 2020;144: 104282. 12. 12.Association between hemoglobin variability, serum ferritin levels, and adverse events/mortality in maintenance hemodialysis patients. Kidney Int. 2014;86: 845–854. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/ki.2014.114&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24759150&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2022%2F06%2F21%2F2022.06.20.22276546.atom) 13. 13.Sabatine MS, Morrow DA, Giugliano RP, Burton PBJ, Murphy SA, McCabe CH, et al. Association of hemoglobin levels with clinical outcomes in acute coronary syndromes. Circulation. 2005;111: 2042–2049. [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6MTQ6ImNpcmN1bGF0aW9uYWhhIjtzOjU6InJlc2lkIjtzOjExOiIxMTEvMTYvMjA0MiI7czo0OiJhdG9tIjtzOjUwOiIvbWVkcnhpdi9lYXJseS8yMDIyLzA2LzIxLzIwMjIuMDYuMjAuMjIyNzY1NDYuYXRvbSI7fXM6ODoiZnJhZ21lbnQiO3M6MDoiIjt9) 14. 14.Geifman Y, El-Yaniv R. SelectiveNet: A Deep Neural Network with an Integrated Reject Option. International Conference on Machine Learning. PMLR; 2019. pp. 2151–2159. 15. 15.Cramer JS. Predictive performance of the binary logit model in unbalanced samples. J Royal Statistical Soc D. 1999;48: 85–94. 16. 16.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. 2017. Available: [http://arxiv.org/abs/1706.03762](http://arxiv.org/abs/1706.03762) 17. 17.Lipton ZC, Kale DC, Wetzel R. Modeling Missing Data in Clinical Time Series with RNNs. 2016. Available: [http://arxiv.org/abs/1606.04130](http://arxiv.org/abs/1606.04130) 18. 18.Raffel C, Ellis DPW. Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems. 2015. Available: [http://arxiv.org/abs/1512.08756](http://arxiv.org/abs/1512.08756) 19. 19.Liu P, Qiu X, Huang X. Recurrent Neural Network for Text Classification with Multi-Task Learning. 2016. Available: [http://arxiv.org/abs/1605.05101](http://arxiv.org/abs/1605.05101) 20. 20.Geifman Y, El-Yaniv R. SelectiveNet: A Deep Neural Network with an Integrated Reject Option. 2019. Available: [http://arxiv.org/abs/1901.09192](http://arxiv.org/abs/1901.09192) 21. 21.Potra FA, Wright SJ. Interior-point methods. J Comput Appl Math. 2000;124: 281–302. [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S0377-0427(00)00433-7&link_type=DOI) 22. 22.Guerrero SC, Sridhar S, Edmonds C, Solis CF, Zhang J, McPherson DD, et al. Access to Routinely Collected Clinical Data for Research: A Process Implemented at an Academic Medical Center. Clin Transl Sci. 2019;12. doi:10.1111/cts.12614 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1111/cts.12614&link_type=DOI) [1]: /embed/graphic-9.gif [2]: /embed/graphic-10.gif [3]: /embed/graphic-11.gif [4]: /embed/graphic-13.gif [5]: /embed/graphic-14.gif [6]: /embed/graphic-15.gif [7]: /embed/graphic-16.gif [8]: /embed/graphic-17.gif [9]: /embed/graphic-18.gif [10]: /embed/graphic-19.gif [11]: /embed/graphic-20.gif [12]: /embed/graphic-21.gif [13]: /embed/graphic-22.gif [14]: /embed/graphic-23.gif