Abstract
Laboratory data in electronic health records (EHRs) is an effective source of information to characterize patient populations, inform accurate diagnostics and treatment decisions, and fuel research studies. However, despite their value, laboratory values are underutilized due to high levels of missingness. Existing imputation methods fall short, as they do not fully leverage patient clinical histories and are commonly not scalable to the large number of tests available in real-world data (RWD). To address these shortcomings, we present Laboratory Imputation Framework using EHRs (LIFE), a deep learning framework based on multi-head attention that is trained to impute any laboratory test value at any point in time in the patient’s journey using their complete EHRs. This architecture (1) eliminates the need to train a different model for each laboratory test by jointly modeling all laboratory data of interest; and (2) better clinically contextualizes the predictions by leveraging additional EHR variables, such as diagnosis, medications, and discrete laboratory results. We validate our framework using a large-scale, real-world dataset encompassing over 1 million oncology patients. Our results demonstrate that LIFE obtains superior or equivalent results compared to state-of-the-art baselines in 23 out of 25 evaluated laboratory tests and better enhances a downstream adverse event detection task in 7 out of 9 cases, showcasing its potential in efficiently estimating missing laboratory values and, consequently, in transforming the utilization of RWD in healthcare.
Introduction
Electronic health records (EHRs) create novel opportunities to leverage clinical information across large populations using machine learning and other data-driven approaches.1 EHRs contain a mix of structured and unstructured data elements, such as demographics, clinical diagnoses, laboratory results, medication prescriptions, and free-text clinical notes. Among these, laboratory data is particularly informative since it captures an objective physiological attribute of patients, rather than being used primarily for billing purposes such as International Classification of Diseases (ICD) diagnosis codes.2
However, there are many challenges associated with using laboratory results in real-world data (RWD) analysis. Foremost amongst these challenges is data missingness. Laboratory values can be missing from a patient’s EHR for various reasons, including patients not needing specific tests, irregular check-ups or follow-ups, data recording and integration issues, and more, often resulting in biased statistics and suboptimal clinical model performances, specifically in groups with traditionally poor access to healthcare.3–5
The task of imputing missing laboratory values is particularly difficult in EHRs, which commonly contain irregular time intervals between consecutive records. The literature presents a number of approaches, generally using either simple rules-based methods, statistical modeling, or machine learning. Some common statistical methods in use are expectation maximization, regression, and multiple imputations by chained equations (MICE).6 Machine learning techniques include tree-based models, such as Multi-directional Approach for Missing Value Estimation in Multivariate Time Series Data (MD-MTS),7 Gaussian processes such as 3D-MICE,8 and deep learning architectures, such as Bidirectional Recurrent Imputation for Time Series (BRITS),9 Multi-directional Recurrent Neural Networks (M-RNN),10 and Gated Recurrent Unit with Decay.11 A recent benchmark suggested that MD-MTS may have the best performance on healthcare data.12
All these methods have limitations. First, they have been trained and evaluated on clean, homogenous, regularly sampled ICU data from a single institution, such as the Medical Information Mart for Intensive Care (MIMIC) datasets,13,14 and have not been tested on the heterogeneous, irregularly sampled, outpatient and inpatient RWD from multiple institutions, which is more common in the healthcare industry. Second, none of them take advantage of the entire feature space available in the EHRs. MICE is designed for cross-sectional data and is not well suited to longitudinal EHRs because it does not consider patient’s past and future laboratory values.15 Methods like BRITS, MD-MTS, and M-RNN consider past and future values of the laboratory test of interest, but ignore other EHR variables, such as demographics, ICD codes, and discrete laboratory results (e.g., EGFR status), which can be useful to clinically contextualize the imputed values. Last, these methods were designed to either infer one laboratory value at the time or, because of architectural constraints, a small subset of values (up to 25-50 different tests). However, there are about 31,000 different laboratory tests in the Logical Observation Identifiers Names and Codes (LOINC) catalog and building one model for each of them is impractical.
Recent advances in deep learning16 and self-supervised learning17 are enabling new ways to leverage EHRs for personalized healthcare,18–20 and provide an opportunity to address the challenges discussed above. The transformer architecture and its building block, the multi-head self-attention module, specifically, can contextualize each data point more efficiently than other machine learning architectures21, showing improved performances when modeling EHRs.22 This work fits into this landscape by presenting a scalable architecture based on self-supervised learning to process heterogeneous and large cohorts of clinical data for imputing laboratory data into patient clinical histories.
In particular, this paper proposes Laboratory Imputation Framework using EHRs (LIFE), a novel and scalable architecture that can impute any laboratory value at any point in time from EHRs (Figure 1). LIFE receives as input a patient’s EHR data and a query describing the laboratory test of interest. The model leverages multi-head attention and time decay layers and a training objective inspired by masked language modeling (Figure 2). The framework uses the entire feature space of the EHRs, including discrete laboratory values and codes. We demonstrate that LIFE performs better than several baselines including MICE, BRITS, and MD-MTS using real-world multi-institute EHRs of about 1.1 million de-identified oncology patients in the Tempus data warehouse. We then show that this improvement in imputation performance translates to better performance on a downstream task, namely detection of adverse events. Finally, using attention weights, we show that LIFE is interpretable and the feature importances it selects are consistent with medical knowledge.
This architecture enables laboratory data imputation at scale by eliminating the need for training a different model for each laboratory test and by including the use of additional EHR variables to better clinically contextualize the predictions. By imputing a large amount of laboratory data for patients at different points in time, LIFE aims to reduce the sparseness of real-world clinical data and provide a more complete and context-aware characterization of patient clinical histories. This will consequently enhance systems using EHRs for, among others, clinical decision support, care gaps detection, phenotyping, and clinical trial matching.23–26
Results
We first assessed the capability of LIFE to impute laboratory data at scale. Specifically, we investigated the model’s ability to accurately infer randomly deleted laboratory values within patients’ clinical histories. To achieve this, we evaluated its performance on a hold-out test set of 110,953 patients across 344 laboratory tests, each of which with at least 10,000 occurrences in the training data (refer to the “Dataset” subsection in “Methods” for more details). Figure 3 and Supplementary Table 1 present results in terms of mean absolute error (MAE) for all tests. Overall, LIFE obtained strong performances, with an average MAE of 0.15. These results showcase LIFE’s ability to make effective predictions at scale. It is important to notice that these results were obtained using a single model that was jointly trained via masking to predict values for the 344 laboratory tests.
Table 1 compares the performances of LIFE and several baselines from the literature on 25 common laboratory tests. Differently from LIFE, which by design can scale to any number of laboratory tests, baselines were only trained on these 25 common laboratory values. This corresponds to the approximate number of tests used to train and validate these models in the literature (see “Baselines” and “Experimental Design” subsections in “Methods” for more details). As it can be seen, LIFE outperformed all baselines with an average MAE of 0.14, while the second-best model, BRITS, had an average MAE of 0.16. LIFE also obtained the best performances in 23 out of the 25 laboratory tests and, notably, outperformed rules-based baselines by a substantial 19%.
We notice that performances of LIFE remained consistent irrespective of the timing of prediction within a patient’s history (Supplementary Figure 1) and clinical status, i.e., tumor’s origin and stage (Supplementary Table 2). However, we observed a substantial variation in performance related to the magnitude of the laboratory value (Supplementary Figure 2). Specifically, LIFE exhibited better performances when predicting either very low or very high values for laboratory tests. This is actually to be expected: when the model is uncertain of its prediction, it is more likely to revert to the mean value of the test result. This hypothesis is supported by Supplementary Figure 3, which shows that predictions are biased towards central or median value for all models.
We also observe that LIFE’s continuous predictions can effectively categorize imputed values as normal or abnormal by assessing the extremeness of the predicted values. As illustrated in Supplementary Figure 4, LIFE outperformed all baselines in identifying abnormal laboratory values, achieving an area under the receiver operating characteristic curve (AUC-ROC) of 0.79 and an area under the precision - recall curve (AUC-PR) of 0.45.
We then evaluated the benefits of using LIFE’s imputed values in the downstream task of detecting adverse events from laboratory data. Specifically, for each patient, we imputed their laboratory data, which were then used to detect the presence of an adverse event at that time. As can be seen in Table 2, LIFE obtained the best AUC-PR for 7 out of 9 adverse events. On average, LIFE achieved an AUC-PR of 0.72, with the second best model for this use case, Nearest Lab Imputation, achieving 0.68 (6% improvement).
Lastly, we conducted an analysis of the attention weights assigned by LIFE to clinical features in the input space. Specifically, we looked at the attention weights assigned to six patients from the test set for different laboratory tests. For each patient, Figure 4 shows the 20 observations with the largest attention weights, accompanied by the corresponding laboratory test. Overall, we found that the attention weights assigned by the model were highly consistent with known medical knowledge, not limited to other laboratory data, which supports the validity of our model’s predictions. For example, as it can be seen, in the first patient who was queried for body weight, the model paid attention to past and future values of body weight, body surface area, and body mass index. In the case of hemoglobin, the model attended to hematocrit, which is closely related to hemoglobin, and mean corpuscular hemoglobin, which is often used to evaluate for anemia. Similarly, for leukocytes, the model attended to closely related laboratory tests in the white blood cell panel, such as the monocyte count. For albumin, the model attended to past and future albumin levels, as well as to total protein levels. Finally, in the case of prostate-specific antigen (PSA), the model attended to past and future PSA values as well as to diagnosis of secondary bone metastasis, which suggest a case of more severe prostate cancer. Supplementary Figure 5 shows the average attention weight across the patient population for an additional set of laboratory tests, further confirming the alignment with medical knowledge.
Discussion
This paper introduces LIFE, a novel deep learning framework for laboratory data imputation in patient clinical history of real-world EHRs. LIFE leverages time decay and multi-head attention to consume all available data in patients’ clinical histories and impute missing laboratory test values at scale. Evaluation results show that LIFE was able to impute values for all the 344 laboratory tests included in the experiments with high accuracy and outperformed the state of the art. The improvement of LIFE with respect to rule-based models (19%) is particularly significant, given that these baselines are the most prevalent model type capable of scaling to impute a large amount of laboratory data and are often utilized in RWD analysis.
This work addresses an important gap in the literature, as most existing imputation models for laboratory data were not designed for RWD, which is unique in terms of its large scale, heterogeneity, and noise. Overcoming these challenges, LIFE provides a mechanism to predict missing laboratory test values by simply requiring a laboratory test, a unit of measure, and a point in time.
Because LIFE enables laboratory data imputation at scale, it opens new possibilities for downstream applications that rely on EHRs. For example, by imputing missing values at multiple points in time throughout a patients’ history, LIFE could facilitate and improve tools for cohort selection, clinical trial matching, outcome measurement, risk stratification, clinical decision support, and causal inference.5 One of the most promising applications is clinical trial matching, where laboratory values are used in inclusion and exclusion criteria. Patients are often not matched to a trial because of missing data. LIFE promises to reduce this problem by increasing the amount of longitudinal data available for matching algorithms to identify patients.
To demonstrate the benefits in downstream tasks, we showed how using imputations from LIFE improved performance on detecting adverse events. This is a common application in personalized healthcare and it is crucial for enhancing clinical practice. By analyzing data from EHRs, healthcare providers can anticipate potential adverse events early. This enables timely interventions, personalized treatments, and better resource allocation, ultimately leading to improved patient outcomes.27–29 The inclusion of additional laboratory values enhances prediction accuracy and empowers clinicians to make informed decisions, resulting in a safer and more efficient healthcare environment.
The main component of LIFE is multi-head attention. Transformers and attention mechanisms have already been used with EHRs, commonly to train foundation models from structured EHRs and clinical notes to improve predictive modeling tasks.22,30,31 These models were all trained to predict masked discrete tokens and did not include laboratory values. To the best of our knowledge, LIFE is the first model for EHRs based on transformers, which (1) jointly consumes both discrete concepts and continuous values from laboratory data and (2) predicts continuous values.
Due to the attention architecture, LIFE imputations are readily interpretable. Experiments showed that the model primarily leveraged three categories of clinical features: (1) past and future values of the specified laboratory test; (2) closely related laboratory tests; and (3) other related clinical observations. Not only do these features align medically, but they also showcase the wide range of information that LIFE can employ to make predictions, confirming the initial hypothesis that incorporating supplementary EHR concepts could enhance contextualization and refinement of the imputed values.
While there is a large literature of existing techniques for dealing with missing laboratory values, they all have significant limitations when applied to RWD. For example, there are thousands of different laboratory tests that may need to be imputed in a large multi-institute dataset of EHRs, but existing machine learning methods have only been used to impute 10 to 50 laboratory tests at once. Moreover, existing methods do not utilize all of a patient’s EHR. Our results demonstrate that using the entire EHRs translates to better overall performance on both imputation and downstream detection, while also affording scalability that previous methods lack.
Existing imputation models also miss an opportunity to tackle different but closely related problems in EHRs, such as standardization and outliers. Healthcare data firms frequently combine data from multiple institutions and EHR systems. Standardizing laboratory codes, units of measure, and values across these different data sources is a herculean and error-prone process that is often passed on to the data consumer. Outlier values are particularly difficult to address at scale because it is difficult to distinguish between true physiologic outliers that are secondary to disease and outliers from false data entry. For example, a white blood cell count of 300 thousand/microliter could be a result of false data entry, but it could also be a real value from a patient with leukemia. While a substantial body of research has sought to address these problems separately, solutions largely require considerable time and domain expertise. LIFE potentially improves both issues. By specifying proper queries and units of measure, we can ensure that all imputed laboratory tests follow the same standards, and we can convert existing data not following standards as well. By imputing values when there is an outlier, we can verify if that outlier is physiological (in that case it would likely be replicated by LIFE) or a false data point (LIFE would predict a completely different value).
This work comes with some limitations. Some laboratory data may be missing not at random, which means that the probability that a value is missing is related to the value itself.32 As an example, sometimes laboratory tests are ordered in panels, and all the tests included in the panel are performed. Consequently, when we want to impute values for a test in a panel, chances are that the whole panel would be missing. However, when validating, we mask individual laboratory values rather than entire panels. This gives the imputation models more information than they might otherwise have in real-world scenarios. This limitation is a widespread concern when evaluating laboratory data imputation and, notably, it has yet to be addressed in the existing literature.7–9 However, the performance of LIFE, evaluated both absolutely and in comparison to baselines, in laboratory tests such as HbA1C and PSA, typically not included in panel orders, indicates its effectiveness even for standalone tests.
A challenge for all methods for EHR imputation is that there are no standard benchmark real-world datasets to validate against. The MIMIC datasets are not suitable for our use case because they are too homogenous and do not exhibit the pathologies common to most RWD. LIFE was designed to handle and organize larger datasets aggregated from multiple institutions. For this reason, we validated the model using the Tempus data warehouse. We recognize that the subset of the database selected here, while serving as a notable illustration of RWD, consists solely of oncology patients. However, given the diverse spectrum of oncology patients included, encompassing cured, in remission, terminal, and early-stage cases, alongside the dataset’s size and the comprehensive nature of available EHRs, including all comorbidities, we believe that the described performance is sufficiently robust and generalizable. It is noteworthy that LIFE performs well across cancer indications and stages (Supplementary Table 2), adding a promising indication that it would generalize to non-oncology patient populations.
It is also crucial to underscore that imputation, while a valuable tool for handling missing data, is inherently a limited technique and should not be viewed as a substitute for actual, measured laboratory values. Utilizing imputed values in clinical practice would necessitate rigorous validation for each specific laboratory test within the context of the clinical setting.
Future works will attempt to address the aforementioned limitations, including validating LIFE on a general EHR dataset. Another direction is to provide error estimates along with the predicted laboratory values, for example using ideas from conformal prediction33. This would allow users to assess how confident the model is in its predictions, in addition to interpretability analysis, and how much uncertainty there is in the data. Another possibility is to extend LIFE to handle discrete concepts in addition to continuous values. This would enable the user to impute discrete laboratory results (e.g., EGFR positive or negative) and discrete observations as well (e.g., presence of a disease, medications, biomarkers). Rather than multiple systems working with different patient representations derived for different tasks, joint modeling would be capable of handling all features from EHRs at the same time. Building upon LIFE, such future work could contribute to the next generation of clinical systems that use a single-foundation model from EHRs to include millions of patients and features to effectively support the healthcare industry and clinicians.
Methods
This section presents an overview of the LIFE architecture and a description of the evaluation design (see Figure 1).
Dataset
We used de-identified EHRs from the Tempus data warehouse for both model training and testing. This database contains de-identified, structured EHRs of oncology patients gathered from over 350 direct data connections across approximately 2,000 healthcare institutions in the United States that order Tempus products and services. We included in the study only patients with at least one continuous laboratory value in their clinical history. This led to a dataset of 1,109,530 patients, spanning the years 2000 to 2023, comprising 658,780 females, 449,212 males, and 1,538 not declared, with a mean age of 69.1 years (standard deviation = 15.8) as of 2023.
For each patient, we aggregated general demographic details (i.e., age, sex, and race) and clinical descriptors. We included ICD-10 diagnosis codes, medications normalized to RxNorm, Current Procedural Terminology version 4 procedure codes, and vital signs and laboratory data normalized to LOINC. The vocabulary was composed of 29,454 medical concepts, leading to a dataset with an average of 727.4 records per patient.
The dataset contained 707 million laboratory values, and 4,788 distinct laboratory tests. In order to obtain meaningful predictions and not bias training because of data scarcity, we considered only the tests with at least 10,000 occurrences. This led to 344 distinct laboratory tests that were used as imputation targets in the evaluation.
The data was split into an 85% train set, 5% tuning set and 10% test set. The test set was used to generate all performance metrics reported in “Results”, whereas the tuning set was used for early stopping and hyperparameter tuning.
In addition to EHRs, the Tempus data warehouse also includes a number of variables such as cancer stage, biomarker status, and adverse events, which were manually curated by a team of clinical abstractors. Specifically, the dataset included about 273,302 patients with curated variables. While these variables were not utilized during the training of LIFE and baselines, nor in the assessment of the imputed values, they served as gold standard labels for the downstream task evaluation experiment and subtyping analyses.
Data Processing
LIFE takes as data inputs patient EHRs and a query. EHRs include all the data used as features by the model (Figure 2A). The query specifies the details of the laboratory test to predict, including unit of measure and point in time.
Every patient record in the data warehouse is aggregated as a longitudinal sequence of medical observations from EHRs, which represent different measurements or events recorded in the patient’s medical history. Each observation consists of four fields: a code, a value, a unit of measure, and a date. The observation code is a high-level description of the observation, such as the ICD-10 C34 “Lung Cancer” code or the LOINC 2857-1 “PSA” code. The observation value is the result associated with an observation. This can be a categorical value (e.g., a positive or negative result for EGFR status) or a numerical value (e.g., a glucose level). The observation unit of measure is the unit of the continuous value. Lastly, the observation date is the point in time when the observation was recorded. Some of these fields might be unavailable depending on the observation type. For example, an ICD-10 diagnosis code observation is only composed of code and date fields, without having any associated value.
We first embed codes, units of measure, and categorical values into an embedding space which is learned during training. Observation dates are converted to the number of days from the first observation in the patient’s history. Continuous values are mapped to a uniform distribution on [0, 1] using the quantile transform,34 and then zero-centered by subtracting 0.5. For instance, a height value of 6’2” was first mapped to 0.97 as this height corresponds to the 97th percentile in the population, and then to 0.47 in a zero-center scale. This is done to standardize all values on the same scale, minimize the impact of outliers, and facilitate the comparison of model performance across laboratory results.
The embeddings and the continuous value are then concatenated to form a single vector for each observation. We lastly combine all the observations together into a N x E matrix, where N is the number of observations and E is the embedding size of each observation.
The query has the same structure as an observation from the EHRs: a code, a value, a unit of measure, and a date. In this case, since the value is the prediction target, it is always missing or masked. The same preprocessing of EHR observations is applied to also map the query into the same embedding space and to contextualize it in the longitudinal patient histories.
LIFE Architecture
LIFE is a method based on self-supervised learning to process EHRs and impute missing laboratory values into a patient’s clinical history. The architecture is composed of three main blocks: a time decay module, a multi-head attention layer, and the Hadamard module (Figure 2 B-D).
The time decay module assigns different weights to each observation based on how close its date is to the query date. This architecture has been shown to outperform RNNs or CNNs when used to capture temporal information in structured EHRs.35 For each observation, a weight is determined as: where w is the resulting weight, Dq is the date in the query expressed in days, Do is the date of the observation in days, and t1/2 is the half-life coefficient, which is the time required to decrease the relevance of the observation to one-half. We used different values for the half-life parameter to model different temporal windows. To this aim, we fed the inputs into a number of identical models but with different half-lives and concatenated their results together in the final module.
The multi-head attention layer computes similarity scores between each observation and the query, and uses them to generate patient embeddings that capture the most important features for prediction.21 For example, if the query is “Body Weight”, the model might pay more attention to observations related to body mass index than heart rate. Considering the context of the standard transformer architecture, keys correspond to the embedded EHR observations, while values are derived by multiplying each EHR observation with its respective time decay weight. This approach enables the architecture to amalgamate temporal and medical relevance within a single attention mechanism. The output of this layer is a batch-normalized one-dimensional embedding vector that represents the patient’s state with respect to the query.
The Hadamard module combines the output of the attention module with the query and produces a single scalar value as the prediction. This model performs a component-wise multiplication between the patient representation vector and the query vector to create a new embedding that combines the features of both inputs. It then concatenates the output of the different time decay modules and feeds the resulting vector into layers of dense neural networks. The last layer outputs a scalar value between 0 and 1, which is the quantile prediction of the query.
LIFE is trained in a self-supervised fashion by predicting a masked deleted laboratory value in a patient’s EHR (Figure 2A). This mechanism is inspired by the Bidirectional Encoder Representations from Transformers architecture,36 in which a transformer model is trained to interpret text by predicting a deleted word. The main difference is that LIFE does not predict a missing laboratory code but only its value since only the value is masked. Prediction is then compared to the original laboratory value via mean squared error (MSE) loss and the weights are updated via backpropagation until convergence.
Implementation Details
All model hyperparameters were empirically tuned using the tuning set to minimize the architecture loss, while balancing training efficiency and computation time. We tested a number of configurations (e.g., data embedding space of size {128, 256, 512, 1024}; learning rate equal to {0.1, 0.01, 0.001, 0.0001}; attention embedding space of size {256, 512, 1024, 2048}; number of decay modules D = {1, 3, 5} and half-lives {7, 30, 60, 180, 365}). For brevity, we report only the final setting used in the results described in the evaluation. All modules were implemented in Python 3.9.16, using scikit-learn 1.2.1, pytorch 1.12.0, and pytorch-lightning 1.9.0 as machine learning libraries.
We used a 256-dimensional space to embed codes, units of measure, and categorical results. The multi-head attention layer was composed of 1024-dimensional embeddings and 8 heads. Other hyperparameters were the default set by pytorch. We used three time-decay modules with half-lives of 7, 30, and 365 days. These were chosen to simulate different clinical scenarios of inpatient visits, frequent outpatient visits, and regular outpatient visits, respectively. The Hadamard module was composed of two multilayer perceptron (MLP) networks. The first layer had 3,072 hidden units and an ReLU activation function; the second and final MLP had 1,024 hidden units and a Sigmoid activation function to output a continuous value between 0 and 1.
The model was trained using Adam optimizer, with a learning rate of 0.001 and a batch size of 128 per GPU with Distributed Data Parallelism37 (DDP) on 8 GPUs (effective batch size of 1024). To accommodate for the large batch size, we used accumulated gradients, which update the model parameters after accumulating gradients over several mini batches.38 All models were trained for 30 epochs with early stopping if the MSE of the tuning set did not improve on two successive epochs.
Baselines
We compared LIFE with two rule-based methods (Interpolation and Nearest Lab) and three statistical or machine learning methods (MICE, MD-MTS, and BRITS). Unless otherwise specified, we implemented each method as reported in the literature or using the default configuration provided in the corresponding scikit-learn and pytorch packages.
Interpolation imputes the average value of the same laboratory test before and after the prediction date. If one of these values was not available, we simply imputed the other value. If both were missing, we imputed the distribution average for that laboratory test in the training set.
Nearest Lab imputes the laboratory value which is close in time to the prediction date. Similarly as above, if no values were available in the patient history, we imputed the distribution average in the training set.
MICE is a statistical method that runs multiple regression models with each missing value being modeled conditional upon other values in the data.6 MICE only uses cross-sectional features at query date for imputation. If a laboratory value was not present on the query date, it was imputed using Nearest Lab. Since there is no exact implementation of MICE in Python, we used the closely related IterativeImputer method from scikit-learn.
MD-MTS is a LightGBM-based model that performed best on a recent laboratory value imputation competition using the MIMIC III dataset.7,12 For each laboratory test, MD-MTS builds a separate LightGBM model using a number of hand-crafted features, specifically: (1) cross-sectional laboratory data available on the same visit; (2) same laboratory values from the past and future three visits around the query date; (3) temporal information such as the number of days since the start of the patient’s history in the EHRs; and (4) summary statistics such as the minimum, mean, and maximum of the laboratory test of interest. LightGBM is a gradient boosting framework that uses tree-based learning algorithms for classification, ranking and other general machine learning tasks.
BRITS is a deep learning architecture, which directly learns missing values in a bi-directional recurrent dynamical system.9 The imputed values are treated as variables of a RNN graph and, similarly to LIFE, are updated during the backpropagation loop. We trained BRITS in parallel using DDP on 8 GPUs with a batch size of 32 per GPU and Adam optimizer with the default learning rate. Acknowledging that LIFE benefited from larger batch sizes owing to the large amount of noise in real-world EHRs, we attempted to do this with BRITS as well, but this did not improve results.
Experimental Design
We conducted a number of experiments to assess LIFE’s effectiveness in imputing clinically meaningful laboratory data.
First, we trained the model on all 344 laboratory tests included in the dataset. For each patient in the test set (n=110,953), we selected from their clinical history one random laboratory value from these 344 options and deleted it. We then used LIFE to predict the missing values. We finally grouped patients with the same deleted laboratory test together and measured average MAE per laboratory test.
We then compared LIFE’s performance to all baselines on a subset of laboratory tests. Due to the baselines’ less scalable design, we focused on 25 common laboratory tests, which is approximately the number of tests baselines were evaluated in the literature. For each laboratory test, we randomly selected 10,000 patients who had at least one observation of that test in their EHRs. For each test patient, we deleted one observation of the laboratory test of interest and used it as the prediction target. We then compared each models’ ability to predict this value given the rest of the patient’s EHRs and the query for the laboratory test. We chose a sample size of 10,000 patients per laboratory test based on a power calculation to ensure the statistical significance of our results using two criteria: (1) statistically significant was defined as p < 0.01 calculated via a two-sample t-test and (2) MAE effect size would be 0.01. For some laboratory tests, there were not enough patients available in the test set. In this case, we used the maximum available number of patients.
We also conducted a series of supplementary analyses to assess the performance of LIFE when applied to specific clinical scenarios. Specifically: (1) we evaluated how effectively the model performs at different points in the patient’s timeline, shedding light on its potential to predict future laboratory values; (2) we analyzed how the magnitude of laboratory values impacted LIFE’s performance; (3) we evaluated whether LIFE could categorize predicted values as “abnormal” by measuring the distance between predicted continuous values and laboratory test median and (4) we delved into the robustness of the model across various levels of cancer severity, to verify performances in early-stage and benign diseases, which would indicate that LIFE may generalize well when applied to non-oncology patients. These analyses are described further in the appendix.
We then measured the performances of imputed laboratory values on the downstream task of detecting adverse events from laboratory data. To do this, we selected nine adverse events that are frequent in oncology patients using curated variables available in the Tempus data warehouse. For each adverse event, we identified a cohort of patients from the test set with that adverse event and the same number of random patients without it. Afterward, we performed sub-sampling on all cohorts to match the size of the smallest cohort, resulting in each cohort consisting of 6,932 patients. We then used LIFE and all baselines to impute values for the same 25 laboratory tests at the time of the corresponding adverse event. For each adverse event, we then used a 80/20 split to train and evaluate a logistic regression classifier which takes as input the feature vector composed of the 25 imputed laboratory values and output if the patients had that adverse event or not. We evaluated averaged performances of all models and across all patients in terms of AUC-PR.
Lastly, we generated plots to visualize attention weights produced by LIFE’s multi-head attention layer, in order to interpret the imputation predictions. We randomly selected six patients with different laboratory tests evaluated during our first experiment. Each patient’s timeline was represented as a band with the laboratory of interest labeled at the top. The observations within these timelines were color-coded based on the intensity of their attention weights (average across all heads), allowing for an easy assessment of the model’s focus. To further enhance the interpretability of the plot, we specifically labeled only the 20 observations with the highest attention weights along the x-axis. The less relevant observations were collapsed in the interest of conserving space and improving the overall readability of the plot. To illustrate that LIFE’s interpretability is scalable we also conducted a supplementary analysis in which we calculated the average attention paid to different features across the entire patient population for an additional subset of tests.
Author Contributions
S.P.H. and R.M. conceived and designed the work. S.P.H. conducted the research and the experimental evaluation, provided clinical validation of the results and drafted the manuscript. R.M. created the dataset, supervised and supported the research, and substantially edited the manuscript. C.C. advised on methodological choices, contributed to the experimental evaluation and critically revised the manuscript. D.M.V. critically revised the manuscript. E.T.M. and M.C.S. supported the research and revised the manuscript. All the authors gave final approval of the completed manuscript version and are accountable for all aspects of the work.
Competing Interests
All authors are employees and shareholders of Tempus Labs, Inc.
Data Availability
Deidentified data used in the research was collected in a real-world health care setting and is subject to controlled access for privacy and proprietary reasons. When possible, derived data supporting the findings of this study have been made available within the paper and its figures and tables.
Code Availability
Requests for source code are subject to review by Tempus and can be directed to the lead author.