Abstract
The unexpected effects of medications has led to more than 14 million drug adverse events reported to the Food and Drug Administration (FDA) over the past 10 years in the United States alone, with a little over 1.3 million of them linked to death, and represents a medical and financial burden on our healthcare. Laboratory tests have the potential to capture inter-individual variability in drug responses, but a significant portion of the patient population has unique treatment pathways that impedes forecasting and optimal decision making.
Generative Adversarial Networks (GANs) are flexible implicit generative models that have demonstrated their ability to capture complex correlations in field like computer vision and natural language. Their latent representation capacity is an opportunity for drug effect simulation on laboratory test trajectories. In this paper, we developed and evaluated conditional GANs on glucose laboratory tests in patients exposed to drug combinations and showed a proof of concept for these models in the simulation of unseen drug combinations. By using conditional Wasserstein GANs (WGANs) to simulate drug effects in laboratory tests, we hope to pave the way for novel clinical decision support (CDM) systems and enable the development of better predictive models for rare cohorts of patients.
1 Introduction
Drug effects can be unpredictable.
Each novel therapeutics submitted for approval to the Food and Drug Administration (FDA) needs to be safe and effective for its intended use. However, FDA approval does not guarantee safety and effectiveness for all patients. In fact, the response rates of patients to most major drugs fall in the 50 to 75% range.[57] This is due to the variability in treatment response among patients, known as inter-patient variability, caused by factors such as the environment, genetics, polypharmacy or comorbidities [58, 17, 68]. The consequences for drug safety are more concerning: out of 222 novel therapeutics approved by the FDA between 2001 and 2010, 71 (32%) were flagged with post-market safety events, including 61 incremental boxed warnings for 43 of these therapeutics [18]. Between 2008 and 2017, the FDA approved 321 novel drugs. Over the same period of time, the FDA Adverse Event Reporting System (FAERS) recorded more than 10 million AE reports, among which 5.8 million were serious adverse drug reactions (SADRs), and 1.1 million were AEs related to death. AEs burden our health system causing 2 million hospital stays each year and lengthening visits by 1.7 to 4.6 days[2]. The economic, social, and health burden of these events make pharmacovigilance an essential and pressing public health concern.
The solution is to pick the right treatment for the right patient using all the resources available. While clinical trials try to ascertain that a drug is safe and effective for its intended use before its marketing, pharmacovigilance centers monitor AE reports and aim at ensuring that a drug’s safety information is up to date. However, neither of these processes are error proof. On one hand, clinical trials have focused on designing drugs for the average patient[60] even at a time when there are increasing calls for precision medicine to enable the ”right drug at the right dose to the right patient”[14]. On the other hand, spontaneous reporting systems are known to suffer from biases such as under-reporting which is especially troublesome for rare events and drug-drug interactions (DDIs)[39]. Therefore it is only by using post-marketing observational data that we can uncover off-label uses, treatment patterns, patient specific variability in responses, and rare ADRs [28, 51, 52, 53, 54, 55].
Unfortunately, a large number of patients have unique treatment trajectories that makes their outcomes hard to predict. In a large-scale EHR study, Hirpcsak et al. [29] showed that many patients have unique treatment pathways. They analyzed 11 EHR data sources that had adopted the Observational Health Data Sciences and Informatics (OHDSI) common data model (CDM), in four different countries and including 250 million patient records. By enumerating 3-year treatment pathways, they found that 10% of patients with type 2 diabetes, 11% of patients with depression, and 24% of patients with hypertension could not compare their treatment pathway with anyone else in this quarter billion population of diverse individuals.
Therefore, there is a need to go beyond patient matching and classic supervised machine learning models that suffer from unbalanced training classes or the sample size of population with rare events. We need methods to interpolate sets of conditional information unseen – or very rare – in the dataset of interest, to support decision making and biomedical predictive models.
With the development of novel deep learning methods such as generative adversarial networks (GANs), there is an opportunity to learn how to augment existing clinical datasets with realistic synthetic data and increase predictive performances. Moreover, GANs have the potential to simulate effects of individual covariates such as drug exposures by leveraging the properties of implicit generative models. Instead of modeling every covariate and confounding variable, conditional GANs can learn how to match auxiliary clinical information to conditionally learned distributions from which they can be stochastically sampled. Although the amount of information these models can retain naturally cannot exceed the information available in the input data, conditional models appear to disentangle information from other sample classes, and other combinations of auxiliary information to infer conditional distributions unseen at training time. In this paper, our contributions are two-fold:
we developed and evaluated deep implicit generative models to learn distributions of laboratory test time series using two versions of the WGAN: the WGAN with gradient penalty (WGAN-GP) [26], and the WGAN with Lipschitz penalty (WGAN-LP) [44];
we studied the use of conditional generative adversarial networks (GANs) to model laboratory test time series and demonstrate how these models can be used for the simulation of drug effects.
We provided two applications as proof of concepts of these proposed models. The first is to illustrate the latent representation power of these conditional GAN by showing that we can infer laboratory test time series associated with drug exposure combinations unseen at training. The second is its direction consequence: the data augmentation of rare events to improve their predictability. We conducted the targeted augmentation of the 10 rarest drug combinations occurring during glucose lab test trajectories and we were able to improve forecasting on seven of them by adding synthetic samples to real samples during training of predictive models. These are evidence that these methods should be explored further.
2 Methods
2.1 Data selection
All the data come from Columbia University Irving Medical Center/New York Presbyterian Hospital (CUIMC/NYPH) transformed for the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) v5, including inpatient and outpatient records. CUIMC/NYPH is an academic medical center with over 1000 inpatient beds serving both adult and pediatric populations. The laboratories receive on average over 10,000 samples a day. Over 550 different assays are performed on-site in several laboratories, including Core (Hematology and Chemistry), Microbiology, Molecular Diagnosis, Immunogenetics, Cytogenetics, and several Specialty Laboratories and Satellite Laboratories. Over 15 million assays are performed annually in-house. Due to the complexity of the cases treated at our hospital, over 200,000 assays and panels are sent to over 60 different outside reference laboratories every year.[33]
As of 2019, we worked with a structured database of 6.38 million patients that counts:
78.95 million drug orders, representing 40.76 million single ingredient exposures, for 1.41 million patients,
140.30 million diagnosis codes for 5.40 million patients,
64.38 million procedure codes for 3.58 million patients,
810.68 million measurements (i.e., laboratory tests and vitals) for 2.29 million patients.
We considered inpatient visits as a unit of analysis, with about 38.49 million visit occurrences recorded in our research database. We mapped each visit to the measurements performed throughout its duration. Each laboratory test time series (LTTS) was therefore a time series of value from a given laboratory test or vital, for a given patient, during a unique visit. Limiting these time series to a visit versus considering time series for the whole patient medical history enabled us to have more reasonable time intervals and account for clinical events that are more relevant to the time series at hand. The general statistics of unique time series length available during these visits are displayed in table 1. That table has 3 main take aways: (1) vitals (i.e., respiratory rate, heart rate, blood pressure…) are the most abundant measurements available but do not represent measurements with the most time series. (2) the most abundant time series are for laboratory tests that belong to the routine blood panels; (3) extreme values show that we are dealing with data that can present errors when entered in the EHR or exported for research purpose, and quality control is required.
Summary statistics of the top-25 most recorded measurements at CUIMC/NYP; names have been edited for length; min-max ranges display the presence of outliers and errors in the source database.
We will now focus on the data specific to each modality: regularly sampled laboratory tests for the regular WGAN as a baseline, and irregularly sampled laboratory tests with drug exposures for the conditional WGAN.
2.1.1 Regularly sampled time series
We restricted the sample set of laboratory test time series (LTTS) to the ones that have a regular sampling for the experiments with a non-conditional GAN. When multiple measurements were available the same day, we averaged them. We explored the amount of data available in function of the wanted length of time series (i.e., 5, 10 or 15 values) and the regular or irregular nature of sampling (Figure 1). We selected a time windows of 10 days which provided a good balance of number of features available for modeling and amount of samples. We observed that the longer the time series, the smaller the sample size, in particular for the most abundance measurements. When a given visit has multiple time series candidates available, we took the earliest for consistency.
Laboratory test time series above 10,000 samples, for 3 different lengths, and 2 types of sampling.
The regularly sampled laboratory test time series restricted to a 10-day length are characterized in table 2. We computed the average dispersion (i.e., the ratio of the standard deviation by the mean of each time series) for all laboratory tests to get a sense of the relative variance of each measurement type. We selected glucose lab (LOINC 2345-7) as the main measurement to be modeled throughout this paper, for its high dispersion, satisfactory number of samples available, and link to various physiological processes.
Summary statistics for top-20 laboratory tests when filtering for contiguous time series of length days=10. The laboratory test selected for modeling is highlighted in bold.
Before training models on these time series, we proceeded to a quality control step for all laboratory test to remove values outside of the 1-99 percentile range and remove extreme outliers or spurious values that could arise at the various steps of data collection and mapping. A time series with such values would be removed from the dataset.
We also computed the distribution of measurements for glucose lab in figure 2, a boxenplot to visualize the mean and percentiles in figure 3, along with a density heatmap of time series represented by their standard deviation and mean in figure 4.
Distribution of glucose lab measurements post-quality control for regular time series.
Boxenplot of glucose lab measurements post-quality control for regular time series to visualize percentiles and potential outliers.
Density heatmap of regular glucose lab measurements time series represented by their standard deviations and mean.
2.1.2 Irregularly sampled time series
For the conditional model, we lifted the constrain on regularity of the time series, to have a more realistic model. We used a forecasting task described in the Supplementary Material to evaluate if time information was relevant. Since time intervals were hurting the prediction of irregular time series, compared to models on regular time series with no time information, we focused on drug exposures as the sole auxiliary data type.
We selected time series of 10 values that can be spaced by more than one day, producing a set of irregularly sampled time series but with an constant number of measurements. In table 3 we can observe that relaxing the time interval constraints yields higher sample sizes.
Summary statistics for top-20 irregularly sampled laboratory tests when filtering for total length=10. The top-5 laboratory tests with the highest dispersion index are highlighted.
After quality control, we summarized the among of time series left to train the models, along with their associated demographics in table 4. Figures 5,6 and 7 represent the distribution of measurements, boxenplot, and density heatmaps of the glucose lab irregular time series.
Post quality control filtering demographics for regularly and irregularly sampled blood glucose time series (days=10)
Distribution of glucose lab measurements post-quality control for irregular time series.
Boxenplot of glucose lab measurements post-quality control for irregular time series to visualize percentiles and potential outliers.
Density heatmap of regular glucose lab measurements time series represented by their standard deviations and mean.
2.1.3 Drug exposure information
The drug exposure information were used in the conditional WGAN as auxiliary information, to be able to later simulate time series based on real or hand-picked drug exposure conditional vectors. For each time series, we collected the drugs at the ingredient level (i.e., referenced by RxNorm) that had a drug era overlap with the measurements.
Drug eras were defined at the ingredient level using the definition of the OMOP CDM: they are extrapolated from drug exposures with a persistence window of 30 days, meaning that prescriptions with a gap lesser or equal to 30 days belong to the same drug era (Figure 9).
Time intervals heatmap of regular glucose lab measurements time series represented by their standard deviations and mean.
Construction of drug eras in the OMOP CDM
The Anatomical Therapeutic Chemical (ATC) classification system [1] is a hierarchical terminology controlled by the World Health Organization Collaborating Centre for Drug Statistics Methodology (WHOCC) to group drugs (Table 5). We mapped these RxNorm codes to their ATC counterparts. This RxNorm to ATC mapping is one-to-many, but it presents the advantage of enabling hierarchical grouping.
First level of the Anatomical Therapeutic Chemical (ATC) classification system
ATC categories represent different granularity levels: therapeutic subgroups (second level), therapeutic/pharmacological subgroup (third level), and chemical/therapeutic/pharmacological subgroup (fourth level), a granularity that resembles the most drug classes. We focused on the third (ATC-3), fourth (ATC-4) and fifth level (ATC-5, ingredient level similar to RxNorm).
For each of the four representations, we performed a two-sample Kolmogorov-Smirnov test to assess how different the distributions of the means of the laboratory test time series (LTTS) were between the exposed and non-exposed groups. We ranked them by p-value, adjusted for multiple hypothesis testing since we re-used samples between the different tests, and KS statistics. The drug exposure vector was then built using the 5 or 10 most significant drugs for each of the four representations: ATC-3 (Table 6), ATC-4 (Table 7), ATC-5 (Table 8) and RxNorm (Table 9).
ATC level 3 abstraction of drug exposures with the most significantly different glucose lab (2345-7) time series (TS) ranked by KS statistics.
ATC level 4 abstraction of drug exposures with the most significantly different glucose lab (2345-7) time series (TS) ranked by KS statistics.
ATC level 5 abstraction of drug exposures with the most significantly different glucose lab (2345-7) time series (TS) ranked by KS statistics.
RxNorm mapping of drug exposures with the most significantly di.erent glucose lab (2345-7) time series (TS) ranked by KS statistics.
In addition, we characterized the drug exposure relationship with the laboratory test time series by representing the density heatmaps of time series exposed to the top-10 drugs according to the KS test, for each drug representation (Figures 10, 11, 12 and 13).
Density heatmap of irregular glucose lab time series exposed to the top-10 ATC-3 concepts ranked by KS statistics.
Density heatmap of irregular glucose lab time series exposed to the top-10 ATC-4 concepts ranked by KS statistics.
Density heatmap of irregular glucose lab time series exposed to the top-10 ATC-5 concepts ranked by KS statistics.
Density heatmap of irregular glucose time series exposed to the top-10 RxNorm concepts ranked by KS statistics.
It is important to note that these heatmaps show that some drug concepts are never occurring by themselves and always in combination with another top-10 drug concept. Moreover, the center of gravity of these distributions is usually higher in mean and standard deviation than the time series exposed to none of the top-10 drug concepts.
2.2 Deep learning models
2.2.1 Deep learning experimental pipeline
Deep learning models require a more extensive hyper-parameter tuning. While their flexibility and capacity make them extremely powerful universal non-linear approximators, it comes at the cost of having to explore very large hyper-parameter spaces. We designed an experimental pipeline that balances breadth of hyper-parameters exploration, and depth of model exploration.
We split our dataset with 90/10 for training and testing. These two sub dataset were identical for the generative models and the forecasting models described in the Supplementary Material. For each deep learning model, we first ran a large amount of experiments with a limited amount of epochs (i.e., 50 epochs or less), determined model by model to be sufficient to observe a plateau in the training and validation loss. for these general tuning experiments, the loss was computed on the training set and validation set, and the testing set is held out until the very end of the process.
Following the general tuning, we selected the 10 models that have the lowest validation loss averaged over the second half of the epochs. For the generative models, we used the Fréchet distance to compare hyperparameter sets, estimate 10 times between randomly generated samples and the real training dataset. These models were then re-run for 100 epochs, 10 times each with different random seeds to compute more robust estimates of the performances of these models: this is the Jne tuning step. Based on these estimates, we selected the model (i.e., hyper parameters set) and the epoch that had the best validation loss averaged over the 10 separate runs. Generative models went through a double evaluation described in the Evaluation metrics section.
Concerning the forecasting models, during the testing step we ran the selected model for the number of epochs determined with Jne tuning on the whole training set and computed MSE and MAE on the test set with the final model. This step was repeated 10 times to account for the stochastic nature of neural network training and produce estimates of the MSE and MAE.
2.2.2 WGAN and conditional WGAN: model specifications
The Wasserstein GAN algorithm implemented followed the recommendations of the original paper on WGAN with gradient penalty [26], including the use of layer normalization in the critic instead of batch normalization. The difference between gradient and Lipschitz penalty [44] is simply that the latter takes the maximum of zero or the gradient penalty, ensuring in effect that the gradient penalty is always positive or null.
The conditional architecture was designed by transposing the conditional GAN paper [41] to the WGAN framework: the auxiliary information were concatenated with the random latent vector at the input of the generator, to generate the time series. It was then concatenated again to the produced synthetic time series, and to the real time series at the input of the critic, so that the critic also computes the estimated Wasserstein-1 distance taking into account the auxiliary information – particularly important for the generator’s training. Finally, we used RMSProp for the optimization of the WGANs, following the recommendations of Gulrajani et al, and Petzka et al. [26, 44]. The objective functions can be found in Supplementary Materials.
2.3 Evaluation metrics
Evaluating implicit models is a hard task. In computer vision for instance, where human inspection can be used as a sanity check, a dozen of different evaluation metrics have been proposed to compare real and synthetic data and quantify how close the stochastically generated samples are from the training data.[9] With EHR data, we do not have the luxury of visual inspection or highly engineered computer vision networks, and expert evaluation by physician is both time-consuming and costly, and arguably not precise enough to catch subtle differences in very large datasets. More importantly, while computer vision benefits from standard datasets such as MNIST [36], CIFAR-10[34], CelebA[69], or ImageNet [15], biomedical data sciences do not have standard medical datasets, mainly due to privacy and regulations, although MIMIC-III is increasingly regarded as such.[30]
Regardless, in spite of the absence of consensus regarding standard datasets, we must be able to design quantitative evaluation metrics for implicit generative models of EHR data in a single-institution setting to begin with.
2.3.1 Intrinsic evaluations
In computer vision, a wide variety of evaluation methods have been proposed to evaluate the data fidelity of synthetic images generated by GANs. Among them, an approach has been gaining momentum: inception distances. Inception distance metrics rely on the Inception Network, a heavily engineered convolutional neural network (CNN) designed to perform well on image labeling for ImageNet [59]. The intuition behind inception distances is that the weights of the penultimate layer of a deep neural network able to successfully classify images must pick up features that are high level enough to mimic the way the human visual cortex would work. Therefore, Heusel et al.[27] proposed the Féchet Inception Distance (FID) by using the distribution of weights of real and synthetic images flowed through a trained Inception network.
Let pw(.) be the probability of observing real data, and p(.) the probability of of generating model data.
The equality p(.) = pw(.) holds except for a non-measurable set if an only if:
for a basis f (.) spanning the function space in which p(.) and pw(.) live. this function f (.) is replaced by the penultimate layer of an inception network trained on ImageNet and the resulting distributions of weights are approximated by multidimensional Gaussian to get the first two moments: mean and covariance. Therefore, they used the Fréchet distance [21] or Wassertein-2 distance [62] to define the Fréchet Inception Distance (FID) d(.,.) between the Gaussian (m, C) obtained from p(.) and the Gaussian (mw, Cw) obtained from pw(.) given by:
This quantitative evaluation metric for implicit generative models has shown good performance, and it has been translated to biomedical sciences for de novo drug design by Preuer et al. [45] who designed an inception network called ChemblNet, a network trained to predict bioactivities of about 1,300 assays from ChEMBL[8]. We used this Fréchet distance to evaluate the distance between the real and synthetic data generated by the GAN models we have trained, and called it FID by analogy with the FID in computer vision, although no network played the role of an inception network providing abstracted representations.
2.3.2 Extrinsic evaluations
In addition of the intrinsic metrics described in the previous paragraph, we used a train on synthetic test on real (TSTR) approach, a method proposed by Esteban et al. [19], to compare the real and synthetic datasets in a supervised learning task and evaluate how well the synthetic data generated can retain the information needed for the forecasting task, and how well it generalizes.
Other task-based predictions have been described by Razavian et al.[46, 48] and Che et al.[10] using laboratory test time series for medical outcome prediction, and could be used as alternative extrinsic evaluations of the synthetic data, compared to the real data. However, these tasks are all classification tasks, while the forecasting task we describe in the Supplementary Material is a regression, providing a finer grained evaluation of the models and their impacts.
2.4 Applications of conditional WGANs: drug effect simulation in laboratory test time series
The two main directions of these experiments are: 1/ the simulation of lab test time series by manually selecting conditional drug exposure vectors, and investigating the interpolation power of these model to infer the behavior of time series with conditional information not seen during training; 2/ the data augmentation task, where real datasets are augmented with synthetic data to improve locally forecasting performances using the regressors described in the Supplementary Material.
2.4.1 Inference power evaluation
In order to evaluate how well the conditional WGANs trained in the previous Aim 2 can infer samples from conditional class unseen at training, we designed the following experiments: we removed from the training dataset glucose time series exposed to a unique drug, trained the GAN from scratch on this new dataset, and then generated samples with single drug exposures to compare them to real single-drug time series, and single-drug time series from the Aim 2 WGAN model exposed to them at training. This experiment relies on the interpolation power of the latent space in GANs. The two hypotheses tested in this evaluation are that conditional GAN can infer the behavior of samples from other classes seen at training (i.e., even if the GAN has never seen drug A alone, it has seen drug A in combination with other drug exposures), and that the latent space can continuously interpolate these inferred samples. It means that we could for a given drug of exposure generate all the states between its two binary states, 0 and 1, and get continuously closer to samples exposed to the drug.
In order to evaluate the simulation of these time series, we used the Frechet Inception Distance (FID) introduced in the previous section as a distance metric.
2.4.2 Data Augmentation by conditional generation
After analyzing the simulation properties of conditional WGANs, we investigated their usage for data augmentation. If these models can infer unseen classes, or unseen combinations of classes, they could inflate specific sub-groups of samples in training sets to improve the performances of the model at testing. In this second experiment, we first augmented time series with unique drug exposures and evaluated the performances of the MLP forecasting model from Aim 1 with these new training datasets.
Then, we targeted specific drug combinations that have low frequency in the training set, and on which forecasting models yield high error on the testing set. These are the most interesting candidates for data augmentation, as they could benefit from a high count in the training set to decrease their high testing errors.
3 Results
3.1 Regular LTTS generation with WGAN-GP
3.1.1 Hyperparameter tuning
The WGAN hyperparameter tuning was extensive, with more than 7,000 hyperparameter combinations evaluated (Table 10).
Hyper-parameters for WGAN general tuning on regular glucose lab time series.
We took the best performing hyperparameter combinations in terms of FID, and tested them, adding gradient penalty (GP) to the Lipschitz penalty (LP) (Table 11) used for the general tuning above.
Hyper-parameters for WGAN fine tuning on regular glucose lab time series.
3.1.2 Impact of sampling on FID
For each ratio of synthetic data, we generated 10 distinct datasets to get an estimate of the FID and how much it varies. The FID values and their sub-components (i.e, difference of the means and difference of the covariance matrices) did not change significantly depending on how much synthetic data was generated to compute the distance. However, the standard deviation tended to decrease as the amount of synthetic data increases, as expected (Figure 14).
Evolution of the FID for different amounts of sampled synthetic data, expressed in ratio of real data, for WGANs on regular glucose lab time series.
3.1.3 FID evolution during training
The FID dropped within the first 30 epochs or so, and then either stabilized, or showed signs of overfitting of the WGAN with an upward trend toward epoch 80 (Figure 15). For reference, the FID between the real training and the testing set was 15.22 (mean diff.: 4.85, cov. diff.: 10.37) while the FID of the top 10 GAN model spanned from 26.71 to 46.26.
Evolution of the FID during training of WGANs on regular glucose lab time series every 5 epochs.
3.1.4 TSTR with an MLP forecasting model
We evaluated the 10 best WGAN models according to their FID with the TSTR extrinsic evaluation (Table 12). We can see that the difference in the covariance matrices was more correlated to the TSTR metrics than the FID itself. The general trend is a relatively low error on forecasting synthetic samples from the same generator, a higher error on the training data that the WGAN was trained on, and the highest error on the testing set that the WGAN was never exposed to. While the MSE on the testing set was above 1,100, it is not extremely far from the error from the models trained on real regular time series (see Supplementary Material). The GAN model that performed the best at the TSTR task got an average MSE of 1120.84 (±8.553), and a lowest MSE at 1103.84. Interestingly enough, it was not the model with the lowest FID (39.88).
Train on Synthetic Test on Real (TSTR) evaluation of the top-10 WGAN models selected by FID.
3.1.5 Density evolution through training of the best regular WGAN
By sampling every 5 epochs a synthetic dataset of the same size as the training dataset, we can visualize the coverage of the synthetic data compared to the real training data it was fitted on (Figure 16). We also computed the weighted ratio of synthetic coverage. On every tile of the density heatmap, We computed the difference of real and synthetic data, and counted hits (correct coverage), wrong (coverage of a tile not represented in the real data), and miss (absence of coverage of a tile represented in the real data) (Figure 17). This barplot summarizes quantitatively the heatmaps by epoch. We can observe that coverage increased fast and then plateaued around 30 epochs, to increase again past 70 epochs. The wrong ”modes”, synthetic time series that have unrealistic summary statistic, decreased within the first 30 epochs, and coverage improved after 60 epochs. When observing the kernel density estimation plot, we see that the mean distribution is approximated first, and then the covariance is adjusted in order to cover the real distribution better during that second stage of the training. The best coverage reached was 88.22% and was reached at epoch 95.
Evolution of the synthetic density of time series for the best WGAN trained on regular glucose lab time series, every 5 epochs.
Weighted ratio of density categories covered by the best synthetic density every 5 epochs.
3.2 Irregular LTTS with auxiliary drug exposure generation with conditional WGAN-GP
We then proceeded to train the conditional WGAN on the irregular laboratory test time series using auxiliary drug exposures as conditional information. While generating realistic synthetic glucose time series is a good first step, the generation is only random due to the nature of GAN as implicit generative models. We do not have any control over the mode of the generated samples, the properties, or the sub-group to which these time series belong. Conditional GANs on the other hand allow for known auxiliary information to be used during the training, at the input of the generator and the critic. Once trained, the conditional GAN generator takes as an input a random vector concatenated to an auxiliary information vector. While the former is useful for diversity in the stochastic generation process, the latter enables the user to select the conditional information of the time series generated and therefore direct this generation.
3.2.1 Hyperparameter tuning
Given the major similarities between the WGAN and the conditional WGAN, we only rerun the 10 best hyperparameter sets from the previous study. The model selection followed the same process, at the difference that we had to run these experiments for each drug terminology and conditional vector length (i.e., eight modalities), similarily to the forecasting experiments described in the Supplementary Material: ATC-3, ATC-4, ATC-5 and RxNorm with 5 or 10 drug concepts. Because RxNorm with 5 drugs yielded the most realistic conditional WGAN, the figures illustrating the following analyses are for that dataset only. The figures associated to the other drug terminologies and conditional vector lengths can be found in the Appendix (Figures ??,??,??,??,??,??,??).
3.2.2 Impact of sampling on FID
The same phenomenon than in the previous section was observed: the FID was stable through sampling rate, and only the standard deviation of the FID decreased as the sample size of the generated data increased (Figure 18). The real training set and the testing set had an FID of 11.04 (mean diff.: 1.06, cov. diff.: 9.98) while the FID with the synthetic data ranged from 15.67 to 43.26 across the various drug representation tested.
Evolution of the FID for different amounts of sampled synthetic data, expressed in ratio of real data, for conditional WGANs on irregular glucose lab time series with auxiliary RxNorm 5 drug exposures.
3.2.3 FID evolution during training
Similarly to the WGAN analysis, we observe that the covariance difference is the main driver of the FID, and usually decreases almost linearly, while the mean error is much lower and therefore fluctuates more with less impact on the total distance (Figure 19).
Evolution of the FID during training of WGANs on irregular glucose lab time series with RxNorm-5 auxiliary drug information, every 5 epochs.
3.2.4 TSTR with a a conditional MLP forecasting model: evaluating different drug representations
Table 13 presents the best models in the TSTR task for each drug representation. The MSE on the testing set spanned from 1088.36 for ATC-5 with 5 drugs, down to 1056.32 for RxNorm with 5 drugs, effectively performing as well as a linear regression trained on the real dataset with RxNorm concepts for the top-5 drugs of exposure. ATC-5 with 5 drugs had the worst TSTR score with its best MSE at 1088.36. The FID of these conditional GANs spanned from 17.46 for RxNorm with 10 concepts, to 34.03 for ATC-3 with 10 concepts. For reference, the best non-conditional model in terms of FID scored at 26.71. The synthetic samples from conditional models were therefore overall more realistic, at the exception of ATC-3 with 10 concepts, and ATC-5 with 5 concepts. This might explain their high MSE in the TSTR task. While the best MSE with non-conditional models was 1103.84, all the TSTR MSEs of the conditional models were between 1056.32 and 1088.36.
Train on Synthetic Test on Real (TSTR) evaluation of the best WGAN models by drug representation.
3.2.5 Density evolution through training of the best conditional WGAN
The best conditional GAN with 5 RxNorm concepts as auxiliary information presented an analogous behavior during training as observed in the non-conditional GAN: The learned distribution first adjusted its mean, and then its variance and covariance (Figure 20). The analysis of the coverage, wrong modes, and missed modes (Figure 21) showed a peak coverage of 92.07% at epoch 100.
Evolution of the synthetic density of time series for the best conditional WGAN trained on irregular glucose lab time series with RxNorm-5 auxiliary drug information, every 5 epochs.
Weighted ratio of density categories covered by the best conditional WGAN trained on irregular glucose lab time series with RxNorm-5 auxiliary drug information, every 5 epochs.
3.3 Inferring glucose time series with drug exposures unseen at training
The original dataset contained 11,746 glucose time series exposed to a single drug at a time. These were removed from the training set that was left with 64,857 samples. the conditional WGAN was trained with the hyperparameters that yielded the best FIDs in Aim 2, and the best model out of 10 runs was selected for the simulation experiment (FID: 22.31). Let’s call this conditional WGAN the inference model.
For each of the 10 drugs, we generated 10,000 synthetic samples with a conditional vector incrementally increasing between 0.0 and 1.0 by steps of 0.1 for that drug of interested, and computed at every step the FID between these synthetic samples and the real samples only exposed to that drug. In order to get FID ranges of reference, we also computed the FID between synthetic samples and real single-drug samples using the conditional WGAN from Aim 2, that was exposed at training to samples with single drug of exposure (i.e., the exposed model). The FID between synthetic and real samples at different interpolation values in the latent spaces are displayed in figures 22 to 31.
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to insulin detemir
We observed two phenomenons: (1) sample sets generated along that interpolated conditional exposure vector presented a continuously decreasing FID with the real data, showing the interpolation power of the latent space in generative adversarial networks; (2) for all drugs, the model never exposed to single drug exposures got an FID close to the model that was trained with single-drug exposure samples. Insulin detemir exposed time series were inferred by the inference model with an FID comparable to the model that was exposed to it. There was one outlying drug: isopropanol, where our inference model samples were more realistic than the sample of the exposed GAN that was trained on these single exposure time series. Table 14 summarizes the FID between synthetic and real samples for the model exposed at training, and the model not exposed at training.
FID between real and synthetic data for exposure to a unique drug at a time: inference conditional WGAN never exposed to these single-exposure samples, and exposed conditional WGAN that was trained with samples exposed to these unique drug exposures.
For visualization purposes, we represented these single-exposure samples for the inference model, the expose model, and the real data density represented by variance and mean of the time series (Figures 32, 33).
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to sitagliptin
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to insulin glargine
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to insulin lispro
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to metformin
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to isopropanol
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to glipizide
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to glimepiride
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to glibenclamide
FID between synthetic and real glucose time series, for different values in the conditional vector between 0, non-exposed, and 1, exposed to insluin (human)
Density heatmap of irregular glucose lab time series exposed to the top-10 RxNorm, synthetic vs. real with equal sampling.
Density heatmap of irregular glucose lab time series exposed to the top-10 RxNorm, synthetic vs. real with fixed synthetic sampling at 10% of the training set size.
3.4 Targeted data augmentation with conditional WGANs
Now that we demonstrated that conditional WGAN can generate time series with conditional information that was not seen at training, we built some confidence about the ability of these models to augment real datasets. Indeed, the ideal target for data augmentation are sub-groups of samples that have low sample counts in the training set and high errors when the model is applied on the testing set. We showed that conditional WGAN can simulate glucose time series even when their belong to subgroups not seen at training, which should extend to time series with very few samples at training.
As a first experiment, we augmented the single drug exposure time series. Qyality control of the input time series seemed to be very important in the data augmentation part when adding synthetic data to real data. By applying the same criteria applied to the real data to the synthetic data, forecasting performances dramatically improved between no quality control 15, and identical quality control as in the forecasting experiments of Aim 1 with real data 16. Only drugs 1 (insulin detemir), 4 (insulin lispro) and 10 (insulin (human)) showed improvements when the training set was augmented with simulated time series exposed to these drugs only.
We grouped irregular glucose time series by drug exposures (RxNorm, 10 concepts), and represented these groups by their count in the training set and the MSE on the testing set based on the MLP forecasting model from Aim 1 (Figure 34).
Drug exposures represented by their count in gluocose time series in the training set, and their MSE in the testing set using the best performing MLP forecasting model.
In spite of the low number of samples for these combinations, augmenting these types of glucose time series in the training set led to improved MSE in 7 out of 10 cases (Table 17).
MSE by sub-group exposed to only one of the top-10 RxNorm concepts to evaluated the effects of drug-specific data augmentation of 10% of the total dataset size with a conditional WGAN. No quality control on the synthetic time series.
MSE by sub-group exposed to only one of the top-10 RxNorm concepts to evaluated the effects of drug-specific data augmentation of 10% of the total dataset size with a conditional WGAN. Same control on the synthetic time series as with the real data.
MSE before and after targeted augmentation of the 10 drug combinations with the highest MSE.
4 Discussion
In this study, we demonstrated that we can build generative adversarial networks able to generate synthetic laboratory test time series that look like real data. we compared two different models: a Wasserstein GAN with gradient penalty (WGAN-GP) to model regular glucose lab time series, and a conditional WGAN-GP that uses the drug exposure representation introduced in the previous aim to learn better models, but also to control the drug exposure of the synthetic generated time series.
We used two different quantitative methods to evaluate these models: an intrinsic metric called the Fréchet Inception Distance (FID) that consists of measuring the different of the mean and covariance between the synthetic data and the real data; and an extrinsic evaluation that uses the forecasting models developed in the previous aim to evaluate how well models trained on synthetic data only perform on never seen before testing sets.
We first studied the properties of the FID by comparing how the amount of generated data impact the metric when it’s computed with a fixed size training set the GAN was trained on. This experiment showed that the generated data are very stable in spite of the stochasticity of the implicit generative process. We then computed the FID and it’s two components (i.e., the mean and the covariance differences) across epochs to see how they evolve. It seems that most of the time the covariance difference is the main driver of the FID, and that the FID could be a good tool to visualize overfitting of a WGAN model where the FID starts going back up.
We also added visual inspection of the data using the density heatmaps we have been generating throughout the experiments, to help understand how the density learned by the WGAN evolves throughout epochs. It is important to note that all the GAN models developed in this aim had, in spite of their similar mean and standard deviation densities, longer tails. These models generated more extreme values, including some negative measurements (i.e., unrealistic). It exposes one of the main limitation of the Frechet distance that approximates Gaussians and is less sensitive to these longer tails than kernel-based metrics such as maximum mean discrepancy (MMD). Similarly to the forecasting model, these GANs could also benefit from clinically relevant intrinsic evaluation metrics, for instance comparing how real and synthetic data stand in terms of normal ranges, or involving physicians in a qualitative expert evaluation.
The TSTR evaluation demonstrated that the covariance of the synthetic data has an enormous weight on the generalizability of models trained on them. It is also important to note that the performances obtained with synthetic data from conditional models are on-par with the performances of models trained on real data, with an increased MSE that stays within performances obtained with classic machine learning regression models.
Finally, while ATC-3 was a good drug representation for regression, RxNorm appears to be the best conditional information to generate more realistic and generalizable synthetic glucose lab time series. More importantly conditional GANs outperformed the non-conditional model in spite of having to learn a more complex representation of the laboratory test time series.
Only one type of deep generative models, namely generative adversarial networks (GANs) was used. They are not the only types of deep generative models that could have been used. While GANs are implicit models, Variational Autoencoders (VAEs), a type of prescribed model, represent an alternative. Further investigation could be done using VAE, and hybrid models called VAE-GAN [40]. Moreover, supervised models using LSTM have been the standard for sequence generation[24], where text can be generated token by token by sampling from a distribution conditioned on the previous token and a hidden representation of all the tokens already generated. However there are known issues where the models have to sample from conditional distributions never seen at training and their solutions lead to models with no cost function specifically designed to encourage synthetic data fidelity.[35, 20] The GAN models of this aim present the advantage of being self-supervised with a cost function, the Wasserstein distance, designed to explicitly compute an estimate of the distance between real and synthetic samples.
Other auxiliary data could be used as conditional information, such as demographics (i.e., sex, age, race/ethnicity) as they are known to have a direct impact of the laboratory tests distribution and dynamics. Such as conditional WGAN generating laboratory test time series based on demographics, drug exposures, and other clinical covariates could be a module in a larger analysis where these covariates are handpicked by the user for specific generation, or generated themselves stochastically using discrete GANs. With an increasing number of clinical variables, and therefore an increased sparsity, methods to compress patient representations could be used to improve the learning with large auxiliary datasets.
After having evaluated our generative models, we then showed the potential applications for conditional WGAN that can generate irregular glucose time series based on a conditional vector of drug exposures. The two avenues investigated were the inference ability of these generative models, directly tied to arithmetic properties in the latent space of the generator network, and data augmentation in supervised tasks.
In the first experiment, one of the main limitation is the conditional WGAN used, along with the drugs of exposure. This model was selected based on its overall performances at generating synthetic samples close to the real data, and using the Frechet Inception distance. The FID is a good metric to get a sense of the distance between gaussians approximated on two datasets, but not fine grained enough for more thorough comparisons. This is where the maximum mean discrepancy (MMD) metric, a kernel based distance, could be useful. Moreover, there are only 10 drugs in the auxiliary information vector while this population was exposed to hundreds of them. Adding drugs to the auxiliary vector, along with other clinical covariates as discussed in the previous chapter would yield to a better conditional WGAN and better inference. However, most of the inferred time series sub-groups were very close to the data generated by the WGAN that was exposed to them at training.
In the data augmentation task, there was an obvious limitation with the sample size of the time series groups in the testing set. A lot of them had only one sample, resulting in a very noisy and sensitive MSE. Another difficulty comes from the fact that the task is a regression task. Every data augmentation study that has used GANs to improve supervised learning results was applied to classification task. We do think that classification tasks are more robust to data augmentation than regression, due to the more discrete process of selecting a decision threshold, versus a continuous non-linear relationship between inputs and output in regressions. It would also be beneficial to compare the GAN-powered data augmentation with more classic data augmentation methods as a baseline of comparison.
Another limitation of this study was that only one laboratory test, blood glucose, was investigated. However it opens the way for subsequent studies to systematize the methods presented here and refine them.
5 Conclusion
The model evaluation tier of this paper is the proof of concept that we can generate laboratory test time series from visits with generative adversarial networks. These synthetic time series are close to the real ones, and yield on-par results when used to train forecasting models then tested on real data never seen by the GAN. The conditional WGAN demonstrated two properties: conditional GANs generate higher quality samples than non-conditional GANs, and they enable the targeted generation of synthetic laboratory test time series exposed to drugs defined at the input of the generator.
With regards to the applications, we demonstrated useful properties of conditional WGAN to simulate drug exposures on laboratory test time series. We showed that these generative models can infer samples behavior based on latent space arithmetic when there is enough useful conditional information available. We also showed that these simulated data can be used in specific tasks where sub-groups of samples are under-powered, causing high errors in these sub-groups. While these studies are only proof of concepts, they show promising applications that could directly impact how researchers work with medical data, and provide the bases for next generation clinical decision making tools that would be able to simulate population specific laboratory test time series.
Data Availability
The study was conducted with electronic health records (EHR) data from Columbia University Irving Medical Center (CUIMC) and is not publicly available for HIPAA compliance.
Funding
A.Y. was supported by the NIGMS grant R01GM107145. N.P.T. was supported by the NIGMS grant R01GM107145 and the NCATS grant OT3TR002027.
Acknowledgments
References
- [1].↵
- [2].↵
- [3].
- [4].
- [5].
- [6].
- [7].
- [8].↵
- [9].↵
- [10].↵
- [11].
- [12].
- [13].
- [14].↵
- [15].↵
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].
- [23].
- [24].↵
- [25].
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].
- [32].
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].
- [38].
- [39].↵
- [40].↵
- [41].↵
- [42].
- [43].
- [44].↵
- [45].↵
- [46].↵
- [47].
- [48].↵
- [49].
- [50].
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].
- [62].↵
- [63].
- [64].
- [65].
- [66].
- [67].
- [68].↵
- [69].↵
- [70].