Generative AI Mitigates Representation Bias Using Synthetic Health Data ======================================================================= * Nicolo Micheletti * Raffaele Marchesi * Nicholas I-Hsien Kuo * Sebastiano Barbieri * Giuseppe Jurman * Venet Osmani ## Abstract Representation bias in health data can lead to unfair decisions, compromising the generalisability of research findings and impeding under-represented subpopulations from benefiting from clinical discoveries. Several approaches have been developed to mitigate representation bias, ranging from simple resampling methods, such as SMOTE, to recent approaches based on generative adversarial networks (GAN). However, generating high-dimensional time-series synthetic health data remains challenging for both resampling and GAN-based approaches. In this work, we propose a novel CA-GAN architecture able to synthesise authentic, high-dimensional time series data. CA-GAN outperforms state-of-the-art methods in qualitative and quantitative evaluation while avoiding mode collapse, a significant GAN failure. We evaluate CA-GAN’s generalisability in mitigating representation bias for Black patients in two diverse, clinically relevant datasets: acute hypotension and sepsis. Finally, we show that CA-GAN generates authentic data of the minority class while faithfully maintaining the original distribution of both datasets. Keywords * generative adversarial networks * synthetic data * resampling * oversampling * bias * underrepresentation * health data ## 1 Introduction Clinical practice is poised to benefit from developments in machine learning as data-driven digital health technologies transform health care [1]. Digital health can catalyse the World Health Organisation’s (WHO) vision of promoting equitable, affordable, and universal access to health and care [2]. However, as machine learning methods increasingly weave themselves into the societal fabric, critical issues related to fairness and algorithmic bias in decision-making are coming to light. Algorithmic bias can originate from diverse sources, including socio-economic factors, where income disparities between ethno-racial groups are reflected in algorithms deciding which patients need care [3]. Bias can also originate from the underrepresentation of particular demographics (such as ethnicity, gender, and age) in the datasets used to develop machine learning models, known as health data poverty [4]. Health data poverty impedes underrepresented subpopulations from benefiting from clinical discoveries, compromising the generalisability of research findings and leading to representation bias that can compound health disparities. The machine learning community has developed several approaches to correct representation bias, with data resampling being the most widely used. Oversampling generates representative synthetic data from the underrepresented subpopulation (minority class), resulting in similar or equal representation. Synthetic Minority Over-sampling TEchnique (SMOTE) [5] is a representative example of this method, where synthetic samples lie between a randomly selected data sample and its randomly selected neighbour (using k-nearest neighbour algorithm [6]). SMOTE and related methods [7–9] are popular approaches due to their simplicity and computational efficiency. However, SMOTE, when used with high-dimensional time-series data, may decrease data variability and introduce correlation between samples [10–12]. In response, alternative approaches based on Generative Adversarial Networks (GAN) are gaining ground [13–17]. GANs have shown incredible results in generating realistic images [18], text [19], and speech [20] in addition to improving privacy [21]. However, while GANs address some of the issues of SMOTE-based approaches, the generation of high-dimensional time-series data remains a significant research challenge [22–24]. To address this challenge, we propose a new generative architecture called Conditional Augmentation GAN (CA-GAN). Our CA-GAN extends Wasserstein GAN with Gradient Penalty [25, 26], presented in the Health Gym study [27] (referred to in this paper as WGAN-GP*). However, our work has a different objective. Instead of generating new synthetic datasets, we condition our GAN to augment the minority class only while maintaining correlations between the variables, correlations over time, and capturing the distribution of the overall dataset, including the majority class. We compare the performance of our CA-GAN with WGAN-GP* and SMOTE in augmenting data of patients of an underrepresented ethnicity (Black patients in our case), using two critical care datasets of acute hypotension (n=3343) and sepsis (n=4192), comprising 7535 patients. Our data includes both categorical and continuous variables with different distributions and is derived from the well-studied MIMIC-III critical care database [28]. Of important note, our architecture is highly flexible and is disease agnostic. A summary of the contribution of our work is as follows: (1) We propose a new CA-GAN architecture to mitigate representation bias, addressing the short-comings of the traditional and state-of-the-art approaches in high-dimensional time-series synthetic data generation. (2) Our multi-metric evaluation using qualitative and quantitative methods against a state-of-the-art architecture demonstrates superior performance of CA-GAN while avoiding a significant GAN failure, namely mode collapse. (3) We also evaluate our CA-GAN against SMOTE, a naive but effective and popular resampling method, demonstrating the superior performance of generative models in the generalisation and synthesis of authentic data. (4) We show that CA-GAN can synthesise realistic data that can augment the real data when used in a downstream predictive task. (5) We show the generalisability of CA-GAN in mitigating representation bias for two different but clinically relevant conditions, namely acute hypotension and sepsis. ## 2 Results To evaluate the performance of our architecture, we compare the synthetic data generated by our CA-GAN with the data generated by state-of-the-art WGAN-GP* and SMOTE using a multi-metric evaluation. Considering significant challenges in evaluating generative models in general, [29], and high-dimensional time-series data in particular, [23], we adopted a holistic approach to evaluating our work based on both qualitative and quantitative methods. We present the results of the data generated by the three methods to balance the original datasets, compared with the real patients of the minority class, namely Black ethnicity. ### 2.1 Qualitative evaluation To gain initial insights into the obtained results, we conduct a qualitative evaluation employing visual representation techniques. We use t-distributed Stochastic Neighbor Embedding (t-SNE) [30] to plot both real and synthetic datasets in a two-dimensional latent space while preserving the local neighbour-hood relationships between data points. Additionally, Principal Component Analysis (PCA) is employed to project the real and synthetic data onto a two-dimensional space. Lastly, we leverage Uniform Manifold Approximation and Projection (UMAP) [31], which offers better preservation of the global structure of the dataset when compared to t-SNE. The parameters of t-SNE and UMAP are the same for all three methods as shown in Appendix C. The results of our representation are illustrated in the top panels of Figure 1 (acute hypotension) and Figure 2 (sepsis). It can be observed that the synthetic data generated by CA-GAN exhibits significant overlap with the real data, indicating the model’s ability to capture the underlying structure accurately. In contrast, WGAN-GP* and SMOTE exhibit different characteristics. SMOTE’s interpolation of real data points fills the gaps between the closest points without expanding into the embedded space, potentially compromising data authenticity. On the other hand, CA-GAN distributes data points uniformly throughout the space while adhering to the constraints of the real distribution. This suggests that our model demonstrates improved generalisation in the data space. In the case of the sepsis dataset, SMOTE displays similar behaviour to acute hypotension, interpolating real data points, whereas our CA-GAN achieves even better results by showcasing a substantial overlap between real and synthetic data without the interpolation effect. While WGAN-GP* shows some improvement, its distribution still exhibits visible areas where the distributions do not overlap. ![Fig. 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F1.medium.gif) [Fig. 1:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F1) Fig. 1: Two-dimensional representations of the acute hypotension dataset. Top panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP*, and CA-GAN. Middle panels: PCA two-dimensional representation of real and synthetic data, where CA-GAN provides the best coverage of real data distribution, followed by WGAN-GP* and SMOTE. It can be seen that CA-GAN provides better overall coverage of real data distribution, while WGAN-GP* shows evidence of mode collapse. Bottom panels: UMAP two-dimensional representation, with the models having similar latent distributions. ![Fig. 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F2.medium.gif) [Fig. 2:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F2) Fig. 2: Two-dimensional representations of the sepsis dataset. Top panels: t-SNE two-dimensional representation of real data (red) and synthetic data (blue) for the three methods SMOTE, WGAN-GP*, and CA-GAN. It can be seen that the three methods provide different coverage of the latent distribution, with CA-GAN being the most homogeneous. Middle panels: PCA two-dimensional representation of real and synthetic data, where CA-GAN provides the broadest coverage of the latent distribution, followed by SMOTE, while WGAN-GP* tends to converge. Bottom panels: UMAP two-dimensional representation, where the three methods have the most similar distribution. Subsequently, PCA-based analysis shows that CA-GAN is able to generate data points that cover the entire variance of the real data, while SMOTE and WGAN-GP* tend to converge on the mean, flattening their variance as shown in the middle panels of Figure 1 for acute hypotension, and Figure 2 for sepsis. We also present the UMAP latent representation of the data, which like PCA, preserves the global structure in the bottom panels of Figures 1 and 2 with similar observations. State of the art, WGAN-GP* appears to suffer from mode collapse, a significant limitation of GANs [32]. Mode collapse occurs when the generator produces a limited variety of samples despite being trained on a diverse dataset. The generator cannot fully capture the complexity of the target distribution, limiting the quantity of generated samples and resulting in repetitive output. This is because the generator can get stuck in a local minimum where a few outputs are repeatedly generated, even though the training data contains more modes that can be learned, presenting a significant challenge in generating high-quality, diverse samples, while our CA-GAN model overcomes this limitation. Finally, we show the distribution of individual variables of synthetic data overlaid on the distribution of the real data to compare the three methods for the acute hypotension dataset in Figure 3 and sepsis in Appendix A. The distribution of synthetic data generated by our CA-GAN exhibits the closest match to that of the real data. This close alignment is particularly evident in variables related to blood pressure, including MAP, diastolic, and systolic measurements. However, certain variables pose challenges for all three methods, such as urine and ALT/AST. These variables have highly skewed distributions with long tails, making them difficult to transform effectively using power or logarithmic transformations. In contrast, our method and WP-GAN* effectively capture the distribution of categorical variables. Conversely, SMOTE encounters difficulties with several variables, including both the numeric variable of urine and the categorical variable of the Glasgow Coma Score (GCS). These observations are also reflected in the quantitative evaluation in Section 2.2. The variables in the sepsis dataset are not only more than twice as many as those in acute hypotension but also have more complex distributions. Variables such as SGOT, SGPT, total bilirubin, maximum dose of vasopressors, and others have extremely long tails. The three methods struggle to generate these kinds of distributions and show a tendency to converge to the median value. In contrast, the behaviour is similar to acute hypotension for categorical and numerical variables normally distributed. ![Fig. 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F3.medium.gif) [Fig. 3:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F3) Fig. 3: Distribution plots of each variable, overlaying real and synthetic data for acute hypotension dataset. The distribution of variables related to blood pressure (MAP, diastolic, and systolic) is captured well by our method in comparison to WGAN-GP* and SMOTE. CA-GAN performs better also for categorical variables, while all three methods struggle with variables with long tail distributions. ### 2.2 Quantitative evaluation We used Kullback-Leibler (KL) divergence [33] to measure the similarity between the discrete density function of the real data and that of the synthetic data. For each variable *v* of the dataset, we calculate: ![Formula][1] where *Q**v* is the true distribution of the variable and *P**v* is the generated distribution. The smaller the divergence, the more similar the distributions; zero indicates identical distributions. The left half of Tables 1a and 1b show the results of the KL divergence for each variable. Our CA-GAN method has the lowest median across all variables for acute hypotension and sepsis data compared to WGAN-GP* and SMOTE. This is even though SMOTE is specifically designed to maintain the distribution of the original variables. View this table: [Table 1:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/T1) Table 1: KL-Divergence and Maximum Mean Discrepancy between the distribution of real and synthetic data for each variable of the datasets. In addition, we used Maximum Mean Discrepancy (MMD) [34] to calculate the distance between the distributions based on kernel embeddings, that is, the distance of the distributions represented as elements of a reproducing kernel Hilbert space (RKHS). We used a Radial Basis Function (RBF) Kernel: ![Formula][2] with *σ* = 1. The right half of Tables 1a and 1b shows the MMD results for SMOTE, WGAN-GP* and our CA-GAN. Again, our model has the best median performance across all the variables for acute hypotension data, while for sepsis data, SMOTE shows a difference in performance by 0.00028. In summary, CA-GAN performs best in the acute hypotension dataset by a wide margin while showing comparable performance with SMOTE in the sepsis dataset. ### 2.3 Variable correlations We used the Kendall rank correlation coefficient *τ* [35] to investigate whether synthetic data maintained original correlations between variables found in the real data of acute hypotension and sepsis datasets. This choice is motivated by the fact that the *τ* coefficient does not assume a normal distribution, which is the case for some of our variables, of the sepsis dataset in particular (as shown in Figure 3 and Appendix A). Figure 4 shows the results of Kendall’s rank correlation coefficients. For the acute hypotension dataset (Figure 4a), CA-GAN captures the original variable correlations, as does SMOTE, with the former having the closest results on categorical variables, while the latter on numerical ones. WGAN-GP* shows the worst performance, accentuating correlations that do not exist in real data. Similar patterns are also obtained for the variables of patients with sepsis in Figure 4b. ![Fig. 4:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F4.medium.gif) [Fig. 4:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F4) Fig. 4: Kendall’s rank correlation coefficients for the real data and the data generated with CA-GAN, WGAN-GP*, and SMOTE. ### 2.4 Synthetic data authenticity When generating synthetic data, the output must be a realistic representation of the original data. Still, we also need to verify that the model has not merely learned to copy the real data. GANs are prone to overfitting by memorising the real data [36]; therefore, we use Euclidean Distance (*L*2 Norm) to evaluate the originality of our model’s output. Our analysis shows that the smallest distance between a synthetic and a real sample is 52.6 for acute hypotension and 44.2 for sepsis, indicating that the generated synthetic data are not a mere copy of the real data. This result, coupled with the visual representation of CA-GAN (shown in Figure 1 and 2), illustrates the ability of our model to generate authentic data. SMOTE, on the other hand, which by design interpolates the original data points, is unable to explore the underlying multidimensional space. Therefore its generated data samples are much closer to the real ones, with a minimum Euclidean distance of 0.0023 for acute hypotension and 0.033 for sepsis. ### 2.5 Downstream regression task Finally, we also sought to evaluate the ability of CA-GAN to maintain the temporal properties of time series data. As our objective is to augment the minority class to mitigate representation bias, we wanted to verify that the datasets augmented with synthetic data generated by our model can maintain or improve the predictive performance of the original data on a downstream task. Initially, we trained only a Bidirectional Long Short-Term Memory (BiLSTM) with real data as the baseline. Later we trained the BiLSTM with the synthetic and augmented datasets separately. For this task, the BiLSTM is trained on the first 20 hours for hypotension and 10 hours for sepsis of the patients’ values to predict the next hour, using a sliding window approach. To ensure the fairness of our result, 15% of the time series data points of Black patients were set apart as a test set and were used to evaluate the performance in a regression task. Table 2a and Table 2b show the mean relative errors between the BiLSTM prediction and the actual acute hypotension and sepsis observations, respectively. In the first column, we show the results achieved using only the real data to make the predictions; in the second column, the results using only the synthetic data; and finally, in the third column, the results achieved by predicting with the augmented dataset, that is, with both the real and synthetic data together. Overall, adding the synthetic data reduces the predictive error. This indicates that the temporal characteristics of the data generated by our CA-GAN model are close enough to those of the real data to maintain the original predictive performance. Thus the augmented dataset could be used in a downstream task, mitigating the representation bias. View this table: [Table 2:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/T2) Table 2: Mean Absolute Prediction Errors of a BiLSTM trained on real, synthetic, and augmented data for a downstream prediction task. It should be noted that relative errors in fluid bolus, urine, and vasopressors are exceptionally high compared to the other variables since predicting these variables is generally challenging, stemming in part from how they are collected and recorded rather than an issue inherent to the synthetic data. ## 3 Discussion As machine intelligence scales upwards in clinical decision-making, the risk of perpetuating existing health inequities increases significantly. This is because biased decision-making can continuously feed back the data used to train the models, creating a vicious circle that further ingrains discrimination towards underrepresented groups. Representation bias, in particular, frequently occurs in health data, leading to decisions that may not be in the best interests of all patients, favouring specific sub-populations while treating underrepresented sub-populations, such as those with standard set characteristics including ethnicity, gender, and disability unfavourably. To address these issues, representation must be improved before algorithmic decision-making becomes integral to clinical practice. While unequal representation is a multifaceted challenge involving diverse factors such as socio-economic, cultural, systemic, and data, our work represents a step towards addressing one significant facet of this challenge: existing representation bias in health data. We have shown that our work can improve representation bias when evaluated against state-of-the-art architectures, as well as traditional approaches such as SMOTE. SMOTE has notable advantages over other data generation techniques as it requires no training, and can generate data instantly, making it an efficient approach. SMOTE demonstrates better performance than other methods when dealing with complex numerical distributions. It can create non-normal distributions even if it tends to overestimate the median in long-tail distributions. This is in contrast to GANs, which struggle with these types of distributions. Moreover, SMOTE maintains a correlation between the variables as the generated data is similar to the original due to the use of interpolation. In turn, the generation of authentic data remains a significant challenge for SMOTE. Through qualitative and quantitative evaluation, we have shown that CA-GAN can generate authentic data samples with high distribution coverage, avoiding mode collapse while ensuring that the generated data are not copies of the real data. We have also shown that augmenting the dataset with the synthetic data generated by CA-GAN leads to lower relative errors in the prediction task. This indicates that our model can generalise well from the original data. A notable advantage of our approach is that it uses the overall dataset, and not only of the minority class, as is the case with WGAN-GP* and SMOTE. This means that CA-GAN can be applied in smaller datasets and those with highly imbalanced classes, such as is the case with rare diseases, for example. Furthermore, we evaluated our method on two datasets with diverse characteristics and found that our CA-GAN performed better on the acute hypotension dataset. This may be because some of the numerical variables in the sepsis dataset have long-tailed distributions, presenting a modelling challenge for all the methods. Additionally, the sepsis dataset contained fewer data points per patient than acute hypotension (15 versus 48 observations). A shorter sequence may have created difficulty for BiLSTM modules to learn the underlying structure of the original data effectively, coupled with a higher number of variables in the sepsis dataset (twice as many). We also note that the lower number of patients in the acute hypotension dataset does not impact the generative performance of our method. This ability to work with fewer data points (patients) is encouraging, given the overall objective of our goal of augmenting representation. In a comprehensive comparison to state-of-the-art methods and computationally inexpensive approaches, CA-GAN architecture showed superior performance while also considering some of the limitations of our work. CA-GAN may require additional optimisation to further increase performance on datasets with variables with long-tailed distribution and lower data points per patient. Furthermore, additional analysis will be necessary to evaluate the generalisation capability of our architecture beyond clinical data. ## 4 Methods We begin by formally formulating the problem we are addressing. Then we discuss the data sources we used to train our models and compare and contrast Generative Adversarial Networks (GANs) and Conditional Generative Adversarial Networks (CGANs). We also provide an in-depth analysis of the baseline model for this work, WGAN-GP*. Finally, we present the architecture of our proposed Conditional Augmentation GAN (CA-GAN) and discuss its advantages over other methods. ### 4.1 Problem Formulation Let *A* be a vector space of features and let *a* ∈ *A* represent a feature vector. Let *L* = {0, 1} be a binary distribution modifier, and let *l* a binary mask extracted from *L*. We consider a data set ![Graphic][3] with *l* = 0, where individual samples are indexed by *n* ∈ {1, …, *N*} and we also consider a data set ![Graphic][4] with *l* = 1, where individual samples are indexed by *m* ∈ {*N* + 1, …, *N* + *M*}, and *N > M*. We define the training data set *D* as *D* = *D* ∪ *D*1. Our goal is to learn a density function ![Graphic][5] that approximates the true distribution *d* {*A*} of *D*. We also define ![Graphic][6] *A* as ![Graphic][7] with *l* = 1 applied. To balance the number of samples in *D*, we draw random variables *X* from ![Graphic][8] and add them to *D*1 until *N* = *M*. Thus, we balance out *D*. ### 4.2 Data sources and patient population In this work, we want to generate tabular datasets with a complex structure: longitudinal, multivariate, with different data types. We chose two datasets extracted from the MIMIC-III database. The inclusion and exclusion criteria are described in the literature [37, 38]. In addition, these two datasets have already been used in data generation studies [27]. We decided to test the methods for the oversampling of only one minority class, thus discarding for the time being patients who did not belong to the Caucasian or Black ethnic groups. The acute hypotension dataset comprises 3343 patients admitted to critical care; the patients were either of Black (395) or Caucasian (2948) ethnicity. Each patient is represented by 48 data points, corresponding to the first 48 hours after the admission, in addition to 9 numeric, four categorical, and seven binary variables (20 in total). Details of the dataset are presented in Table B1. The Sepsis dataset comprises 4192 patients admitted to critical care of either Black (461) or Caucasian (3731) ethnicity. Each patient is represented by 15 data points, corresponding to observations taken every 4 hours from admission, in addition to 35 numeric, six categorical, and three binary variables (44 in total). Details of the dataset are presented in Table B2. ### 4.3 GAN vs CGAN The Generative Adversarial Network (GAN) [39] entails two components, a generator and a discriminator. The generator *G* is fed a noise vector *z* taken from a latent distribution *p**z* and outputs a sample of synthetic data. The discriminator *D* inputs either fake samples created by the generator or real samples *x* taken from the true data distribution *p**data*. Hence, the GAN can be represented by the following minimax loss function: ![Formula][9] The goal of the discriminator is to maximise the probability of discerning fake from real data, whilst the purpose of the generator is to make samples realistic enough to fool the discriminator, i.e. to minimise ![Graphic][10]. As a result of the reciprocal competition, both the generator and discriminator improve during training. The limitations of vanilla GAN models become evident when working with highly imbalanced datasets, where there might not be sufficient samples to train the models to generate minority-class samples. A modified version of GAN, the Conditional GAN [40], solves this problem using labels *y* in both the generator and discriminator. The additional information *y* divides the generation and the discrimination in different classes. Hence, the model can now be trained on the whole dataset to generate only minority-class samples. Thus, the loss function is modified as follows: ![Formula][11] GAN and CGAN, overall, share the same significant weaknesses during training, namely mode collapse and vanishing gradient [32]. In addition, as GANs were initially designed to generate images, they have been shown unsuitable for generating time-series [41] and discrete data samples [42]. ### 4.4 WGAN-GP* The WGAN-GP* introduced by Kuo et al. [27] solved many of the limitations of vanilla GANs. The model was a modified version of a WGAN-GP [25, 26]; thus, it applied the Earth Mover distance (EM) [43] to the distributions, which had been shown to solve both vanishing gradient and mode collapse [44]. In addition, the model applied Gradient Penalty during training, which helped to enforce the Lipschitz constraint on the discriminator efficiently. In contrast with vanilla WGAN-GP, WGAN-GP* employed soft embeddings [45, 46], which allowed the model to use inputs as numeric vectors for both binary and categorical variables, and a Bidirectional LSTM layer [47, 48], which allowed for the generation of samples in time-series. While *L**D* was kept the same, *L**G* was modified by Kuo et al. [27] by introducing alignment loss, which helped the model to capture correlation among variables over time better. Hence, the loss functions of WGAN-GP* are the following: ![Formula][12] ![Formula][13] To calculate alignment loss we computed Pearson’s r correlation [49] for every unique pair of variables *X*(*i*) and *X*(*j*). We then applied the *L*1 loss to the differences in the correlations between *r**syn* and *r**real*, with *λ**corr* representing a constant acting as a strength regulator of the loss. In their follow-up papers, Kuo et al. noted that their simulated data based on their proposed WGAN-GP* lacked diversity. In [50], the authors found that WGAN-GP* continued to suffer from mode collapse like the vanilla GAN. Similar to our own CA-GAN, the authors extended the WGAN-GP setup with a conditional element where they externally stored features of the real data during training and replayed them to the generator sub-network at test time. In [51] the same panel of researchers also experimented with diffusion models [52] and found that diffusion models better represent binary and categorical variables. Nonetheless, they demonstrated that GAN-based models encoded less bias (in the means and variances) of the numeric variable distributions. ### 4.5 CA-GAN We built our CA-GAN by conditioning the generator and the discriminator on static labels *y*. Hence, the updated loss functions used by our model are as follows: ![Formula][14] ![Formula][15] Where *y* can be any categorical label. During training, the label *y* was used to differentiate the minority from the majority class, and during generation, they were used to create fake samples of the minority class. Compared to WGAN-GP* we also increased the number of BiLSTMs from 1 to 3 both in the generator and the discriminator, as stacked BiLSTMs have been shown to capture complex time-series better [53]. In addition, we decreased the learning rate and batch size during training. An overview of the CA-GAN architecture is shown in Figure 5. ![Fig. 5:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F5.medium.gif) [Fig. 5:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F5) Fig. 5: Proposed architecture of our CA-GAN. ## Data Availability The data underlying this article are freely available in the MIMIC-III repository [https://mimic.physionet.org/](https://mimic.physionet.org/) ## Declarations ### 4.6 Funding None ## 4.7 Competing interests None ## 4.8 Ethics approval The data in MIMIC-III was previously de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of the database for research. ## 4.9 Availability of data and materials The data underlying this article are freely available in the MIMIC-III repository ([https://mimic.physionet.org/](https://mimic.physionet.org/)). ### 4.10 Code availability The source code will be made available upon publication at [https://github.com/nic-olo/CA-GAN](https://github.com/nic-olo/CA-GAN). ## Appendix A Distribution Plots for Sepsis ![Fig. A1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F6.medium.gif) [Fig. A1:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F6) Fig. A1: Overlaid distribution plots of real data and CA-GAN synthetic data for each variable in the sepsis dataset. ![Fig. A2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F7.medium.gif) [Fig. A2:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F7) Fig. A2: Overlaid distribution plots of real data and WGAN-GP* synthetic data for each variable in the sepsis dataset. ![Fig. A3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2023/09/27/2023.09.26.23296163/F8.medium.gif) [Fig. A3:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/F8) Fig. A3: Overlaid distribution plots of real data and SMOTE synthetic data for each variable in the sepsis dataset. ## Appendix B Datasets View this table: [Table B1:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/T3) Table B1: Variables in the acute hypotension dataset. For each variable, the data type, the unit in which it is expressed, and the distribution statistics are presented. View this table: [Table B2:](http://medrxiv.org/content/early/2023/09/27/2023.09.26.23296163/T4) Table B2: Variables in the sepsis dataset. For each variable, the data type, the unit in which it is expressed, and the distribution statistics are presented. ## Appendix C Parameters In this study, we used t-SNE and UMAP algorithms to perform dimensionality reduction on our datasets. The following parameters were used for each algorithm: t-SNE: Library: scikit-learn version 1.2.2 Parameters: *n*_*components* = 2, *n*_*iter* = 400, *perplexity* = 40 UMAP: Library: umap-learn version 0.5.3 Parameters: *spread* = 1, *min*_*dist* = 0.4 * Received September 26, 2023. * Revision received September 26, 2023. * Accepted September 27, 2023. * © 2023, Posted by Cold Spring Harbor Laboratory This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at [http://creativecommons.org/licenses/by-nd/4.0/](http://creativecommons.org/licenses/by-nd/4.0/) ## References 1. [1].Topol, E.J.: High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25(1), 44–56 (2019). doi:10.1038/s41591-018-0300-7 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-018-0300-7&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=30617339&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 2. [2].Global Strategy on Digital Health 2020-2025. World Health Organization, Genève, Switzerland (2021) 3. [3].Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S.: Dissecting racial bias in an algorithm used to manage the health of populations. Science 366(6464), 447–453 (2019). doi:10.1126/science.aax2342 [Abstract/FREE Full Text](http://medrxiv.org/lookup/ijlink/YTozOntzOjQ6InBhdGgiO3M6MTQ6Ii9sb29rdXAvaWpsaW5rIjtzOjU6InF1ZXJ5IjthOjQ6e3M6ODoibGlua1R5cGUiO3M6NDoiQUJTVCI7czoxMToiam91cm5hbENvZGUiO3M6Mzoic2NpIjtzOjU6InJlc2lkIjtzOjEyOiIzNjYvNjQ2NC80NDciO3M6NDoiYXRvbSI7czo1MDoiL21lZHJ4aXYvZWFybHkvMjAyMy8wOS8yNy8yMDIzLjA5LjI2LjIzMjk2MTYzLmF0b20iO31zOjg6ImZyYWdtZW50IjtzOjA6IiI7fQ==) 4. [4].Ibrahim, H., Liu, X., Zariffa, N., Morris, A.D., Denniston, A.K.: Health data poverty: an assailable barrier to equitable digital health care. The Lancet Digital Health 3(4), 260–265 (2021). doi:10.1016/S2589-7500(20)30317-4 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/S2589-7500(20)30317-4&link_type=DOI) 5. [5].Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002). doi:10.1613/jair.953 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1613/jair.953&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=WOS:00017602&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 6. [6].Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967). doi:10.1109/TIT.1967.1053964 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TIT.1967.1053964&link_type=DOI) 7. [7].Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A new oversampling method in imbalanced data sets learning. In: Lecture Notes in Computer Science. Lecture notes in computer science, pp. 878–887. Springer, Berlin, Heidelberg (2005). doi:10.1007/11538059_91 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/11538059_91&link_type=DOI) 8. [8].Gosain, A., Sardana, S.: Farthest SMOTE: A modified SMOTE approach. In: Advances in Intelligent Systems and Computing. Advances in intelligent systems and computing, pp. 309–320. Springer, Singapore (2019). doi:10.1007/978-981-10-8055-5_28 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1007/978-981-10-8055-5_28&link_type=DOI) 9. [9].Nguyen, H.M., Cooper, E.W., Kamei, K.: Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradig. 3(1), 4 (2011). doi:10.1504/IJKESDP.2011.039875 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1504/IJKESDP.2011.039875&link_type=DOI) 10. [10].Blagus, R., Lusa, L.: SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14(1), 106 (2013). doi:10.1186/1471-2105-14-106 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1186/1471-2105-14-106&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 11. [11].Fernandez, A., Garcia, S., Herrera, F., Chawla, N.V.: SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research 61, 863–905 (2018). doi:10.1613/jair.1.11192 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1613/jair.1.11192&link_type=DOI) 12. [12].He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009). doi:10.1109/TKDE.2008.239 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/TKDE.2008.239&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000268062400003&link_type=ISI) 13. [13].Lu, C., Reddy, C.K., Wang, P., Nie, D., Ning, Y.: Multi-Label Clinical Time-Series Generation via Conditional GAN. arXiv (2022). doi:10.48550/ARXIV.2204.04797. [https://arxiv.org/abs/2204.04797](https://arxiv.org/abs/2204.04797) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/ARXIV.2204.04797&link_type=DOI) 14. [14].Engelmann, J., Lessmann, S.: Conditional wasserstein gan-based oversampling of tabular data for imbalanced learning. Expert Systems with Applications 174, 114582 (2021). doi:10.1016/j.eswa.2021.114582 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.eswa.2021.114582&link_type=DOI) 15. [15].Zheng, M., Li, T., Zhu, R., Tang, Y., Tang, M., Lin, L., Ma, Z.: Conditional wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Information Sciences 512, 1009–1023 (2020). doi:10.1016/j.ins.2019.10.014 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ins.2019.10.014&link_type=DOI) 16. [16].Seibold, M., Hoch, A., Farshad, M., Navab, N., Fürnstahl, P.: Conditional Generative Data Augmentation for Clinical Audio Datasets. arXiv (2022). doi:10.48550/ARXIV.2203.11570. [https://arxiv.org/abs/2203.11570](https://arxiv.org/abs/2203.11570) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/ARXIV.2203.11570&link_type=DOI) 17. [17].Gao, X., Deng, F., Yue, X.: Data augmentation in fault diagnosis based on the wasserstein generative adversarial network with gradient penalty. Neurocomputing 396, 487–494 (2020). doi:10.1016/j.neucom.2018.10.109 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.neucom.2018.10.109&link_type=DOI) 18. [18].Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4396–4405 (2019). doi:10.1109/CVPR.2019.00453 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/CVPR.2019.00453&link_type=DOI) 19. [19].de Rosa, G.H., Papa, J.P.: A survey on text generation using generative adversarial networks. Pattern Recognit. 119(108098), 108098 (2021). doi:10.1016/j.patcog.2021.108098 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.patcog.2021.108098&link_type=DOI) 20. [20].Kong, J., Kim, J., Bae, J.: Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33, 17022–17033 (2020) 21. [21].Savage, N.: Synthetic data could be better than real data. Nature (2023). doi:10.1038/d41586-023-01445-8 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/d41586-023-01445-8&link_type=DOI) 22. [22].Brophy, E., Wang, Z., She, Q., Ward, T.: Generative adversarial networks in time series: A systematic literature review. ACM Comput. Surv. 55(10) (2023). doi:10.1145/3559540 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1145/3559540&link_type=DOI) 23. [23].Alaa, A., Van Breugel, B., Saveliev, E.S., van der Schaar, M.: How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In: International Conference on Machine Learning, pp. 290–306 (2022). PMLR 24. [24].Ghosheh, G., Li, J., Zhu, T.: A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources. arXiv (2022). doi:10.48550/ARXIV.2203.07018. [https://arxiv.org/abs/2203.07018](https://arxiv.org/abs/2203.07018) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/ARXIV.2203.07018&link_type=DOI) 25. [25].Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning, pp. 214–223 (2017). PMLR 26. [26].Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017) 27. [27].Kuo, N.I.-H., Polizzotto, M.N., Finfer, S., Garcia, F., Sönnerborg, A., Zazzi, M., Böhm, M., Kaiser, R., Jorm, L., Barbieri, S.: The health gym: synthetic health-related datasets for the development of reinforcement learning algorithms. Scientific Data 9(1) (2022). doi:10.1038/s41597-022-01784-7 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41597-022-01784-7&link_type=DOI) 28. [28].Johnson, A.E.W., Pollard, T.J., Shen, L., Lehman, L.-w.H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L.A., Mark, R.G.: MIMIC-III, a freely accessible critical care database. Scientific Data 3(1) (2016). doi:10.1038/sdata.2016.35 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/sdata.2016.35&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=27219127&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 29. [29].Theis, L., van den Oord, A., Bethge, M.: A note on the evaluation of generative models. In: International Conference on Learning Representations (2016). [http://arxiv.org/abs/1511.01844](http://arxiv.org/abs/1511.01844) 30. [30].van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(86), 2579–2605 (2008) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=WOS:00026263&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 31. [31].McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (2020) 32. [32].Goodfellow, I.J.: NIPS 2016 tutorial: Generative adversarial networks. CoRR abs/1701.00160 (2017) [https://arxiv.org/abs/1701.00160](https://arxiv.org/abs/1701.00160) 33. [33].Kullback, S., Leibler, R.A.: On information and sufficiency. The Annals of Mathematical Statistics 22(1), 79–86 (1951). Accessed 2022-09-14 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.transproceed.2017.12.043&link_type=DOI) 34. [34].Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. Journal of Machine Learning Research 13(25), 723–773 (2012) 35. [35].Kendall, M.G.: THE TREATMENT OF TIES IN RANKING PROB-LEMS. Biometrika 33(3), 239–251 (1945). doi:10.1093/biomet/33.3.239 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/biomet/33.3.239&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=21006841&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 36. [36].Yazici, Y., Foo, C.-S., Winkler, S., Yap, K.-H., Chandrasekhar, V.: Empirical analysis of overfitting and mode drop in gan training. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 1651–1655 (2020). doi:10.1109/ICIP40778.2020.9191083 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/ICIP40778.2020.9191083&link_type=DOI) 37. [37].Gottesman, O., Futoma, J., Liu, Y., Parbhoo, S., Celi, L., Brunskill, E., Doshi-Velez, F.: Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions. In: International Conference on Machine Learning, pp. 3658–3667 (2020). PMLR 38. [38].Komorowski, M., Celi, L.A., Badawi, O., Gordon, A.C., Faisal, A.A.: The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature medicine 24(11), 1716–1720 (2018) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1038/s41591-018-0213-5&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) 39. [39].Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Networks. arXiv (2014). doi:10.48550/ARXIV.1406.2661. [https://arxiv.org/abs/1406.2661](https://arxiv.org/abs/1406.2661) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/ARXIV.1406.2661&link_type=DOI) 40. [40].Mirza, M., Osindero, S.: Conditional Generative Adversarial Nets (2014) 41. [41].Yoon, J., Jarrett, D., Van der Schaar, M.: Time-series generative adversarial networks. Advances in neural information processing systems 32 (2019) 42. [42].Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: Sequence generative adversarial nets with policy gradient. CoRR abs/1609.05473 (2016) [https://arxiv.org/abs/1609.05473](https://arxiv.org/abs/1609.05473) 43. [43].Levina, E., Bickel, P.: The earth mover’s distance is the mallows distance: some insights from statistics. In: Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2, pp. 251–2562 (2001). doi:10.1109/ICCV.2001.937632 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/ICCV.2001.937632&link_type=DOI) 44. [44].Arjovsky, M., Bottou, L.: Towards Principled Methods for Training Generative Adversarial Networks. arXiv (2017). doi:10.48550/ARXIV.1701.04862. [https://arxiv.org/abs/1701.04862](https://arxiv.org/abs/1701.04862) [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.48550/ARXIV.1701.04862&link_type=DOI) 45. [45].Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25(2-3), 259–284 (1998) [https://arxiv.org/abs/10.1080/01638539809545028](https://arxiv.org/abs/10.1080/01638539809545028). doi:10.1080/01638539809545028 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1080/01638539809545028&link_type=DOI) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000075118800005&link_type=ISI) 46. [46].Mottini, A., Lheritier, A., Acuna-Agost, R.: Airline passenger name record generation using generative adversarial networks. CoRR abs/1807.06657 (2018) [https://arxiv.org/abs/1807.06657](https://arxiv.org/abs/1807.06657) 47. [47].Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). doi:10.1162/neco.1997.9.8.1735 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1162/neco.1997.9.8.1735&link_type=DOI) [PubMed](http://medrxiv.org/lookup/external-ref?access_num=9377276&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2023%2F09%2F27%2F2023.09.26.23296163.atom) [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=A1997YA04500007&link_type=ISI) 48. [48].Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional lstm networks for improved phoneme classification and recognition. In: Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications - Volume Part II. ICANN’05, pp. 799–804. Springer, Berlin, Heidelberg (2005) 49. [49].Mukaka, M.: Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi medical journal : the journal of Medical Association of Malawi 24, 69–71 (2012) 50. [50].Kuo, N.I., Jorm, L., Barbieri, S., et al: Generating synthetic clinical data that capture class imbalanced distributions with generative adversarial networks: Example using antiretroviral therapy for hiv. arXiv preprint arXiv:2208.08655 (2022) 51. [51].Kuo, N.I., Jorm, L., Barbieri, S., et al: Synthetic health-related longitudinal data with mixed-type variables generated using diffusion models. arXiv preprint arXiv:2303.12281 (2023) 52. [52].Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015). PMLR 53. [53].Althelaya, K.A., El-Alfy, E.-S.M., Mohammed, S.: Evaluation of bidirectional lstm for short-and long-term stock market prediction. In: 2018 9th International Conference on Information and Communication Systems (ICICS), pp. 151–156 (2018). doi:10.1109/IACS.2018.8355458 [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1109/IACS.2018.8355458&link_type=DOI) [1]: /embed/graphic-4.gif [2]: /embed/graphic-6.gif [3]: /embed/inline-graphic-1.gif [4]: /embed/inline-graphic-2.gif [5]: /embed/inline-graphic-3.gif [6]: /embed/inline-graphic-4.gif [7]: /embed/inline-graphic-5.gif [8]: /embed/inline-graphic-6.gif [9]: /embed/graphic-9.gif [10]: /embed/inline-graphic-7.gif [11]: /embed/graphic-10.gif [12]: /embed/graphic-11.gif [13]: /embed/graphic-12.gif [14]: /embed/graphic-13.gif [15]: /embed/graphic-14.gif