Abstract
Representation bias in health data can lead to unfair decisions, compromising the generalisability of research findings. As a consequence, underrepresented subpopulations, such as those from specific ethnic backgrounds or genders, do not benefit equally from clinical discoveries. Several approaches have been developed to mitigate representation bias, ranging from simple resampling methods, such as SMOTE, to recent approaches based on generative adversarial networks (GAN). However, generating high-dimensional time-series synthetic health data remains a significant challenge. In this work we propose a novel CA-GAN architecture that synthesises authentic, high-dimensional time series data. CA-GAN outperforms state-of-the-art methods in a qualitative and a quantitative evaluation while avoiding mode collapse, a serious GAN failure. We evaluate our CA-GAN’s generalisability in mitigating representation bias and improving model fairness for Black patients, as well as female patients. We perform evaluation using two diverse, real-world clinical datasets, comprising 7535 patients with hypotension and sepsis. Finally, we show that CA-GAN generates authentic data of the minority class while faithfully maintaining the original distribution of data, resulting in improving performance in a downstream predictive task.
1 Introduction
Clinical practice is poised to benefit from developments in machine learning as data-driven digital health technologies transform health care [1]. Digital health can catalyse the World Health Organisation’s (WHO) vision of promoting equitable, affordable, and universal access to health and care [2]. However, as machine learning methods increasingly weave themselves into the societal fabric, critical issues related to fairness and algorithmic bias in decision-making are coming to light. Algorithmic bias can originate from diverse sources, including socio-economic factors, where income disparities between ethno-racial groups are reflected in algorithms deciding which patients need care [3]. Bias can also originate from the underrepresentation of particular demographics, such as ethnicity, gender, and age in the datasets used to develop machine learning models, known as health data poverty [4]. Health data poverty impedes underrepresented subpopulations from benefiting from clinical discoveries, compromising the generalisability of research findings and leading to representation bias that can compound health disparities.
The machine learning community has developed several approaches to mitigate representation bias, with data resampling being the most widely used. Oversampling generates representative synthetic data from the underrepresented subpopulation (minority class), resulting in a similar or equal representation. Synthetic Minority Over-sampling TEchnique (SMOTE) [5] is a representative example of this method, where synthetic samples lie between a randomly selected data sample and its randomly selected neighbour (using k-nearest neighbour algorithm [6]). SMOTE and related methods [7–9] are popular approaches due to their simplicity and computational efficiency.
However, SMOTE, when used with high-dimensional time-series data, may decrease data variability and introduce correlation between samples [10–12]. In response, alternative approaches based on Generative Adversarial Networks (GAN) are gaining ground [13–17]. GANs have shown incredible results in generating realistic images [18], text [19], and speech [20] in addition to improving privacy [21]. However, while GANs address some of the issues of SMOTE-based approaches, the generation of high-dimensional time-series data remains a significant research challenge [22–24].
To address this challenge, we propose a new generative architecture called Conditional Augmentation GAN (CA-GAN). Our CA-GAN extends Wasser-stein GAN with Gradient Penalty [25, 26], presented in the Health Gym study [27] (referred to in this paper as WGAN-GP*). However, our work has a different objective. Instead of generating new synthetic datasets, we condition our GAN to augment the minority class only, while maintaining correlations between the variables and correlations over time, in contrast to the recent work [28]. As a result, CA-GAN captures the distribution of the overall dataset, including the majority class. We compare the performance of our CA-GAN with WGAN-GP* and SMOTE in generating synthetic data of patients of an under-represented ethnicity (Black patients in our case) as well as gender (female). We use two critical care datasets comprising acute hypotension (n=3343) and sepsis (n=4192), resulting in 7535 patients overall.
Our datasets include both categorical and continuous variables with diverse distributions and are derived from the well-studied MIMIC-III critical care database [29]. While these datasets were chosen as a proof of concept, our architecture is disease agnostic.
Our work makes the following contributions: (1) We propose a new CA-GAN architecture that addresses the shortcomings of traditional and the state-of-the-art methods in generating high-dimensional, time-series, synthetic data, using two real-world datasets. (2) Our multi-metric evaluation using qualitative and quantitative methods demonstrate superior performance of CA-GAN with respect to the state of the art architecture, while avoiding mode-collapse, a significant GAN failure. (3) We evaluate our CA-GAN against SMOTE, a naive but effective and popular resampling method, however limited in the synthesis of authentic data. (4) We show the impact of synthetic data augmentation in improving model performance for the minority (underrepresented) class, resulting in a fairer model between Black and White ethnicities. (5) We also show that CA-GAN can synthesise realistic data of diverse ethnicities and genders to augment the real data, improving the performance in a downstream predictive task.
2 Results
We primarily focus on multi-metric evaluation of synthetic data generated by our CA-GAN architecture in comparison to the data generated by state-of-the-art WGAN-GP* architecture and the popular SMOTE approach. We provide a separate analysis on the impact of synthetic data generated by our architecture to mitigate representation bias and improve model fairness in Section 2.5. Considering significant challenges in evaluating generative models in general, [30], and high-dimensional time-series data in particular [23], we adopted a holistic approach to evaluating our work based on both qualitative and quantitative methods. We present the results of the data generated comparing the performance of the three methods in augmenting the underrepresented (minority) class, namely Black ethnicity (shown in this section) and female gender (shown in Appendix G).
2.1 Qualitative evaluation
To gain initial insights into the obtained results, we conduct a qualitative evaluation employing visual representation methods that show the coverage of synthetic data with respect to the real data. We use Principal Component Analysis (PCA) to project the real and synthetic data onto a two-dimensional space. We also use t-distributed Stochastic Neighbor Embedding (t-SNE) [31] to plot both real and synthetic datasets in a two-dimensional latent space while preserving the local neighbourhood relationships between data points. To compare the performance between the methods and ensure a consistent visualisation of the real data, we have computed a common t-SNE embedding.
In Appendix H we present the results of Uniform Manifold Approximation and Projection (UMAP) [32], which offers better preservation of the global structure of the dataset when compared to t-SNE. The parameters of t-SNE and UMAP are the same for all three methods as shown in Appendix C.
The results are illustrated in Figure 1 (acute hypotension) and Figure 2 (sepsis). Synthetic data generated by CA-GAN exhibits significant overlap with the real data, indicating our model’s ability to accurately capture the underlying structure of real data. This is especially evident in PCA, where the representations reveal that the synthetic data generated by CA-GAN provides the best overall coverage of the real data distribution. Further evidence is provided from the marginal distributions. For the acute hypotension dataset (Figure 1), both WGAN-GP* and SMOTE show evidence of mode collapse (evident also in the t-SNE plots), where synthetic data is generated from a limited space. Similarly, for the sepsis dataset (Figure 2), CA-GAN covers more of the real data distribution compared to SMOTE, while WGAN-GP* again tends towards mode collapse.
Figures 1 and Figure 2 in the bottom panels show the t-SNE representations of the real and synthetic data. For the acute hypotension dataset, CA-GAN more uniformly covers the distribution of real data, while SMOTE does not cover a significant part of it (top right in the panel). This is also evident from the marginal distributions. WGAN-GP* coverage is almost completely separated from the real data. For the sepsis dataset, t-SNE shows that SMOTE follows an interpolation pattern failing to expand into the latent space. In contrast, CA-GAN successfully expands the distribution into the latent space, generating authentic data points, while remaining within the clusters identified by t-SNE. Data generated by WGAN-GP* fall outside of the real data.
Figures 1 and 2 provide evidence that state of the art WGAN-GP* appears to suffer from mode collapse, a significant limitation of GANs [33]. Mode collapse occurs when the generator produces a limited variety of samples despite being trained on a diverse dataset. The generator cannot fully capture the complexity of the target distribution, limiting the quantity of generated samples and resulting in repetitive output. This is because the generator can get stuck in a local minimum where a few outputs are repeatedly generated, even though the training data contains more modes that can be learned. This presents a significant challenge in generating high-quality, authentic samples, while our CA-GAN model overcomes this limitation.
The evidence that CA-GAN captures accurately the underlying structure of real data is further reinforced based on joint distribution of variables, which we show in Appendix D. In Appendix H we also present the UMAP latent representation of the data, which preserves the global structure in Figures H6 (acute hypotension) and H7 (sepsis).
Finally, we show the distribution of individual variables of synthetic data overlaid on the distribution of the real data. We use this to compare the performance of our method with state of the art WGAN-GP* as well as SMOTE as the baseline method, using acute hypotension dataset in Figure 3 and sepsis in Appendix A. Joint distributions are shown in Appendix D). The distribution of synthetic data generated by our CA-GAN exhibits the closest match to that of the real data. This close alignment is particularly evident in variables related to blood pressure, including MAP, diastolic, and systolic measurements. However, certain variables, such as urine and ALT/AST, pose challenges for all three methods. These variables have highly skewed, non-Gaussian distributions with long tails, making them difficult to transform effectively using power or logarithmic transformations. In contrast, our CA-GAN and WP-GAN* effectively capture the distribution of categorical variables.
Conversely, SMOTE encounters difficulties with several variables, including both the numeric variable of urine and the categorical variable of the Glasgow Coma Score (GCS). These observations are also reflected in the quantitative evaluation in Section 2.2. The variables in the sepsis dataset are not only more than twice as many as those in acute hypotension but also have more complex distributions. Variables such as SGOT, SGPT, total bilirubin, maximum dose of vasopressors, and others have extremely long tails. The three methods struggle to generate these kinds of distributions and show a tendency to converge to the median value. In contrast, the behaviour is similar to acute hypotension for categorical and numerical variables normally distributed.
2.2 Quantitative evaluation
We used Kullback-Leibler (KL) divergence [34] to measure the similarity between the discrete density function of the real data and that of the synthetic data. For each variable v of the dataset, we calculate: where Qv is the true distribution of the variable and Pv is the generated distribution. The smaller the divergence, the more similar the distributions; zero indicates identical distributions. The left half of Tables 1a and 1b show the results of the KL divergence for each variable. Our CA-GAN method has the lowest median across all variables for acute hypotension and sepsis data compared to WGAN-GP* and SMOTE. This is despite the fact that SMOTE is specifically designed to maintain the distribution of the original variables.
In addition, we used Maximum Mean Discrepancy (MMD) [35] to calculate the distance between the distributions based on kernel embeddings, that is, the distance of the distributions represented as elements of a reproducing kernel Hilbert space (RKHS). We used a Radial Basis Function (RBF) Kernel: with σ = 1. The right half of Tables 1a and 1b shows the MMD results for SMOTE, WGAN-GP* and our CA-GAN. Again, our model has the best median performance across all the variables for acute hypotension data, while for sepsis data, SMOTE shows a difference in performance by 0.00028. In summary, CA-GAN performs best in the acute hypotension dataset by a wide margin while showing comparable performance with SMOTE in the sepsis dataset.
2.3 Variable correlations
We used the Kendall rank correlation coefficient τ [36] to investigate whether synthetic data maintained original correlations between variables found in the real data of acute hypotension and sepsis datasets. This choice is motivated by the fact that the τ coefficient does not assume a normal distribution, which is the case for some of our variables, of the sepsis dataset in particular (as shown in Figure 3 and Appendix A). Figure 4 shows the results of Kendall’s rank correlation coefficients. For the acute hypotension dataset (Figure 4a), CA-GAN captures the original variable correlations, as does SMOTE, with the former having the closest results on categorical variables, while the latter on numerical ones. WGAN-GP* shows the worst performance, accentuating correlations that do not exist in real data. Similar patterns are also obtained for the variables of patients with sepsis in Figure 4b.
2.4 Synthetic data authenticity
When generating synthetic data, the output must be a realistic representation of the original data. Still, we also need to verify that the model has not merely learned to copy the real data. GANs are prone to overfitting by memorising the real data [37]; therefore, we use Euclidean Distance (L2 Norm) to evaluate the originality of our model’s output. Our analysis shows that the smallest distance between a synthetic and a real sample is 52.6 for acute hypotension and 44.2 for sepsis, indicating that the generated synthetic data are not a mere copy of the real data. This result, coupled with the visual representation of CA-GAN (shown in Figure 1 and 2), illustrates the ability of our model to generate authentic data. SMOTE, which by design interpolates the original data points, is unable to explore the underlying multidimensional space. Therefore its generated data samples are much closer to the real ones, with a minimum Euclidean distance of 0.0023 for acute hypotension and 0.033 for sepsis. A summary of distance metrics, including median, mean, standard deviation, maximum and minimum is shown in Appendix E, Tables E3a and E3b.
2.5 Improving model fairness
Having evaluated the quality of synthetic data generated by our CA-GAN, we now focus on evaluating the potential of synthetic data in improving model fairness. Machine learning literature has highlighted many examples where datasets with class imbalance lead to unfair decisions for the minority class [38]. This is especially important when those decisions significantly affect the health of patients and the society at large, as is the case with our study. We measure fairness by comparing model performance across subgroups, based on the approach described in [39]. In this respect we have carried out an analysis to understand the performance of a predictive model within Black patients and White patients separately and the impact of synthetic data. For this purpose we chose the task of predicting lactate within each ethnic subgroup, based on our previous work [40]. Lactate is essential in guiding clinical decision making for patients with sepsis and ultimately affects patients’ survival [41, 42]. Therefore, any differences between Black and White patients in lactate prediction performance would result in potentially unfair treatment decisions due to data representation bias. Based on our work in [40] we trained an LSTM classifier to predict whether the outcome of the last lactate value in the time series of the patients was above a critical threshold, using as input the previous observations of the clinical variables in our dataset. The model was validated on real data only, using a stratified 3-fold cross validation with 10 repetitions. Initially we used only the real (original) sepsis dataset containing around 11% of Black patients. The predictive performance of an LSTM model within the Black patient cohort was AUC of 0.65. This is in contrast to AUC of 0.70 within the White patient cohort. For comparison the overall performance of the model when including both ethnicities was 0.70. Then we augmented the original Black patient cohort with synthetic data (conditioned on Black patients only) generated by our CA-GAN method, such that the representation between Black and White patients was equal. As a consequence, the predictive performance within Black patient cohort increased to AUC of 0.69, while maintaining the AUC of 0.70 for the White patients cohort and the full dataset with both ethnicities. Synthetic data augmentation resulted in a fairer model between ethnicities, with a statistically significant difference between non-augmented and augmented datasets.
2.6 Downstream regression task
Finally, we also sought to evaluate the ability of CA-GAN to maintain the temporal properties of time series data, considering ethnicity (shown here) and gender (shown in Appendix G). Since our objective is to augment the minority class to mitigate representation bias, we wanted to verify that the datasets augmented with synthetic data generated by our model can maintain or improve the predictive performance of the original data on a downstream task. Initially, we trained only a Bidirectional Long Short-Term Memory (BiLSTM) with real data as the baseline. Later, we trained the BiLSTM with synthetic and augmented datasets separately, considering ethnicity and gender diversity. They were used to evaluate the performance in a regression task. To address the inherent randomness in the models, we used CA-GAN to create five different synthetic datasets. For each of these datasets, we trained five BiLSTM models, resulting in a total of 25 BiLSTM models (5 datasets × 5 models per dataset). A complete description of how we performed these experiments is provided in Appendix F.
Table 2a and Table 2b show the mean absolute errors between the BiLSTM prediction and the actual acute hypotension and sepsis observations, respectively. In the first column, we show the results achieved using only the real data to make the predictions; in the second column, the results using only the synthetic data; and finally, in the third column, the results achieved by predicting with the augmented dataset, that is, with both the real and synthetic data together, considering both ethnic and gender diversity in the augmentation process. Overall, adding the synthetic data reduces the predictive error. This indicates that the temporal characteristics of the data generated by our CA-GAN model are close enough to those of the real data to maintain the original predictive performance. Thus, the augmented dataset could be used in a downstream task, mitigating the representation bias. Similar findings are also shown for the gender-augmented dataset in Appendix G.
It should be noted that the errors in Input Fluids Total and Total Urine Output are exceptionally high compared to the other variables. This is because predicting these variables is generally challenging, stemming partly from how they are collected and recorded rather than an issue inherent to synthetic data generation.
3 Discussion
As machine intelligence scales upwards in clinical decision-making, the risk of perpetuating existing health inequities increases significantly. This is because biased decision-making can continuously feed back the data used to train the models, creating a vicious circle that further ingrains discrimination towards underrepresented groups. Representation bias, in particular, frequently occurs in health data, leading to decisions that may not be in the best interests of all patients, favouring specific subpopulations while treating underrepresented sub-populations, such as those with standard set characteristics including ethnicity, gender, and disability unfavourably.
To address these issues, representation must be improved before algorithmic decision-making becomes integral to clinical practice. While unequal representation is a multifaceted challenge involving diverse factors such as socio-economic, cultural, systemic, and data, our work represents a step towards addressing one significant facet of this challenge: mitigating existing representation bias in health data.
We have shown that our work can generate high-quality synthetic data when evaluated against state-of-the-art architectures and traditional approaches such as SMOTE. SMOTE has notable advantages over other data generation techniques as it requires no training and can work with smaller datasets. It can mirror non-normal distributions even if it tends to overestimate the median in long-tail distributions. This is in contrast to GANs, which struggle with these types of distributions. However, generating authentic data remains a significant challenge for SMOTE, especially important when considering confidentiality of data and patients’ privacy.
Through qualitative and quantitative evaluation, we have shown that CA-GAN can generate authentic data samples with high distribution coverage, avoiding mode collapse failure, while ensuring that the generated data are not copies of the real data. We have also shown that augmenting the dataset with the synthetic data generated by CA-GAN leads to lower errors in the downstream regression task. This indicates that our model can generalise well from the original data.
A notable advantage of our approach is that it uses the overall dataset, and not only the minority class, as is the case with WGAN-GP* and SMOTE. This means that CA-GAN can be applied in smaller datasets and those with highly imbalanced classes, such as rare diseases.
Furthermore, we evaluated our method on two datasets with diverse characteristics and found that our CA-GAN performed better on the acute hypotension dataset. This may be because some of the numerical variables in the sepsis dataset have long-tailed distributions, presenting a modelling challenge for all the methods. Similar challenges are observed for variables with non-normal distributions. Additionally, the sepsis dataset contained fewer data points per patient than acute hypotension (15 versus 48 observations). A shorter input sequence may have created difficulty for BiLSTM modules to learn the underlying structure of the original data effectively, coupled with a higher number of variables (twice as many) in the sepsis dataset. We also note that the lower number of patients in the acute hypotension dataset does not impact the generative performance of our method. This ability to work with fewer data points (patients) is encouraging, given the overall objective of our goal of augmenting representation.
The task of generalising to unseen data categories is particularly challenging due to the inherent unknowns these categories represent. While a definitive strategy is yet to be developed, we believe that integrating conditional generation with metric learning, as seen in prototypical networks [43], could provide a dual advantage. This integration could not only facilitate the generation of novel data points but also offer a quantifiable framework to assess their similarity or dissimilarity to known data categories. Such an approach could extend the capabilities of CA-GAN beyond data augmentation, potentially improving the interpretability and applicability of the synthetic data generated.
In terms of evaluation metrics, our current approach primarily focuses on statistical properties of the generated data. However, to ascertain the practical utility and accuracy of the synthetic data produced by CA-GAN, we recognise the importance of domain-specific validations, including performance in subpopulations [44, 45]. Inspired by the collaborative efforts outlined in [46], we are committed to exploring partnerships with experts in relevant fields such as healthcare and clinical practice to guide the development of evaluation metrics. Their input will ensure that our synthetic datasets can meet the rigorous demands of real-world applications and contribute meaningfully to the domains they are intended to serve.
Our architecture can provide a solid basis to generate privacy preserving synthetic data and mitigate barriers to access clinical data. This is because, we have ensured that the synthetic data generated by our CA-GAN are not a mere copy of the real data on one hand, while on the other, the synthetic data reflect the distributions of the real data. However, privacy preserving aspects will require additional analysis, such as reconstruction attacks, which are beyond the scope of the current work.
While CA-GAN architecture showed superior performance with respect to state of the art method as well as computationally inexpensive approaches, some limitations are present. Namely, CA-GAN may require additional optimisations to further increase performance on datasets with variables with non Gaussian distributions and those with long-tailed distributions. Furthermore, additional analysis will be required to evaluate the generalisation capability of our architecture with datasets of different characteristics. In this respect we aim to refine CA-GAN while exploring alternative architectures (such as Diffusion Models) in addressing some of these limitations. One approach might be using convolutional neural networks (CNNs) or Temporal Convolutional Networks (TCNs) as a promising direction to potentially improve the efficiency of our model. The work of Bai et al. [47] provides an empirical foundation for this approach, indicating that such network structures can rival the performance of recurrent networks for sequence modelling tasks. Additionally, prior research [48] suggests that simplifying the internal mechanisms of these recurrent units can lead to improvements in both performance and computational efficiency. These insights provide a strong impetus for our future work, where we aim to refine the CA-GAN model to harness the benefits of these alternative architectures without compromising its ability to perform conditional generation.
Finally, we are aware that the use of synthetic data may generate several ethical and policy implications including the fact that synthetic data cannot fully address the historical biases and discriminatory practices which are often reflected in the data [49]. While our work can mitigate existing representation biases, we must ensure that this does not come at the risk of disincentivising participation of underrepresented groups or perpetuating other types of data biases [50, 51]. Finally, while we showed the utility of synthetic data, we also note that the findings should always be confirmed using real data.
4 Methods
We begin by formally formulating the problem we are addressing. Then we discuss the data sources we used to train our models and compare and contrast Generative Adversarial Networks (GANs) and Conditional Generative Adversarial Networks (CGANs). We also provide an in-depth analysis of the baseline model for this work, WGAN-GP*. Finally, we present the architecture of our proposed Conditional Augmentation GAN (CA-GAN) and discuss its advantages over other methods.
4.1 Problem Formulation
Let A be a vector space of features and let a ∈ A represent a feature vector. Let L = {0, 1} be a binary distribution modifier, and let l a binary mask extracted from L. We consider a data set with l = 0, where individual samples are indexed by n ∈ {1, …, N } and we also consider a data set with l = 1, where individual samples are indexed by m ∈ {N + 1, …, N + M }, and N > M. We define the training data set D as D = D0 ∪ D1.
Our goal is to learn a density function that approximates the true distribution d{A} of D. We also define as with l = 1 applied.
To balance the number of samples in D, we draw random variables X from and add them to D1 until N = M. Thus, we balance out D.
4.2 Data sources, variables and patient population
Our analysis uses two datasets extracted from the MIMIC-III database. The detailed data pre-processing steps are outlined in our previous publication [27], in Section 1.2 and Section 1.3 of the supplementary material. We chose these two datasets as they have already been used in the study describing WGAN-GP* [27] making the comparison with our method fairer. We decided to test the methods for the oversampling of only one minority class, thus including only patients that belonged to the White (coded as Caucasian) or Black ethnic groups. We used a similar approach for gender, shown in Appendix G.
The acute hypotension dataset comprises 3343 patients admitted to critical care; the patients were either of Black (395) or White (2948) ethnicity. Each patient is represented by 48 data points, corresponding to the first 48 hours after the admission, and 20 variables, namely nine numeric, four categorical, and seven binary variables. Details of this dataset are presented in Appendix B, Table B1.
The Sepsis dataset comprises 4192 patients admitted to critical care of either Black (461) or White (3731) ethnicity. Each patient is represented by 15 data points, corresponding to observations taken every four hours from admission, and 44 variables, namely 35 numeric, six categorical, and three binary variables. Details of this dataset are presented in Appendix B, Table B2.
4.3 GAN vs CGAN
The Generative Adversarial Network (GAN) [52] entails two components: a generator and a discriminator. The generator G is fed a noise vector z taken from a latent distribution pz and outputs a sample of synthetic data. The discriminator D inputs either fake samples created by the generator or real samples x taken from the true data distribution pdata. Hence, the GAN can be represented by the following minimax loss function:
The goal of the discriminator is to maximise the probability of discerning fake from real data, whilst the purpose of the generator is to make samples realistic enough to fool the discriminator, i.e., to minimise .As a result of the reciprocal competition, both the generator and discriminator improve during training.
The limitations of vanilla GAN models become evident when working with highly imbalanced datasets, where there might not be sufficient samples to train the models to generate minority-class samples. A modified version of GAN, the Conditional GAN [53], solves this problem using labels y in both the generator and discriminator. The additional information y divides the generation and the discrimination in different classes. Hence, the model can now be trained on the whole dataset to generate only minority-class samples. Thus, the loss function is modified as follows:
GAN and CGAN, overall, share the same significant weaknesses during training, namely mode collapse and vanishing gradient [33]. In addition, as GANs were initially designed to generate images, they have been shown unsuitable for generating time-series [54] and discrete data samples [55].
4.4 WGAN-GP*
The WGAN-GP* introduced by Kuo et al. [27] solved many of the limitations of vanilla GANs. The model was a modified version of a WGAN-GP [25, 26]; thus, it applied the Earth Mover distance (EM) [56] to the distributions, which had been shown to solve both vanishing gradient and mode collapse [57]. In addition, the model applied the Gradient Penalty during training, which helped to enforce the Lipschitz constraint on the discriminator efficiently. In contrast with vanilla WGAN-GP, WGAN-GP* employed soft embeddings [58, 59], which allowed the model to use inputs as numeric vectors for both binary and categorical variables, and a Bidirectional LSTM layer [60, 61], which allowed for the generation of samples in time-series. While LD was kept the same, LG was modified by Kuo et al. [27] by introducing alignment loss, which helped the model to capture correlation among variables over time better. Hence, the loss functions of WGAN-GP* are the following:
To calculate alignment loss, we computed Pearson’s r correlation [62] for every unique pair of variables X(i) and X(j). We then applied the L1 loss to the differences in the correlations between rsyn and rreal, with λcorr representing a constant acting as a strength regulator of the loss.
In their follow-up papers, Kuo et al. noted that their simulated data based on their proposed WGAN-GP* lacked diversity. In [63], the authors found that WGAN-GP* continued to suffer from mode collapse like the vanilla GAN. Similar to our own CA-GAN, the authors extended the WGAN-GP setup with a conditional element where they externally stored features of the real data during training and replayed them to the generator sub-network at test time. In [64], the same panel of researchers also experimented with diffusion models [65] and found that diffusion models better represent binary and categorical variables. Nonetheless, they demonstrated that GAN-based models encoded less bias (in the means and variances) of the numeric variable distributions.
4.5 CA-GAN
We built our CA-GAN by conditioning the generator and the discriminator on static labels y. Hence, the updated loss functions used by our model are as follows:
Where y can be any categorical label. During training, the label y was used to differentiate the minority from the majority class, and during generation, they were used to create fake samples of the minority class.
Compared to WGAN-GP* we also increased the number of BiLSTMs from 1 to 3 both in the generator and the discriminator, as stacked BiLSTMs have been shown to capture complex time-series better [66]. In addition, we decreased the learning rate and batch size during training. An overview of the CA-GAN architecture is shown in Figure 5.
Data Availability
The data underlying this article are freely available in the MIMIC-III repository
Declarations
4.6 Funding
None
4.7 Competing interests
None
4.8 Ethics approval
The data in MIMIC-III was previously de-identified, and the institutional review boards of the Massachusetts Institute of Technology (No. 0403000206) and Beth Israel Deaconess Medical Center (2001-P-001699/14) both approved the use of the database for research.
4.9 Availability of data and materials
The data underlying this article are publicly available in the MIMIC-III repository (https://mimic.physionet.org/).
4. 10 Code availability
The source code will be made available upon publication at the github repository: https://github.com/nic-olo/CA-GAN
Appendix A Distribution Plots for Sepsis
Appendix B Datasets
B.1 Data preprocessing
Detailed description of the data preprocessing steps are available from our previous publication [27] in Section 1 of the supplementary material 1. However, for completeness we highlight some of the main approaches. For the acute hypotension dataset we included adult patients (18 or over) in the MIMIC-III dataset with at least 24 hours of data, aggregating 48 hours of clinical variables from patients with seven or more mean arterial pressure (MAP) values of 65 mmHg or less, indicating acute hypotension. Missing values were replaced with the last available data, while an indicator variable was used to denote whether a value was measured or not. For the sepsis dataset we included adult patients only who had any suspicious infections based on history of administering antibiotics. We included all the variables from at least 44 hours before the suspected infection and up to 28 hours after. The missing data was imputed using nearest neighbour method. More detailed information is available in https://static-content.springer.com/esm/art%3A10.1038%2Fs41597-022-01784-7/MediaObjects/41597_2022_1784_MOESM4_ESM.pdf
Appendix C UMAP and t-SNE parameters
In this study, we used t-SNE and UMAP algorithms to perform dimensionality reduction on our datasets and highlight the differences between the results of the three methods under analysis. The following parameters were used for each algorithm:
t-SNE:
Library: scikit-learn version 1.2.2
Parameters for sepsis: n_components = 2, n_iter = 500,
learning_rate = 100, perplexity = 50
Parameters for acute hypotension: n_components = 2, n_iter = 100,
learning_rate = 1000, perplexity = 30
UMAP:
Library: umap-learn version 0.5.3
Parameters: n_neighbors = 5, spread = 5, min_dist = 0.5
Appendix D Joint distributions of variables
We have carried out an analysis of a set of variables that are clinically known to follow a joint distribution, namely systolic, diastolic and mean arterial blood pressure. This was to investigate whether CA-GAN can capture joint distributions of variables and whether synthetic data are clinically meaningful, where we know that systolic blood pressure is always higher than diastolic blood pressure. Using a scatter plot in Figure D4 we show that the joint distribution of real data is similar to that of synthetic data.
Following from this, we have also implemented several sanity checks based on clinical knowledge to ensure that the generated synthetic data is clinically meaningful. In this respect, we have performed checks to investigate whether systolic BP values are always lower than diastolic. From our analysis, in the real sepsis dataset, 99.94% of values of these variables were correct; that is, systolic values were always lower than the diastolic values. In the synthetic sepsis dataset, this figure was 99.93%, with only 0.01% difference between the real and the synthetic dataset. On the other hand for the hypotension dataset there were 99.86% correct values of systolic and diastolic variables versus 99.84% in the synthetic dataset, representing a 0.02% difference. This analysis suggests that our architecture captures the structure of the real data quite well.
Appendix E Summary of distance metrics
Appendix F Description of Downstream Regression Task
For this task, the BiLSTM is trained on the first 20 hours for hypotension and 10 hours for sepsis of the patients’ values to predict the next hour, using a sliding window approach. To ensure the fairness of our result, 15% of the time series data points of Black patients and a proportional representation of different genders were set apart as a test set. To account for the stochasticity of the models, we generated five synthetic datasets with CA-GAN. Then, we trained 5 BiLSTM models on each dataset for a total of 25 BiLSTM models (5x(5 per dataset)). The predictions of the trained models were then compared with the test dataset, and the resulting error was averaged across the five models and, subsequently, the five datasets.
Appendix G Downstream regression task on gender-conditioned data
We have devised an additional dataset of patient subpopulations with gender as the underrepresented category. Specifically, we have created two datasets: one with the original data (Real) and the other (the Real surrogate) from which we have removed 80% of the data from female patients. We have replaced the removed portion of the dataset with the synthetic data generated from our architecture. Finally, we used a downstream regression task in the same manner as we used it with the ethnicity data to evaluate the performance of our approach.
Our results show that the performance of both algorithms is comparable, namely using the original dataset, we obtain a Mean Absolute Error (MAE) of 4.784 (±0.393), while we obtain an MAE of 3.193 (±0.052) using the synthetic dataset, indicating that the CA-GAN generates faithful synthetic data. What’s even more interesting is that combining synthetic data with the real data further reduced the MAE to 2.870 (±0.220), which is lower than that of the real data alone (MAE of 3.294 (±0.241)).
Footnotes
Introduced the concept of fairness, updated figures
↵1 https://static-content.springer.com/esm/art%3A10.1038%2Fs41597-022-01784-7/MediaObjects/41597_2022_1784_MOESM4_ESM.pdf
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].
- [9].↵
- [10].↵
- [11].
- [12].↵
- [13].↵
- [14].
- [15].
- [16].
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵