Modeling Tabular data using Conditional GAN

07/01/2019 ∙ by Lei Xu, et al. ∙ Universidad Rey Juan Carlos University of Cambridge MIT 0

Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design TGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. TGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent developments in deep generative models have led to a wealth of possibilities. Using images and text, these models can learn the underlying probability distributions and generate high-quality samples. Over the past two years, the promise of such models has encouraged the development of generative adversarial networks (GANs) (Goodfellow et al., 2014) for tabular data generation. The idea of being GANs offers greater flexibility in modeling distributions than their statistical counterparts. This proliferation of new GANs brought up this question “Can these new GANs offer better generative models for their statistical counterpart?”. To answer this question and evaluate these GANs, we used a group of real datasets to set-up a benchmarking system and implemented three of the most recent techniques 111 We call our system SDGym as it evaluates generative modeling capability in terms of the model’s ability to generate realistic synthetic data.

. For comparison purposes, we created two baseline methods using Bayesian networks. After testing these models using both simulated and real datasets, we found that modeling tabular data poses unique challenges for GANs, causing them to fall short of the baseline methods on a number of metrics, including the machine learning efficacy of the synthetically generated data. These challenges include the need to simultaneously model discrete and continuous columns, the multi-modality of information within each column, and the severe imbalance of categorical columns (we describe these challenges in detail in Section 

3).

To address these challenges, in this paper, we propose TGAN, a method which introduces several new techniques including: augmenting training procedures with reversible data transforms, architectural changes to the neural networks, and addressing data imbalance by employing a novel conditional GAN (described in detail in section 4). When applied to the same datasets with the new benchmarking suite, TGAN performs significantly better than both the Bayesian network baselines and the other new GANs tested, as shown in Table 1. The contributions of this paper are as follows:

(1) Conditional GANs for synthetic data generation. We propose TGAN as a synthetic tabular data generator to address several of the issues mentioned above. TGAN outperforms all methods to date and surpasses Bayesian networks on at least 87.5% of our datasets. To further challenge TGAN

, we adapted a variational autoencoder (VAE)

(Kingma and Welling, 2013) for mixed-type tabular data generation. We call this TVAE. VAEs directly use data to build the generator; even with this advantage, we show that our proposed TGAN achieves competitive performance across many datasets and outperforms TVAE 3 times.

(2) A benchmarking system for synthetic data generation algorithms.222Our benchmark can be found on https://github.com/DAI-Lab/SDGym.

We designed a comprehensive benchmark framework using several tabular datasets and different evaluation metrics as well as implementations of several baselines and state-of-the-art methods. Our system is open source and can be extended with other methods and additional datasets. At the time of this writing, the benchmark has 5 deep learning methods, 2 Bayesian network methods, and 15 datasets.

2 Related Work

# outperform
Method CLBN (Chow and Liu, 1968) PrivBN (Zhang et al., 2017)
MedGAN (Choi et al., 2017) 1 1
VeeGAN (Srivastava et al., 2017) 0 2
TableGAN (Park et al., 2018) 3 3
TGAN 7 8
Table 1: The number of wins of a particular method compared with the corresponding Bayesian network against an appropriate metric on real datasets.

During the past decade, synthetic data has been generated by treating each column in a table as a random variable, modeling a joint multivariate probability distribution, and then sampling from that distribution. For example, a set of discrete variables may have been modeled using decision trees

(Reiter, 2005) and Bayesian networks (Aviñó et al., 2018; Zhang et al., 2017). Spatial data could be modeled with a spatial decomposition tree (Cormode et al., 2012; Zhang et al., 2016). A set of non-linearly correlated continuous variables could be modeled using copulas (Patki et al., 2016; Sun et al., 2018). These models are restricted by the type of distributions and by computational issues, severely limiting the synthetic data’s fidelity.

The development of generative models using VAEs and, subsequently, GANs and their numerous extensions (Arjovsky et al., 2017; Gulrajani et al., 2017; Zhu et al., 2017; Yu et al., 2017), has been very appealing due to the performance and flexibility offered in representing data. GANs are also used in generating tabular data, especially healthcare records; for example, (Yahi et al., 2017) uses GANs to generate continuous time-series medical records and (Camino et al., 2018) proposes the generation of discrete tabular data using GANs. medGAN (Choi et al., 2017) combines an auto-encoder and a GAN to generate heterogeneous non-time-series continuous and/or binary data. ehrGAN (Che et al., 2017) generates augmented medical records. tableGAN (Park et al., 2018)

tries to solve the problem of generating synthetic data using a convolutional neural network which optimizes the label column’s quality; thus, generated data can be used to train classifiers.

PATE-GAN (Jordon et al., 2019) generates differentially private synthetic data.

3 Challenges with GANs in Tabular Data Generation Task

The task of synthetic data generation task requires training a data synthesizer learnt from a table and then using to generate a synthetic table . A table contains continuous columns and discrete columns

, where each column is considered to be a variable. These random variables follow an unknown joint distribution

. One row , is one observation from the joint distribution. is partitioned into training set and test set . After training on , is constructed by independently sampling rows from . We evaluate the efficacy of a generator along 2 axes.

  • Likelihood fitness: Columns in follow the same joint distribution as .

  • Machine learning efficacy: We train a classifier or a regressor to predict one column using other columns as features. Such classifier or regressor learned from can achieve a similar performance on , as would a model learned on .

Problems MedGAN TableGAN PATE-GAN TGAN
C1
C2 x x x
C3 x x
C4 x x x
C5 x x
Table 2: A summary showing whether existing methods and our TGAN explicitly address these challenges [C1 - C5]. ( indicates it is able to model continuous and binary.)

Several unique properties of tabular data challenge the design of a GAN model. In this section we highlight these challenges in increasing order of the complexity of solution required to solve them. In Table 2, we note which subset of these is addressed by the existing methods.

C1. Mixed data types. Real-world tabular data consists of mixed types (i.e. continuous, ordinal, categorical etc). To simultaneously generate a mix of discrete and continuous columns, modifications to GANs must apply both softmax and tanh on the output.

C2. Non-Gaussian distributions

: In images, a pixel’s values follow a Gaussian-like distribution, which can be normalized to using a min-max tranform. A tanh function is usually employed in the last layer of a network to output a value in this range. Continuous variables in tabular data are usually non-Gaussian and have distributions with long tails; thus most generated values will not be centred around zero. The gradient of tanh where most values will be located is flat - a phenomenon known as gradient saturation. This results in the model’s inability to learn via gradients.

C3. Multimodal distributions. Continuous columns in tabular data usually have multiple modes. We observe that continuous columns in our 8 real-world datasets have multiple modes. Srivastava et al. (2017) showed that vanilla GAN couldn’t model all modes on a simple 2D dataset; thus it also wouldn’t be able to model the multimodal distribution of continuous columns. To solve this problem and C2, we employ mode-specific pre-processing techniques as described in Section 4.1 and use PacGAN (Lin et al., 2018) to overcome mode collapse.

C4. Learning from sparse one-hot-encoded vectors.

To enable learning from non-ordinal categorical columns, a categorical column is converted into a one-hot vector. When generating synthetic samples, a generative model is trained to generate a probability distribution over all categories using

softmax. This is problematic in GANs because a trivial discriminator can simply distinguish real and fake data by checking the distribution’s sparseness instead of considering the overall realness of a row. TGAN avoids such pathologies by applying gumbel-softmax (Jang et al., 2016) to generate a sparse and differentiable distribution over all categories.

C5. Highly imbalanced categorical columns. In real world datasets, most categorical columns have highly imbalanced distribution. In our datasets we noticed that of the categorical columns are highly imbalanced, in which the major category appears in more than of the rows, resulting in severe mode collapse. Missing a minor category only causes tiny changes to the data distribution, but imbalanced data leads to insufficient training opportunities for minor classes. The critic network cannot detect such issue unless mode-collapse-preventing mechanisms such as PacGAN are used. These mechanisms can prevent GANs from generating only the most salient category. Synthetic data for minor categories are expected to be of lower quality, necessitating resampling.

4 TGAN Model

In this section, we explain our preprocessing method and introduce our TGAN model.

Notations: Besides the common operations like tanh, ReLU, softmax

, batch normalization

(Ioffe and Szegedy, 2015) as BN and dropout (Srivastava et al., 2014) as drop, we define

  • : categorical PMF of with parameters

  • : concatenate vectors

  • : apply Gumbel softmax(Jang et al., 2016) with parameter on a vector

  • : apply a leaky ReLU activation on with leaky ratio

  • : project linearly a -dimensional vector to a -dimensional vector by means of a fully connected neural network with linear activation.

For readability, when we define our model, we replace the , , , and for gumbel, leaky, and FC with actual settings in the experiments. Additionally, we use to stress that a given function is a probability distribution.

4.1 Reversible Data Transformations

Figure 1: Reversible data transformation of a row with two continuous and two discrete columns. In this example we have assumed that consists of three gaussian components and consists of two; while and . Additionaly, we have assumed that the mode selected for was , the mode selected for was , the value and the value . The resulting vector has size .

In order to deal with mixed data types, each column is processed independently, according to whether its values are continuous or discrete. Figure 1 summarizes the transformation. A discrete column is simply transformed into a one-hot representation . For continuous columns, we use a mode-specific normalization, which is able to deal with non-Gaussian and multimodal distributions.

The mode-specific normalization consists of four steps. Let be a continuous value corresponding to the th continuous column, , and the th row in the tabular data.

  1. Begin by estimating the number of modes in the distribution of

    . To do so, we use a variational Gaussian mixture model (VGM)

    (Bishop, 2006) that produces the probabilistic model , a Gaussian Mixture of components with means

    , standard deviations

    and weights respectively,

    The VGM model is trained to maximize the likelihood on the training data.

  2. Compute the PMF of the value sampled from each of the modes as

  3. Sample and convert it into an one-hot vector .

  4. Normalize as . Then clip to , i.e. keep the area of a Gaussian distribution which covers of samples. Finally, employ and to represent .

4.2 Conditional Tabular GAN

Traditionally, a GAN is fed with a vector sampled from a standard multivariate normal distribution (MVN), and by means of the

Generator and Discriminator or Critic (Arjovsky et al., 2017; Gulrajani et al., 2017) neural networks one eventually obtains a deterministic transformation that maps the standard MVN into the distribution of the data. This method of training a generator does not account for the imbalance in the categorical columns. If the training data are randomly sampled during training, the rows that fall into the minor category will not be sufficiently represented, thus the generator may not be trained correctly. This problem is reminiscent of the “class imbalance” problem in discriminatory modeling - the challenge however is exacerbated since there is not a single column to balance and the real data distribution should be kept intact. If the training data are resampled, the generator learns the resampled distribution which is different from the real data distribution.

Specifically, the goal is to resample efficiently in a way that all the categories from discrete attributes are sampled evenly (but not necessary uniformly) during the training process, and to recover the (not-resampled) real data distribution during test. A way to attain this is to enforce that the generator matches a given category. Let be the value from the th discrete column that has to be matched by the generated samples , then the generator can be interpreted as the conditional distribution of rows given that particular value at that particular column, i.e. . For this reason, in this paper we name it Conditional generator, and a GAN built upon it is referred to as Conditional GAN. Moreover, in this paper we construct our TGAN as a Conditional GAN, upon two main modules: the conditional generator and the critic .

Integrating a conditional generator into the architecture of a GAN requires to deal with the following issues: 1) it is necessary to devise a representation for the condition as well as to prepare an input for it, 2) it is necessary for the generated rows to preserve the condition as it is given, and 3) it is necessary for the conditional generator to learn the real data conditional distribution, i.e. , so that

We present a solution that consists of three key elements, namely: the conditional vector, the generator loss, and the training-by-sampling method.

Figure 2: TGAN structure.

Conditional vector. We introduce the vector as the way for indicating the condition . Recall that, after the reversible data transformation, all the discrete columns end up as one-hot vectors such that the th one-hot vector is , for . Let , for be the th vector associated to the th one-hot vector . Hence, the condition can be expressed in terms of these mask vectors as

Then, define the vector as . For instance, for two discrete columns, and ,the condition is expressed by the mask vectors and ; so .

Generator loss. During training, the conditional generator is free to produce any set of one-hot discrete vectors . In particular, given the condition in the form of vector, nothing in the feed-forward pass prevents from producing either  or  for . The mechanism proposed to enforce the conditional generator to produce is to penalize its loss by adding the cross-entropy between and , averaged over all the instances of the batch. Thus, as the training advances, the generator learns to make an exact copy of the given into .

Training-by-sampling. The output produced by the conditional generator must be assessed by the critic, which estimates the distance between the learned conditional distribution and the conditional distribution on real data . The sampling of real training data and the construction of vector should comply to help critic estimate the distance. There are two possibilities: either we randomly select an instance (row) from the table and then select the condition attribute in it, or we randomly select an attribute (column) and a value from that column and then select a row filtering the table by the value of that column. Clearly, the first one is not appropriate for our goal because we cannot ensure that all the values from discrete attributes are sampled evenly during the training process. On the other hand, if we consider all the discrete columns equally likely and randomly select one, and then consider all the values in its range equally likely, it might be the case that one row from a very low frequency category will be excessively oversampled; so once again is not an appropriate choice. Thus, for our purposes, we propose the following steps:

  1. Create zero-filled mask vectors , for , so the th mask vector corresponds to the th column, and each component is associated to the category of that column.

  2. Randomly select a discrete column out of all the discrete columns, with equal probability. Let be the index of the column selected. For instance, in Figure 2, the selected column was , so .

  3. Construct a PMF across the range of values of the column selected in 2, , such that the probability mass of each value is the logarithm of its frequency in that column.

  4. Let be a randomly selected value according to the PMF above. For instance, in Figure 2, the range has two values and the first one was selected, so .

  5. Set the th component of the th mask to one, i.e. .

  6. Calculate the vector . For instance, in Figure 2, we have the masks and , so .

We use the PacGAN framework (Lin et al., 2018), taking samples from training data in each pac. The training algorithm under this framework is completely described in Algorithm 1. It begins by creating as many condition vectors , and drawing as many samples from standard MVN, as the batch size (lines 1-3). Both are fed-forward into the conditional generator to produce a batch of fake rows (line 4). The input to PacGAN is twofold. On one hand, it comes from sampling the training tabular data according to the vector. On the other hand, it is the output of the conditional generator. Both are preprocessed as detailed in lines 7 and 8 before being fed-forwarded into the critic, to obtain its loss (line 9). In lines 10-12 we follow (Gulrajani et al., 2017) to compute the gradient penalty for the critic. To update the parameters of the critic we use a gradient descent step, with learning rate and Adam optimizer (line 13). In order to update the parameters of the conditional generator, it is first necessary to repeat the feed-forward steps both in the conditional generator (lines 1-7) and in the critic (line ) , which leads to the loss of the conditional generator, since in this step the critic is not updated. Then, we use a gradient descent step similar to the one for the parameters of the critic (line ).

Finally, the conditional generator architecture can be formally described as

and, the architecture of the critic (with pac size ) can be formally described as

Generate synthetic data for different purposes. During testing, the user has to provide the Conditional TGAN both with a random MVN vector (as to any other GAN) and a vector properly constructed according to the discrete columns and their range of values. Users can construct to generate rows with a specific value in a discrete column, for exmaple generate several columns with . In our experiments, is sampled uniformly and follows the marginal distribution of so that the generated data are expected to reveal the real data distribution.

Input: Training data ; Conditional generator parameters ; Critic parameters ; batch size ; pac size .
Result: Conditional generator and Critic parameters and updated.
Create masks ,   for Create condition vectors ,   for from masks Create conditional vectors Sample ,   for ,   for Generate fake data Sample ,   for Get real data ,   for Conditional vector pacs ,   for Fake data pacs ,   for Real data pacs Sample ,   for Gradient Penalty (Gulrajani et al., 2017) Regenerate following lines 1 to 7
Algorithm 1 Train TGAN on step.

5 Benchmarking synthetic data generation algorithms

There are multiple deep learning methods for modeling tabular data. We notice that all methods and their corresponding papers neither employed the same datasets nor were evaluated under similar metrics. This fact made comparison challenging and did not allow for identifying each method’s weaknesses and strengths vis-a-vis the intrinsic challenges presented when modeling tabular data. To address this, we developed a comprehensive benchmarking suite.

5.1 Baselines and datasets

Our baselines consist of Bayesian networks (CLBN (Chow and Liu, 1968), PrivBN (Zhang et al., 2017)), and implementations of current deep learning approaches for synthetic data generation (MedGAN (Choi et al., 2017), VeeGAN (Srivastava et al., 2017), TableGAN (Park et al., 2018)). This library along with its very easy to use APIs are described in the supplementary material. More datasets and methods can be easily added.

To challenge comparisons and motivate further development, we added a VAE baseline as well, called TVAE. TVAE uses the same preprocessing as TGAN

. The structure and loss function of VAE have been adapted accordingly so as to model tabular data (details can be found in supplemental materials.)

Simulated data We handcrafted a simulated data oracle to represent a known joint distribution, then sample and from . This oracle is a Gaussian mixture model or a Bayesian network. We followed (Srivastava et al., 2017) to generate Grid and Ring Gaussian mixture oracles. We add random offset to each mode in Grid and call it GridR. We pick 4 well known Bayesian networks - alarm, child, asia, insurance,333The structure of Bayesian networks can be found at http://www.bnlearn.com/bnrepository/. - and construct Bayesian network oracles.

Real datasets: We picked commonly used machine learning feature-and-label tables, adult, census, covertype, intrusion and news from UCI machine learning repository (Dua and Graff, 2017) and credit

from Kaggle. We also binarized

the MNIST (LeCun and Cortes, 2010) dataset and converted each sample to 784 dimensional vector plus one label column to mimic high dimensional binary data, called MNIST28. We resized the images to and used the same process to generate MNIST12.

5.2 Evaluation metrics

Given that evaluation of generative models is not a straightforward process, where different metrics yield substantially diverse results (Theis et al., 2016), our benchmark evaluates multiple metrics on multiple datasets. Simulated data have known probability distribution and are used to evaluate the likelihood fitness, whereas real datasets come from a real machine learning task and can be used to evaluate the machine learning efficacy. Figure 3 illustrates the evaluation framework.

Likelihood fitness metric: On simulated data, take advantage of simulated data oracle to compute the likelihood fitness metric. We retrain the simulated data generator using . has the same structure but different parameters as . If is a Gaussian mixture model, we use the same number of Gaussian components and retrain the mean and covariance of each component. If is a Bayesian network, we keep the same graphical structure and learn a new conditional distribution on each edge. We compute the likelihood of on . This metric overcomes the issue in . It can detect mode collapse. But this metric introduces the prior knowledge of the structure of which is not necessarily encoded in .

Machine learning efficacy: For a real dataset, we cannot compute the likelihood fitness, instead we evaluate the performance of using synthetic data as training data for machine learning. We train prediction models on and test prediction models using . We evaluate the performance of classification tasks using accuracy and F1, and evaluate the regression task using

. For each dataset, we select classifiers or regressors that achieve reasonable performance on each data. (Models and hyperparameters can be found in supplementary material as well as our benchmark framework.) Since we are not trying to pick the best classification or regression model, we take the the average performance of multiple prediction models as metrics for

.

Figure 3: Evaluation framework on simulated data (left) and real data (right).

6 Experiments and results

We evaluate CLBN, PrivBN, MedGAN, VeeGAN, TableGAN, TGAN, and TVAE using our benchmark framework. We train each model with a batch size of . Each model is trained for epochs. Each epoch contains steps where is the number of rows in the training set. For TGAN, we use hyperparameters described in section 4. Hyperparameters for TVAE can be found in supplementary materials. We posit that for any dataset, across any metrics except , the best performance is achieved by . Thus we present the Identity method which outputs .

grid gridr ring
method
Identity -3.06 -3.06 -3.06 -3.07 -1.70 -1.70
CLBN(2) -3.68 -8.62 -3.76 -11.60 -1.75 -1.70
PrivBN(4) -4.33 -21.67 -3.98 -13.88 -1.82 -1.71
MedGAN(7) -10.04 -62.93 -9.45 -72.00 -2.32 -45.16
VEEGAN(6) -9.81 -4.79 -12.51 -4.94 -7.85 -2.92
TableGAN(5) -8.70 -4.99 -9.64 -4.70 -6.38 -2.66
TVAE(1) -2.86 -11.26 -3.41 -3.20 -1.68 -1.79
TGAN(3) -5.63 -3.69 -8.11 -4.31 -3.43 -2.19
asia alarm child insurance
method
Identity -2.23 -2.24 -10.3 -10.3 -12.0 -12.0 -12.8 -12.9
CLBN(3) -2.44 -2.27 -12.4 -11.2 -12.6 -12.3 -15.2 -13.9
PrivBN(1) -2.28 -2.24 -11.9 -10.9 -12.3 -12.2 -14.7 -13.6
MedGAN(5) -2.81 -2.59 -10.9 -14.2 -14.2 -15.4 -16.4 -16.4
VEEGAN(7) -8.11 -4.63 -17.7 -14.9 -17.6 -17.8 -18.2 -18.1
TableGAN(6) -3.64 -2.77 -12.7 -11.5 -15.0 -13.3 -16.0 -14.3
TVAE(2) -2.31 -2.27 -11.2 -10.7 -12.3 -12.3 -14.7 -14.2
TGAN(4) -2.56 -2.31 -14.2 -12.6 -13.4 -12.7 -16.5 -14.8
adult census credit cover. intru. mnist12/28 news
method F1 F1 F1 Macro Macro Acc Acc
Identity 0.669 0.494 0.720 0.652 0.862 0.886 0.916 0.14
CLBN(3) 0.334 0.310 0.409 0.319 0.384 0.741 0.176 -6.28
PrivBN(4) 0.414 0.121 0.185 0.270 0.384 0.117 0.081 -4.49
MedGAN(6) 0.375 0.000 0.000 0.093 0.299 0.091 0.104 -8.80
VEEGAN(6) 0.235 0.094 0.000 0.082 0.261 0.194 0.136 -6.5e6
TableGAN(5) 0.492 0.358 0.182 0.000 0.000 0.100 0.000 -3.09
TVAE(1) 0.626 0.377 0.098 0.433 0.511 0.793 0.794 -0.20
TGAN(1) 0.601 0.391 0.672 0.324 0.528 0.394 0.371 -0.43
Table 3: Benchmark results over three sets of experiments, namely Gaussian mixture simulated data, Bayesian network simulated data, and real data. The number in the bracket is the rank of a method (lower better). It is computed as follows: For each set of experiment, (1) rank algorithms over all metrics in each set. (2) Take the average of all ranks of each algorithm. Get one score in range for each algorithm. (3) Rank the score again.

Experimental results are shown in Table 3. In the continuous data case, CLBN and PrivBN suffer because continuous data are discretized. MedGAN, VeeGAN, and TableGAN all suffer from mode collapse. With mode-specific normalization, our model performs well on 2D continuous datasets.
On dataset generated from Bayesian networks, CLBN and PrivBN have a natural advantage. Our TGAN achieves slightly better performance than MedGAN and TableGAN. Surprisingly, TableGAN works well on discrete datasets, despite considering discrete columns as continuous values. Our reasoning for this is that in our simulated data, most columns have fewer than 4 categories, so conversion does not cause serious problems. On real datasets, TVAE and TGAN outperforms CLBN and PrivBN, whereas other GAN models cannot get as good result as Bayesian networks. With respect to large scale real datasets, learning a high-quality Bayesian network is difficult. There is a significant performance gap between real data and synthetic data generated by a learned Bayesian network.
TVAE outperforms TGAN in several cases, but GANs do have several favorable attributes, and this does not indicate that we should always use VAEs rather than GANs on modeling tables. The GANs generator does not have access to real data during the entire training process; thus, we can make TGAN achieve differential privacy easier than TVAE.

7 Conclusion

In this paper we attempt to find a flexible and robust model to learn the distribution of columns with complicated distributions. We observe that none of the existing deep generative models can outperform Bayesian networks which discretize continuous values and learn greedily. We show several properties that make this task unique and propose our TGAN model. Empirically, we show that our model can learn a better distributions than Bayesian networks. As future work, we would derive a theoretical justification on why GANs can work on a distribution with both discrete and continuous data.

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017.
  • Aviñó et al. [2018] Laura Aviñó, Matteo Ruffini, and Ricard Gavaldà. Generating synthetic but plausible healthcare record datasets. In KDD workshop on Machine Learning for Medicine and Healthcare, 2018.
  • Bishop [2006] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
  • Camino et al. [2018] Ramiro Camino, Christian Hammerschmidt, and Radu State. Generating multi-categorical samples with generative adversarial networks. In ICML workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.
  • Che et al. [2017] Zhengping Che, Yu Cheng, Shuangfei Zhai, Zhaonan Sun, and Yan Liu. Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In International Conference on Data Mining. IEEE, 2017.
  • Choi et al. [2017] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks. In Machine Learning for Healthcare Conference. PMLR, 2017.
  • Chow and Liu [1968] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3):462–467, 1968.
  • Cormode et al. [2012] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differentially private spatial decompositions. In International Conference on Data Engineering. IEEE, 2012.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Goodfellow et al. [2014] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, 2017.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on International Conference on Machine Learning, 2015.
  • Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2016.
  • Jordon et al. [2019] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In International Conference on Learning Representations, 2019.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2013.
  • LeCun and Cortes [2010] Yann LeCun and Corinna Cortes. MNIST handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/.
  • Lin et al. [2018] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples in generative adversarial networks. In Advances in Neural Information Processing Systems, 2018.
  • Park et al. [2018] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. Data synthesis based on generative adversarial networks. In International Conference on Very Large Data Bases, 2018.
  • Patki et al. [2016] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The synthetic data vault. In

    International Conference on Data Science and Advanced Analytics

    . IEEE, 2016.
  • Reiter [2005] Jerome P Reiter. Using cart to generate partially synthetic public use microdata. Journal of Official Statistics, 21(3):441, 2005.
  • Srivastava et al. [2017] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In Advances in Neural Information Processing Systems, 2017.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Sun et al. [2018] Yi Sun, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Learning vine copula models for synthetic data generation. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • Theis et al. [2016] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In International Conference on Learning Representations, 2016.
  • Yahi et al. [2017] Alexandre Yahi, Rami Vanguri, Noémie Elhadad, and Nicholas P Tatonetti. Generative adversarial networks for electronic health records: A framework for exploring and evaluating methods for predicting drug-induced laboratory test trajectories. In NIPS workshop on machine learning for health care, 2017.
  • Yu et al. [2017] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI Conference on Artificial Intelligence, 2017.
  • Zhang et al. [2016] Jun Zhang, Xiaokui Xiao, and Xing Xie. Privtree: A differentially private algorithm for hierarchical decompositions. In International Conference on Management of Data. ACM, 2016.
  • Zhang et al. [2017] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems, 42(4):25, 2017.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In

    international conference on computer vision

    , pages 2223–2232. IEEE, 2017.

8 Details about Benchmark

The statistical information of simulated and real data is in Table 5. The raw data of 8 real datasets are avialable online.

For each dataset, I select a few classifiers or regressors which give reasonable performance on such dataset shown in Table 6.

dataset name accuracy f1 macro_f1 micro_f1 r2
adult Adaboost (estimator=50) 86.07% 68.03%
Decision Tree (depth=20) 79.84% 65.77%
Logistic Regression 79.53% 66.06%
MLP (50) 85.06% 67.57%
census Adaboost (estimator=50) 95.22% 50.75%
Decision Tree (depth=30) 90.57% 44.97%
MLP (100) 94.30% 52.43%
covtype Decision Tree (depth=30) 82.25% 73.62% 82.25%
MLP (100) 70.06% 56.78% 70.06%
credit Adaboost (estimator=50) 99.93% 76.00%
Decision Tree (depth=30) 99.89% 66.67%
MLP (100) 99.92% 73.31%
intrusion Decision Tree (depth=30) 99.91% 85.82% 99.91%
MLP (100) 99.93% 86.65% 99.93%
mnist12 Decision Tree (depth=30) 84.10% 83.88% 84.10%
Logistic Regression 87.29% 87.11% 87.29%
MLP (100) 94.40% 94.34% 94.40%
mnist28 Decision Tree (depth=30) 86.08% 85.89% 86.08%
Logistic Regression 91.42% 91.29% 91.42%
MLP (100) 97.28% 97.26% 97.28%
news Linear Regression 0.1390
MLP (100) 0.1492
Table 6: Classifiers and regressors selected for each real dataset and corresponding performance.

8.1 Data Format

We converted all the datasets into a float array in the interest of consistency. The array has the same number of rows and columns as the original table. It keeps the exact values as the original table for continuous columns. For discrete columns, each category is converted to an integer index. The array stores the index for each category. A separate metafile is created for each dataset, storing the name of the column, the range of a continuous column, and the index to category mapping for a discrete column.

8.2 Current available methods

We provide several baseline methods in our framework. Some of the methods are not designated to generate tabular data. We make small changes to adapt these methods to all the datasets in the benchmark. The result of experiments can be reproduced using default hyper parameters.

CLBN uses the chow-liu algorithm [Chow and Liu, 1968] to create a tree structure Bayesian network. For continuous columns, we evenly discretize them to 15 bins. We use the implementation in pomegranate package (https://pomegranate.readthedocs.io/en/latest/index.html).

PrivBN

uses a heuristic method to construct a differentially private Bayesian network

[Zhang et al., 2017]. We wrap the authors’ C++ implementation (https://sourceforge.net/projects/privbayes/) into our benchmark framework. For continuous columns, we evenly discretize them to 15 bins. We set privacy budget to which is fairly large for the method to model the data accurately instead of adding too much noise.

MedGAN [Choi et al., 2017] is a GAN-based synthetic data generator. The authors released their implementation (https://github.com/mp2893/medgan). But the implementation only support continuous data or binary data. It does not support multi category discrete data or a mix of data types. We modify the autoencoder to support such data. For simplicity, we assume are min-max normalized to , and are one-hot representation for categorical columns. The model contains four components:

  • An encoder that encodes a row to a dense vector.

  • An decoder that decodes a row from a dense vector.

  • A generator that project a 128-dimensional Gaussian noise to a row.

  • A discriminator that takes a row and the average over a minibatch of size as features and predicts whether a row of data is real or fake.

The encoder and decoder is pretrained for epochs with batch size , by minimizing the autoencoder loss

After pretrain, the autoencoder is fixed in the rest of the training period.

The generator and discriminator is trained by minimizing

Adam optimizer is used. The learning rate is 1e-3 and the l2 weight decay is 1e-3.

VeeGAN [Srivastava et al., 2017] add a reconstructor to GAN to detect mode collapse. It is shown to be useful on grid dataset. Thus we adapt this method to other datasets. The authors released their implementation https://github.com/akashgit/VEEGAN. For simplicity, we assume are min-max normalized to .444We observe that normalizing continuous column to and using tanh give better performance than and sigmoid in veegan. The model contains 3 components:

  • The generator that takes a standard Gaussian noise vector and project it to a row of data:

  • A discriminator takes the data and the hidden vector, and try to decide whether it’s real or fake. The discriminator works as follows:

  • A reconstructor which reconstructs the hidden vector from data.

The discriminator, generator and reconstructor are optimized using

TableGAN [Park et al., 2018] is a data synthesizer using convolutional neural networks. It considers all columns as continuous values. Discrete columns are considered as integers in . All columns are min-max normalized to . Here we use to denote the + dimensional vector.

is then padded and wrapped to a

matrix . 555To adapt larger datasets in our benchmark, the matrix size is automatically selected in . The structure will be described using matrix. To describe the model, we define two notations

  • : replace the entry representing the label column in the tabular data to .

  • : extract the value of label column from the matrix.

  • : compute the expected standard diviation.

The model contains three components

  • a generator that uses deconvolution of project a 100 dimensional standard Gaussian noise to a matrix

  • a discriminator is

  • A classifier that predict the label column from all other columns

The discriminator is trained by

The generator is trained by minimizing the sum of the following loss functions666In Park et al. [2018], L2 norm is used for and , while L1 norm is used in their implementation.

The classifier is trained by

If the dataset is not a binary classification task, the classifier is disabled and the is set to .

8.3 TVAE Model

The VAE simultaneously trains a generative model and an inference model by minimizing the evidence lower-bound (ELBO) loss [Kingma and Welling, 2013]

(1)

where

Usually is multivariate Gaussian distribution . Moreover, and are parameterized using neural networks and optimized using gradient descent.

When using VAE to model rows in tabular data , each row is preprocessed as

and that affects the design of the network that needs to be done differently so that can be modeled accurately and trained effectively. In our design, the neural network outputs a joint distribution of variables, corresponding to variables . We assume

follows a Gaussian distribution with different means and variance. All

and follow a categorical PMF. Here is our design.