Learning Vine Copula Models For Synthetic Data Generation

12/04/2018
by   Yi Sun, et al.
6

A vine copula model is a flexible high-dimensional dependence model which uses only bivariate building blocks. However, the number of possible configurations of a vine copula grows exponentially as the number of variables increases, making model selection a major challenge in development. In this work, we formulate a vine structure learning problem with both vector and reinforcement learning representation. We use neural network to find the embeddings for the best possible vine model and generate a structure. Throughout experiments on synthetic and real-world datasets, we show that our proposed approach fits the data better in terms of log-likelihood. Moreover, we demonstrate that the model is able to generate high-quality samples in a variety of applications, making it a good candidate for synthetic data generation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 6

page 7

page 8

page 9

05/03/2021

Synthetic Data for Model Selection

Recent improvements in synthetic data generation make it possible to pro...
08/06/2019

Second-order Control of Complex Systems with Correlated Synthetic Data

Generation of hybrid synthetic data resembling real data to some criteri...
09/20/2019

BinarySDG: binary sensor data generation with R

The scarcity of Smart Home data is still a pretty big problem, and in a ...
01/29/2018

Model selection in sparse high-dimensional vine copula models with application to portfolio risk

Vine copulas allow to build flexible dependence models for an arbitrary ...
05/20/2021

Objective-aware Traffic Simulation via Inverse Reinforcement Learning

Traffic simulators act as an essential component in the operating and pl...
06/09/2021

Tensor feature hallucination for few-shot learning

Few-shot classification addresses the challenge of classifying examples ...
02/20/2020

Cluster Aware Mobility Encounter Dataset Enlargement

The recent emerging fields in data processing and manipulation has facil...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The machine learning (ML) community is increasingly interested generative modeling. Broadly, generative modeling consists of modeling either both the joint distribution of data and classes for supervised learning or of modeling only the joint distribution of data for unsupervised learning. In tasks involving classification, a generative model is useful to augment smaller, labeled datasets, which are especially problematic when developing deep learning applications for novel fields

[Wang and Perez2017].

In unsupervised learning (clustering), a generative model supports development of model-driven algorithms (where each cluster is represented by a model). These algorithms scale better than data-driven algorithms, and are typically faster and more reliable. Additionally, these algorithms are crucial tools in tasks that don’t easily fit in the supervised/unsupervised paradigm, including, for example, survival analysis - e.g. predicting when a chronically ill person will return to the hospital, or how long will a project last in a Kickstarter [Vinzamuri, Li, and Reddy2014].

Synthetic data generation - i.e. sampling new instances from joint distribution - can also be carried out by a generative model. Synthetic data has found multiple uses within machine learning. On one hand, it is useful for testing the scalability and the robustness of new algorithms; on the other, it is safe to be shared openly, preserving the privacy and confidentiality of the actual data [Li et al.2014].

The central underlying problem for generative modeling is to construct a joint probability distribution function, usually high-dimensional and comprising both continuous and discrete random variables. This is often accomplished by using probabilistic graphical models (PGM) such as Bayesian networks (BN) and conditional random fields (CRF), to mention only two of many possible approaches

[Jordan1999, Koller and Friedman2009]. In PGM, the joint probability distribution obtained is simplified due to assumptions on the dependence between variables, which is represented in form of a graph. Thus the major task for PGM is the process of learning such a graph; this problem is often understood as a structure learning task that can be solved in a constructive way, adding one node at a time while attempting to maximize the likelihood or some information criterion. In PGM, continuous variables are quantized before a structure is learned or parameters identified, and, due to this quantization, this solution looses information and scales poorly.

Copula functions are joint probability distributions in which any univariate continuous probability distribution can be plugged in as a marginal. Thus, the copula captures the joint behaviour of the variables and models the dependence structure, whereas each marginal models the individual behaviour of its corresponding variable. In other words, the choice of the copula and the marginals results in the construction of the joint probability distribution directly [Nelsen2006]. However, in practice, there are many bivariate copula families but only a few multivariate ones, with the Gaussian copula and the T-copula being the most prominent. For this reason, these two families have been used extensively, leading to models that most of the time outperform the multivariate normal (MVN). However, these models still assume a dependence structure that may only loosely capture the interaction between a subset of variables. The strongest and most well-known case against the abuse of Gaussian copula can be observed in [MacKenzie and Spears2014]. The key problem pinpointed in that work is that financial quantities are seldom jointly linear, and even if so, the measure of such association, i.e. the correlation, is not stable across time. Therefore, the Gaussian bivariate copula, whose parameter is the correlation between its covariates, is a bad choice. Other copula families depend on non-linear degrees of association such as Kendall’s , but it is equally unwise to model the joint behavior of more than two covariates with any of them on the hope that a single scalar will be able to capture all the pairwise dependencies.

Vine copulas provide a powerful alternative in modeling dependence of high-dimensional distributions [Kurowicka and Joe2011]. To explicate, consider the same case presented in [MacKenzie and Spears2014]

under three distinct generative models. Let the multivariate normal distribution (MVN) be the first one, the Gaussian copula next, and finally a vine model. For the sake of simplicity, let us also assume that there are three covariates involved,

, for

with probability density functions (PDF)

. Hence, the MVN correlation matrix . Since the marginals of MVN are also normal whereas the actual marginals might be very different, samples from the MVN are very likely to represent the actual data poorly. In this case, one would then proceed by finding , the best match with the actual marginals . Together with the Gaussian copula, a more accurate model is thus constructed and sampled. If covariates are not jointly linear, samples will differ from actual data; for example, when and occur jointly in similar rank positions, but not necessarily in linear correlation, Kendall’s rank correlation is a much better parameter, and this pair of covariates would be better modeled by a copula parameterized by , such as the Clayton, the Frank or the Gumbel family, to mention only the most popular ones. This copula would be the first block of the vine model. Following with the example, next we have to plug the third covariate to either the left end or the right end of the tree, the decision is due to some metric such as likelihood, information criteria, goodness-of-fit, etc, with another copula following the same procedure. Thus, the first tree for three covariate would be ended. For the second tree, the copulas (edges) in the previous trees are now nodes that are to be linked. Since we only have two, they can only form one possible bond, and the construction ends for we cannot iterate once more.

The dependence structure in Vine copulas is constructed using bivariate copulas as building blocks; thus, two or more variables can have a completely different behavior than the rest. A vine is represented as a sequence of trees, organized in levels, so they can be considered as a PGM. Another distinctive feature of vine copulas is that the joint probability density factorization resulting from the entire graph can be derived by the chain rule. In other words, theoretically, there are no assumptions about the independence between pairs of variables, and in fact the building block in such a case is the independence copula. However, in practice usually only the top levels of a vine are constructed. Therefore, learning a vine has the cost of learning the graph, tree-by-tree, plus the cost of selecting the function that bonds each pair of nodes.

The problem is more challenging when the vine is pruned from one level downwards to the last. Yet this effort is rewarded by a superior and more realistic model.

This paper presents the following contributions. Firstly, we formulate the model selection problem for regular vine as a reinforcement learning (RL) problem and relax the tree-by-tree modeling assumption, where each level of tree is selected sequentially. Moreover, we use long-short term memory (LSTM) networks in order to learn from vine configurations tested long and short ago. Second, as far as we are aware, this work is the first to use regular vine copula models for generative modeling and synthetic data generation. Finally, a novel and functional technique for evaluating model accuracy is a side result of synthetic data generation. We propose that a model can be admitted if it generates data that produces a performance similar to the actual data on a number of ML techniques; e.g. decision trees, SVM, Neural Networks, etc.

The rest of the paper is organized as follows. In section  2, we present the literature related to the topics of this paper. In section  3, we introduce the definition and construction of a regular vine copula model. Section  4 describes our proposed learning approach for constructing regular vine model. In section  5, we apply the algorithm to several synthetic and real datasets and evaluate the model in terms of fitness and the quality of the samples generated.

2 Motivation and Related work

The rise of deep learning (DL) in the current decade has brought forth new machine learning techniques such as convolutional neural networks (CNN), long-short term memory networks (LSTM) or generative adversarial networks (GAN)

[Goodfellow, Bengio, and Courville2016]. These techniques outrank the state-of-the-art in problems from many fields, but require large datasets for training, which can be a significant problem given that often collecting data is expensive or time consuming. Even when data is already collected, often it cannot be released due to privacy or confidentiality issues. Synthetic data generation, currently a well researched topic in machine learning, provides a promising solutions for these problems [Alzantot, Chakraborty, and Srivastava2017, Libes, Lechevalier, and Jain2017, Soltana, Sabetzadeh, and Briand2017, Sagduyu, Grushin, and Shi2018]. Generative models - that is, high-dimensional multivariate probability distribution functions - are a natural way to generate data. More recently, the rise of GAN and its variations provides a way of generating realistic synthetic images [Goodfellow, Bengio, and Courville2016, Ratner et al.2017].

Copula functions, and PGM involving copulas and vines, have gained momentum in ML since the early proposal of the copula based regression models [Kolev and Paiva2009] and copula Bayesian networks [Elidan2010]. Gaussian process vine copulas were introduced in [Lopez-Paz, Hernandez-Lobato, and Ghahramani2013]. The Copula Discriminant Analysis (CODA) is a high-dimensional classification method based on the Gaussian Copula proposed in [Han, Zhao, and Liu2013]. The multi-task copula was introduced in [Zhou and Tao2014] for multi-task learning. A copula approach has been applied to jointly modeling of longitudinal measurements and survival times in AIDS studies [Ganjali and Baghfalaki2015]

. A vine copula classifier has performed competitively compared to the four best classification methods presented at the Mind Reading Challenge Competition 2011

[Carrera, Santana, and Lozano2016].

In this paper we obtain them by means of a copula function based PGM known as Vine. Vines were first introduced in [Bedford and Cooke2001] as a probabilistic construction of multivariate distributions based on the simple building blocks of bi-variate copulas. These constructions are organized graphically as a sequence of nested undirected trees. Compared to black-box deep learning models, vine copula has better interpretability since it uses a graph like structure to represent correlations between variables. However, learning a vine model is generally a hard problem. In general, there exists different -dimensional regular vines with variables, and different combinations of bivariate copula families where is the size of the candidate bivariate families [Morales-Napoles2010].

To reduce the complexity of model selection, [Dißmann et al.2013] proposed a tree-by-tree approach that selects each tree

sequentially, with a greedy maximum-spanning tree algorithm where edge weights are chosen to reflect large dependencies. Although this method works well in lower-dimensional problems, the greedy approach does not ensure optimal solutions for high-dimensional data. Gruber and Czado (Gruber_Czado_2015, Gruber_Czado_2018) proposed a Bayesian approach to estimate regular vine structure along with the pair copula families from an arbitrary set of candidate families. However, the sequential, tree-by-tree, Bayesian approach is computationally intensive and cannot be used for more than 20 dimensions. A novel approach to high-dimensional copulas has been recently proposed by Müller and Czado (Muller_Czado_2018).

In this paper, we reformulate the model selection as a sequential decision-making process, and cast it as an RL problem, which we solve with policy learning [Sutton and Barto1998]. Additionaly, when constructing the Vine, a decision made in the first tree can limit the choices in construction of subsequent trees. Therefore, we cannot assure the markovian property, i.e. that the next state depends only of the current state and current decisions. Such a non-markovian property suggests the use of Long Short Term Memory (LSTM) [Hochreiter and Schmidhuber1997]. LSTM in conjunction with model-free RL was presented in [Bakker2001] as a solution to non-Markovian RL tasks with long-term dependencies.

3 Regular Vine Copula

Figure 1: (a-h) All the different layouts of every possible tree in a 5-dim vine. (a-c) correspond to the 1st level, (d-e) correspond to the 2nd level, (f,g,h) are the unique possible layouts for trees in 3rd, 4th and 5th level respectively.
Figure 2: An example of 5-dim vine constructed with layouts {b,e,f,g,h} in Figure 2, and a valid choice of edges in it. For the sake of clarity, there is only one edge with the detail of its copula family and parameter.

In this section we summarize the essential facts about the meaning of the vine graphical model and how it is transformed into a factorization of the copula density function that models the dependence structure. A deeper explanation about vines can be found in [Aas et al.2009].

According to Sklar’s theorem, the copula density function is what it needs to complete the joint probability density of continuous covariates in case they are not independent. In other words, if the individual behaviour of covariate is given by the marginal probability density function for , then the copula brings the dependence structure into the joint probability distribution. Such a structure is also independent of every covariate distribution, so it is usually represented as a function of a different set of variables , that are referred to as transformed covariates. Formally expressed, we have:

where

There are many families of parametric bivariate copula functions but only a few parametric -variate ones. Among the latter, Gaussian and T copulas are quite flexible because depend on the correlation matrix of the transformed covariates.

A better approach is to use bivariate copulas as building blocks for multivariate copulas. This solution was first presented in [Joe1996]. Bedford and Cooke (Bedford_Cooke_2001) later developed a more general factorization of multivariate densities and introduced regular vines.

Definition 3.1

(R-Vine on variables) A regular vine on variables consists of connected trees , that satisfy the following:

  1. consists of the node set , where each variable is represented by only one of the nodes; and the edge set , with each edge representing a copula that links two variables.

  2. For , the tree consists of the node set and the edge set

  3. Each Tree has exactly edges, for . Two nodes in Tree can always form an edge in . Two nodes in Tree , with , can form an edge in only if their corresponding edges in Tree share a common node.

The regular vine copula has density function defined as:

where is the set of bivariate copula families selected for the Vine and the corresponding parameters.

For the sake of clarity, let us consider 5 variables , and that they have been already transformed via their corresponding marginals into , so . According to definition 3.1, a Tree will have 5 nodes and 4 edges. Therefore, its layout will necessarily be one out of those displayed in Figure 1(a–c). Each one of these layouts leads to many possible and different trees , depending on how variables are arranged in the layout. For this example, let us assume that such an arrangement happens to be the one shown in the tree of Figure 2, in which it has been explicitly shown that the dependence (edge) between and is modeled with a Clayton copula, with parameter . The layout of the next tree, may, or may not be one out of Figure 1(d–e). It depends on whether it satisfies the third requirement. In this example, the layout in Figure 1(a) imposes the layout in to be exclusively the one in Figure 1(d), resulting in the so called D-Vine. On the other hand, the layout in Figure 1(b) allows both Figures 1(d) and 1(e) to be layouts for . When arranging the variables in the layout, it is a good practice to write the actual variables, and not the edges of the precedent tree. Then, all variables shared in both nodes of an edge are turned into conditioning variables, and the remaining are conditioned. Eventually, the factorization of the copula density is the product of all the nodes from on. Thus, copula density of the vine shown in Figure 2 is

where the denotes the bi-variate copula density .

4 Methodology

Model selection for vine is essentially a combinatorial problem over a huge search space, which is hard to solve with heuristic approaches. DiBmann proposed a tree-by-tree approach

[Dißmann et al.2013] , which selects maximum spanning trees at each level according to pairwise dependence. The problem with this locally greedy approach is that there is no guarantee that it will select a structure that is close to the optimal structure.

In this section, we describe in details our learning approach for regular vine copula construction. In order to feed a vine configuration using a neural network based meta learning approach, we need to represent it in a compact way. Complexity arises since the construction of each level of tree depends on previous level’s tree. The generated vine also needs to satisfy the following desirable properties:

  • Tree Property Each level in the vine should be a tree with no cycles.

  • Dependence Structure The layout of each level tree depends on its previous level’s tree.

  • Sparsity In general, a sparse model where most of edges are independent copulas are preferred.

Next we compare two different representations for a regular vine model: Vector representation and RL representation.

Vector Representation

An intuitive way to embed a vine is to flatten all edges in the vine into a vector. Since the number of edges in the vine is fixed, the network can output a vector of edge indices. The representation proposed is depicted in Figure 3.

Let the set of edges in each tree be , the likelihood of the generated configuration can be computed as:

If the generated configuration contains a cycle at level , a penalty will be subtracted from the objective function. The penalty will decrease with the level, indicating that since later trees depend on early trees, violation in tree property in early levels will incur a larger penalty. Let the number of cycles at level be , thus the penalty due to violation of tree property can be computed as:

In practice, most variables are not correlated and the vines tend to be sparse. To avoid over-fitting, we favor smaller edges in the vine graph by adding a penalty term , defined as:

At each training iteration, we are maximizing the following objective function:

where and are hyper-parameters that can be tuned.

Figure 3: A example of the vector representation of vine. The model takes in a random initial configuration vector and search for the best vector that maximizes the objective function using a fully connected neural network (FCNN). The first element in the output vector (“”) means that first node in is not connected, the second element (“”) means that second node in is connected to 1, etc. The layout on the right shows how to take the output vector and assembles it into a regular vine.

Reinforcement Learning Representation

One problem with the vector representation is that the vine configuration generated are not guaranteed to satisfy the tree property at each training step. On the other hand, the construction of the vine model can also be seen as a sequential decision making process: At each step, the set of nodes are partitioned into two sets, and , where denotes set of nodes that have been already added to the vine, and denotes set of nodes that are not in the vine. When building tree , in each step a node in is selected and linked to a node in , which is equivalent to adding a new edge to the tree . Since pair of nodes already in the vine will never be selected at the same time, there won’t be cycles forming, and therefore the tree property is maintained throughout the construction.

After we obtain the set of edges for tree , we repeat the process for the next level of tree. The decision process can be defined as a fixed length sequence of actions of choosing where . For an untruncated vine with variables, the total number of edges adds up to . Motivated by recent developments in RL, we reformulate the problem in its syntax.

States

Let be the -th edge added to the vine model, which consists of a pair of indexes . At step , the states can be represented by the current set of edges in the vine , and the partitioned vertices set and . Formally , is the state representation for the current vine at step .

Actions

The set of possible actions consists of all pairs of nodes that can be added to the current tree. Formally .

Rewards

The log likelihood of fitting a data point to the vine at step can be decomposed into a set of trees and a set of independent nodes in that have not been added to the tree:

where .

Then the incremental reward at state can be defined as:

where is the newly added edge to the vine.

As before, we add a penalty term to avoid over-fitting. Hence, the total reward is defined as:

Policy Network

Let be the true underlying distribution, and be the distribution of the network. Our goal is to learn a sequence of edges (pair of indices) of length such that the following objective function is maximized:

Let be the stochastic policy defined implicitly by network and parameterized by . The policy gradient of the above objective function can be computed as:

At each step, the policy network outputs a distribution over all possible actions, from which an action (edge) is sampled. Following standard practice, the expectation is approximated by sampling data points and

action sequences per data point. Additionally, we use discounted reward and subtract a baseline term to reduce variance of gradients

[E. Greensmith and Baxter2004].

procedure Training
     for number of training iterations do
         Sample m examples
         from real distribution
         Transform examples into
         
         while total num of edges do
              sample action
              Find for edge
              ,
              
              Calculate step reward          
         update the model by descending its
          stochastic gradient      
Algorithm 1 Vine Learning

When constructing the Vine, a decision made in the first tree can affect to the nodes in deeper trees. Therefore, we cannot assure the markovian property, i.e. that the next state depends only of the current state and current decisions. A natural choice will be to adopt LSTM as a solution to this non-Markovian reinforcement learning tasks with long-term dependencies. Once the configuration is determined, we can find copula family and parameter for each edge following the method described in the following section.

Figure 4: A example of a training step for a 4-dim vine with RL formulation. The set of nodes is partitioned into and . Node 1 in and node 2 in is sampled by the policy network and edge is added to the vine.

Pair Copula Selection

For each edge, we estimate the log-likelihood of the empirical density, and both the fit to the left tail and to the right tail. Then we combine these three measurements in a hard-voting fashion to select a bivariate copula family according to the three-way check methods [Veeramachaneni, Cuesta-Infante, and O’Reilly2015]. The parameter is estimated according to its max likelihood fit after fitting the bivariate copula family. This provides us an approximated reward for adding an edge that guides us through the search space.

Sampling From A Vine

Figure 5: D-Vine of 4 variables.

After learning a vine model from the data, we can sample synthetic data from it. Kurowicka and Cooke first proposed an algorithm to sample an arbitrary element from a regular vine, which they call the Edging Up Sampling Algorithm [Kurowicka and Cooke2007]. The sampling procedure requires a complete vine model of nodes , and their corresponding marginal distributions . Considering the case where we have a D-Vine model with 4 nodes, shown in Figure 5. We start by sampling univariate distributions

from Uniform Distribution over [0,1]. We randomly pick a node to start with, say

.

Then the first variable can be sampled as:

(1)

After we have , we randomly pick a node connected to . Suppose we pick , recall that the conditional density can be written as:

(2)
(3)
(4)

Thus, can be sampled by:

(5)

where can be obtained from in by plugging in sampled values of .

Similarly, we pick a node that shares an edge with , say . Then can be sampled as:

(6)

Finally can be sampled as:

(7)

For a more general vine graph, we can use a modified Breadth First Search to traverse the tree and keep track of nodes that have already been sampled. The general procedure is described in Algorithm 2.

procedure Sampling
     
     
     
     explore.enqueue(start)
     while explore not empty do
         
         
         
                    
         for s neighbor(cur) do
              if s visited then
                  Continue
              else
                  explore.enqueue(s)                        
         visited.left_append(cur)      
Algorithm 2 Sampling from a regular vine

5 Experiments

Baselines and Experiment Setup

We compare our models with the tree-by-tree greedy approach [Dißmann et al.2013], which selects maximum spanning trees at each level according to pairwise dependence, as well as the bayesian approach [Gruber and Czado2015]. To test our approach, we used constructed examples. In this we construct vines manually, sample data from them and then use our learning approach to recreate the vine. We use two of the constructed examples defined in [Min and Czado2010]

and compare the results with both the greedy approach and the Bayesian approach. Second, we used three real data sets and compared the two approaches. The neural network used for creating vines for the vector representation is set-up as fully connected feed forward neural networks with two hidden layers. Each layer uses ReLU as activation function and the output layer is normalized by a softmax layer. The network for the reinforcement learning representation is set up as LSTM. The algorithm is trained over 50 epochs in the experiments.

Constructed Examples

To illustrate a scenario where the greedy approach might fail, we independently generate data sets of size 500 from a pre-specified regular vine. The fitting abilities of each model are measured by the ratio of log-likelihood of the estimated model and log-likelihood of the true underlying model.

  • Dense Vine Dense Vine is a 6-dim vine where all bivariate copulas are non-independent. The bivariate families and thetas are listed in Table LABEL:tab:dense_vine.

  • Sparse Vine A 6-dim vine where all trees consisted of independent bivariate copulas except the first tree. In other words, the vine is truncated after the first level.

Dense Sparse Dense
T1 correct
Dißmann relative loglik(%) 76.6 101.3 No
Gruber relative loglik(%) 81.0 100.6 No
Vector relative loglik(%) 80.3 99.8 No
RL relative loglik(%) 84.2 100.2 Yes
Table 1: Comparison of relative log-likelihood on constructed examples. All results reported here are based on 100 independent trails.

In the dense vine example, the RL vine is the only model able to recover the correct first tree and also achieves the highest relative log-likelihood. In the sparse vine example, all four models achieve similar results, among which Dißmann obtains the highest likelihood. As argued in [Min and Czado2010], the higher likelihood in Dißmann is achieved at the expense of over-fitting. The two examples demonstrate the fitness of the model is improved by our model under different scenarios. Moreover, for a 6-dimensional data set of size 500, the RL algorithm finishes in approximately 15 minutes with a single GPU.

Real Data

In this section, we apply vine model to real data sets for the task of synthetic data generation. The three data sets that are picked are a binary classification problem, a multi-class classification problem and a regression problem. The batch size used is 64 for breast cancer dataset and 128 for the other two datasets. The three data sets used in experiments are:

  • Wisconsin Breast Cancer

    Describes 30 variables computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and a binary variable indicating if the mass is benign or malignant. This dataset includes 569 instances.

  • Wine Quality This dataset includes 11 physiochemical variables and a quality score between 0 and 10 for red (1599 instances) and white (4898 instances) variants of the Portuguese ”Vinho Verde” wine.

  • Crime The communities and crime dataset includes 100 variables related to crimes ranging from socio-economic data to law enforcement data and an attribute to be predicted (Per Capita Violent Crime) . This dataset has 1994 instances.

To evaluate the quality of the synthetic data, we first evaluate its log likelihood. As shown in 2, synthetic data generated from RL Vine achieves the highest log likelihood per instance in all three datasets.

Breast Cancer Wine Crime
Dißmann 0.79 0.033 0.26
Vector Rep 0.84 0.037 0.27
RL Vine 0.91 0.045 0.31
Table 2: log-likelihood per instance truncated after third tree

Besides log-likelihood, the quality of the generated synthetic data is also evaluated from a more practical perspective. High quality synthetic datasets enable people to draw conclusions and make inferences as if they are working with a real data set. We first use both Dißmann’s algorithm and our proposed algorithms to learn copula vine models from a real data set. Later, we generate synthetic data from each model and train models for target variables on the synthetic training set as well as real training set, and use the real testing set to compute the corresponding evaluation metric (F1 score for classification and MSE for regression). For ease of computation, the learned vines are truncated after the third level, which means all pair copulas are assumed to be independent beyond the third level. All results reported are based on 10-fold cross-validation over different splits of training and testing set.

As shown in table 2 and table 3, synthetic data from RL Vine achieves highest F1 score and the results are comparable to real data. For regression data, table 4 demonstrates that synthetic data obtains lowest Mean Squared Error (MSE) among the models. The results shown demonstrate that our proposed model improve the overall model selection and is able to generate reliable synthetic data.

Decision Tree SVM 5 layer MLP
Real Data
Dißmann
Vector Rep
RL Vine
Table 3: F1 score of different end classifiers on breastcancer dataset
Decision Tree SVM 5 layer MLP
Real Data
Dißmann
Vector Rep
RL Vine
Table 4: F1 score of different end classifiers on wine quality datset (averaged over 11 classes)
Decision Tree SVM 5 layer MLP
Real Data
Dißmann
Vector Rep
RL Vine
Table 5: MSE error of different end classifiers on crime dataset

6 Conclusion

In this paper we presented a meta learning approach to create vine models for modeling high-dimensional data. Vine models allow for the creation of flexible structures using bivariate building blocks. However, to learn the best possible model, one has to identify the best possible structure, which necessitates identifying the connections between the variables and selecting between the multiple bivariate copulas for each pair in the structure. We formulated the problem as a sequential decision making problem similar to reinforcement learning, used long-short-term memory networks to simultaneously learn the structure and select the bivariate building blocks. We compared our results to the state of the art approaches and found that we achieve significantly better performance across multiple data sets. We also show that our approach can generate higher quality synthetic data that could be directly used to learn a machine learning model replacing the real data.

Acknowledgment

Dr. Cuesta-Infante is funded by the Spanish Government Research Project TIN-2015-69542-C2-1-R (MINECO/FEDER) and the Banco de Santander grant for the Computer Vision and Image Processing Excellence Research Group (CVIP). Dr. Kalyan Veeramachaneni and Yi Sun acknowledge the generous funding provided by Accenture and National Science Foundation under the grant “

CIF21 DIBBs: Building a Scalable Infrastructure for Data-Driven Discovery and Innovation in Education”, Award # 1443068.

References

  • [Aas et al.2009] Aas, K.; Czado, C.; Frigessi, A.; and Bakken, H. 2009. Pair-copula constructions of multiple dependence. Insurance: Mathematics and Economics 44(2):182 – 198.
  • [Alzantot, Chakraborty, and Srivastava2017] Alzantot, M.; Chakraborty, S.; and Srivastava, M. 2017. Sensegen: A deep learning architecture for synthetic sensor data generation. In 2017 IEEE International Conference on Pervasive Computing and Communications Workshops, 188–193.
  • [Bakker2001] Bakker, B. 2001. Reinforcement Learning with Long Short-term Memory. In Proc. of the 14th Int. Conf. on Neural Information Processing Systems, NIPS’01.
  • [Bedford and Cooke2001] Bedford, T., and Cooke, R. M. 2001. Probability density decomposition for conditionally dependent random variables modeled by vines.

    Annals of Mathematics and Artificial Intelligence

    32:245–268.
  • [Carrera, Santana, and Lozano2016] Carrera, D.; Santana, R.; and Lozano, J. A. 2016. Vine copula classifiers for the mind reading problem. Progress in Artificial Intelligence 5(4):289–305.
  • [Dißmann et al.2013] Dißmann, J.; Brechmann, E.; Czado, C.; and Kurowicka, D. 2013. Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis 59:52 – 69.
  • [E. Greensmith and Baxter2004] E. Greensmith, P. L. B., and Baxter, J. 2004. Variance reduction techniques for gradient estimates in reinforcement learning. In Journal of Machine Learning Research.
  • [Elidan2010] Elidan, G. 2010. Copula bayesian networks. In Lafferty, J.; Williams, C. K. I.; Shawe-Taylor, J.; Zemel, R.; and Culotta, A., eds., Advances in Neural Information Processing Systems 23. 559–567.
  • [Ganjali and Baghfalaki2015] Ganjali, M., and Baghfalaki, T. 2015.

    A Copula Approach to Joint Modeling of Longitudinal Measurements and Survival Times Using Monte Carlo Expectation-Maximization with Application to AIDS Studies.

    Journal of Biopharmaceutical Statistics 25(5):1077–1099.
  • [Goodfellow, Bengio, and Courville2016] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. The MIT Press.
  • [Gruber and Czado2015] Gruber, L., and Czado, C. 2015. Sequential bayesian model selection of regular vine copulas. Bayesian Analysis 10(4):937–963.
  • [Gruber and Czado2018] Gruber, L. F., and Czado, C. 2018. Bayesian model selection of regular vine copulas. Bayesian Analysis 13(4):1111–1135.
  • [Han, Zhao, and Liu2013] Han, F.; Zhao, T.; and Liu, H. 2013. Coda: high dimensional copula discriminant analysis. J. Mach. Learn. Res. 14(1):629–671.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. LONG SHORT-TERM MEMORY. In Neural Computation, 1735–1780.
  • [Joe1996] Joe, H. 1996. Families of m-variate distributions with given margins and m(m-1)/2 bivariate dependence parameters. Distributions with Fixed Marginals and Related Topics, IMS Lecture Notes - Monograpgh Series 28:120–141.
  • [Jordan1999] Jordan, M. I., ed. 1999. Learning in Graphical Models. Cambridge, MA, USA: MIT Press.
  • [Kolev and Paiva2009] Kolev, N., and Paiva, D. 2009. Copula-based regression models: A survey. Journal of statistical planning and inference 139(11):3847–3856.
  • [Koller and Friedman2009] Koller, D., and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press.
  • [Kurowicka and Cooke2007] Kurowicka, D., and Cooke, R. 2007. Sampling algorithms for generating joint uniform distributions using the vine-copula method. Computational Statistics & Data Analysis 51(6):2889–2906.
  • [Kurowicka and Joe2011] Kurowicka, D., and Joe, H. 2011. Dependence Modeling: Vine Copula Handbook. World Scientific Publishing Company.
  • [Li et al.2014] Li, H.; Xiong, L.; Zhang, L.; and Jiang, X. 2014. Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. Proceedings of the VLDB Endowment 7(13):1677–1680.
  • [Libes, Lechevalier, and Jain2017] Libes, D.; Lechevalier, D.; and Jain, S. 2017. Issues in synthetic data generation for advanced manufacturing. In 2017 IEEE International Conference on Big Data (Big Data), 1746–1754.
  • [Lopez-Paz, Hernandez-Lobato, and Ghahramani2013] Lopez-Paz, D.; Hernandez-Lobato, J.; and Ghahramani, Z. 2013. Gaussian process vine copulas for multivariate dependence. In JMLR W&CP 28(2): Proceedings of The 30th International Conference on Machine Learning, 10–18. JMLR.
  • [MacKenzie and Spears2014] MacKenzie, D., and Spears, T. 2014. ‘The formula that killed Wall Street’: The Gaussian copula and modelling practices in investment banking. Social Studies of Science 44(3):393–417.
  • [Min and Czado2010] Min, A., and Czado, C. 2010. Bayesian inference for multivariate copulas using pair-copula constructions. Journal of Financial Econometrics 8(4):511–546.
  • [Morales-Napoles2010] Morales-Napoles, O. 2010. Dependence Modeling Vine Copula Handbook. World scientific. chapter Counting Vines, 189–218.
  • [Müller and Czado2018] Müller, D., and Czado, C. 2018. Selection of sparse vine copulas in high dimensions with the lasso. Statistics and Computing (available on–line).
  • [Nelsen2006] Nelsen, R. B. 2006. An introduction to copulas. Springer Series in Statistics, 2nd. edition.
  • [Ratner et al.2017] Ratner, A. J.; Ehrenberg, H. R.; Hussain, Z.; Dunnmon, J.; and Ré, C. 2017. Learning to Compose Domain-Specific Transformations for Data Augmentation. In Proc. of Neural Information Processing Systems (NIPS) 2017.
  • [Sagduyu, Grushin, and Shi2018] Sagduyu, Y. E.; Grushin, A.; and Shi, Y. 2018. Synthetic social media data generation. IEEE Transactions on Computational Social Systems 1–16.
  • [Soltana, Sabetzadeh, and Briand2017] Soltana, G.; Sabetzadeh, M.; and Briand, L. C. 2017. Synthetic data generation for statistical testing. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), 872–882.
  • [Sutton and Barto1998] Sutton, R. S., and Barto, A. G. 1998. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press.
  • [Veeramachaneni, Cuesta-Infante, and O’Reilly2015] Veeramachaneni, K.; Cuesta-Infante, A.; and O’Reilly, U.-M. 2015. Copula Graphical Models for Wind Resource Estimation. In Proc. of the 24th Int. Joint Conference on Artificial Intelligence, IJCAI’15, 2646–2654.
  • [Vinzamuri, Li, and Reddy2014] Vinzamuri, B.; Li, Y.; and Reddy, C. K. 2014. Active learning based survival regression for censored data. In Proc. of the 23rd ACM Int. Conference on Information and Knowledge Management, CIKM’14, 241–250.
  • [Wang and Perez2017] Wang, J., and Perez, L. 2017. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. In ArXiv.
  • [Zhou and Tao2014] Zhou, T., and Tao, D. 2014. Multi-task copula by sparse graph regression. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, 771–780. New York, NY, USA: ACM.