Implicit Generative Copulas

09/29/2021 ∙ by Tim Janke, et al. ∙ Technische Universität Darmstadt 0

Copulas are a powerful tool for modeling multivariate distributions as they allow to separately estimate the univariate marginal distributions and the joint dependency structure. However, known parametric copulas offer limited flexibility especially in high dimensions, while commonly used non-parametric methods suffer from the curse of dimensionality. A popular remedy is to construct a tree-based hierarchy of conditional bivariate copulas. In this paper, we propose a flexible, yet conceptually simple alternative based on implicit generative neural networks. The key challenge is to ensure marginal uniformity of the estimated copula distribution. We achieve this by learning a multivariate latent distribution with unspecified marginals but the desired dependency structure. By applying the probability integral transform, we can then obtain samples from the high-dimensional copula distribution without relying on parametric assumptions or the need to find a suitable tree structure. Experiments on synthetic and real data from finance, physics, and image generation demonstrate the performance of this approach.



There are no comments yet.


page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many approaches are available to reliably model univariate distributions of real-valued variables. One can use various parametric distributions or employ flexible, non-parametric methods, e.g., by using empirical distribution functions, kernel density estimation (KDE), or quantile regression


. Modeling distributions in higher dimensions is a much harder task. Parametric approaches using, e.g., Gaussian distributions, are common but inflexible. Many non-parametric approaches become infeasible in higher dimensions due to the curse of dimensionality


Recent successes in modeling multivariate distributions have been achieved with deep neural network-based likelihood-free implicit generative models [Mohamed2016]

, such as Generative Adversarial Networks


(GANs) and Generative Moment Matching Networks (GMMNs)

[Dziugaite2015; Li2015].

Copulas are a tool to decouple the modeling of the univariate marginal distributions from modeling the high-dimensional joint dependency structure [Joe2014]. This allows to obtain high-quality marginal models with the mentioned univariate techniques. Moreover, the (conditional) marginals can be tailored to the problem at hand and many fields have developed sophisticated domain-specific univariate models for this task, e.g., for probabilistic forecasts in finance [Bollerslev1986], weather [Raftery2005], or energy [Hong2016]. These models can then be combined with a suitable copula structure to enable the simulation of multivariate quantities, e.g., [Patton2012; Moller2013; Tastu2015]

. In the machine learning literature copulas have been used to increase the flexibility of Bayesian networks


, to model multi-agent coordination in reinforcement learning


, or for image generation via the Vine Copula Autoencoder

[Tagasovska2019]. Another popular application of copulas is synthetic tabular data generation [Patki2016; Meyer2021]. Note that in many of these applications, the main goal is to sample from a multivariate distribution and the copula is used as a building block for a generative model.

High-dimensional copula distributions are commonly modeled via parametric structures such as the Gaussian or the student- copula [demarta2005t]. [Ling2020] propose to enhance the subclass of Archimedean copulas by learning the generator functions via deep neural networks. However, the family of Archimedean copulas is of limited use in higher dimension as it makes restrictive symmetry assumptions. A more flexible, semi-parametric state-of-the art approach is to use vine copulas which are built from trees of bivariate pair-copulas [Aas2009]. These pair-copulas can again be parametric, such as the Gumbel or Clayton copula, or non-parametric, e.g., by using bivariate kernel density estimation [Nagler2017].

Directly applying implicit generative modeling to the task of estimating copula distributions is not straight-forward since copula distributions are required to have uniform marginals. Such an approach has been proposed by [Letizia2020; Hofert2021], but without ensuring the uniformity property, at least not for finite sample sizes when the model is not expected to exactly fit the true copula distribution. Ensuring the marginal uniformity of the learned copula distribution is important since deviations from uniformity will result in unwanted alterations of the marginal distributions in the data space.

In this paper we show how to design and train implicit generative copula (IGC) models to match the dependency structure of given data. A key challenge is to ensure the marginal uniformity of the estimated distribution. We achieve this through first learning a latent distribution with unspecified marginals but the same dependency structure as the training data. We then obtain the desired copula model by applying the probability integral transform component-wise to the latent distribution. The IGC model and the data distribution are matched in copula space using the energy distance [Szekely2004]. During training, the probability integral transform is approximated through a differentiable softrank layer which allows to use gradient-based methods.

Our contributions

  • We propose the first universal, non-parametric model for estimating high-dimensional copula distributions with guaranteed uniformity of the marginal distributions.

  • We show how flexible likelihood-free implicit generative models based on deep neural networks can be trained for this task through the use of a differentiable softrank layer.

  • Compared with the state-of-the-art semi-parametric vine copula approach, we demonstrate similar or improved performance for different tasks.

We present the proposed copula model in Section 2. Model training is described in Section 3. Section 4 shows experimental results for synthetic and real data from finance, physics, and image generation. We conclude in Section 5.

2 Proposed IGC Model

We introduce our IGC model following the schematic shown in Figure 1. Figure 2 provides an exemplary application for a bivariate two-component Gaussian mixture data distribution.

Copula Basics

The vector-valued continuous random variable

represents the data source of our task. We denote its distribution by

and the cumulative distribution function (cdf) of

by . Moreover, let be the cdf of the marginal distribution of , the -th component of , for .

Each sample vector can be mapped to a vector by defining as the value of with respect to the corresponding marginal cdf of , i.e., for all . This operation is called the probability integral transform (PIT). In the copula literature, values obtained by the PIT are called pseudo-observations [Joe2014]

. The random variable

follows a so-called copula distribution by construction, i.e., the distribution is defined on the unit space and its marginals are uniform. The joint cdf of is typically called the copula, which we denote by here. Sklar’s theorem states that such such a copula function exists for all random variables and it holds that , i.e., any multivariate distribution can be expressed in terms of its marginals and its copula [sklar1959fonctions].

Figure 1: Illustration of the implicit generative copula (IGC) concept: We use a generative neural network with learnable parameters

to map samples from a multivariate standard Normal distribution to samples of the latent distribution

. The marginal cdfs of this distribution are then used to generate samples of the copula distribution . The model is trained to match the copula distribution of the real data using the energy distance.
Figure 2: Contour plots and marginal densities of all data and model distributions mentioned in Figure 1 for the example of a two-dimensional mixture of Gaussians data distribution . Contour plots are derived via KDE from a set of 10000 samples.

IGC Model

In this work, we aim at defining a flexible, non-parametric model for copula distributions

in high dimensions. Such a model can either be used to (approximately) represent the copula distribution of the unit space vectors or the distribution of the original data vectors , in which case the vectors have to be transformed component-wise with the inverse cdfs of the components , i.e., .

A flexible model class for learning high-dimensional probability distributions is the class of implicit generative models

[Mohamed2016]. Here, latent random variables with a simple and known distribution are transformed via a parameterizable mapping, typically a deep neural network, to a complex distribution that approximates the distribution of the training data. A straight-forward application of this framework to model copula distributions, as done in [Hofert2021] and [Letizia2020], is difficult. The output of the generative model can be guaranteed to lie in

by applying an appropriate output layer, e.g., using a sigmoid function. However, guaranteeing the desired uniformity of the marginal distributions is not straight-forward. We have first experimented with additional training losses that penalize deviations from marginal uniformity. However, a simpler approach without any hyper-parameters for weighting additional cost terms and with a guarantee of marginal uniformity is the following two-step procedure.

In the first step, we model the distribution of the vector-valued latent random variable . To this end, we start with random variables

from a simple base distribution that we choose as zero mean, unit variance Gaussian here. The tuneable map

with parameters is then used to transform the noise samples to the latent samples as


We denote the resulting probability distribution of by and the marginal cdfs by .

In a second step, we transform the latent variables into unit space vectors . To this end, we define component-wise for


This transformation step is analogous to going form to . It guarantees both that the sample values from the model lie in the unit interval

and that they are uniformly distributed, regardless of the distribution

. We refer to the distribution of as the model copula distribution and denote it by .


Note that the distributions and might be very different. More precisely, the mapping (1) can result in arbitrary marginal distributions but an optimal model of the distribution would have


Given a sufficiently rich function class for any distribution of can be modeled with small error [lu2020universal]. This statement naturally extends to universal approximation properties for the dependency structure of , i.e., the unit space random vectors . The approximation of the data copula distribution with our model copula distribution will thus be arbitrarily close if the generator function is flexible enough and if the optimal is selected.

3 Training Procedure

Problem statement

We aim at empirically estimating an IGC model from data, i.e., we want to determine the optimal parameter vector given a set of training samples from . The underlying target is to optimally match the copula model to the (typically unknown) copula distribution from which the samples were generated.

Parameter estimation

We first transform the given data samples into unit space vectors in , before starting the actual training procedure. Since we typically do not know the exact marginal cdfs of the components , we estimate them from the given data. To this end, we use the empirical cdfs separately for each component and define for and ,


Here, is the indicator function of event , i.e., it is iff is true. Note that this is an unbiased and consistent estimator for the true cdf [vanderVaart2000].

We then aim at matching the copula distribution to the unit space vector samples . We base the inference of our likelihood-free model on the energy distance [Szekely2004] between distributions and ,


Here, and are independent copies of a random vector with distribution and , respectively. It holds that if and only if [Szekely2004].

In practice, we use a sample-based approximation of the energy distance. We draw samples from

and minimize the loss function


Since is independent of , the second term of (5) does not need to be included. Note that the energy distance is an instance of the more general maximum mean discrepancy (MMD) measure [Gretton2012; Sejdinovic2013]. We also experimented with MMD losses based on the Gaussian kernel but found the energy distance to be more robust as it does not require to tune the kernel bandwidth.

To generate samples from our copula model , we first draw samples from the implicit generative model . To obtain vectors in unit space, we then again require the cdfs of the marginal distributions of the components of , . However, these functions are not known during model training and will likely change after each gradient step. Hence, we again use the empirical distribution functions to define for and ,


This provides us with an unbiased estimate of the current cdf during training. However, the indicator operation in (

7) is not differentiable and hence will not allow the gradients to flow through this operation. Therefore, during training, we replace (7) with a softrank layer based on a scaled sigmoid function


where is a scaling constant [Qin2010]. For sufficiently large , (8) provides a close approximation to the empirical marginal cdfs at the current training step. A larger will result in a finer approximation of the true marginal cdfs but comes with the increased cost for computing the ranks which requires operations for all samples.

Sampling from the trained model

Once we have completed training and thus have fixed , we have also fixed the component-wise cdfs . Now, the operation to estimate does not need to be differentiable anymore and we can chose any available method to estimate the univariate marginal distributions based on samples from . We choose to simply draw a very large set of samples from and then use the empirical marginal cdfs, i.e., for each sampled value we store the corresponding value according to (4), . Given a new realization of , we obtain its approximate cdf value

by interpolation of the stored values. Other more compact approximations like univariate kernel density estimation would also be feasible.

In sum, we obtain a sample from at test time by first sampling a realization from , transform it via , and then apply the component-wise PIT approximation to obtain . Given estimates for the component-wise marginal cdfs of the data we can derive a sample in data space via .

4 Experiments

In the following we empirically demonstrate the capabilities of IGC models on a series of experiments on synthetic and real data with increasing complexity. Additional experiments can be found Appendix B.


For all experiments except the image generation task, we use a fully connected neural network with two layers, 100 units per layer, ReLU activation functions, and train for 500 epochs. For the image generation experiment we use a three layer, fully connected neural network with 200 neurons in each layer, and train for 100 epochs. In all cases we train with a batch size of

and generate samples from the model per batch. The number of noise distributions is set as and . We use the Adam optimizer [Kingma2014] with default parameters.

All experiments besides the training of the autoencoder models were carried out on a desktop PC with a Intel Core i7-7700 3.60Ghz CPU and 8GB RAM. For the training of the autoencoders we used Google Colab [Colab]. Training times for all experiments are in the range of a few minutes except for the image generation task. Details are provided in Appendix B

. We use tensorflow with the Keras API for the neural networks

[Tensorflow2015; Keras]. For copula modeling, we use pyvinecopulib [pyvinecopulib] in Python and the R package kdecopula [kdecopula]. Our code is available from


We evaluate the models by comparing the distance between the learned and the true copula. More specifically, we use the integrated squared error (ISE) in unit space, i.e.,


where is the joint cdf of , the true copula, and the joint cdf of the learned model. Since analytical integration is not possible, we approximate the integral with an empirical sum and use the data vectors for this purpose.

The copula function is not known for real data. Instead, we use the empirical cdf of the unit space data vectors in (9), i.e., for we set


Due to the curse of dimensionality this approximation of the joint cdf will have a coarse structure in high dimensions. This is why we only use the values at the data vectors in (9).

For parametric copula models the distribution function is analytic. For the non-parametric models, including IGC, we again resort to approximation. We draw samples from the copula distribution model and use their empirical cdf in unit space in (9). We found that this procedure provided reliable and stable estimates for the underlying copula in dimensions .

4.1 Synthetic data

Learning bivariate parametric copulas

We first show that IGC models are able to emulate bivariate parametric copulas. We compare our IGC approach to two non-parametric copula density estimation techniques, kernel density estimation with beta kernels (BETA) [Charpentier2007] and the transformation local likelihood estimator (TLL) [Geenens2017]. These are state-of-the art non-parametric estimators for pair-copulas [kdecopula]

. The compact support of the beta distribution on

is convenient for copula density estimation. The idea of the TLL approach is to first transform the data using the inverse normal CDF such that the data is supported on . Then the density is estimated via local regression using linear (TLL1) or cubic polynomials (TLL2) on a fixed grid of points. Finally the estimated density is transformed back to unit space. Additionally we report the performance of a parametric model. In this approach a copula family is selected from a set of copulas based on the Bayesian information criterion (BIC). This set comprises the Independence, Gaussian, Student-t, Clayton, Gumbel, Frank, Joe, BB1, and BB7 copula. Parameters are then estimated via maximum likelihood.

We run experiments for data sets generated from a Student-t, a Gumbel, and a Clayton copula as well as the copula resulting from a two-component Gaussian mixture distribution as shown in Figure 2. For each one of these, we sampled a random parameter set and then generated 1000 training samples from the resulting model. We repeated each experiment 25 times. Appendix B contains the details on the considered parameter ranges and the exact sampling procedure.

Figure 2 shows the data and the model with intermediate steps for one instance of the Gaussian mixture case. The fitted copula matches the data distribution in unit space very well. Figure 3 presents aggregated ISE results for the different setups. The IGC model shows comparable performance scores to TLL1 and TLL2 for the Student-t, Clayton, and Gumbel copulas, and better performance than the BETA approach. For the more complex Gaussian mixture test case, IGC shows superior performance and a much lower variance than all baseline methods. Notably, the parametric approach with BIC-based model selection results in large variance in accuracy for the Clayton and Gumbel data, most likely because the wrong copula family is selected in some cases.

Figure 3: Box plot for the integrated squared error (lower is better) for 25 runs of fitting bivariate copulas to training data from a Student-t, Gumbel, Clayton, and Gaussian mixture copula.

Learning multivariate vine copulas

We now turn to the problem of estimating copulas in dimensions . To this end, we conduct a similar experiment as before using 5-dimensional vine copulas as data generating distribution. For each of 25 repetitions, we first sample a random tree structure using pyvinecopulib’s RVineStructure.Simulate() method. Then, we randomly assign parametric pair-copulas with random parameters to each edge in the tree. The considered families of bivariate copulas are Independence, Gaussian, Student-t, Clayton, Gumbel, Frank, Joe, BB1, and BB7. Further details are available in Appendix B. We generate 5000 samples from the resulting vine copula model and use these to train our IGC model as well as the benchmark models. As benchmarks, we use a vine copula model with pair-copulas only and a vine copula model which can select all parametric copula families named above as well as the . Additionally, we estimate a parametric Gaussian copula. The parameters are estimated via maximum likelihood and the copula families are selected using the BIC. The selection of the vine structure is based on the Dissmann algorithm [Dissmann2013]. We evaluate all models with the ISE at 10000 data points sampled from the true model.

The results of the simulation study are presented in Figure 4

. The IGC model has the lowest mean ISE (0.0697) followed by the TLL2-Vine model (0.2093), while the latter has a lower median ISE (0.0339 vs. 0.045). This is the result of some large outliers of the ISE for the vine models that do not occur for the IGC model. These are most likely caused by a poorly selected vine structure. Interestingly, the results specifically imply that the data is on average better modeled by the IGC model than by the vine approach although this model class entails the true data generating model. However, recovering the true model does not seem to be an easy task.

Figure 4: Box plot of the ISE results for 25 runs of fitting data from a 5D vine copula

4.2 Real data

Exchange rates

Copulas are a popular tool in financial risk management as they can be used to estimate the distribution of the multivariate returns over different assets [Patton2012] under the assumption of a stationary copula. We consider a data set of size that contains 15 years of daily exchange rates between the US-Dollar and the Canadian Dollar, the Euro, the British Pound, the Swiss Franc, and the Japanese Yen. The data was obtained from the R package qrm_data. We preprocess the data to filter out the effects of temporal dependencies. To this end, we fit an AR(1)-GARCH(1,1) [Bollerslev1986] process with Student-t innovations to the time series of the daily returns. We then obtain the standardized residuals from the AR-GARCH models and transform these observations to the unit space using the empirical cdfs. Figure 4(a) shows a kernel density estimate for two selected dimensions of the resulting data set in unit space, namely the US-Dollar/Euro and US-Dollar/Pound exchange rates. While being non-trivial and multi-modal, the distribution is strongly concentrated along the diagonal.

We then estimate a Gaussian copula, a vine copula, a vine copula with only TLL2 pair-copulas, a GMMN as proposed by [Hofert2021]

, and our IGC model for this data. The GMMN model has the same loss function, architecture, and hyper parameters as the IGC model but uses a sigmoid activation in the final layer. To ensure a test data set of appropriate size, we use a 5-fold cross-validation scheme where 20% of the data is used for training and 80% for testing. For the IGC and the GMMN model we additionally use five different random initializations per fold for the neural network weights, i.e., we report means and standard deviations from 25 values for the IGC and the GMMN model and 5 values for the other methods. The results are presented in Table

1. The IGC model clearly outperforms the Gaussian baseline and also the GMMN model. However, the vine copula models show lower average ISEs. This might be due to the symmetric nature of the data which can be approximated well by Student-t and TLL pair-copulas.

exchange rates magic
IGC (ours)
Table 1: ISE mean and standard deviation for exchange rate and magic data from 5 fold CV and 5 random NN weight initializations per fold (lower is better).

MAGIC Gamma Telescopes data

In order to test our approach on a more complex dependency structure, we consider the MAGIC (Major Atmospheric Gamma-ray Imaging Cherenkov) Telescopes data set available from the UCI repository ( This data was also used for benchmarking non-parametric copula estimation techniques in [Nagler2017]

. We only consider the observations classified as

gamma and the 5 variables fLength, Width, fConc, fM3Long, fM3Trans. The size of the data set is . We use the empirical cdfs to transform the observations to the unit space. Figure 4(b) showcases the dependency structure of the data for two selected data dimensions. The observed structures are highly asymmetric. We again fit the data with a Gaussian copula, a vine copula with all available pair-copulas as above, a vine copula with TLL2 pair-copulas, and a GMMN model with a sigmoid output layer [Hofert2021] as benchmarks and use the same 5-fold CV strategy as before, i.e., in each fold we use 20% as training data, 80% as test data, and report the average ISE. The results are presented in Table 1. The IGC model achieves the lowest average ISE. Again the GMMN model shows a substantially larger mean ISE with a much larger standard deviation. As only TLL2 pair-copulas were selected by the BIC criterion, the scores for both vine models are the same. The Gaussian copula is clearly not suited for such complex data as the average ISE is 5 times larger than for the other approaches.

Figure 5: Exemplary bivariate KDE estimates of the copula densities from the real data sets: (a) Exchange rates EUR/USD-GBP/USD, (b) MAGIC fLength-fM3Trans, (c) and (d): FashionMNIST Autoencoder latent space dimensions 1-2 and 2-3

Copula Autoencoders

[Tagasovska2019] introduced the Vine Copula Autoencoder, a generative model which uses vine copulas for ex-post density estimation of the latent space of a trained autoencoder. In a first step, an autoencoder is trained on a data set to learn a low dimensional representation of the data. After training, the encoder part of the network is used to map the training data to the autoencoder’s latent space representation. Next, the uni-variate marginal distributions are estimated, e.g., by using the empirical cdfs or kernel density estimation, and the compressed data is mapped to unit space. After fitting a copula model to the observations in unit space, one can sample from the fitted copula model, apply the inverse marginal cdfs, and map the simulated data back to the image space using the decoder network of the autoencoder in order to generate new data samples.

image latent copula
Indep 0.01912 0.00722 0.00373
Gauss 0.00619 0.00209 0.00087
Vine 0.00674 0.00131 0.00079
GMMN 0.00392 0.00341 0.00073
IGC 0.00426 0.00114 0.00069
VAE 0.01316
Table 2: MMD scores for the FashionMNIST test set (lower is better)

In the following, we present results for the FashionMNIST [FashionMNIST] data set. We train a convolutional autoencoder on the entire training data of 60000 samples with a latent space of dimension 25. Details on the architecture are found in Appendix B. After training the autoencoder, we estimate the empirical cdfs of the compressed data using the training data. See Figures 4(c) and 4(d) for exemplary visualizations of the resulting bivariate data densities in unit space. Notably, these distributions are more complex than the ones from the previous experiments. We fit this data with a Gaussian copula, a vine copula with TLL2 pair-copulas, a GMMN with sigmoid output layer [Hofert2021], and an IGC model. We also test an independence copula, i.e., we assume independence over the latent space. Additionally we report results for a standard variational autoencoder (VAE) [Kingma2013] with the same architecture.

Exemplary samples from all models and the test set are presented in Figure 6. Sampling with the independence copula, i.e., assuming no dependency structure, leads to images with many artifacts. The Gaussian copula produces better images, but still produces some artifacts. Samples from the vine copula or the IGC model are comparable in quality. They show smaller details and few implausible artifacts. The images generated by the VAE are blurry and show much less variation in pixel intensity than the test data.

In 25 dimensions it is not feasible anymore to compute and compare empirical cdfs as required for ISE scoring. We thus resort to the MMD [Gretton2012] for the numerical evaluation since it is commonly used as a simple and robust measure for evaluating image generation models [Xu2018]. We generate 10000 images from each model and compare those to the 10000 test set images by computing the MMD with a Gaussian kernel. We use the bandwidths , , and

for the latent unit space, the latent data space, and the image space, respectively. The bandwidths were selected based on the median heuristic proposed in


The results are found in Table 2. We provide p-values for these results using the test proposed by [Bounliphone2016] in Appendix B. The IGC and GMMN models perform better than all other models for the latent unit space distribution as well as the image space. Interestingly the GMMN model shows a relatively large MMD value for the latent data space. This could be caused by deviations from marginal uniformity of the learned copula distribution which alter the marginal distributions in the data space.

(a) independence copula
(b) Gaussian copula
(c) vine copula
(d) IGC (ours)
(e) VAE
(f) test data
Figure 6: Generated image samples from different copula models on top of a pre-trained autoencoder as well as from a standard variational autoencoder. The latent space has 25 dimensions.

5 Conclusion

We have proposed the first fully non-parametric model framework for copulas in higher dimensions that guarantees uniformity of the marginal distributions. The proposed IGC approach, which is based on an implicit generative step and a differentiable ranking transformation, is structurally simple. Yet, if the generator class is sufficiently complex, any copula dependency structure can be modeled. The model can be well implemented with standard deep learning frameworks. For various data sets we have shown a modeling performance on par or above other state-of-the-art approaches, especially, the vine copula approach.

IGC models should be further investigated in various ways. First, it would be straight-forward to condition the model on external factors, either by including such values into the input of the generator network or by reparameterization of the noise distributions. This would allow to describe context-dependent changes of the dependency structure. Second, many more applications for copulas should be examined, e.g., synthetic tabular data generation [NEURIPS2019_254ed7d2]. Third, the ranking operation could probably be sped up from to using ideas from [Blondel2020]. Finally, the use of adversarial training schemes could also be investigated.

This work has been performed in the context of the LOEWE center emergenCITY.


Appendix A Algorithm

The algorithm for training an IGC model is described in Algorithm 1.

Input : data , initial model parameters , initial learning rate , number of batches , number of random noise samples , number of epochs
Output : Model parameters
Partition data into mini batches
For each batch , generate a set of random noise samples
for  do
        for  do
               for  do
               end for
              Map model output to unit space
               Compute gradient of loss over batch
               Update learning rate (e.g. using ADAM)
               Update model parameters
        end for
end for
Algorithm 1 Training algorithm

Appendix B Experiments

b.1 Additional experiments on toy data sets

We provide additional experiments on three toy data sets commonly used in the literature on deep generative models, the "Swiss Roll", the "Grid of Gaussians", and the "Ring of Gaussians". We test four copula based models: a Gaussian copula, a TLL2 copula, an GMMN with sigmoid output layer Hofert2021, and our IGC model. For these models we use a linear interpolation of the ECDF of the training set as models for the marginals of the data distribution. We additionally report results for two implicit generative models that directly model the data distribution, a GMMN Li2015, Dziugaite2015 and a GAN Goodfellow2014. Like the IGC model, the GMMN based models are trained by minimizing the energy distance. All neural network models use two layers, 100 neurons per layer, and are trained for 500 epochs. IGC and GMMN models are trained using the Adam optimizer with standard values. For the GAN we use a lower learning rate of and a lower momentum as the model did not converge with the standard settings. We evaluate the models using the average negative log-likelihood of a test set based on kernel density estimates. We repeat each experiment 10 times with different training and test sets of size 5000 and random initializations for the neural networks.

Table 3 presents the results. The GAN achieves the best result for the Swiss Roll data. However, the scores for the GAN show a much higher standard deviation than all other methods. The IGC model has the second lowest score overall and the lowest score of all copula based methods. For the ring of Gaussians, the TLL2 copula shows the lowest NLL, closely followed by the IGC and GMMN copula. Here the GAN shows the worst performance of all models. The Grid of Gaussians is a trivial test case for the copula based models as it is sufficient to sample from the marginal distributions with an independence copula. Both the GMMN and the GAN show substantially worse NLL values and fail at properly approximating the true data distribution as can be seen from the bottom row of Figure 7.

(a) True
(b) Gauss copula
(c) TLL2 copula
(d) GMMN copula
(e) IGC
(f) GMMN
(g) GAN
Figure 7: True and exemplary learned densities for the Swiss Roll (top), Ring of Gaussians (middle), and Grid of Gaussians (bottom).
swiss roll ring grid
Gaussian copula 6.10 (0.01) 5.72 (0.02) 4.47 (0.01)
TLL2 copula 5.61 (0.01) 5.11 (0.01) 4.47 (0.01)
GMMN copula 5.45 (0.09) 5.24 (0.01) 4.48 (0.01)
IGC 5.26 (0.08) 5.19 (0.03) 4.47 (0.01)
GMMN 5.76 (0.09) 5.40 (0.05) 6.27 (0.05)
GAN 4.82 (0.40) 7.14 (0.52) 5.95 (0.26)
Table 3: Means and standard deviations of the average negative log-likelihood (lower is better) from 10 repetitions with random parameter initializations and training and test sets.

b.2 Learning bivariate copulas

For the Student-t, Gumbel, and Clayton copulas we sample the parameters and rotations uniformly from the ranges given in Table 4.

Samples form the Gaussian mixture copula are generated by sampling times from one of the two components with equal probability and then transforming all samples to the unit space via the component-wise PIT. Samples for the component are drawn from the two-dimensional Gaussian distribution , where is sampled from , is sampled from , is sampled from , and .

b.3 Learning multivariate vine copulas

The following table presents the parameter ranges of the pair-copulas for the vine copula experiments. We used uniform sampling for selecting the pair-copula family as well as the parameters and rotations.

Independence - - -
Gaussian -
Clayton -
Gumbel -
Frank -
Joe -
Table 4: Parameter ranges for pair-copulas for learning vine copulas

b.4 Training times

Table 5 shows the average training times for the IGC model and a vine copula model with TLL2 pair-copulas for the different data sets. Note that the IGC model for the FashionMNIST experiment is a three-layer neural network with 200 neurons per layer which is trained for 100 epochs while in all other cases a two-layer network with 100 neurons per layer was trained for 500 epochs. Timings are for a desktop PC with a Intel Core i7-7700 3.60Ghz CPU and 8GB RAM.

IGC vine copula (TLL2)
Learning bivariate copulas 49s <1s 2 1000
Learning vine copulas 131s 5s 5 5000
Exchange rates 33s 4s 5 1169
Magic 61s 4s 5 2466
FashionMNIST AE 1472s 1743s 25 60000
Table 5: Average training times for IGC and vine copula (TLL2)

b.5 Autoencoder architecture

We used the following architecture for the autoencoder and the VAE:

  • Encoder:

  • Decoder:

denotes a convolutional layer with filters, a kernel of height and width

, and a stride with height and width

. denotes a deconvolutional layer. denotes a fully connected layer with neurons. and denote the sigmoid and ReLU activation functions and

denotes batch normalization layers. We use a padding of 2 to achieve an image resolution of

pixels and normalize the inputs to the range . The models are trained with the binary cross entropy as reconstruction loss for 100 epochs using the Adam optimizer with default parameters.

b.6 Significance of FashionMNIST results

In the following tables we present the p-values from the test proposed in Bounliphone2016 for the experiments on the FashionMNIST data. Values close to one/zero in the row indicate significantly better/worse performance compared to model in the column.

IGC GMMN Vine Gauss Indep
IGC 0.5561 0.9871 0.9990 1.0000
GMMN 0.4439 0.8645 0.9689 1.0000
Vine 0.0129 0.1355 0.9510 1.0000
Gauss 0.0010 0.0311 0.0490 1.0000
Indep 0.0000 0.0000 0.0000 0.0000
Table 6: p-values for copula space samples
IGC GMMN Vine Gauss Indep
IGC 1.0000 0.9812 1.0000 1.0000
GMMN 0.0000 0.0000 0.0000 1.0000
Vine 0.0188 1.0000 1.0000 1.0000
Gauss 0.0000 1.0000 0.0000 1.0000
Indep 0.0000 0.0000 0.0000 0.0000
Table 7: p-values the latent space samples
IGC GMMN Vine Gauss Indep VAE
IGC 0.1131 1.0000 1.0000 1.0000 1.0000
GMMN 0.8869 1.0000 1.0000 1.0000 1.0000
Vine 0.0000 0.0000 0.1152 1.0000 1.0000
Gauss 0.0000 0.0000 0.8848 1.0000 1.0000
Indep 0.0000 0.0000 0.0000 0.0000 0.0000
VAE 0.0000 0.0000 0.0000 0.0000 1.0000
Table 8: p-values image space samples