Copula-Based Normalizing Flows

by   Mike Laszkiewicz, et al.
Ruhr University Bochum

Normalizing flows, which learn a distribution by transforming the data to samples from a Gaussian base distribution, have proven powerful density approximations. But their expressive power is limited by this choice of the base distribution. We, therefore, propose to generalize the base distribution to a more elaborate copula distribution to capture the properties of the target distribution more accurately. In a first empirical analysis, we demonstrate that this replacement can dramatically improve the vanilla normalizing flows in terms of flexibility, stability, and effectivity for heavy-tailed data. Our results suggest that the improvements are related to an increased local Lipschitz-stability of the learned flow.



page 4

page 7

page 9


Tails of Triangular Flows

Triangular maps are a construct in probability theory that allows the tr...

Efficient CDF Approximations for Normalizing Flows

Normalizing flows model a complex target distribution in terms of a bije...

Jacobian Determinant of Normalizing Flows

Normalizing flows learn a diffeomorphic mapping between the target and b...

Robust model training and generalisation with Studentising flows

Normalising flows are tractable probabilistic models that leverage the p...

Marginal Tail-Adaptive Normalizing Flows

Learning the tail behavior of a distribution is a notoriously difficult ...

Normalizing flows for atomic solids

We present a machine-learning approach, based on normalizing flows, for ...

RBM-Flow and D-Flow: Invertible Flows with Discrete Energy Base Spaces

Efficient sampling of complex data distributions can be achieved using t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Normalizing Flows

(NFs) are a recently developed class of density estimators, which aim to transform the distribution of interest

to some tractable base distribution

. Using the change of variables formula, this allows for exact likelihood computation, which is in contrast to other deep generative models such as Variational Autoencoders


or Generative Adversarial Networks

(goodfellow2014generative). Impressive estimation results, especially in the field of natural image generation, have lead to great popularity of these deep generative models. Motivated by this success, much effort has been put into the development of new parametric classes of NFs to make them even more performant dinh2014nice; dinh2016density; chen2019residual; papamakarios2017masked; grathwohl2018ffjord. However, our theoretical understanding has not developed at the same speed, which, so we claim, slows down further progress in the development of powerful NF architectures.

Fortunately, very recent works have addressed theoretical limitations of these methods: One important limitation is the expressive power of NFs. Because they are based on the change of variables formula, the learned transformations are required to be diffeomorphisms. As a consequence, a NF with bounded Lipschitz constant is unable to map one distribution to a lighter-tailed distribution (wiese2019copula; pmlr-v119-jaini20a). Therefore, since vanilla NFs are implemented using an isotropic Gaussian base distribution, they are unable to learn heavy-tailed distributions, which are known to appear in natural image data (zhu-longtails; vanhorn2017devil; range_loss). This is in conflict with observations recently made by pmlr-v130-behrmann21a: Even though NFs are typically designed to obey invertibility, this property is often violated in practice. This is due to numerical inaccuracies, which are promoted by a large Bi-Lipschitz constant. Bounding the Bi-Lipschitz constant, however, conflicts with the previously mentioned theoretical requirements needed to avoid a limited expressiveness of the NF.

These findings emphasize the high importance of choosing an appropriate base distribution for NFs. We therefore propose to generalize the isotropic Gaussian to a much broader class of distributions—the class of copula distributions. Copulae are a well-known concept in statistical theory and are being used to model complex distributions in finance, actuarial science, and extreme-value theory (genest; joe2014dependence; elementsofcopula)

. Broadly speaking, a copula is some function that couples marginal distributions to a multivariate joint distribution. Hence, copulae allow for flexible multivariate modeling with marginals stemming from a huge range of suitable classes. This allows for example to formulate NF base distributions that combine heavy-tailed marginals—as proposed by

wiese2019copula; pmlr-v119-jaini20a; alexanderson2020robust—with light-tailed marginals. This paper presents a novel framework for choosing the base distribution of NFs building on the well-studied theory of copulae. A first empirical investigation demonstrates the benefits brought by this approach. Our experimental analysis on toy data reveals that using even the most simple copula model—the Independence Copula—we are able to outperform the vanilla NF approach, which uses an isotropic Gaussian base distribution. The resulting NF converges faster, is more robust, and achieves an overall better test performance. In addition, we show that the learned transformation has a better-behaved functional form in the sense of a more stable local Lipschitz continuity.

2 Background

In this section, we quickly review some background knowledge about NFs (Section 2.1), followed by an introduction to copula theory (Section 2.2).

2.1 Normalizing Flows

Density estimation via NFs revolve around learning a diffeomorphic transformation that maps some unknown target distribution to a known and tractable base distribution . At the cornerstone of NFs is the change of variables formula


which relates the evaluation of the estimated density of to the evaluation of the base density , of , and of . By composing simple diffeomorphic building blocks , we are able to obtain expressive transformations, while presuming diffeomorphy and computational tractablity of the building blocks. Due to the tractable PDF in (1), we are able to train the model via maximum likelihood estimation (MLE)


where is the PDF of the empirical distribution of . A comprehensive overview of NFs, including the exact parameterizations of certain flow models , computational aspects, and more, can be found in kobyzev2020normalizing and papamakarios2019normalizing.

2.2 Copulae

A completely different approach of density estimation, which has mostly been left unrelated to NFs, is the idea of copulae.

Definition 2.1 (Copula).

A copula is a multivariate distribution with CDF that has standard uniform marginals, i.e. the marginals of satisfy .

The fundamental idea behind copula theory is that we can associate every distribution with a uniquely defined copula . Vice versa, given marginal distributions, each copula defines a multivariate distribution with the given marginals. Formally, this is known as Sklar’s Theorem (sklar; sklarproof).

Theorem 2.2 (Sklar’s Theorem).

Taken from elementsofcopula.

  1. [nolistsep]

  2. For any -dimensional CDF with marginal CDFs , there exists a copula such that


    for all . The copula is uniquely defined on , where is the image of . For all it is given by


    where are the right-inverses of .

  3. Conversely, given any -dimensional copula and marginal CDFs , a function as defined in (3) is a -dimensional CDF with marginals .

Part 1 of Sklar’s Theorem finds much application in statistical dependency analysis (joe2014dependence). In contrast to classical dependency measures, such as Pearson correlation, copulae are a more flexible tool that allow the decoupling of the marginals and the dependency structure. Part 2 of Sklar’s Theorem is of relevance for statistical modeling, and more precisely, to define multivariate distributions. Given marginal distributions, which are typically much easier to estimate than the full joint distribution, and a copula we can “couple” the marginals and the dependency structure to a multivariate joint distribution. This perspective finds various applications in the context of finance and related disciplines that need to take heavy tails and tail dependencies into account, see genest for an overview. In Section A of the Appendix we give some illustrative examples and further details about properties of copula distributions.

By differentiating (3), we obtain the PDF of as


where are the PDFs of , respectively.

3 NFs With Copula-Base Distributions

In this paper, we propose to employ copulae to model a flexible, yet appropriate base distribution with the goal of gaining a NF that is able to solve the limitations of NFs discussed in Section 1. We expect to gain powerful and robust PDF approximators by combining different marginals and properties of theoretical sound copulae (see for instance Chapter 8 in joe2014dependence) with NFs, which allow for the estimation of complex densities.

3.1 A General Framework

We propose to replace the isotropic Gaussian base distribution in the vanilla NF framework by a more flexible copula distribution. Importantly, we want to learn a base distribution that is able to represent the tail behavior of the distribution of . For training a NF with a copula base distribution we build on the fact that we can write the PDF of the latent variables as written in (5). This requires two estimation steps: First, we need to estimate the marginal distributions , which can further be used to calculate the marginal densities . Secondly, we need to estimate the copula density . A popular approach for estimating the density in (5) is to employ the method of inference functions for margins (IFM) (Joe_Xu_1996), which sequentially estimates the marginals using MLE first, and then employs these marginals to estimate the copula using MLE.

It is important to note that in contrast to standard applications of copula theory, we do not aim at estimating the full data generating distribution based on (5). Instead, following the investigations by pmlr-v119-jaini20a, our goal is to capture the tail-behavior of . Hence, we propose to learn surrogate marginals that are able to represent the tailedness of the marginals . By combining these marginals with some simple copula structure, such as the Gaussian Copula or the Independence Copula (see (6) below), we are able to create a joint distribution that represents the marginal tail behavior of .

The proposed adjustment can be applied to any existing NF architecture as long as (5) remains tractable. However, as the main goal of the base distribution is not to fully estimate the target but to represent the tail behavior of , we can restrict ourselves to tractable parametric marginal distributions and copulae.

3.2 Experimental Analysis

In this section, we investigate the benefits of the proposed approach by analyzing a toy problem. In the following experiments, we employ the framework proposed in Section 3.1 using the most simple copula: the Independence Copula, i.e. we consider a base distribution with PDF


Note that by plugging Gaussian marginals in (6), we would obtain the vanilla NF. We consider a training set generated from a 2-dimensional heavy-tailed distribution,111all computational details can be found in Section C of the Appendix which has standardized t-distributed marginals with degrees of freedom. The corresponding copula is a Gumbel Copula with parameter , i.e.

As a proof of concept we compare the estimation of this heavy-tailed distribution using a NF222We are using a 3-layered MAF papamakarios2017masked. Further computational details can be found in Section C of the Appendix. with an isotropic Gaussian base distribution, and with 3 different heavy-tailed base distributions constructed via (6). We consider the following heavy-tailed marginals:

  1. [nolistsep]

  2. Laplace and . We call this case heavierTails because one marginal is heavy-tailed;

  3. and . We call this case correctFamily since both marginals stem from the same parametric class as the exact marginals;

  4. and . We call this case exactMarginals.

Samples from the target distribution and from the different base distributions are visualized in Figure 6 and 7 in the Appendix.

Training and test loss

In Figure 1, we plot the average training and test performance over trails. It is apparent that training using a base distribution with the correct type of tail behavior is beneficial. First of all, we observe a significant gap between the test performance of the vanilla NF and the NFs with a heavy-tailed base distribution. Notice that in Figure 1 we excluded all runs with a final test loss of above , which happened in 17 of the normal runs and not once in the other cases. Furthermore, we clearly observe a much faster convergence and a more stable training procedure. The fluctuations and instabilities in the vanilla NF are due to tail-samples that have a massive effect on the likelihood in (2), which can be reduced by choosing base distributions with slower decaying tails alexanderson2020robust.

(a) Training loss
(b) Test loss
Figure 1: Mean training and test loss over trails for NFs with different base distributions: normal (blue), heavierTails (orange), correctFamily (green), and exactMarginals (red). The shaded area depicts the confidence interval, which was computed using bootstrapping. We excluded normal runs that achieved a final loss larger than , which happened in out of runs.

Learning the tails

To illustrate the ability to model the tails, we compared the estimated empirical quantile functions. We did so for both marginal distributions (Figure 

2) but also for the distribution of (Figure 9 in the Appendix). In line with the findings by pmlr-v119-jaini20a, we notice that the vanilla NF is not capable of modeling the quantiles of the target distribution. More precisely, we observe that the corresponding quantile function is steeper around its center and has shorter tails. This means that the distribution learned by the NF does not account for the heavy tails by directly modeling them, but instead covers samples from the tails of the data distribution by being more widespread. In contrast, the base distributions that took the tailedness of into account, could achieve a much better fit to the quantiles, see Figure 8 in the Appendix for further results.

Figure 2: Estimated marginal quantiles in the case of normal (orange, dashed) and exactMarginals (green, dotted). The corresponding negative log-likelihoods are and , respectively.

Invertibility and numerical stability

As investigated by pmlr-v130-behrmann21a, the Bi-Lipschitz constant plays a fundamental role in the practical invertibility and numerical stability of NFs. To understand the learned transformation in terms of its Lipschitz continuity, we propose to study the Lipschitz surface of . Note that if is differentiable and -Lipschitz, we can follow the derivation by pmlr-v130-behrmann21a (in equation (5) and (6) therein) to approximate

where is some small constant. This motivates to consider an estimate333see Section C in the Appendix for details of for as a local surrogate for the Lipschitz-continuity of . Plotting these quantities for both, and , and we obtain the Lipschitz surfaces, which are depicted in Figure 3. We notice that the vanilla NF has many fluctuations in the local Lipschitz-continuity, while the proposed copula method leads to a well-behaved transformation. The inverse transformation in the vanilla NF has exploding local Lipschitz constants, while—again—the proposed method results in a stable inverse transformation.

(a) normal ()
(b) exactMarginals ()
(c) normal ()
(d) exactMarginals ()
Figure 3: Examples for the Lipschitz surfaces of and on a -scale. The corresponding negative log-likelihood is shown in brackets.

4 Discussion

In this work, we paved the way toward a general extension of NF architectures using copulae. Synthetic experiments revealed that the modeling performance of NFs can be improved substantially by replacing the vanilla Gaussian base distribution by a base distribution that reflects basic properties of the data distribution more accurately. Of course, we have just scraped the surface of the underlying potential of the proposed approach: While we concentrate on the tail behavior of the marginals in this work, the general idea can potentially also be applied to incorporate other types of inductive bias, such as multimodality by choosing multimodal marginals, or symmetries and tail dependencies by selecting appropriate marginals and copulae.

Our experiments suggest that it is sufficient to have only a broad estimate of the marginals. As mentioned in Section 3.1, one could also employ IFM to learn these before training the NF. However, the question of what is the best technique for choosing or estimating an appropriate marginal distributions and copula still requires further investigation. Nevertheless, we think that this flexibility brings additional improvement over the methods proposed by pmlr-v119-jaini20a and alexanderson2020robust. Of course, we have yet only gained preliminary results with our empirical study, which we plan to verify on real-world data and for different models in the future.

We believe that our analysis of the base functions can help to popularize NFs in a wide spectrum of domains. One such application might be financial risk analysis, where it is essential to model tail dependencies.


This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC- 2092 CASA – 390781972. We also thank the anonymous reviewers for their careful reading and their useful suggestions, which will help us extending this work to a full paper.


Appendix A Examples of Copula Distributions

There is a wealth of different copula distributions, ranging from parametric distributions to semi-parametric and completely non-parametric copula models. In this Section, we give some insights into the construction of one class of copula distributions and illustrate some basic properties.

Example A.1 (Construction using Sklar’s Theorem).

There is a generic way to construct a copula according to Definition 2.1. Consider any multivariate continuous and invertbile CDF with marginal CDFs . Then, following (4) in Sklar’s Theorem 2.2, we know that


defines a valid copula. Setting

to be the multivariate Gaussian distribution with correlation matrix

, we obtain the Gaussian Copula. If

is the identity matrix we obtain the

Independence Copula

, which is simply the product of independent uniform distributions. Both copulae are visualized in Figure 

4. Following the construction in (7) and replacing the CDF, we can obtain other copulae,such as the t-Copula. Figure 5 visualizes one example for a distribution based on the Independence Copula and based on the Gaussian Copula.

Example A.2 (Copulae that induce Tail-Dependency).

A crucial property of models in financial risk analysis is tail dependency. Roughly said, a tail dependency casts marginals to be dependent in a tail event. To give an example for a tail dependency, consider two essentially independent marginal distributions, such as and . Usually, our trust for the bank is not essentially determined by the amount of debts it has. However, in a marginal tail event

, of course, our trust drops rapidly. Mathematically, the upper tail dependency, i.e. the dependency in upper-tail events, for a random variable

is given by

Similarly, we can define the lower tail dependency . One such copula that accounts for upper tail dependency is the Gumbel Copula, which is given by


for . Figure 4 shows a visualization of the Gumbel copula. One can show that (joe2014dependence). Figure 5 visualizes the Gumbel copula distribution with Gaussian and Gamma marginals. We can observe a decent dependency—indicated by the peak pointing to the upper right—in the upper-tail events, which is mathematically described by .

(a) The Independence Copula
(b) The Gaussian Copula with correlation
(c) The Gumbel Copula with parameter
Figure 4: Some popular Copulae.
(a) The product distribution of a Gaussian and a Gamma marginal
(b) The Gaussian Copula distribution with marginals from (a)
(c) The Gumbel Copula distribution with marginals from (a)
Figure 5: Some examples of distributions constructed via the copula approach.

One could think of many more different properties that might be included using our copula approach, such as multi-modality and symmetry. However, it is still to be researched whether properties, such as the tail-dependency, are being preserved by the NF. Nonetheless, we think that fixing specific known properties in the base distribution facilitates training in acting as a type of regularization towards these given properties.

Appendix B Supplementary Experiments

In the following, we give some further empirical results that underpin our findings from the main paper.

Figure 6 shows the investigated base distributions and Figure 7 visualizes the target distribution.

(a) normal
(b) heavierTails
(c) correctFamily
(d) exactMarginals
Figure 6: Samples from the 4 different base distributions that we used in our experiments.
(a) Gumbel Copula with -distributed marginals.
(b) The distribution from 6(a) zoomed in.
Figure 7: Samples from the target distribution.

Figure 8 and Figure 9 supplement the findings about the learned quantiles: Even the other heavy-tailed distributions—the cases heavierTails and correctFamily—are able to successfully approximate the quantiles. This observation suggests that it is sufficient to use a broad surrogate of the base distribution that captures the true tail behavior.

Figure 8: Estimated marginal CDFs in the case of heavierTails (orange, dashed) and correctFamily (green, dotted). The corresponding negative log-likelihoods are and , respectively.
(a) normal ()
(b) heavierTails ()
(c) correctFamily ()
(d) exactMarginals ()
Figure 9: Estimated quantiles of using the different base distributions. The corresponding negative log-likelihood is shown in brackets.

To address the invertibility and numerical stability of the learned transformation, we investigate the local Lipschitz constant as derived in Section 3.2 in the main text. Figure 10 shows the supplementary Lipschitz surfaces of the cases heavierTails and correctFamily, which indicate a more regular transformation than the normal case. Still, in the case correctFamily we find large local Lipschitz constants in

, which, however, are structured in accordance to the base distribution. In the vanilla NF this irregularity is less structured and, since these irregularities occur in high-probability areas of the base distribution, more relevant.

(a) heavierTails ()
(b) correctFamily ()
(c) heavierTails ()
(d) correctFamily ()
Figure 10: Examples of the Lipschitz surfaces of and on a -scale. The corresponding negative log-likelihood is shown in brackets.

These results support our claim that choosing appropriately tailed base distribution can help in learning a numerically robust transformation.

Appendix C Computational Details

In all of our experiments, we employed Masked Autoregressive Flows (papamakarios2017masked) with 3 layers. Each layer contains a reverse permutation, followed by an autoregressive transformation with 4 hidden features. The code is based on the nflows package (nflows).

We trained the NFs on a sample of size . Optimization was carried out using the Adam optimizer with the PyTorch default settings and a batch size of . Test losses are evaluated based on test samples.

The reported training and test losses (Figure 1) have been averaged over runs and the depicted confidence intervals correspond to a confidence of .

To estimate the Lipschitz surface we rely on an estimation of


for some small constant . We chose . Further, we approximate (9) by

where are i.i.d. samples from , see pmlr-v130-behrmann21a for further details.

All code is provided with the submission and can further be accesses through