1 Introduction
Normalizing Flows
(NFs) are a recently developed class of density estimators, which aim to transform the distribution of interest
to some tractable base distribution. Using the change of variables formula, this allows for exact likelihood computation, which is in contrast to other deep generative models such as Variational Autoencoders
(kingma2013auto)or Generative Adversarial Networks
(goodfellow2014generative). Impressive estimation results, especially in the field of natural image generation, have lead to great popularity of these deep generative models. Motivated by this success, much effort has been put into the development of new parametric classes of NFs to make them even more performant dinh2014nice; dinh2016density; chen2019residual; papamakarios2017masked; grathwohl2018ffjord. However, our theoretical understanding has not developed at the same speed, which, so we claim, slows down further progress in the development of powerful NF architectures.Fortunately, very recent works have addressed theoretical limitations of these methods: One important limitation is the expressive power of NFs. Because they are based on the change of variables formula, the learned transformations are required to be diffeomorphisms. As a consequence, a NF with bounded Lipschitz constant is unable to map one distribution to a lighter-tailed distribution (wiese2019copula; pmlr-v119-jaini20a). Therefore, since vanilla NFs are implemented using an isotropic Gaussian base distribution, they are unable to learn heavy-tailed distributions, which are known to appear in natural image data (zhu-longtails; vanhorn2017devil; range_loss). This is in conflict with observations recently made by pmlr-v130-behrmann21a: Even though NFs are typically designed to obey invertibility, this property is often violated in practice. This is due to numerical inaccuracies, which are promoted by a large Bi-Lipschitz constant. Bounding the Bi-Lipschitz constant, however, conflicts with the previously mentioned theoretical requirements needed to avoid a limited expressiveness of the NF.
These findings emphasize the high importance of choosing an appropriate base distribution for NFs. We therefore propose to generalize the isotropic Gaussian to a much broader class of distributions—the class of copula distributions. Copulae are a well-known concept in statistical theory and are being used to model complex distributions in finance, actuarial science, and extreme-value theory (genest; joe2014dependence; elementsofcopula)
. Broadly speaking, a copula is some function that couples marginal distributions to a multivariate joint distribution. Hence, copulae allow for flexible multivariate modeling with marginals stemming from a huge range of suitable classes. This allows for example to formulate NF base distributions that combine heavy-tailed marginals—as proposed by
wiese2019copula; pmlr-v119-jaini20a; alexanderson2020robust—with light-tailed marginals. This paper presents a novel framework for choosing the base distribution of NFs building on the well-studied theory of copulae. A first empirical investigation demonstrates the benefits brought by this approach. Our experimental analysis on toy data reveals that using even the most simple copula model—the Independence Copula—we are able to outperform the vanilla NF approach, which uses an isotropic Gaussian base distribution. The resulting NF converges faster, is more robust, and achieves an overall better test performance. In addition, we show that the learned transformation has a better-behaved functional form in the sense of a more stable local Lipschitz continuity.2 Background
In this section, we quickly review some background knowledge about NFs (Section 2.1), followed by an introduction to copula theory (Section 2.2).
2.1 Normalizing Flows
Density estimation via NFs revolve around learning a diffeomorphic transformation that maps some unknown target distribution to a known and tractable base distribution . At the cornerstone of NFs is the change of variables formula
(1) |
which relates the evaluation of the estimated density of to the evaluation of the base density , of , and of . By composing simple diffeomorphic building blocks , we are able to obtain expressive transformations, while presuming diffeomorphy and computational tractablity of the building blocks. Due to the tractable PDF in (1), we are able to train the model via maximum likelihood estimation (MLE)
(2) |
where is the PDF of the empirical distribution of . A comprehensive overview of NFs, including the exact parameterizations of certain flow models , computational aspects, and more, can be found in kobyzev2020normalizing and papamakarios2019normalizing.
2.2 Copulae
A completely different approach of density estimation, which has mostly been left unrelated to NFs, is the idea of copulae.
Definition 2.1 (Copula).
A copula is a multivariate distribution with CDF that has standard uniform marginals, i.e. the marginals of satisfy .
The fundamental idea behind copula theory is that we can associate every distribution with a uniquely defined copula . Vice versa, given marginal distributions, each copula defines a multivariate distribution with the given marginals. Formally, this is known as Sklar’s Theorem (sklar; sklarproof).
Theorem 2.2 (Sklar’s Theorem).
Taken from elementsofcopula.
-
[nolistsep]
-
For any -dimensional CDF with marginal CDFs , there exists a copula such that
(3) for all . The copula is uniquely defined on , where is the image of . For all it is given by
(4) where are the right-inverses of .
-
Conversely, given any -dimensional copula and marginal CDFs , a function as defined in (3) is a -dimensional CDF with marginals .
Part 1 of Sklar’s Theorem finds much application in statistical dependency analysis (joe2014dependence). In contrast to classical dependency measures, such as Pearson correlation, copulae are a more flexible tool that allow the decoupling of the marginals and the dependency structure. Part 2 of Sklar’s Theorem is of relevance for statistical modeling, and more precisely, to define multivariate distributions. Given marginal distributions, which are typically much easier to estimate than the full joint distribution, and a copula we can “couple” the marginals and the dependency structure to a multivariate joint distribution. This perspective finds various applications in the context of finance and related disciplines that need to take heavy tails and tail dependencies into account, see genest for an overview. In Section A of the Appendix we give some illustrative examples and further details about properties of copula distributions.
3 NFs With Copula-Base Distributions
In this paper, we propose to employ copulae to model a flexible, yet appropriate base distribution with the goal of gaining a NF that is able to solve the limitations of NFs discussed in Section 1. We expect to gain powerful and robust PDF approximators by combining different marginals and properties of theoretical sound copulae (see for instance Chapter 8 in joe2014dependence) with NFs, which allow for the estimation of complex densities.
3.1 A General Framework
We propose to replace the isotropic Gaussian base distribution in the vanilla NF framework by a more flexible copula distribution. Importantly, we want to learn a base distribution that is able to represent the tail behavior of the distribution of . For training a NF with a copula base distribution we build on the fact that we can write the PDF of the latent variables as written in (5). This requires two estimation steps: First, we need to estimate the marginal distributions , which can further be used to calculate the marginal densities . Secondly, we need to estimate the copula density . A popular approach for estimating the density in (5) is to employ the method of inference functions for margins (IFM) (Joe_Xu_1996), which sequentially estimates the marginals using MLE first, and then employs these marginals to estimate the copula using MLE.
It is important to note that in contrast to standard applications of copula theory, we do not aim at estimating the full data generating distribution based on (5). Instead, following the investigations by pmlr-v119-jaini20a, our goal is to capture the tail-behavior of . Hence, we propose to learn surrogate marginals that are able to represent the tailedness of the marginals . By combining these marginals with some simple copula structure, such as the Gaussian Copula or the Independence Copula (see (6) below), we are able to create a joint distribution that represents the marginal tail behavior of .
The proposed adjustment can be applied to any existing NF architecture as long as (5) remains tractable. However, as the main goal of the base distribution is not to fully estimate the target but to represent the tail behavior of , we can restrict ourselves to tractable parametric marginal distributions and copulae.
3.2 Experimental Analysis
In this section, we investigate the benefits of the proposed approach by analyzing a toy problem. In the following experiments, we employ the framework proposed in Section 3.1 using the most simple copula: the Independence Copula, i.e. we consider a base distribution with PDF
(6) |
Note that by plugging Gaussian marginals in (6), we would obtain the vanilla NF. We consider a training set generated from a 2-dimensional heavy-tailed distribution,111all computational details can be found in Section C of the Appendix which has standardized t-distributed marginals with degrees of freedom. The corresponding copula is a Gumbel Copula with parameter , i.e.
As a proof of concept we compare the estimation of this heavy-tailed distribution using a NF222We are using a 3-layered MAF papamakarios2017masked. Further computational details can be found in Section C of the Appendix. with an isotropic Gaussian base distribution, and with 3 different heavy-tailed base distributions constructed via (6). We consider the following heavy-tailed marginals:
-
[nolistsep]
-
Laplace and . We call this case heavierTails because one marginal is heavy-tailed;
-
and . We call this case correctFamily since both marginals stem from the same parametric class as the exact marginals;
-
and . We call this case exactMarginals.
Samples from the target distribution and from the different base distributions are visualized in Figure 6 and 7 in the Appendix.
Training and test loss
In Figure 1, we plot the average training and test performance over trails. It is apparent that training using a base distribution with the correct type of tail behavior is beneficial. First of all, we observe a significant gap between the test performance of the vanilla NF and the NFs with a heavy-tailed base distribution. Notice that in Figure 1 we excluded all runs with a final test loss of above , which happened in 17 of the normal runs and not once in the other cases. Furthermore, we clearly observe a much faster convergence and a more stable training procedure. The fluctuations and instabilities in the vanilla NF are due to tail-samples that have a massive effect on the likelihood in (2), which can be reduced by choosing base distributions with slower decaying tails alexanderson2020robust.
![]() |
![]() |
Learning the tails
To illustrate the ability to model the tails, we compared the estimated empirical quantile functions. We did so for both marginal distributions (Figure
2) but also for the distribution of (Figure 9 in the Appendix). In line with the findings by pmlr-v119-jaini20a, we notice that the vanilla NF is not capable of modeling the quantiles of the target distribution. More precisely, we observe that the corresponding quantile function is steeper around its center and has shorter tails. This means that the distribution learned by the NF does not account for the heavy tails by directly modeling them, but instead covers samples from the tails of the data distribution by being more widespread. In contrast, the base distributions that took the tailedness of into account, could achieve a much better fit to the quantiles, see Figure 8 in the Appendix for further results.
Invertibility and numerical stability
As investigated by pmlr-v130-behrmann21a, the Bi-Lipschitz constant plays a fundamental role in the practical invertibility and numerical stability of NFs. To understand the learned transformation in terms of its Lipschitz continuity, we propose to study the Lipschitz surface of . Note that if is differentiable and -Lipschitz, we can follow the derivation by pmlr-v130-behrmann21a (in equation (5) and (6) therein) to approximate
where is some small constant. This motivates to consider an estimate333see Section C in the Appendix for details of for as a local surrogate for the Lipschitz-continuity of . Plotting these quantities for both, and , and we obtain the Lipschitz surfaces, which are depicted in Figure 3. We notice that the vanilla NF has many fluctuations in the local Lipschitz-continuity, while the proposed copula method leads to a well-behaved transformation. The inverse transformation in the vanilla NF has exploding local Lipschitz constants, while—again—the proposed method results in a stable inverse transformation.
![]() |
![]() |
![]() |
![]() |
4 Discussion
In this work, we paved the way toward a general extension of NF architectures using copulae. Synthetic experiments revealed that the modeling performance of NFs can be improved substantially by replacing the vanilla Gaussian base distribution by a base distribution that reflects basic properties of the data distribution more accurately. Of course, we have just scraped the surface of the underlying potential of the proposed approach: While we concentrate on the tail behavior of the marginals in this work, the general idea can potentially also be applied to incorporate other types of inductive bias, such as multimodality by choosing multimodal marginals, or symmetries and tail dependencies by selecting appropriate marginals and copulae.
Our experiments suggest that it is sufficient to have only a broad estimate of the marginals. As mentioned in Section 3.1, one could also employ IFM to learn these before training the NF. However, the question of what is the best technique for choosing or estimating an appropriate marginal distributions and copula still requires further investigation. Nevertheless, we think that this flexibility brings additional improvement over the methods proposed by pmlr-v119-jaini20a and alexanderson2020robust. Of course, we have yet only gained preliminary results with our empirical study, which we plan to verify on real-world data and for different models in the future.
We believe that our analysis of the base functions can help to popularize NFs in a wide spectrum of domains. One such application might be financial risk analysis, where it is essential to model tail dependencies.
Acknowledgement
This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – EXC- 2092 CASA – 390781972. We also thank the anonymous reviewers for their careful reading and their useful suggestions, which will help us extending this work to a full paper.
References
Appendix A Examples of Copula Distributions
There is a wealth of different copula distributions, ranging from parametric distributions to semi-parametric and completely non-parametric copula models. In this Section, we give some insights into the construction of one class of copula distributions and illustrate some basic properties.
Example A.1 (Construction using Sklar’s Theorem).
There is a generic way to construct a copula according to Definition 2.1. Consider any multivariate continuous and invertbile CDF with marginal CDFs . Then, following (4) in Sklar’s Theorem 2.2, we know that
(7) |
defines a valid copula.
Setting to be the multivariate Gaussian distribution with correlation matrix is the identity matrix we obtain the , which is simply the product of independent uniform distributions. Both copulae are visualized in Figure
Example A.2 (Copulae that induce Tail-Dependency).
A crucial property of models in financial risk analysis is tail dependency. Roughly said, a tail dependency casts marginals to be dependent in a tail event. To give an example for a tail dependency, consider two essentially independent marginal distributions, such as and . Usually, our trust for the bank is not essentially determined by the amount of debts it has. However, in a marginal tail event , of course, our trust drops rapidly.
Mathematically, the upper tail dependency, i.e. the dependency in upper-tail events, for a random variable
Similarly, we can define the lower tail dependency . One such copula that accounts for upper tail dependency is the Gumbel Copula, which is given by
(8) |
for . Figure 4 shows a visualization of the Gumbel copula. One can show that (joe2014dependence). Figure 5 visualizes the Gumbel copula distribution with Gaussian and Gamma marginals. We can observe a decent dependency—indicated by the peak pointing to the upper right—in the upper-tail events, which is mathematically described by .
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
One could think of many more different properties that might be included using our copula approach, such as multi-modality and symmetry. However, it is still to be researched whether properties, such as the tail-dependency, are being preserved by the NF. Nonetheless, we think that fixing specific known properties in the base distribution facilitates training in acting as a type of regularization towards these given properties.
Appendix B Supplementary Experiments
In the following, we give some further empirical results that underpin our findings from the main paper.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Figure 8 and Figure 9 supplement the findings about the learned quantiles: Even the other heavy-tailed distributions—the cases heavierTails and correctFamily—are able to successfully approximate the quantiles. This observation suggests that it is sufficient to use a broad surrogate of the base distribution that captures the true tail behavior.

![]() |
![]() |
![]() |
![]() |
To address the invertibility and numerical stability of the learned transformation, we investigate the local Lipschitz constant as derived in Section 3.2 in the main text. Figure 10 shows the supplementary Lipschitz surfaces of the cases heavierTails and correctFamily, which indicate a more regular transformation than the normal case. Still, in the case correctFamily we find large local Lipschitz constants in
, which, however, are structured in accordance to the base distribution. In the vanilla NF this irregularity is less structured and, since these irregularities occur in high-probability areas of the base distribution, more relevant.
![]() |
![]() |
![]() |
![]() |
These results support our claim that choosing appropriately tailed base distribution can help in learning a numerically robust transformation.
Appendix C Computational Details
In all of our experiments, we employed Masked Autoregressive Flows (papamakarios2017masked) with 3 layers. Each layer contains a reverse permutation, followed by an autoregressive transformation with 4 hidden features. The code is based on the nflows package (nflows).
We trained the NFs on a sample of size . Optimization was carried out using the Adam optimizer with the PyTorch default settings and a batch size of . Test losses are evaluated based on test samples.
The reported training and test losses (Figure 1) have been averaged over runs and the depicted confidence intervals correspond to a confidence of .
To estimate the Lipschitz surface we rely on an estimation of
(9) |
for some small constant . We chose . Further, we approximate (9) by
where are i.i.d. samples from , see pmlr-v130-behrmann21a for further details.
All code is provided with the submission and can further be accesses through https://github.com/MikeLasz/Copula-Based-Normalizing-Flows.