Increasing triangular maps are a recent construct in probability theory that can transform any source density to any target probability density . The Knothe-Rosenblatt transformation [30; 18],[36, Ch.1]
, gives a heuristic construction of an increasing triangular map for transporting densities that isunique (up to null sets) 
. These transformations provide a unified framework to study popular neural density estimation methods like normalizing flows[33; 32; 29]26; 14; 17; 35; 19] which provide a tractable method for evaluating a probability density 
. Indeed, these methods are becoming increasingly attractive for task of multivariate density estimation in unsupervised machine learning.
This work is devoted to studying the properties of triangular flows that learn increasing triangular transformations when the target density is a heavy-tailed distribution. Heavy tailed analysis studies the phenomena governed by large movements and encompasses both statistical inference and probabilistic modelling . Indeed, heavy-tail analysis is extensively used in diverse applications like financial risk-modelling wherein the financial returns and risk-management calculations require heavy-tailed analysis [5; 10; 21], in data-networks where heavy-tailed distributions are observed for file sizes, transmission rates, transmission duration and network traffic [24; 8; 20], and in modelling insurance claim sizes and frequencies in order to set premiums efficiently to quantify the risk to the company [5; 10].
Specifically, we study triangular flows to represent multivariate heavy-tailed elliptical distributions often used for modeling financial data and in the theory of portfolio optimization. Indeed, the basis of modern portfolio optimization relies on the Gaussian distribution hypothesis[23; 34; 31]. However, as demonstrated by multiple studies [9; 12; 16]
, Gaussian distribution hypothesis cannot be justified for financial modelling and elliptical distributions are the suggested alternative particularly because they allow to retain certain desirable practical properties of normal distribution.
We begin our exposition in §3 where we show that in one-dimension, the density quantile functions of the source and the target probability density precisely characterizes the slope of an (unique) increasing transformation. Subsequently, we give an exact characterisation of degree of heavy-tailedness of a distribution based on the asymptotic properties of the density quantile function. This allows us to clearly characterize the properties of an increasing transformation required to push a source density to any target density with varying tail behaviour respectively. Finally, we make precise the connection between the asymptotics of the density quantile function and existence of higher-order moments of a distribution. We use this to give a precise rate (which accounts for the relative heaviness of source and target densities) at which an increasing transformation must grow to capture the tail behaviour of the target density.
In §4, we extend these results for higher dimensions. We define multivariate heavy-tailed distributions as distributions whose marginals are heavy tailed in all directions and show that any increasing triangular map from a light-tailed distribution to a heavy-tailed distribution must have all diagonal entries of the Jacobian matrix (and hence all eigenvalues and the determinant) to be unbounded. We discuss the implications of our findings for neural density estimation in §5. We highlight the trade-off between choosing an appropriate source density and the “complexity” of the transformation required to learn a target density. We provide all the proofs in §A.
We summarize our main contributions as follows: 1) We show that density quantiles precisely capture the properties of a push-forward transformation, 2) We relate the properties of density quantiles to existence of functional moments and tail-properties allowing us to provide asymptotic rates for transformations required to capture heavy-tailed behaviour, 3) We reveal properties of density quantiles for certain classes of distributions both for one dimensions and higher-dimensions that might be of independent interest, 4) We precisely study the properties of increasing maps required to capture heavy-tailed behaviour, 5) We reveal the trade-off between choosing a “complex” source density and an “expressive” transformation for representing target densities and its implications for flow based models.
Consider two probability density functionsand (with respect to the Lebesgue measure) over the source domain and the target domain , respectively. There always exists a deterministic transformation (cf. [Ch.1, 36]) such that for all (measurable) set ,
Specifically, by using the change of variables formula, i.e. , a diffeomorphic function
can push forward a base random variableto a target random variable such that is the push forward of i.e. , where is the absolute value of the determinant of the Jacobian of .
Fortuitously, it is always possible to construct such a transformation : we call a mapping triangular if its -th component only depends on the first variables . The name “triangular” comes from the fact that the Jacobian of is a triangular matrix function. We call increasing if for all , is an increasing function of .
Theorem 1 ().
For any two densities and over , there exists a unique (up to null sets of ) increasing triangular map so that .
Before proceeding further, let us first give an example of a construction of an increasing triangular transformation to help better understand Theorem 1. This example will subsequently form the basis of our theoretical exposition in the paper.
Example 1 (Increasing Rearrangement).
Let and be univariate probability densities with distribution function and , respectively. One can define the increasing map such that , where is the quantile function of :
Indeed, if , one has that . Also, if , then . Theorem 1 is a rigorous iteration of this univariate argument by repeatedly conditioning (a construction popularly known as the Knothe-Rosenblatt transformation [30; 18]). Note that the increasing property is essential for claiming the uniqueness of . 111For instance, if is symmetric, then both and push to itself.
Thus, triangular mappings constitute an appealing function class to learn a target density. Indeed, many recent generative models in unsupervised machine learning are precisely special cases of this approach . In this paper, we characterize the properties of such increasing triangular mappings required to learn a target density that is heavy-tailed from a source density .
3 Properties of Univariate Transformations
Increasing Rearrangement is a unique increasing transformation between two densities. (cf. Example 1). Conveniently, we can analyze the slope of this transformation analytically. For a probability density over a domain , let
denote the cumulative distribution function of, be the quantile function given by and be the density quantile function with a functional form as . It is further given by the reciprocal of the derivative of the quantile function i.e. . The slope of such that where are two densities is given by the ratio of the density quantile function of the source and the target distribution respectively, i.e.
Let and be two densities and be an increasing map such that . If the density quantile of shrinks to 0 at a rate slower than the density quantile of , then is asymptotically unbounded.
Clearly, the density quantile functions precisely characterize the slope of an increasing transformation. Moreover, we can further characterise the asymptotic properties of an increasing transformation using the asymptotics of density quantiles of distributions following [27; 1] who proved that the limiting behaviour of any density quantile function as (corresponding to right tails) is:
where implies that is a finite constant.
Let and . Then, such that is given by:
where is the error function. Furthermore, and and hence, . Similarly, for :
and . Therefore, .
Additionally, we can also define the limiting behaviour of the quantile function as as:
The parameter is called the tail-exponent and defines the (right) tail-area of a distribution. Indeed, if for two distributions with tail exponents and , if , the corresponding distribution has heavier tails relative to the other. The tail exponent allows us to define distributions based on their degree of heaviness as follows:
Following , if
the distributions are short-tailed, e.g. Uniform distribution. Here, we further show that a distribution has support bounded from above if and only if the right density quantile function has tail-exponent.
Let be a density with as . Then, where i.e. has a support bounded from above.
corresponds to a family of distributions for which all higher order moments exist. However, these distributions are relatively heavier tailed than short-tailed distributions and were termed as medium tailed distributions in 
, e.g. normal and exponential distribution. Additionally, for, a more refined description of the asymptotic behaviour of quantile function can be given in terms of the shape parameter :
determines the degree of heaviness in medium tailed distributions; the smaller the value of , the heavier the tails of the distribution e.g. exponential distribution has , and normal distribution has . Based on this, we can define
Therefore, we have and . Finally, heavy tailed distributions have e.g. Cauchy and . We next give a precise characterization of asymptotic properties of a diffeomorphic transformation from one distribution to the other with varying tail behaviour in the following corollary of Theorem 2:
Let be a source distribution, be a target distribution and be an increasing transformation such that . Then,
if , the slope of converges asymptotically to 0
if , the slope of converges asymptotically to a finite constant
if , the slope of asymptotically diverges to infinity
if , the slope of diverges to infinity asymptotically
if , the slope of converges to a finite constant
if , the slope of converges to zero asymptotically
Let us give another example to underscore the importance of using density quantiles to define tail-behaviour and the increasing push-forward transformations.
Example 3 (Pushing uniform to normal).
Let be uniform over and be normal distributed. The unique increasing transformation
where is the error function, which was Taylor expanded in the last equality. The coefficients and . We observe that the derivative of is an infinite sum of squares of polynomials. Both uniform and normal distributions are considered “light-tailed” (all their higher moments exist and are finite). However, an increasing transformation from uniform to normal distribution has unbounded slope. Density quantile functions help us to reveal this precisely: and i.e. Normal distribution is “relatively” heavier tailed than uniform distribution explaining the asymptotic divergence of this transformation. Indeed, the density quantiles help to provide a more granular definition of heavy-tailedness based on the tail-exponent and shape exponent .
Given a random variable , the expected value of a function can be written in terms of the quantile function as: . This allows us to draw a precise connection between the degree of heavy-tailed ness of a distribution as given by the density quantile functions (and tail exponent ) and the the existence of the number of its higher-order moments.
Let be a distribution with as . Then, exists and is finite for some iff .
If is a distribution with as and as 222This condition takes the left-tail into account as well. Note that it is not necessary for both tails to have the same behaviour and our analysis will extend to such cases easily.. Then, exists and is finite iff .
Based on these observations, we can equivalently define heavy-tailed distributions as follows:
A distribution with compact support i.e. where and is said to be heavy tailed if for all , exists and is finite, but for , is infinite or does not exist.
A distribution with tail exponent is said to be heavy tailed if for all , exists and is finite, but for , is infinite or does not exist.
Definition 3 (heavy tailed distributions).
A distribution with tail-exponent is heavy tailed with degree with if for all , exists and is finite, but for all , is infinite or does not exist.
These definitions allow us to finally give the rate an increasing transformation must emulate to exactly represent tail-properties of a target density given some source density. Concretely,
Let be a heavy distribution, be a heavy distribution and be a diffeomorphism such that . Then for small , .
4 Properties of Multivariate Transformations
We recall that there exists a unique bijective increasing triangular map that transforms a high-dimensional joint source density to a target density . The th component of is given by where is the cdf of the conditional distribution of given . Analogous to our results in §3, we shall characterise the properties of by studying the properties of required to push to with varying tail properties. Evidently, for a triangular transformation , the determinant of the Jacobian i.e. is just the product of the diagonals where each diagonal entry is given by .
Hence, by being able to characterize the properties of the conditional density quantiles, we shall be able to characterize the properties of . However, we first define the notion of tail-behaviour in multivariate distributions: A multivariate distribution is heavy-tailed if the marginal distributions in every direction on the (high-dimensional) sphere are heavy-tailed i.e. a distribution is said to be heavy-tailed if for all vectors where and , . This definition automatically implies that the univariate random variable is heavy tailed. In particular, we will consider the class of elliptical distributions since they admit the same tail-behaviour in every direction.
Definition 4 (Elliptical distribution, ).
A random vector is said to be elliptically distributed denoted by with if and only if there exists a , a matrix with maximal rank and, a non-negative random variable , such that , where the random -vector is independent of and is uniformly distributed over the unit sphere , and is the cumulative distribution function of the variate .
For ease in developing our results, we consider only full rank elliptical distributions i.e. . The spherical random vector produces elliptically contoured density surfaces due to the transformation . The density function of an elliptical distribution as defined above 333The density function is defined if the density is absolutely continuous, which happens if the generating variate is absolutely continuous is given by:
where the function is related to , the density of , by the equation: , here is the area of a unit sphere. Thus, the tail properties of a random variable with an elliptical distribution i.e. is determined by the generating random variable . Indeed, is heavy-tailed in all directions if the univariate generating random variable is heavy-tailed. Define
Intuitively, is the -th order moment of when is integer-valued. We can now generalize Definition 3 to the multivariate case: the distribution is -heavy iff is finite for all iff is -heavy. Similarly, from Definition 2 one has that is -heavy iff is -heavy.
Elliptical distributions have certain convenient properties: an affinely transformed elliptical random vector is elliptical. Let and . Consider the transformed vector where . Then, . In particular, if is a permutation matrix then is also elliptically distributed and belongs to the same location-scale family as . Additionally, the marginal and conditional distributions of an elliptical distributions are also elliptical.
Lemma 1 (Marginal distributions of an elliptical distribution are elliptical, ).
Let where and partition such that . Let and be the corresponding partitions of and respectively. Then, .
Let where and is p.s.d with and where . Further, let and partition such that . Let and be the corresponding partitions of and respectively. If the conditional random vector exists then
where where .
In our next result we analyze the tail-properties of the conditional distributions of heavy-tailed elliptical distribution.
Under the same assumptions as in Lemma 2, if is -heavy, then the conditional distribution of is -heavy.
We now state the main result of this section: an increasing triangular map that transforms a light-tailed elliptical distribution to a heavy-tailed elliptical distribution has all diagonal entries of to be unbounded.
Let and be two random variables with elliptical distributions with densities and respectively where is heavier tailed than . If is an increasing triangular map such that , then all diagonal entries of are unbounded. Moreover, the determinant of the Jacobian of is also unbounded.
Let be a random variable with density function that is light-tailed and be a target random variable with density function that is heavy-tailed. If pushes forward to i.e. such that , then there exists an index such that is unbounded.
We next give a general result for any transformation.
We provide the proof in Appendix A. ∎
5 Triangular Flows and Approximation
Neural density estimation methods like autoregressive models [25; 2; 19; 35] and normalizing flows [29; 33; 32] provide a tractable way to evaluate the exact density and are increasingly being used for the purpose of multivariate density estimation in machine learning [17; 6; 7; 26; 35; 14]. Invariably, these methods aim to learn a bijective, invertible and increasing transformation from a simple, known source density to a desired target density such that the inverse and the Jacobian are easy to compute.
As discussed in , most autoregressive models and normalizing flows at their core implement exactly a triangular map i.e. they learn a transformation such that .  considered the affine map .  alternatively replaced the affine form of 
with a univariate neural network and proposed to use the primitive of a univariate sum-of-squares of polynomials as the approximation of an increasing function.  and  proposed efficient implementations of these methods based on affine maps using binary masks that compute all the parameters of the transformation in a single pass of the network. Interestingly, all these methods compose several triangular maps in the hope that this composition of functions is “complex” enough to approximate any generic triangular map.
Here, we argue that there are two ways to learn a target density : First, as we discussed in Section 3-4, we can choose an appropriate base density such that the resulting triangular transformation from to can be represented using simpler triangular transformations that are Lipschitz continuous, or, we can choose a base density from a simple class of distributions (say Gaussian with identity covariance) and learn a “complex” triangular transformation via composition of several triangular transformations. However, we note here that composing several triangular maps is essentially tantamount to converting the source density to more complex base density such that the final composition of the triangular map transforms this to the target density. We propose an alternative that can allow for simpler transformations to target density by considering a more flexible class of source densities than the Gaussian distribution. One way would be to parametrize the source density as an elliptical distribution where the generating variate is from a student-t distribution with degrees of freedom where is a parameter to be learned along with the parameters of the transformation. It is evident from our exposition in Section 4 that such a model would require a Lipschitz continuous triangular map when learning heavy-tailed distributions.
Related question is how well one can approximate a distributions with another distribution , where is heavy tailed, is light tailed, but is not flexible enough to push tails, so that is heavier than . One can consider several similarity metrics for this task. Let us start with Wasserstein distance, most natural for the flow theory. We wish to find a lower bound on the approximation error for , where is a set of all measures on with marginals and on the first and the second factors. Here we have two situations. First, assume that doesn’t have the -th moment. Then, because is lighter than , . And because doesn’t have -th moments, . The only possibility to have a finite distance in this case is exactly if , and the distance is zero. Alternatively, assume that has the -th moment. Then is finite. The measure is Radon (as a finite measure on a second-countable space). Because the set of finite-supported Radon measures is dense in the metric space of Radon measures with -distance , one can approximate arbitrary well with any finite-supported Radon measure . Hence, varying one can find arbitrary close to .
One can do similar analysis with -divergence. The existence of the integral depends on tail behaviour of both distributions among other properties 444For example, topological properties: for KL-divergence, must be contained in . . However, if the integral exists and is finite, one writes it as an integral over a compact set plus an integral over tails, and make the latter as small as wanted by simply increasing . Hence, heaviness determines the possibility of approximation. In case when the target distribution has very heavy tails, the approximation reduces to representation problem, and one needs a flexible enough transformation in order to make as heavy as .
We studied the properties of triangular flows for capturing heavy-tailed distributions. We showed that density quantile functions play a central role in characterising the properties of increasing push-forward maps. Subsequently, we proved that for a triangular flow all eigenvalues of the Jacobian are unbounded when pushing a light-tailed distribution to a heavy-tailed distribution. We revealed properties of quantile and density quantile functions and related it to both existence of functional moments and heavy-tailedness of a distribution that can be of independent interest. As a by-product of our analysis, we demonstrated the trade-off between the complexity of source distribution and expressively of transformations in capturing target densities in generative models. This work opens the possibility for multiple future directions: an interesting line of research will be to conduct holistic experiments to systematically analyze our results for example by considering flexible source distributions with parameters that can be trained along with the model. Another direction will be to analyze general flows that are non-triangular. Further, application of these insights into real-world problems of finance, insurance and networks might also be interesting.
-  DF Andrews et al. A general method for the approximation of tail areas. The Annals of Statistics, 1(2):367–372, 1973.
-  Yoshua Bengio and Samy Bengio. Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks. In NeurIPS, 1999.
-  Vladimir Igorevich Bogachev, Aleksandr Viktorovich Kolesnikov, and Kirill Vladimirovich Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309–335, 2005.
Stamatis Cambanis, Steel Huang, and Gordon Simons.
the theory of elliptically contoured distributions.
Journal of Multivariate Analysis, 11(3):368–385, 1981.
-  Stuart Coles, Joanna Bawa, Lesley Trenner, and Pat Dorazio. An introduction to statistical modeling of extreme values, volume 208. Springer, 2001.
-  Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. In ICLR workshop, 2015.
-  Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In ICLR, 2017.
-  Diane E Duffy, Allen A McIntosh, Mark Rosenstein, and Walter Willinger. Analyzing telecommunications traffic data from working common channel signaling subnetworks. Computing Science and Statistics, pages 156–156, 1993.
-  Ernst Eberlein, Ulrich Keller, et al. Hyperbolic distributions in finance. Bernoulli, 1(3):281–299, 1995.
-  Paul Embrechts, Claudia Klüppelberg, and Thomas Mikosch. Modelling extremal events: for insurance and finance, volume 33. Springer Science & Business Media, 2013.
-  Gabriel Frahm. Generalized elliptical distributions: theory and applications. PhD thesis, Universität zu Köln, 2004.
-  Gabriel Frahm, Markus Junker, and Alexander Szimayer. Elliptical copulas: applicability and limitations. Statistics & Probability Letters, 63(3):275–286, 2003.
-  Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked autoencoder for distribution estimation. In ICML, pages 881–889, 2015.
-  Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural Autoregressive Flows. In ICML, 2018.
-  Priyank Jaini, Kira A Selby, and Yaoliang Yu. Sum-of-Squares Polynomial Flow. International Conference of Machine Learning (ICML), 2019.
-  Markus Junker and Angelika May. Measurement of aggregate risk with copulas. The Econometrics Journal, 8(3):428–454, 2005.
-  Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In NeurIPS, pages 4743–4751, 2016.
-  Herbert Knothe et al. Contributions to the theory of convex bodies. The Michigan Mathematical Journal, 4(1):39–52, 1957.
-  Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AIStats, pages 29–37, 2011.
-  W Leland. Statistical analysis of high time resolution ethernet lan traffic measurements. Proc. 25-th Interface, 1993.
-  Yannick Malevergne and Didier Sornette. Extreme financial risks: From dependence to risk management. Springer Science & Business Media, 2006.
-  R Mardare, P Panangaden, and G Plotkin. Free complete Wasserstein algebras. Logical Methods in Computer Science, 14(3), 2018.
-  Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
-  Krishanu Maulik, Sidney Resnick, and Holger Rootzén. Asymptotic independence and a network traffic model. Journal of Applied Probability, 39(4):671–699, 2002.
-  Radford M. Neal. Connectionist learning of belief networks. Artificial Intelligence, 56(1):71–113, 1992.
-  George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In NeurIPS, pages 2338–2347, 2017.
-  Emanuel Parzen. Nonparametric Statistical Data Modeling. Journal of the American statistical association, 74(365):105–121, 1979.
-  Sidney I Resnick. Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science & Business Media, 2007.
-  Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015.
-  Murray Rosenblatt. Remarks on a multivariate transformation. The annals of mathematical statistics, 23(3):470–472, 1952.
-  William F Sharpe. A simplified model for portfolio analysis. Management science, 9(2):277–293, 1963.
-  Esteban G. Tabak and Cristina V. Turner. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145–164, 2013.
-  Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
-  James Tobin. Liquidity preference as behavior towards risk. The review of economic studies, 25(2):65–86, 1958.
-  Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
-  Cédric Villani. Optimal Transport: Old and New, volume 338. Springer, 2008.
Appendix A Proofs
A similar argument proves the reverse direction. ∎
The first integral is finite because the integrand is non-singular. For the second integrand, we can use the asymptotic behaviour of the quantile function by choosing very close to . Subsequently, the integral exists and converges if and only if . ∎
converges for , because is -heavy. Because is a univariate diffeomorphism, it is a strictly monotone function. Without loss of generality, let us consider to be positive increasing function and investigate the right asymptotic. Consider the function for big positive . Assume there is a sequence , such that and the sequence does not converge to zero. In other words, there exists , such that for any there exists , such that . Let us work with this infinite sub-sequence . Because is increasing function, we can estimate its integral from the left by its left Riemannian sum with respect to the sequence of points :
Since, is heavy, the series on the right hand side diverges as a left Riemannian sum of a divergent integral. But this contradicts to the convergence of the integral on the left hand side. Hence, our assumption was wrong and for all sequences we have: . Hence, which leads to the desired result that . ∎
The density function of the conditional is proportional to , where and is the same function as for the distribution of (see ). Then, because it is a -dimensional elliptical distribution, it is -heavy iff for all . It is given that is -heavy, which is equivalent to . Because , one gets that , hence is -heavy. ∎
We need to show that
Thus, all we need to show is that the generating variate of the conditional distribution for the target is heavier than the generating variate of the conditional distribution of the source. From §3, we know that the tail exponent in the asymptotics of the density quantile function characterize the degree of heaviness. Furthermore, we also know that asymptotical behaviour of the density quantile function is directly related to the asymptotical behaviour of the density function since if is a density function, the cdf is given by , the quantile function therefore is and the density quantile function is the reciprocal of the derivative of the quantile function i.e. . Hence, we need to ensure that asymtotically, the density of is heavier than the density of . Using the result of the cdf of a conditional distribution as given by Eq.(15) in  we have that asymptotically
where is the dimension of the partition that is being conditioned upon. Since, is heavier tailed than , we have that is heavier tailed than for all the conditional distributions. ∎
We will prove this using contradiction; assume that . Assume for simplicity that . Therefore, we have
Since, is heavy tailed, such that
Partition into sets , i.e. such that if , and , then there exists at least one index such that . Subsequently, we can rewrite the integral above as