A Universal Approximation Theorem of Deep Neural Networks for Expressing Distributions

04/19/2020 ∙ by Yulong Lu, et al. ∙ Duke University 0

This paper studies the universal approximation property of deep neural networks for representing probability distributions. Given a target distribution π and a source distribution p_z both defined on R^d, we prove under some assumptions that there exists a deep neural network g:R^dR with ReLU activation such that the push-forward measure (∇ g)_# p_z of p_z under the map ∇ g is arbitrarily close to the target measure π. The closeness are measured by three classes of integral probability metrics between probability distributions: 1-Wasserstein distance, maximum mean distance (MMD) and kernelized Stein discrepancy (KSD). We prove upper bounds for the size (width and depth) of the deep neural network in terms of the dimension d and the approximation error ε with respect to the three discrepancies. In particular, the size of neural network can grow exponentially in d when 1-Wasserstein distance is used as the discrepancy, whereas for both MMD and KSD the size of neural network only depends on d at most polynomially. Our proof relies on convergence estimates of empirical measures under aforementioned discrepancies and semi-discrete optimal transport.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning has achieved unprecedented success in numerous machine learning problems 

[30, 51]. The success of deep learning is largely attributed to the usage of deep neural networks (DNNs) for representing and learning the unknown structures in machine learning tasks, which are usually modeled by some unknown function mappings or unknown probability distributions. The effectiveness of using neural networks in approximating functions has been justified rigorously in the last three decades. Specifically, a series of early works [12, 18, 25, 6] on universal approximation theorems show that a continuous function defined on a bounded domain can be approximated by a sufficiently large shallow (two-layer) neural network. In particular, the result by [6]

quantifies the approximation error of shallow neural networks in terms of the decay property of the Fourier transform of the function of interest. Recently, the expressive power of DNNs for approximating functions have received increasing attention starting from the works by

[36] and [57]; see also [58, 45, 47, 42, 48, 14, 41] for more recent developments. The theoretical benefits of using deep neural networks over shallow neural networks have been demonstrated in a sequence of depth separation results; see e.g. [16, 52, 53, 13]

Compared to a vast number of theoretical results on neural networks for approximating functions, the use of neural networks for expressing distributions is far less understood on the theoretical side. The idea of using neural networks for modeling distributions underpins an important class of unsupervised learning techniques called

generative models

, where the goal is to approximate or learn complex probability distributions from the training samples drawn from the distributions. Typical generative models include Variational Autoencoders

[29], Normalizing Flows [46] and Generative Adversarial Networks (GANs) [19], just to name a few. In these generative models, the probability distribution of interest can be very complex or computationally intractable, and is usually modelled by transforming a simple distribution using some map parametrized by a (deep) neural network. In particular, a GAN consists of a game between a generator and a discriminator which are represented by deep neural networks: the generator attempts to generate fake samples whose distribution is indistinguishable from the real distribution and it generate samples by mapping samples from a simple input distribution (e.g. Gaussian) via a deep neural network; the discriminator attempts to learn how to tell the fake apart from the real. Despite the great empirical success of GANs in various applications, its theoretical analysis is far from complete. Existing theoretical works on GANs are mainly focused on the trade-off between the generator and the discriminator (see e.g. [40, 2, 3, 37, 5]). The key message from these works is that the discriminator family needs to be chosen appropriately according to the generator family in order to obtain a good generalization error.

Our contributions. In this work, we focus on an even more fundamental question on GANs and other generative models which is not yet fully addressed. Namely how well can DNNs express probability distributions? Specifically, we aim to answer the following questions:

(1) Given a fixed source distribution and a target distribution, can one construct a DNN such that the push-forward of the input distribution based on the DNN gets close to the target?

(2) If the answer is yes to (1), how complex is the DNN, such as how many depths and widths needed to achieve certain approximation accuracy?

We answer these questions in this paper by making following contributions:

  • [noitemsep,topsep=1pt,parsep=1pt,partopsep=1pt,wide]

  • Given a fairly general source distribution and a target distribution defined on which satisfies certain integrability assumptions, we show that there is a ReLU DNN with inputs and one output such that the push-forward of the source distribution via the gradient of the output function defined by the DNN is arbitrarily close to the target. We measure the closeness between probability distributions by three integral probability metrics (IPMs): 1-Wasserstein metric, maximum mean discrepancy and kernelized Stein discrepancy.

  • Given a desired approximation error , we prove complexity upper bounds for the depth and width of the DNN needed to attain the given approximation error with respect to the three IPMs mentioned above; our complexity upper bounds are given with explicit dependence on the dimension of the target distribution and the approximation error .

  • The DNN constructed in the paper is explicit: the output function of the DNN is the maximum of finitely many (multivariate) affine functions, with the affine parameters determined explicitly in terms of the source measure and target measure.

Related work. As far as the authors are aware of, the only prior work considering expressiveness of neural networks for probability distribution is [31]. There the authors considered a class of probability distributions that are given as push-forwards of a base distribution by a class of Barron functions, and showed that those distributions can be approximated in Wasserstein distances by push-forwards of neural networks, essentially relying on the ability of neural networks to approximate functions in the Barron class. It is however not clear what probability distributions are given by push-forward of a base one by Barron functions. In this work, we aim to provide more explicit and direct criteria of the target distributions.

The rest of the paper is organized as follows. In Section 2 we introduce some useful notations to be used throughout the paper. We describe the problem and state the main result in Section 3. Section 4 and Section 5 devote to the two ingredients for proving the main result: convergence of empirical measures in IPMs and building neural-network-based maps between the source measure and empirical measures via semi-discrete optimal transport respectively. Proofs of lemmas and intermediate results are provided in appendices.

2 Notations

Let us introduce several definitions and notations to be used throughout the paper. We start with the definition of a fully connected and feed-forward neural network.

Definition 2.1.

A (fully connected and feed-forward) neural network of

hidden layers takes an input vector

, outputs a vector and has hidden layers of sizes . The neural network is parametrized by the weight matrices

and bias vectors

with . The output is defined from the input iteratively according to the following.



is a (nonlinear) activation function which acts on a vector

component-wisely, i.e. . When , we say the network network has width and depth . The neural network is said to be a deep neural network (DNN) if . The function defined by the deep neural network is denoted by .

Popular choices of activation functions

include the rectified linear unit (ReLU) function

and the sigmoid function


Given a matrix , let us denote its -fold direct sum by

Given two probability measures and on , a transport map between and is a measurable map such that where denotes the push-forward of under the map , i.e., for any measurable , . We denote by the set of transport plans between and which consists of all coupling measures of and , i.e., and for any measurable . We may use to denote generic constants which do not depend on any quantities of interest (e.g. dimension ).

3 Problem description and main result

Let be the target probability distribution defined on which one would like to learn or generate samples from. In the framework of GANs, one is interested in representing the distribution implicitly by a generative neural network. Specifically, let be a subset of generators (transformations), which are defined by neural networks. The concrete form of is to be specified later. Let be a source distribution (e.g. standard normal). The push-forward of under the transformation is denoted by . In a GAN problem, one aims to find such that . In the mathematical language, GANs can be formulated as the following minimization problem:


where is some discrepancy measure between probability measures and , which typically takes the form of integral probability metric (IPM) or adversarial loss defined by


where is certain class of test (or witness) functions. As a consequence, GANs can be formulated as the minimax problem

The present paper aims to answer the following fundamental questions on GANs:

(1) Is there a neural-network-based generator such that ?

(2) How to quantify the complexity (e.g. depth and width) of the neural network?

As we shall see below, the answers to the questions above depend on the IPM used to measure the discrepancy between distributions. In this paper, we are interested in three IPMs which are commonly used in GANs, including -Wasserstein distance [55, 1], maximum mean discrepancy [21, 15, 35] and kernelized Stein discrepancy [38, 11, 27].

Wasserstein Distance: When the witness class is chosen as the the class of 1-Lipschitz functions, i.e. , the resulting IPM becomes the 1-Wasserstein distance (also known as Kantorovich-Rubinstein distance):

The Wasserstein-GAN proposed by [1] leverages the Wasserstein distance as the objective function to improve the stability of training of the original GAN based on the Jensen-Shannon divergence. Nevertheless, it has been shown that Wasserstein-GAN still suffers from the mode collapse issue [23] and does not generalize with any polynomial number of training samples [2].

Maximum Mean Discrepancy (MMD): When is the unit ball of a reproducing kernel Hilbert space (RKHS) , i.e. , the resulting IPM coincides with the maximum mean discrepancy (MMD) [21]:

GANs based on minimizing MMD as the loss function were firstly proposed in

[15, 35]. Since MMD is a weaker metric than -Wasserstein distance, MMD-GANs also suffer from the mode collapse issue, but empirical results (see e.g. [7]) suggest that they require smaller discriminative networks and hence enable faster training than Wasserstein-GANs.

Kernelized Stein Discrepancy (KSD): If the witness class is chosen to be

where is the Stein-operator defined by


the associated IPM becomes the Kernelized Stein Discrepancy (KSD) [38, 11]:

The KSD has received great popularity in machine learning and statistics since the quantity is very easy to compute and does not depend on the normalization constant of , which makes it suitable for statistical computation, such as hypothesis testing [20] and statistical sampling [39, 10]. The recent paper [27] adopts the GAN formulation (3.1) with KSD as the training loss to construct a new sampling algorithm called Stein Neural Sampler.

3.1 Main result

Throughout the paper, we consider the following assumptions on the reproducing kernel :

Assumption K1.

The kernel is integrally strictly positive definite: for all finite non-zero signed Borel measures defined on ,

Assumption K2.

There exists a constant such that

Assumption K3.

The kernel function is twice differentiable and there exists a constant such that


According to [51, Theorem 7], Assumption K1 is necessary and sufficient for the kernel being characteristic, i.e., implies , which guarantees that MMD is a metric. In addition, thanks to [38, Proposition 3.3], KSD is a valid discrepancy measure under the Assumption K1, namely and if and only if .

Assumption K2 will be used to get an error bound for ; see Theorem 4.2. Assumption K3 will be crucial for bounding ; see Theorem 4.3. Many commonly used kernel functions fulfill all three assumptions K1-K3, including for example Gaussian kernel and inverse multiquadric (IMQ) kernel with and . Unfortunately, Matérn kernels (see e.g. [43]) only satisfy Assumptions K1-K2, but not Assumption K3 since the second order derivatives of are singular on the diagonal so that the second estimate of (3.5) is violated.

In order to bound , we need to assume further that the target measure satisfies the following regularity and integrability assumptions. We will use the shorthand notation .

Assumption 1 (-Lipschitz).

Assume that is globally Lipschitz in , i.e. there exists a constant such that for all . As a result, there exists such that

Assumption 2 (sub-Gaussian).

The probability measure is sub-Gaussian, i.e. there exist and such that

Assume further that for some .

Our main result is the following universal approximation theorem for expressing probability distributions.

Theorem 3.1 (Main theorem).

Let and be the target and the source distributions respectively, both defined on . Assume that is absolutely continuous with respect to the Lebesgue measure. Then under certain assumptions on and the kernel to be specified below, it holds that for any given approximation error , there exists a positive integer , and a fully connected and feed-forward deep neural network of depth and width , with inputs and a single output and with ReLU activation such that

The complexity parameter depends on the choice of the metric , specifically,

1. Consider . If satisfies that , it holds that

where the constant depends only on .

2. Consider with kernel . If satisfies Assumption K2, then

with a constant depending only on the constant in (3.4).

3. Consider with kernel . If satisfies Assumption K3 with constant and if satisfies Assumption 1 and Assumption 2 with parameters , then

where the constant depends only on , but not on .

Theorem 3.1 states that a given probability measure (with certain integrability assumption) can be approximated arbitrarily well by push-forwarding a source distribution with the gradient of a potential which can be parameterized by a finite DNN. Moreover, when the discrepancy between probability measures is measured by , the width of the DNN needed to achieve an approximation error scales like

, indicating that the issue of curse of dimensionality when using

in GANs. Interestingly this result is consistent with the fact that Wasserstein-GANs do not generalize with only -number of training samples when ; see e.g. [2]. On the other hand, if the discrepancy is measured by MMD (resp. KSD), the width scales only like (resp. ), which breaks the curse of dimensionality.

Proof strategy. Our proof of Theorem 3.1 relies on two ingredients: first one approximates the target measure by an empirical measure ; and then the next step one builds a neural-network-based mapping which push-forwards a given source distribution to the empirical distribution . In more details, the two essential ingredients are the following.

  1. [noitemsep,topsep=1pt,parsep=1pt,partopsep=1pt,wide]

  2. Approximation of the target measure by empirical measures. It is well-known that a probability measure (with mild integrability assumptions) can be approximated by the empirical measure of random samples which are i.i.d. drawn from , with respect to various metrics, such as Wasserstein distances [17, 32, 56] and MMD [50]. As a side product of this paper, we also obtain a high-probability approximation error bound for under the assumption that the target measure is sub-Gaussian. Theorem 4.1 summarizes the convergence of empirical measures under three IPMs.

  3. Push-forwarding the source distribution to the empirical distribution via a neural-network-based optimal transport map. Based on the theory of (semi-discrete) optimal transport, one can construct an optimal transport map of the gradient form which push-forwards the source distribution to the empirical distribution . Moreover, the potential function has an explicit structure: it is the maximum of finitely many affine functions; it is such explicit structure that enables one represents the function with a finite deep neural network. See Theorem 5.1 for the precise statement.

Theorem 3.1 then follows immediately by combining Theorem 4.1 and Theorem 5.1, as Theorem 4.1 guarantees the existence of an empirical measure approximating and Theorem 5.1 provides a push-forward from to the empirical measure. The error bounds in Theorem 4.1 translates directly to the complexity bounds in Theorem 3.1.

It is interesting to remark that our strategy of proving Theorem 3.1 shares the same spirit as the one used to prove universal approximation theorems of DNNs for functions [57, 36]. Indeed, both the universal approximation theorems in those works and ours are proved by approximating the target function or distribution with a suitable dense subset (or sieves) on the space of functions or distributions which can be parametrized by deep neural networks. Specifically, in [57, 36] where the goal is to approximate continuous functions on a compact set, the dense sieves are polynomials which can be further approximated by the output functions of DNNs, whereas in our case we use empirical measures as the sieves for approximating distributions, and we show that empirical measures are exactly expressible by transporting a source distribution with neural-network-based transport maps.

We also remark that the push-forward map between probability measures constructed in Theorem 3.1 is the gradient of a potential function given by a neural network, i.e., the neural network is used to parametrize the potential function, instead of the map itself, which is perhaps more commonly used in practice. The specific form of map in our result arises since we build it from the optimal transportation map (with quadratic cost) which always leads to a gradient-form transport map according to the Brenier’s theorem (see Theorem D.1). As the potential function is continuous, while the transport map itself can be discontinuous, it is more natural to use neural networks to parametrize the potential function. The idea of using neural networks to parametrize potentials has also been used recently in [33, 24] to improve the training of Wasserstein-GANs. On the other hand, if one insists of using neural network to parametrize the map, one can further approximate by a neural network with multiple outputs; we will not further delve into this direction in the current work.

4 Convergence of empirical measures in various IPMs

In this section, we consider the approximation of a given target measure by empirical measures. More specifically, let be an i.i.d. sequence of random samples from the distribution and let be the empirical measure associated to the samples . Our goal is to derive quantitative error estimates of with respect to three IPMs described in the last section.

We first state an upper bound on in the average sense in the next proposition.

Proposition 4.1 (Convergence in -Wasserstein distance).

Consider the IPM with . Assume that satisifies that . Then there exists a constant depending on such that

The convergence rates of as stated in Proposition 4.1 are well-known in the statistics literature. The statement in Proposition 4.1 is a combination of results from [8] and [32]; see Appendix A for a short proof. We remark that the prefactor constant

in the estimate above can be made explicit. In fact, one can easily obtain from the moment bound in Proposition

C.1 that if is sub-Gaussian with parameters and , then the constant can be chosen as with some constant depending only on and . Moreover, one can also obtain a high probability bound for if is sub-exponential (see e.g., [32, Corollary 5.2]). Here we content ourselves with the expectation result as it comes with weaker assumptions and also suffices for our purpose of showing the existence of an empirical measure with desired approximation rate.

Moving on to approximation in MMD, the following proposition gives a high-probability non-asymptotic error bound of .

Proposition 4.2 (Convergence in MMD).

Consider the IPM with . Assume that the kernel satisfies Assumption K2 with constant . Then for every , with probability at least ,

Proposition 4.2 can be viewed as a special case of [50, Theorem 3.3] where the kernel class is a skeleton. Since its proof is short, we provide the proof in Appendix B for completeness.

In the next proposition, we consider the convergence estimate of empirical measures to in KSD . To the best of our knowledge, this is the first estimate on empirical measure under the KSD in the literature. This result can be useful to obtain quantitative error bounds for the new GAN/sampler called Stein Neural Sampler [27]. The proof relies on a Bernstein type inequality for the distribution of von Mises’ statistics; the details are deferred to Appendix C.

Proposition 4.3 (Convergence in KSD).

Consider the IPM with where is the Stein operator defined in (3.3). Suppose that the kernel satisfies Assumption K3 with constant . Suppose also that satisfies Assumption 1 and Assumption 2. Then for any there exists a constant such that with probability at least ,


The constant can be computed explicitly as

where are some positive absolute constants which are independent of any quantities of interest and

Remark 4.1.

Proposition 4.3 provides a non-asymptotic high probability error bound for the convergence of the empirical measure converges to in KSD. Our result implies in particualr that with the asymptotic rate . We also remark that the rate is optimal and is consistent with the asymptotic CLT result for the corresponding U-statistics of (see [38, Theorem 4.1 (2)]).

The theorem below is the main result of this section, which summarizes the propositions above.

Theorem 4.1.

Let be a probability measure on and let be the empirical measure associated to the i.i.d. samples drawn from . Then we have the following:

  1. If satisfies , then there exists a realization of empirical measure such that

    where the constant depends only on .

  2. If satisfies Assumption K2 with constant , then there exists a realization of empirical measure such that

    where the constant depending only on .

  3. If satisfies Assumption 1 and 2 and satisfies Assumption K3 with constant , then there exists a realization of empirical measure such that

    where the constant depends only on .

5 Constructing neural-network-based maps from a source distribution to empirical measures via semi-discrete optimal transport

In this section, we aim to build a neural-network-based map which push-forwards a given source distribution to discrete probability measures, including in particular the empirical measures. The main result of this section is the following theorem.

Theorem 5.1.

Let with Lebesgue density . Let for some and . Then there exists a transport map of the form such that where is a fully connected deep neural network of depth and width , and with ReLU activation function and parameters such that .

As shown below, the transport map in Theorem 5.1 is chosen as the optimal transport map from the continuous distribution to the discrete distribution , which turns out to be the gradient of a piece-wise linear function, which in turn can be expressed by neural networks. We remark that the weights and biases of the constructed neural network can also be characterized explicitly in terms of and (see the proof of Proposition 5.1). Since semi-discrete optimal transport plays an essential role in the proof of Theorem 5.1, we first recall the set-up and some key results on optimal transport in both general and semi-discrete settings.

Optimal transport with quadratic cost. Let and be two probability measures on with finite second moments. Let be the quadratic cost. Then Monge’s [44] optimal transportation problem is to transport the probability mass between and while minimizing the quadratic cost, i.e.


A map attaining the infimum above is called an optimal transport map. In general an optimal transport map may not exist since Monge’s formulation prevents splitting the mass so that the set of transport maps may be empty. On the other hand, Kantorovich [28] relaxed the problem by considering minimizing the transportation cost over transport plans instead of the transport maps:


A coupling achieving the infimum above is called an optimal coupling. Noting that problem (5.2

) above is a linear programming, Kantorovich proposed a dual formulation for (


where be the set of measurable functions satisfying . We also define the -transformation of a function by

Similarly, one can define associated to . The Kantorovich’s duality theorem (see e.g. [55, Theorem 5.10]) states that


Moreover, if the source measure is absolutely continuous with respect to the Lebesgue measure, then the optimal transport map defined in Monge’s problem is given by a gradient field, which is usually referred to as the Brenier’s map and can be characterized explicitly in terms of the solution of the dual Kantorovich problem. A precise statement is included in Theorem D.1 in Appendix D.

Semi-discrete optimal transport. Let us now consider the optimal transport problem in the semi-discrete setting: the source measure is continuous and the target measure is discrete. Specifically, assume that is absolutely continuous with respect to the Lebesgue measure, i.e. for some probability density and is discrete, i.e. for some and . In the semi-discrete setting, Monge’s problem becomes


In this case the action of the transport map is clear: it assigns each point to one of these . Moreover, by taking advantage of the dicreteness of the measure , one sees that the dual Kantorovich problem in the semi-discrete case becomes maximizing the following functional


Similar to the continuum setting, the optimal transport map of Monge’s problem (5.4) can be characterized by the maximizer of . To see this, let us introduce an important concept of power diagram (or Laguerre diagram). Given a finite set of points and the scalars , the power diagrams associated to the scalars and the points are the sets


Notice that the set contains all points for which the point minimizes . Power diagrams were first introduced in [4] as a generalization of Voronoi diagrams which corresponds to with in (5.6); see [49] for an review of power diagrams and their applications in computational geometry.

By grouping the points according to the power diagrams , we have from (5.5) that


The following theorem characterizes the optimal transport map of Monge’s problem (5.4) in terms of the power diagrams associated to the points and the maximizer of .

Theorem 5.2.

Let with Lebesgue density . Let . Let be an maximizer of defined in (5.7). Denote by the power diagrams associated to and . Then the optimal transport plan solving the semi-discrete Monge’s problem (5.4) is given by

where for some . Specifically, if .

Theorem 5.2 shows that the optimal transport map in the semi-discrete case is achieved by the gradient of a particular piece-wise affine function which is the maximum of finitely many affine functions. A similar result was proved by [22] for the case where the source measure is defined on a compact convex domain. We provide a proof of Theorem 5.2, which deals with measures on the whole space in Appendix D.2.

The next proposition shows that the piece-wise linear function defined in Theorem 5.2 can be expressed exactly by a deep neural network.

Proposition 5.1.

Let with and . Then there exists a fully connected deep neural network of depth and width , and with ReLU activation function and parameters such that .

The proof of Proposition 5.1 can be found in Appendix D.3. Theorem 5.1 is a direct consequence of Theorem 5.2 and Proposition 5.1.

6 Conclusion

In this paper, we establish that certain general classes of target distributions can be expressed arbitrarily well with respect to three type of IPMs by transporting a source distribution with maps which can be parametrized by DNNs. We provide upper bounds for the depths and widths of DNNs needed to achieve certain approximation error; the upper bounds are established with explicit dependence on the dimension of the underlying distributions and the approximation error.

Appendix A Proof of Proposition 4.1


The proof follows from some previous results by [8] and [32]. In fact, in the one dimensional case, according to [8, Theorem 3.2], we know that if satisfies that



is the cumulative distribution function of

, then for every ,


The condition (A.1) is fulfilled if has finite third moment since

In the case that , it follows from that [32, Theorem 3.1] if , then there exists a constant independent of such that


Appendix B Proof of Proposition 4.2


Thanks to [50, Proposition 3.1], one has that

Let us define Then by definition satisfies that for any ,

where we have used that by assumption. It follows from above and the McDiarmid’s inequality that for every , with probability ,

In addition, we have by the standard symmetrization argument that

where are i.i.d. Radmacher variables and represents the conditional expectation w.r.t given . To bound the right hand side above, we can apply McDiarmid’s inequality again to obtain that with probability at least ,

where we have used Jensen’s inequality for expectation in the second inequality and the independence of and the definition of in the last inequality. Combining the estimates above yields that with probability at least ,

Appendix C Proof of Proposition 4.3

Thanks to [38, Theorem 3.6], is evaluated explicitly as


where is a new kernel defined by

with . Moreover, according to [38, Proposition 3.3], if satisfies Assumption K1, then is non-negative.

Our proof of Proposition 4.3 relies on the fact that can be viewed as a von Mises’ statistics (-statistics) and an important Bernstein type inequality due to [9] for the distribution of -statistics, which gives a concentration bound of around its mean (which is zero). We recall this inequality in the theorem below, which is a restatement of [9, Theorem 1] for second order degenerate -statistics.

c.1 Bernstein type inequality for von Mises’ statistics


be a sequence of i.i.d. random variables on

. For a kernel , we call


a von-Mises’ statitic of order with kernel . We say that the kernel is degenerate if the following holds: