1 Introduction
Generative adversarial networks (GAN) (Goodfellow et al., 2014)
is an important approach to implicitly learning and sampling from highdimensional complex distributions. GANs have been shown to achieve impressive performance in many machine learning tasks
(Radford et al., 2016; Reed et al., 2016; Zhu et al., 2017; Karras et al., 2018, 2019; Brock et al., 2019). Several recent studies have generalized GANs to bidirectional generative learning, which learns an encoder mapping the data distribution to the reference distribution simultaneously together with the generator doing reversely. These studies include the adversarial autoencoder (AAE)
(Makhzani et al., 2015), bidirectional GAN (BiGAN) (Donahue et al., 2016), adversarially learned inference (ALI) (Dumoulin et al., 2016), and bidirectional generative modeling using adversarial gradient estimation (AGES) (Shen et al., 2020). A common feature of these methods is that they generalize the basic adversarial training framework of the original GAN from unidirectional to bidirectional. Dumoulin et al. (2016) showed that BiGANs make use of the joint distribution of data and latent representations, which can better capture the information of data than the vanilla GANs. Comparing with the unidirectional GANs, the joint distribution matching in the training of bidirectional GANs alleviates mode dropping and encourages cycle consistency (Shen et al., 2020).Several elegant and stimulating papers have analyzed the theoretical properties of unidirectional GANs. Arora et al. (2017) considered the generalization error of GANs under the neural net distance. Zhang et al. (2018) improved the generalization error bound in Arora et al. (2017). Liang (2020) studied the minimax optimal rates for learning distributions with empirical samples under Sobolev evaluation class and density class. The minimax rate is , where and are the regularity parameters for Sobolev density and evaluation class, respectively. Bai et al. (2019) analyzed the estimation error of GANs under the Wasserstein distance for a special class of distributions implemented by a generator, while the discriminator is designed to guarantee zero bias. Chen et al. (2020) studies the convergence properties of GANs when both the evaluation class and the target density class are Hölder classes and derived bound, where is the dimension of the data distribution and and are the regularity parameters for Hölder density and evaluation class, respectively. While impressive progresses have been made on the theoretical understanding of GANs, there are still some drawbacks in the existing results. For example,

The reference distribution and the target data distribution are assumed to have the same dimension, which is not the actual setting for GAN training.

The reference and the target data distributions are assumed to be supported on bounded sets.

The prefactors in the convergence rates may depend on the dimension of the data distribution exponentially.
In practice, GANs are usually trained using a reference distributions with a lower dimension than that of the target data distribution. Indeed, an important strength of GANs is that they can model lowdimensional latent structures via using a lowdimensional reference distribution. The bounded support assumption excludes some commonly used Gaussian distributions as the reference. Therefore, strictly speaking, the existing convergence analysis results do not apply to what have been done in practice. In addition, there has been no theoretical analysis of bidirectional GANs in the literature.
1.1 Contributions
We derive nearly sharp nonasymptotic bounds for the GAN estimation error under the Dudley distance between the reference joint distribution and the data joint distribution.To the best of our knowledge, this is the first result providing theoretical guarantees for bidirectional GAN estimation error rate. We do not assume that the reference and the target data distributions have the same dimension or these distributions have bounded support. Also, our results are applicable to the Wasserstein distance if the target data distribution is assumed to have a bounded support.
The main novel aspects of our work are as follows.

We allow the dimension of the reference distribution to be different from the dimension of the target distribution, in particular, it can be much lower than that of the target distribution.

We allow unbounded support for the reference distribution and the target distribution under mild conditions on the tail probabilities of the target distribution.

We explicitly establish that the prefactors in the error bounds depend on the square root of the dimension of the target distribution. This is a significant improvement over the exponential dependence on in the existing works.
Moreover, we develop a novel decomposition of integral probability metric for the error analysis of bidirectional GANs. We also show that the pushforward distribution of an empirical distribution based on neural networks can perfectly approximate another arbitrary empirical distribution as long as the number of discrete points are the same.
Notation We use
to denote the ReLU activation function in neural networks, which is
. We use to denote the identity map. Without further indication, represents the norm. For any function , let . We use notation and to express the order of function slightly differently, where omits the universal constant independent of while omits the constant depending on . We use to denote the ball in with center at and radius . Let be the pushforward distribution of by function in the sense that for any measurable set . We use to denote taking expectation with respect to the empirical distribution.2 Bidirectional generative learning
We describe the setup of the bidirectional GAN estimation problem and present the assumptions we need in our analysis.
2.1 Bidirectional GAN estimators
Let be the target data distribution supported on for Let be a reference distribution which is easy to sample from. We first consider the case when is supported on , and then extend it to , where can be different from . Usually, in practical machine learning tasks such as image generation. The goal is to learn functions and such that , where and , is the pushforward distribution of under and is the pushforward distribution of under . We call the joint latent distribution or joint reference distribution and the joint data distribution or joint target distribution. At the population level, the bidirectional GAN solves the minimax problem:
where are referred to as the generator class, the encoder class, and the discriminator class, respectively. Suppose we have two independent random samples and . At the sample level, the bidirectional GAN solves the empirical version of the above minimax problem:
(2.1) 
where and are two classes of neural networks approximating the generator class and the encoder class respectively, and is a class of neural networks approximating the discriminator class .
2.2 Assumptions
We assume the target and the reference satisfy the following assumptions.
Assumption 1 (Subexponential tail).
For a large , the target distribution on and the reference distribtuion on
satisfies the first moment tail condition for some
,Assumption 2 (Absolute continuity).
Both the target distribution on and the reference distribution on are absolutely continuous with respect to the Lebesgue measure
Assumption 1 is a technical condition for dealing with the case when and are supported on and instead of compact subsets. For distributions with bounded supports, this assumption is automatically satisfied. Here the factor ensures that the tails of and are subexponential, and it can be easily satisfied if the distributions are subgaussian. For the reference distribution, Assumption 1 and 2 can be easily satisfied by specifying as some common distribution with easytosample density such as Gaussian or uniform, which is usually done in the applications of GANs. For the target distribution, Assumption 1 and 2 specifies the type of distributions that are learnable by bidirectional GAN with our theoretical guarantees. Note that Assumption 1 is also necessary in our proof for bounding the generator and encoder approximation error in the sense that the results will not hold if we replace with 1. Assumption 2 is also necessary for Theorem 4.3 in mapping between empirical samples, which is essential in bounding generator and encoder approximation error.
2.3 Generator, encoder and discriminator classes
Let be the discriminator class consisting of the feedforward ReLU neural networks with width and depth . Similarly, let be the generator class consisting of the feedforward ReLU neural networks with width and depth , and the encoder class consisting of the feedforward ReLU neural networks with width and depth .
The functions have the following form:
where are the weight matrices with number of rows and columns no larger than the width ,
are the bias vectors with compatible dimensions, and
is the ReLU activation function . Similarly, functions and have the following form:where and are the weight matrices with number of rows and columns no larger than and , respectively, and and
are the bias vectors with compatible dimensions.
We impose the following conditions on , , and .
Condition 1.
For any and , we have
Condition 1 on can be easily satisfied by adding an additional clipping layer after the original output layer, with ,
(2.2) 
We truncate the output of to an increasing interval to include the whole support for the evaluation function class. Condition 1 on can be satisfied in the same manner. This condition is technically necessary in our proof (see appendix).
3 Nonasymptotic error bounds
We characterize the bidirectional GAN solutions based on minimizing the integral probability metric (IPM, Müller (1997)) between two distributions and with respect to a symmetric evaluation function class , defined by
(3.1) 
By specifying the evaluation function class differently, we can obtain many commonlyused metrics (Liu et al., 2017). Here we focus on the following two
We consider the estimation error under the Dudley metric . Note that in the case when and have bounded supports, the Dudley metric is equivalent to the 1Wasserstein metric . Therefore, under the bounded support condition for and , all our convergence results also hold under the Wasserstein distance . Even if the support of and are unbounded, we can still apply the result of Lu and Lu (2020) to avoid empirical process theory and obtain an stochastic error bound under the Wasserstein distance . However, the result of Lu and Lu (2020) requires subgaussianity to obtain the prefactor. In order to make it more general, we use the empirical processes theory to get the explicit prefactor. Also, the discriminator approximation error will be unbounded if we consider the Wasserstein distance . Hence, we can only consider for the unbounded support case.
The bidirectional GAN solution in (2.1) also minimizes the distance between and under
However, even if two distributions are close with respect to , there is no automatic guarantee that they will still be close under other metrics, for example, the Dudley or the Wasserstein distance (Arora et al., 2017). Therefore, it is natural to ask the question:

How close are the two bidirectional GAN estimators and under some other stronger metrics?
We consider the IPM with the uniformly bounded 1Lipschitz function class on , as the evaluation class, which is defined as, for some finite ,
(3.2) 
In Theorem 3.1, we consider the bounded support case where ; In Theorem 3.2, we extend the result to the unbounded support case; In Theorem 3.3, we extend the result to the case where the dimension of the reference distribution is arbitrary.
We first present a result when is supported on a compact subset and is supported on for a finite .
Theorem 3.1.
Suppose that the target is supported on and the reference is supported on for a finite , and Assumption 2 holds. Let the outputs of and be within and for and , respectively. By specifying the three network structures as , , and for some constants and properly choosing parameters, we have
where is a constant independent of and .
The prefactor in the error bound depends on linearly. This is different from the existing works where the dependence of the prefactor on is either not clearly described or is exponential. In highdimensional settings with large , this makes a substantial difference in the quality of the error bounds. These remarks apply to all the results stated below.
The next theorem deals with the case of unbounded support.
Theorem 3.2.
Note that two methods are used in bounding stochastic errors (see appendix), which leads to two different bounds: one with an explicit prefactor with the cost that we have an additional factor. Another one with an implicit prefactor but with a better factor. Hence, it is a tradeoff between the explicitness of prefactor and the order of .
Our next result generalizes the results to the case when the reference distribution is supported on for
Assumption 3.
The target distribution on is absolutely continuous with respect to the Lebesgue measure on and the reference distribution on is absolutely continuous with respect to the Lebesgue measure on , and .
With the above assumption, we have the following theorem providing theoretical guarantees for the validity of any dimensional reference .
Theorem 3.3.
4 Approximation and stochastic errors
In this section we present a novel inequality for decomposing the total error into approximation and stochastic errors and establish bounds on these errors.
4.1 Decomposition of the estimation error
Define the approximation error of a function class to another function class by
We decompose the Dudley distance between the latent joint distribution and the data joint distribution into four different error terms,

the approximation error of the discriminator class to :

the approximation error of the generator and encoder classes:

the stochastic error for the latent joint distribution :

the stochastic error for the latent joint distribution :
Lemma 4.1.
The novel decomposition (4.1) is fundamental to our error analysis. Based on (4.1), we bound each error term on the right side of (4.1) and balance the bounds to obtain an overall bound for the bidirectional GAN estimation.
For proving Lemma 4.1
, we introduce the following useful inequality, which states that for any two probability distributions, the difference in IPMs with two distinct evaluation classes will not exceed 2 times the approximation error between the two evaluation classes, that is, for any probability distributions
and and symmetric function classes and ,(4.2) 
It is easy to check that if we replace by , (4.2) still holds.
Proof of Lemma 4.1.
∎
Note that we cannot directly apply the symmetrization technic (see appendix) to and since and are correlated with and . However, this problem can be solved by replacing the samples in the empirical terms in and with ghost samples independent of and replacing and with and which are obtained from the ghost samples, respectively. That is, we replace and with and in and , respectively. Then we can proceed with the same proof of Lemma 4.1 and apply the symmetrization technic to and since and have the same distribution. To simplify the notation, we will just use and to denote and here, respectively.
4.2 Approximation errors
We now discuss the errors due to the discriminator approximation and the generator and encoder approximation.
4.2.1 The discriminator approximation error
The discriminator approximation error describes how well the discriminator neural network class approximates functions from the Lipschitz class . Lemma 4.2 below can be applied to obtain the neural network approximation error for Lipschitz functions. It leads to a quantitative and nonasymptotic approximation rate in terms of the width and depth of the neural networks when bounding .
Lemma 4.2 (Shen et al. (2021)).
Let be a Lipschitz continuous function defined on . For arbitrary , there exists a function implemented by a ReLU feedforward neural network with width and depth such that
By Lemma 4.2 and our choice of the architecture of discriminator class in the theorems, we have . Theorem 4.2 also informs about how to choose the architecture of the discriminator networks based on how small we want the approximation error to be. By setting , is dominated by the stochastic terms and .
4.2.2 The generator and encoder approximation error
The generator and encoder approximation error describes how powerful the generator and encoder classes are in pushing the empirical distributions and to each other. A natural question is

Can we find some generator and encoder neural network functions such that ?
Most of the current literature concerning the error analysis of GANs applied the optimal transport theory (Villani, 2008) to minimize an error term similar to , see, for example, Chen et al. (2020). However, the existence of the optimal transport function from is not guaranteed. Therefore, the existing analysis of GANs can only deal with the scenario when the reference and the target data distribution are assumed to have the same dimension. This equal dimensionality assumption is not satisfied in the actual training of GANs or bidirectional GANs in many applications. Here, instead of using the optimal transport theory, we establish the following approximation results in Theorem 4.3, which enables us to forgo the equal dimensionality assumption.
Theorem 4.3.
Suppose that supported on and supported on are both absolutely continuous w.r.t. the Lebesgue measures, and and are i.i.d. samples from and , respectively for . Then there exist generator and encoder neural network functions and such that and are inverse bijections of each other between and up to a permutation. Moreover, such neural network functions and can be obtained by properly specifying and for some constant .
Proof.
By the absolute continuity of and , all the and are distinct a.s.. We can reorder from the smallest to the largest, so . Let be any point between and for . We define the continuous piecewise linear function by
By Yang et al. (2021, Lemma 3.1) , if . Taking , a simple calculation shows for some constant . The existence of neural net function can be constructed in the same way due to the fact that the first coordinate of are distinct almost surely. ∎
When the number of point masses of the empirical distributions are relatively moderate compared with the structure of the neural nets, we can approximate empirical distributions arbitrarily well with any empirical distribution with the same number of point masses pushforwarded by the neural nets.
Theorem 4.3 provides an effective way to specify the architecture of generator and encoder classes. According to this lemma, we can take , which gives rise to . More importantly, Theorem 4.3 can be applied to bound as follows.
We simply reordered and as in the proof. Therefore, this error term can be perfectly eliminated.
4.3 Stochastic errors
The stochastic error () quantifies how close the empirical distribution and the true latent joint distribution (data joint distribution) are with the Lipschitz class as the evaluation class under IPM. We apply the results in the refined Dudley inequality (Schreuder, 2020) in Lemma C.1 to bound and .
Lemma 4.4 (Refined Dudley Inequality).
For a symmetric function class with , we have
The original Dudley inequality (Dudley, 1967; Van der Vaart and Wellner, 1996) suffers from the problem that if the covering number increases too fast as goes to , then the upper bound will be infinity, which is totally meaningless. The improved Dudley inequality circumvents such a problem by only allowing to integrate from as is shown in Lemma C.1, which also indicates that scales with the covering number .
By calculating the covering number of and utilizing the refined Dudley inequality, we can obtain the upper bound
(4.3) 
5 Related work
Recently, several impressive works have studied the challenging problem of the convergence properties of unidirectional GANs. Arora et al. (2017) noted that training of GANs may not have good generalization properties in the sense that even if training may appear successful but the trained distribution may be far from target distribution in standard metrics. On the other hand, Bai et al. (2019) showed that GANs can learn distributions in Wasserstein distance with polynomial sample complexity. Liang (2020) studied the rates of convergence of a class of GANs, including Wasserstein, Sobolev and MMD GANs. This work also established the nonparametric minimax optimal rate under the Sobolev IPM. The results of Bai et al. (2019) and Liang (2020) require invertible generator networks, meaning all the weight matrices need to be fullrank, and the activation function needs to be the invertible leaky ReLU activation. Chen et al. (2020) established an upper bound for the estimation error rate under Hölder evaluation and target density classes, where is Hölder class with regularity and the density of the target is assumed to belong to . They assumed that the reference distribution has the same dimension as the target distribution and applied the optimal transport theory to control the generator approximation error. However, how the prefactor depends in the error bounds on the dimension in the existing results (Liang, 2020; Chen et al., 2020) is either not clearly described or is exponential. In highdimensional settings with large , this makes a substantial difference in the quality of the error bounds.
Singh et al. (2019) studied minimax convergence rates of nonparametric density estimation under a class of adversarial losses and investigated how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate; they also discussed connections to learning generative models in a minimax statistical sense. Uppal et al. (2019) generates the idea of Sobolev IPM to Besov IPM, where both target density and the evaluation classes are Besov classes. They also showed how their results imply bounds on the statistical error of a GAN.
These results provide important insights in the understanding of GANs. However, as we mentioned earlier, some of the assumptions made in these results, including equal dimension between the reference and target distributions and bounded support of the distributions, are not satisfied in the training of GANs in practice. Our results avoid these assumptions. Moreover, the prefactors in our error bounds are clearly described as being dependent on the square root of the dimension . Finally, the aforementioned results only dealt with unidirectional GANs. Our work is the first to address the convergence properties of bidirectional GANs.
6 Conclusion
This paper derives the error bounds for the bidirectional GANs under the Dudley distance between the latent joint distribution and the data joint distribution. The results are established without the two crucial conditions that are commonly assumed in the existing literature: equal dimensionality between the reference and the target distributions and bounded support for these distributions. Additionally, this work contributes to the neural network approximation theory by constructing neural network functions such that the pushforward distribution of an empirical distribution can perfectly approximate another arbitrary empirical distribution with a different dimension as long as their number of point masses are equal. A novel decomposition of integral probability metric is also developed for error analysis of bidirectional GANs, which can be useful in other generative learning problems.
A limitation of our results, as well as all the existing results on the convergence properties of GANs, is that they suffer from the curse of dimensionality, which cannot be circumvented by assuming sufficient smoothness assumptions. In many applications, highdimensional complex data such as images, texts and natural languages, tend to be supported on approximate lowerdimensional manifolds. It is desirable to take into such structure in the theoretical analysis. An important extension of the present results is to show that bidirectional GANs can circumvent the curse of dimensionality if the target distribution is assumed to be supported on an approximate lowerdimensional manifold. This appears to be a technically challenging problem and will be pursued in our future work.
Acknowledgements
The authors wish to thank the three anonymous reviewers for their insightful comments and constructive suggestions that helped improve the paper significantly.
The work of J. Huang is partially supported by the U.S. NSF grant DMS1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China under Grant 11871474 and by the research fund of KLATASDSMOE. The work of Y. Wang is supported in part by the Hong Kong Research Grant Council grants 16308518 and 16317416 and HK Innovation Technology Fund ITS/044/18FX, as well as GuangdongHong KongMacao Joint Laboratory for DataDriven Fluid Mechanics and Engineering Applications.
References
 Wasserstein generative adversarial networks. In ICML, Cited by: 2nd item.
 Generalization and equilibrium in generative adversarial nets (gans). In International Conference on Machine Learning, pp. 224–232. Cited by: §1, §3, §5.
 Approximability of discriminators implies diversity in GANs. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 Large scale gan training for high fidelity natural image synthesis. External Links: 1809.11096 Cited by: §1.
 Statistical guarantees of generative adversarial networks for distribution estimation. arXiv preprint arXiv:2002.03938. Cited by: Appendix B, §1, §4.2.2, §5.
 Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §1.
 The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis 1 (3), pp. 290–330. Cited by: §4.3, Remark 2.
 Real analysis and probability. CRC Press. Cited by: 1st item.
 Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §1.
 Generative adversarial networks. arXiv preprint arXiv:1406.2661. Cited by: §1.

Efficient regression in metric spaces via approximate lipschitz extension.
In
International Workshop on SimilarityBased Pattern Recognition
, pp. 43–58. Cited by: §C.1.1, Lemma C.2.  Progressive growing of gans for improved quality, stability, and variation. External Links: 1710.10196 Cited by: §1.
 A stylebased generator architecture for generative adversarial networks. External Links: 1812.04948 Cited by: §1.
 How well generative adversarial networks learn distributions. External Links: 1811.03179 Cited by: Appendix B, §1, §3, §5.
 Approximation and convergence properties of generative adversarial learning. arXiv preprint arXiv:1705.08991. Cited by: §3.
 A universal approximation theorem of deep neural networks for expressing distributions. arXiv preprint arXiv:2004.08867. Cited by: §3.
 Adversarial autoencoders. arXiv preprint arXiv:1511.05644. Cited by: §1.
 Foundations of machine learning. MIT press. Cited by: Appendix F.
 Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pp. 429–443. Cited by: §3.
 Unsupervised representation learning with deep convolutional generative adversarial networks. External Links: 1511.06434 Cited by: §1.
 Generative adversarial text to image synthesis. In ICML, Cited by: §1.
 Bounding the expectation of the supremum of empirical processes indexed by hölder classes. External Links: 2003.13530 Cited by: §C.1, §4.3.
 Bidirectional generative modeling using adversarial gradient estimation. arXiv preprint arXiv:2002.09161. Cited by: §1.

Deep network approximation characterized by number of neurons
. arXiv preprint arXiv:1906.05497. Cited by: Appendix G.  Nonparametric density estimation with adversarial losses. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10246–10257. Cited by: §5.
 Note on refined dudley integral covering number bound. Unpublished results. http://ttic. uchicago. edu/karthik/dudley. pdf. Cited by: §C.1.
 Nonparametric density estimation & convergence rates for gans under besov ipm losses. arXiv preprint arXiv:1902.03511. Cited by: §5.
 Weak convergence. In Weak convergence and empirical processes, Cited by: §C.1.2, Appendix F, §4.3, Remark 2.
 Optimal transport: old and new. Vol. 338, Springer Science & Business Media. Cited by: §4.2.2.
 On the capacity of deep generative networks for approximating distributions. arXiv preprint arXiv:2101.12353. Cited by: Appendix I, §4.2.2.
 On the discriminationgeneralization tradeoff in GANs. In International Conference on Learning Representations, External Links: Link Cited by: §1.

Unpaired imagetoimage translation using cycleconsistent adversarial networks
. In ICCV, Cited by: §1.
Appendix A Notations and Preliminaries
We use to denote the ReLU activation function in neural networks, which is . Without further indication, represents the norm. For any function , let . We use notation and to express the order of function slightly differently, where omits the universal constant not relying on while omits the constant related to . We use to denote ball in with center at and radius . Let be the pushforward distribution of by function in the sense that for any measurable set .
The covering number of some class w.r.t. norm is the minimum number of  radius balls needed to cover , which we denote as . We denote as the covering number of w.r.t. , which is defined as where are the empirical samples. We denote as the covering number of w.r.t. , which is defined as . It is easy to check that
Appendix B Restriction on the domain of uniformly bounded Lipschitz function class
So far, most of the related works assume that the target distribution is supported on a compact set, for example Chen et al. (2020) and Liang (2020). To remove the compact support assumption, we need to assume Assumption 1, i.e., the tails of the target and the reference are subexponential. Define . In this section, we show that proving Theorem 3.2 is equivalent to establishing the same convergence rate but with the domain restricted function class as the evaluation class.
Under Assumption 1 and by the Markov inequality, we have
(B.1) 
The Dudley distance between latent joint distribution and data joint distribution is defined as
(B.2) 
The first term above can be decomposed as
(B.3) 
For any and fixed point such that , due to the Lipschitzness of , the second term above satisfies
Comments
There are no comments yet.