The generative adversarial networks (GANs) proposed in Goodfellow et al. (2014)
utilize two neural networks competing with each other to generate new samples with the same distribution as the training data. They have been successful in many applications including producing photorealistic images, improving astronomical images, and modding video games(Reed et al., 2016; Ledig et al., 2017; Schawinski et al., 2017; Brock et al., 2018; Volz et al., 2018; Radford et al., 2015; Salimans et al., 2016).
From the viewpoint of statistics, GANs have stood out as an important unsupervised method for learning target data distributions. Different from explicit distribution estimations, e.g., density estimation, GANs implicitly learn the data distribution and act as samplers to generate new fake samples mimicking the data distribution (see Figure 1).
To estimate a data distribution , GANs solve the following minimax optimization problem
where denotes a class of generators, denotes a symmetric class (if , then ) of discriminators, and follows some easy-to-sample distribution
, e.g., uniform or Gaussian distributions. The estimator ofis given by the pushforward distribution of under , denoted by .
The inner maximization problem of (1
) is an Integral Probability Metric (IPM,Müller (1997)), which quantifies the discrepancy between two distributions and with respect to the symmetric function class :
Accordingly, GANs are essentially minimizing an IPM between the generated distribution and the data distribution. IPM unifies many standard discrepancy metrics. For example, when is taken to be all -Lipschitz functions, is the Wasserstein distance; when is taken to be all indicator functions, is the total variation distance; when is taken as the discriminator network, is the so-called “neural net distance” (Arora et al., 2017).
GANs parameterize the generator and discriminator classes and by deep neural networks (ReLU activation is considered in this paper) denoted by and , which consist of functions given by a feedforward ReLU network of the following form
where the ’s and ’s are weight matrices and intercepts respectively, and ReLU
denotes the rectified linear unit (2010; Glorot et al., 2011; Maas et al., 2013). These networks can ease the notorious vanishing gradient issue during training, which commonly arises with sigmoid or hyperbolic tangent activations (Glorot et al., 2011; Goodfellow et al., 2016).
where and are parameters in the generator and discriminator networks, respectively. The empirical estimator of given by GANs is the pushforward distribution of under , denoted by .
In contrast to the prevalence of GANs in applications, only very limited works study the theoretical properties of GANs (Arora et al., 2017; Bai et al., 2018; Liang, 2018; Singh et al., 2018; Thanh-Tung et al., 2019). Here we focus on the following fundamental questions from a theoretical point of view:
What types of distributions can be learned by GANs?
If the distribution can be learned, what is the statistical rate of convergence?
This paper shows that, if the generator and discriminator network architectures are properly chosen, GANs can effectively learn distributions with Hölder densities supported on proper domains. Specifically, we consider a data distribution supported on a compact domain with being the data dimension. We assume has a density lower bounded away from on , and the density belongs to the Hölder class .
In order to learn
, we choose proper generator and discriminator network architectures — we specify the width and depth of the network, total number of neurons, and total number of weight parameters (details are provided in Section2). Roughly speaking, the generator is chosen to be flexible enough to approximate the data distribution, and the discriminator is powerful enough to distinguish the generated distribution from the data distribution.
Let be the optimal solution to (3), and then is the generated data distribution as an estimation of . Our main results can be summarized as, for any ,
where the expectation is taken over the randomness of samples, and hides constants and polynomials in and .
To our best knowledge, this is the first statistical theory of GANs for Hölder densities. It shows that the Hölder IPM between the generated distribution and the data distribution converges at a rate depending on the Hölder index and dimension . When , our theory implies that GANs can estimate any distribution with a Hölder density under the Wasserstein distance. It is different from the generalization bound in Arora et al. (2017) under the weaker neural net distance.
In our analysis, we decompose the distribution estimation error into a statistical error and an approximation error by a new oracle inequality. A key step is to properly choose the generator network architecture to control the approximation error. Specifically, the generator architecture allows an accurate approximation to a data transformation such that . The existence of such a transformation is guaranteed by the optimal transport theory (Villani, 2008), and holds universally for all the data distributions with Hölder densities.
In comparison with existing works (Bai et al., 2018; Liang, 2018; Singh et al., 2018), our theory holds with minimum assumptions on the data distributions and does not require invertible generator networks (all the weight matrices have to be full-rank, and the activation function needs to be the invertible leaky ReLU activation). See Section 4 for a detailed comparison.
Notations: Given a real number , we denote as the largest integer smaller than (in particular, if is an integer,
). Given a vector, we denote its norm by , the norm as , and the number of nonzero entries by . Given a matrix , we denote its number of nonzero entries by . Given a function , we denote its norm as . For a multivariate transformation , and a given distribution in , we denote the pushforward distribution as , i.e., for any measurable set , .
2 Statistical Theory
We consider a data distribution supported on a subset . We assume the distribution has a density function . Suppose we can easily generate samples from some easy-to-sample distribution
, such as the uniform distribution.
Before we proceed, we make the following assumptions. The domains and are compact, i.e., there exists a constant such that for any or , .
The density function belongs to the Hölder class with Hölder index in the interior of , such that
for any positive integer and in the interior of , ;
for any in the interior of ,
where denotes the -th order derivative of . Meanwhile, is lower bounded on , i.e., whenever for some constant .
The easy-to-sample distribution has a (infinitely smooth) density .
Hölder densities have been widely studied in density estimation (Wasserman, 2006; Tsybakov, 2008). being lower bounded is a technical assumption common in literature (Moser, 1965; Caffarelli, 1996). Assumption 2 is always satisfied, since is often taken as the uniform distribution.
We consider the following two sampling scenarios:
Scenario 1. The support is convex.
Scenario 2. The support is open and its boundary satisfies some smoothness condition.
The condition in either scenario guarantees the existence of a Hölder transformation such that (see Section 3). In Scenario 1, one can simply take as the uniform distribution on such that . In Scenario 2, needs to be known as a priori information, since we need to take samples on .
and the discriminator network architecture as
where denotes the number of nonzero entries in a vector or a matrix, and for a matrix .
We show that under either Scenarios 1 or 2, the generator can universally approximate the data distributions.
(a) For any data distribution and easy-to-sample distribution satisfying Assumptions 2 – 2, under either Scenario 1 or 2, there exists a transformation such that .
(b) Let and be fixed under either Scenario 1 or 2. Given any , there exists a generator network with parameters
such that for any data distribution and easy-to-sample distribution satisfying Assumptions 2 – 2, if the weight parameters of this network are properly chosen, then it yields a transformation satisfying
We next state our statistical estimation error in terms of the Hölder IPM between and , where is the optimal solution of GANs in (3).
for the discriminator architecture. Then we have
Theorem 2 demonstrates that GANs can effectively learn data distributions, with a convergence rate depending on the smoothness of the function class in IPM and the dimension . Here are some remarks:
Both networks have uniformly bounded outputs. Such a requirement can be achieved by adding an additional clipping layer to the end the network, in order to truncate the output in the range . We utilize
This is the first statistical guarantee for GANs estimating data distributions with Hölder densities. Existing works require restrictive assumptions on the data distributions (e.g., the density can be implemented by an invertible neural network).
In the case that only samples from the easy-to-sample distribution can be obtained, GANs solve the following alternative minimax problem
We slightly abuse the notation to denote as he optimal solution to (5). We show in the following corollary that GANs retain similar statistical guarantees for distribution estimation with finite generated samples. Suppose Assumptions 2 – 2 hold. We choose
for the generator network, and the same architecture as in Theorem 2 for the discriminator network. Then we have
Here hides logarithmic factors in . As it is often cheap to obtain a large amount of samples from , the statistical convergence rate is dominated by whenever .
3 Proof of Distribution Estimation Theory
3.1 Proof of Theorem 2
We begin with distribution transformations and function approximation theories using ReLU networks.
Transformations between Distributions. Let be subsets of . Given two probability spaces and , we aim to find a transformation , such that for . In general, may not exist nor be unique. Here we assume and have Hölder densities and , respectively. The Monge map and Moser’s coupling ensure the existence of a Hölder transformation .
and machine learning(Ganin and Lempitsky, 2014; Courty et al., 2016). We assume that the support is convex and the densities and belong to the Hölder space with index , i.e., and are bounded below by some positive constant.
The Monge map is the solution to the following optimization problem
where is a cost function. (6) is known as the Monge problem. When is convex and the cost function is quadratic, the solution to (6) satisfies the Monge-Ampère equation (Monge, 1784). Caffarelli (1992b, a, 1996) and Urbas (1988, 1997) proved the regularity of independently, using different sophisticated tools. Their main result is summarized in the following lemma. [Caffarelli (1992b)] In Scenario 1, suppose Assumptions 2 – 2 hold. Then there exists a transformation such that . Moreover, this transformation belongs to the Hölder class .
Moser’s Coupling. Moser’s coupling extends to nonconvex supports, which was first proposed in Moser (1965) to transform densities supported on the same compact and smooth manifold without boundary. Later, Greene and Shiohama (1979) established results for noncompact manifolds. Moser himself also extended the results to the case where the supports are open sets with boundary (Dacorogna and Moser, 1990). We summarize the main result. [Theorem 1 in Dacorogna and Moser (1990)] In Scenario 2, suppose Assumptions 2 – 2 hold. Assume (the boundary of ) is . Then there exists a transformation such that . Moreover, this transformation belongs to the Hölder class .
Such a transformation can be explicitly constructed. Specifically, let solve the Poisson equation , where is the Laplacian. We construct a vector field , and define . Note that is well defined, in that and are bounded below. Using the conservation of mass formula, one checks that (Chapter 1, Villani (2008)), and .
Function Approximation by ReLU Networks. The representation abilities of neural networks are studied from the perspective of universal approximation theories (Cybenko, 1989; Hornik, 1991; Chui and Li, 1992; Barron, 1993; Mhaskar, 1996). Recently, Yarotsky (2017) established a universal approximation theory for ReLU networks, where the network attains the optimal size and is capable of approximating any Hölder/Sobolev functions. The main result is summarized in the following lemma. Given any , there exists a ReLU network architecture such that, for any for , if the weight parameters are properly chosen, the network yields a function for the approximation of with . Such a network has
1) no more than layers, and
2) at most neurons and weight parameters,
where the constants and depend on , , and .
Lemma 3.1 is a direct result of Theorem 1 in Yarotsky (2017), which is originally proved for Sobolev functions. Proof for Hölder functions can be found in Chen et al. (2019a). The high level idea consists of two steps: 1) Approximate the target function using a weighted sum of local Taylor polynomials; 2) Implement each Taylor polynomial using a ReLU network. The second step can be realized, since polynomials can be implemented only using the multiplication and addition operations. It is shown that ReLU network can efficiently approximate the multiplication operation (Proposition 3 in Yarotsky (2017)). We also remark that all the weight parameters in the network constructed in Lemma 3.1 are bounded by .
Theorem 2 is obtained by combining Lemmas 3.1 – 3.1. In Scenario 1, we can take for simplicity, and then apply Lemma 3.1. More generally, if , we define a scaling function for any , where denotes a vector of ’s. For any data transformation , we rewrite it as so that it suffices to approximate supported on . When is a subset of with a positive measure, especially in Scenario 2, we can apply the same proof technique of Lemma 3.1 to extend the approximation theory in Lemma 3.1 to (Hölder functions defined on ).
Lemma 3.1 yields a data transformation with . We invoke Lemma 3.1 to construct the generator network architecture. Denote where . We then approximate each coordinate mapping : For a given error , can be approximated by a ReLU network with layers and neurons and weight parameters. Finally can be approximated by such ReLU networks.
3.2 Proof of Theorem 2
We first show a new oracle inequality, which decomposes the distribution estimation error as the generator approximation error , discriminator approximation error , and statistical error . Let be the Hölder function class defined on with Hölder parameter . Define . Then
Proof Sketch of Lemma 3.2.
The proof utilizes the triangle inequality. The first step introduces the empirical data distribution as an intermediate term:
We replace the first term on the right-hand side by the training loss of GANs:
Note that reflects the approximation error of the discriminator.
To finish the proof, we apply the triangle inequality on :
The last step is to break the coupling between the discriminator and generator class by invoking the auxiliary function class .
where the last inequality follows from . The oracle inequality is obtained by combining all the previous ingredients. See details in Appendix A. ∎
We next bound each error term separately. and can be controlled by proper choices of the generator and discriminator architectures. can be controlled using empirical process theories (Van Der Vaart and Wellner, 1996; Györfi et al., 2006).
Bounding Generator Approximation Error . We answer this question: Given , how can we properly choose to guarantee ? Later, we will pick based on the sample size , and Hölder indexes and . Let and be fixed under either Scenario 1 or 2. Given any , there exists a ReLU network architecture with parameters given by (4) with such that, for any data distribution and easy-to-sample distribution satisfying Assumptions 2 – 2, if the weight parameters of this network are properly chosen, then it yields a transformation satisfying .
Proof Sketch of Lemma 3.2.
For any given , Theorem 2 implies that the chosen network architecture can yield a data transformation satisfying . Here is the data transformation given by the Monge map or the Moser’s coupling so that it satisfies .
The remaining step is to choose so that . Using the definition of , we derive
The proof is complete by choosing . The details are provided in Appendix B. ∎
Bounding Discriminator Approximation Error . Analogous to the generator, we pre-define an error , and determine the discriminator architecture.
The discriminator is expected to approximate any function . It suffices to consider with a bounded diameter. The reason is that IPM is invariant under linear translations, i.e., for any constant , where . Therefore, we may assume there exists such that for all . By the Hölder continuity and the compactness of the support , we have for any , . Given any , there exists a ReLU network architecture with
such that, for , if the weight parameters are properly chosen, this network architecture yields a function satisfying .
Proof Sketch of Lemma 3.2.
Lemma 3.1 immediately yields a network architecture for uniformly approximating functions in . Let the approximation error be . Then the network architecture consists of layers and total number of neurons and weight parameters. To this end, we can establish that for any , identity holds. ∎
Bounding Statistical Error . The statistical error term is essentially the concentration of empirical data distribution to its population counterpart. Given a symmetric function class , we show scales with the complexity of the function class . For a symmetric function class with for some constant , we have
where denotes the -covering number of with respect to the norm.
Proof Sketch of Lemma 3.2.
The proof utilizes the symmetrization technique and Dudley’s entropy integral, with details provided in Appendix C. In short, the first step relates to the Rademacher complexity of :
where ’s are independent copies of ’s. Equality holds due to symmetrization. The proof then proceeds with Dudley’s chaining argument (Dudley, 1967). ∎
Now we need to find the covering number of Hölder function class and that of the discriminator networks. Classical results show that the -covering number of is bounded by with being a constant depending on the diameter of (Nickl and Pötscher, 2007).
On the other hand, the following lemma quantifies the covering number of . The -covering number of satisfies the upper bound
Proof Sketch of Lemma 3.2.
The detailed proof is in Appendix D. Since each weight parameter in the network is bounded by a constant , we construct a covering by partition the range of each weight parameter into a uniform grid. Consider with each weight parameter differing at most . By an induction on the number of layers in the network, we show that the norm of the difference scales as
As a result, to achieve a -covering, it suffices to choose such that . Therefore, the covering number is bounded by
The proof is complete. ∎
Combining Lemma 3.2 and the covering numbers, the statistical error can be bounded by
We find that the first infimum in step is attained at . It suffices to take in the second infimum. By omitting constants and polynomials in and , we derive
Balancing the Approximation Error and Statistical Error. Combining the previous three ingredients, we can establish, by the oracle inequality (Lemma 3.2),
We choose , and satisfying , i.e., . This gives rise to
4 Comparison with Related Works
The statistical properties of GANs have been studied in several recent works (Arora et al., 2017; Bai et al., 2018; Liang, 2018; Jiang et al., 2018; Thanh-Tung et al., 2019). Among these works, Arora et al. (2017) studied the generalization error of GANs. Lemma 1 of Arora et al. (2017) shows that GANs can not generalize under the Wasserstein distance and the Jensen-Shannon divergence unless the sample size is , where is the generalization gap. Alternatively, they defined a surrogate metric “neural net distance” , where is the class of discriminator networks. They showed that GANs generalize under the neural net distance, with sample complexity of . These results have two limitations: 1). The sample complexity depends on some unknown parameters of the discriminator network class (e.g., the Lipschitz constant of discriminators with respect to parameters); 2). A small neural net distance does not necessarily implies that two distributions are close (see Corollary 3.2 in Arora et al. (2017)). In contrast, our results are explicit in the network architectures, and show the statistical convergence of GANs under the Wasserstein distance ().
Some followup works attempted to address the first limitation in Arora et al. (2017). Specifically, Thanh-Tung et al. (2019) explicitly quantified the Lipschitz constant and the covering number of the discriminator network. They improved the generalization bound in Arora et al. (2017) based on the framework of Bartlett et al. (2017). Whereas the bound has an exponential dependence on the depth of the discriminator. Jiang et al. (2018) further showed a tighter generalization bound under spectral normalization of the discriminator. The bound has a polynomial dependence on the size of the discriminator. These generalization theories are derived under the assumption that the generator can well approximate the data distribution with respect to the neural net distance. Nevertheless, how to choose such a generator remains unknown.
Other works (Bai et al., 2018; Liang, 2018) studied the estimation error of GANs under the Wasserstein distance for a special class of distributions implemented by a generator, while the discriminator is designed to guarantee zero bias (or approximation error). Bai et al. (2018) showed that for certain generator classes, there exist corresponding discriminator classes with a strong distinguishing power against the generator. Particular examples include two-layer ReLU network discriminators (half spaces) for distinguishing Gaussian distributions/mixture of Gaussians, and -layer discriminators for -layer invertible generators. In these examples, if the data distribution can be exactly implemented by some generator, then the neural net distance can provably approximate the Wasserstein distance. Consequently, GANs can generalize under the Wasserstein distance. This result is specific to certain data distributions, and the generator network needs to satisfy restrictive assumptions, e.g., all the weight matrices and the activation function must be invertible.
Another work in this direction is Liang (2018), where the estimation error of GANs was studied under the Sobolev IPMs. Liang (2018) considered both nonparametric and parametric settings. In the nonparametric setting, no generator and discriminator network architectures are chosen, so that the bias of the distribution estimation remains unknown. As a result, the bound cannot provide an explicit sample complexity for distribution estimation. The parametric setting in (Liang, 2018) is similar to the one in Bai et al. (2018), where all weight matrices in the discriminator are full rank, and the activation function is the invertible leaky ReLU function. This ensures that the generator network is invertible, and the log density of the generated distribution can be calculated. The discriminator is then chosen as an -layer feedforward network using the dual leaky ReLU activation. The main result in Corollary 1 shows that the squared Wasserstein distance between the GAN estimator and the data distribution converges at a rate of , where is the width of the generator (discriminator) network. This result requires strong assumptions on the data distribution and the generator, i.e., the generator needs to be invertible and the data distribution needs to be exactly implementable by the generator.
Apart from the aforementioned results, Liang (2017); Singh et al. (2018) studied nonparametric density estimation under Sobolev IPMs. Later Uppal et al. (2019) generalized the result to Besov IPMs. The main results are similar to Liang (2018) in the nonparametric setting. The bias of the distribution estimation was assumed to be small, the generator and discriminator network architectures are provided to guarantee this. Our main result is also in the nonparametric setting, but the generator and discriminator network architectures are explicitly chosen to learn distributions with Hölder densities.
where is any estimator of based on data points. The minimax rate suggests that the curse of data dimensionality is unavoidable regardless of the approach.
The empirical performance of GANs, however, can mysteriously circumvent such a curse of data dimensionality. This largely owes to the fact that practical data sets often exhibit low-dimensional geometric structures. Many images, for instance, consist of projections of a three-dimensional object followed by some transformations, such as rotation, translation, and skeleton. This generating mechanism induces a small number of intrinsic parameters (Hinton and Salakhutdinov, 2006; Osher et al., 2017; Chen et al., 2019b). Several existing works show that neural networks are adaptive to low-dimensional data structures in function approxiamtion (Shaham et al., 2018; Chui and Mhaskar, 2016; Chen et al., 2019a) and regression (Chen et al., 2019b). It is worthwhile to investigate the performance of GANs for learning distributions supported on low-dimensional sets.
Convolutional Filters. Convolutional filters (Krizhevsky et al., 2012)
are widely used in GANs for image generating and processing. Empirical results show that convolutional filters can learn hidden representations that align with various patterns in images(Zeiler and Fergus, 2014; Zhou et al., 2018), e.g., background, objects, and colors. An interesting question is whether convolutional filters can capture the aforementioned low-dimensional structures in data.
Smoothness of Data Distributions and Regularized Distribution Estimation. Theorem 2 indicates a convergence rate independent of the smoothness of the data distribution. The reason behind is that the empirical data distribution cannot inherit the same smoothness as the underlying data distribution. This limitation exists in all previous works (Liang, 2017; Singh et al., 2018; Uppal et al., 2019). It is interesting to investigate whether GANs can achieve a faster convergence rate (e.g., attain the minimax optimal rate).
From a theoretical perspective, Liang (2018) suggested to first obtain a smooth kernel estimator from , and then replace
by this kernel estimator to train GANs. In practice, kernel smoothing is hardly used in GANs. Instead, regularization (e.g., entropy regularization) and normalization (e.g., spectral normalization and batch-normalization) are widely applied as implicit approaches to encourage the smoothness of the learned distribution. Several empirical studies of GANs suggest that divergence-based and mutual information-based regularization can stabilize the training and improve the performance(Che et al., 2016; Cao et al., 2018) of GANs. We leave it as future investigation to analyze the statistical properties of regularized GANs.