On Computation and Generalization of GANs with Spectrum Control

by   Haoming Jiang, et al.

Generative Adversarial Networks (GANs), though powerful, is hard to train. Several recent works (brock2016neural,miyato2018spectral) suggest that controlling the spectra of weight matrices in the discriminator can significantly improve the training of GANs. Motivated by their discovery, we propose a new framework for training GANs, which allows more flexible spectrum control (e.g., making the weight matrices of the discriminator have slow singular value decays). Specifically, we propose a new reparameterization approach for the weight matrices of the discriminator in GANs, which allows us to directly manipulate the spectra of the weight matrices through various regularizers and constraints, without intensively computing singular value decompositions. Theoretically, we further show that the spectrum control improves the generalization ability of GANs. Our experiments on CIFAR-10, STL-10, and ImageNet datasets confirm that compared to other methods, our proposed method is capable of generating images with competitive quality by utilizing spectral normalization and encouraging the slow singular value decay.


page 26

page 30

page 31

page 32


Spectral Normalization for Generative Adversarial Networks

One of the challenges in the study of generative adversarial networks is...

Boundary between noise and information applied to filtering neural network weight matrices

Deep neural networks have been successfully applied to a broad range of ...

Spectral Regularization for Combating Mode Collapse in GANs

Despite excellent progress in recent years, mode collapse remains a majo...

A Simple yet Effective Way for Improving the Performance of GANs

This paper presents a simple but effective way that improves the perform...

Orthogonal Deep Neural Networks

In this paper, we introduce the algorithms of Orthogonal Deep Neural Net...

Controllable Orthogonalization in Training DNNs

Orthogonality is widely used for training deep neural networks (DNNs) du...

Data-driven Regularization via Racecar Training for Generalizing Neural Networks

We propose a novel training approach for improving the generalization in...

1 Introduction

Many efforts have been recently devoted to studying Generative Adversarial Networks (GANs, Goodfellow et al. (2014)

). GANs provide a general unsupervised framework to learn a generative model from unlabeled real data. Successful applications of GANs include many unsupervised learning tasks, such as image generation, dialogue generation, and image inpainting 

(Abadi and Andersen, 2016; Goodfellow, 2016; Ho and Ermon, 2016; Li et al., 2017; Yu et al., 2018)

. Different from other unsupervised learning methods, which directly maximize the likelihood of deep generative models (e.g., Variational Auto-encoder, Nonlinear ICA, and Restricted Boltzmann Machine), GANs introduce a competition between two neural networks. Specifically, one neural network serves as the generator that yields artificial samples, and the other serves as the discriminator that distinguishes the artificial samples from the real data.

Mathematically, GANs can be formulated as the following min-max optimization problem:


where are real data points, denotes the generative deep neural network parameterized by , denotes the discriminative neural network parameterized by , denotes the distribution generated by , is a properly chosen monotone function, and denotes a monotone function related to the function . There have been many options for and in existing literature. For example, the original GAN proposed in Goodfellow et al. (2014) chooses , ; Arjovsky et al. (2017) use , , and (1.1) becomes the Wasserstein GAN. Min-max problem (1.1) has a natural interpretation: The minimization problem aims to find a discriminator , which can distinguish between the real data and the artificial samples generated by , while the maximization problem aims to find a generator , which can fool the discriminator

. From the perspective of game theory, the generator and discriminator are essentially two players competing with each other and eventually achieving some equilibrium.

From an optimization perspective, problem (1.1) is a nonconvex-nonconcave min-max problem, that is, is nonconvex in given a fixed and nonconcave in given a fixed

. Unlike convex-concave min-max problems, which have been well studied in existing optimization literature, there is very limited understanding of general nonconvex-nonconcave min-max problems. Thus, most of existing algorithms for training GANs are heuristics. Although some theoretical guarantees have been established for a few algorithms, they all require very strong assumptions, which are not satisfied in practice

(Heusel et al., 2017).

Despite of the lack of theoretical justifications, significant progress has been made in empirical studies of training GANs. Numerous empirical evidence has suggested several approaches for stabilizing the training of the discriminator, which can eventually improve the training of the generator. For example, Goodfellow et al. (2014) adopt a simple algorithmic trick that updates for multiple iterations after updating for one iteration, i.e., training the discriminator more frequently than the generator. Besides, Xiang and Li (2017) suggest that the weight normalization approach proposed in Salimans and Kingma (2016) can also stabilize the training of the discriminator. More recently, Miyato et al. (2018) propose a spectral normalization approach to control the spectral norm of the weight matrix in each layer. Specifically, in each forward step, they normalize the weight matrix by the approximation of its spectral norm, which is obtained by the one-step power method. They further show that spectral normalization essentially controls the Lipschitz constant of the discriminator with respect to the input. Compared to other methods for controlling the Lipschitz constant of the discriminator, e.g., gradient penalty (Gulrajani et al., 2017; Gao et al., 2017), the experiments in Miyato et al. (2018) show that the spectral normalization approach achieves better performance with fairly low computational cost. Moreover, Miyato et al. (2018) show that spectral normalization suffers less from the mode collapse, that is, the generator outputs only over a fairly small support. Such a phenomenon, though not well understood, suggests that the spectral normalization will balance the discrimination and representation well.

Besides the aforementioned algorithmic tricks and normalization approaches, regularization can also stabilize the training of the discriminator (Brock et al., 2016; Roth et al., 2017; Nagarajan and Kolter, 2017; Liu et al., 2018). For instance, orthogonal regularization, proposed by Brock et al. (2016), forces the columns of weight matrices in the discriminator to be orthonormal by augmenting the objective function with , where is the regularization parameter, denotes the weight matrix of the -th layer in the discriminator,

denotes the identity matrix, and

is the depth of the discriminator. The experimental results in Brock et al. (2016) show that the orthogonal regularization improves the performance and generalization ability of GANs. However, the empirical evidence in Miyato et al. (2018) shows that the orthogonal regularization is still less competitive than the spectral normalization approach. One possible explanation is that the orthogonal normalization, forcing all non-zero singular values to be , is more restrictive than the spectral normalization, which only forces the largest singular value of each weight matrix to be .

Motivated by the spectral normalization, we propose a novel training framework, which provides more flexible and precise control over the spectra of weight matrices in the discriminator. Specifically, we reparameterize each weight matrix as , where and are required to have orthonormal columns, denotes a diagonal matrix with , and are singular values of . With such a reparameterization, an -layer discriminator becomes

where is the entry-wise activation operator of the -th layer, , , and denote the parameters of the discriminator , and

denotes the input vector. This reparameterization allows us to control the spectra of the original weight matrix

by manipulating . For example, we can rescale by its largest diagonal element, which essentially is the spectral normalization. Besides, we can also manipulate the diagonal entries of to control the decays in singular values (e.g., fast or slow decays). Recall that our reparameterization requires and to have orthonormal columns. This requirement can be achieved by several methods in the existing literature, such as the stiefel manifold gradient method. However, Huang et al. (2017) show that the stochastic stiefel manifold gradient method is unstable. Moreover, other methods, such as cayley transformation and householder transformation, suffer from several disadvantages: (I). High computational cost111

Without a sparse matrix implementation, these methods are highly unscalable and inefficient (not supported by the existing deep learning libraries such as TensorFlow and PyTorch in GPU).

; (II). Sophisticated implementation (Shepard et al., 2015). Different from the methods mentioned above, our framework applies the orthogonal regularization to all ’s and ’s. Such a regularization suffices to guarantee the approximate orthogonality of ’s and ’s in practice, which is supported by our experiments. Moreover, our experimental results on CIFAR-10, STL-10 and ImageNet datasets show that our proposed method achieves competitive performance on CIFAR-10 and better performance than the spectral normalization and other competing approaches on STL-10 and ImageNet. Besides the empirical studies, we provide theoretical analysis, which characterizes how the spectrum control benefits the generalization ability of GANs. Specifically, denote as the underlying data distribution and as the distribution given by the well trained generator. We establish a generalization bound under spectrum control as follows (informal):

where , is the -distance, and denotes the class of distributions generated by generators. Compared to the results in Zhang et al. (2017), our result improves the generalization bound up to an exponential factor of the depth of the discriminator. More details will be discussed in Section 3.

The rest of the paper is organized as follows: Section 2 introduces our proposed training framework in detail; Section 3 presents the generalization bound for GANs under spectrum control; Section 4 presents numerical experiments on CIFAR-10, STL-10, and ImageNet datasets.

Notations: Given an integer , we denote . Given a vector , we denote as its Euclidean norm. Given a matrix , we denote the spectral norm by as the largest singular value of . We adopt the standard notation, which is defined as as , if and only if there exists and , such that for . We use to denote with hidden logarithmic factors.

2 Methodology

We present a new framework for flexibly controlling the spectra of weight matrices. We first consider an -layer discriminator as follows:


where denotes the entry-wise activation operator of the -th layer, denotes the weight matrix of the -th layer, denotes the input feature, denotes the parameters of the discriminator , and .

2.1 SVD reparameterization

Our framework directly applies an SVD reparameterization to each weight matrix in the discriminator , i.e., , where , and denote two matrices with orthonormal columns, denotes a diagonal matrix, and are the singular values of . The discriminator can be rewritten as follows:


where , , and 222 essentially is a vector. To be consistent, we still use to reparametrize . Actually, it is not necessary. We can directly control the norm of in practice. denote the parameters of the discriminator . Throughout the rest of the paper, if not clear specified, we denote by for notational simplicity. The motivation behind this reparameterization is to control the singular values of each weight matrix by explicitly manipulating . We then consider a new min-max problem as follows:


where denotes the identity matrix of size , is the regularizer with a regularization parameter , and denotes a feasible set. By choosing different and , (2.1) can control the spectrum of the weight matrix flexibly. For example, if we take the feasible set and , then our method essentially is the orthogonal regularization. We will discuss some options of and later in detail.

As mentioned earlier, the orthogonal constraints in (2.1) suffer from the high computational cost and sophisticated implementation. To address these drawbacks, we directly apply the orthogonal regularization to all ’s and ’s. Therefore, problem (2.1) becomes


where is a regularization parameter. A relative large (e.g., ), ensures the orthogonality of and . See more details in Section 4.1. Moreover, (2.4) can be efficiently solved by stochastic gradient algorithms. Projection may be needed to handle the constraint . See more details later.

2.2 Spectrum Control

We provide a few options of and for controlling the spectra of weight matrices in the discriminator, which is motivated by Miyato et al. (2018). Miyato et al. (2018) have shown that for an -layer discriminator , we have:


where is the Lipschitz constant of

. The last equation holds for our proposed reparameterization. For commonly used activation operators, such as the sigmoid, ReLU, and leak-ReLU functions,

. Therefore, is essentially an upper bound for the Lipschitz constant, which can be controlled by our proposed and . Note that is a vector with only one singular value. For simplicity, we set in the following analysis.

2.2.1 Flexible Spectral Control

Comparing to the orthogonal regularization, Miyato et al. (2018) suggest that we should allow more flexibility by using spectral normalization, which only bounds the largest singular value. They implement spectral normalization by one-step power iteration.

Spectrum Normalization: We can also easily implement spectral normalization under our SVD reparameterization framework. Specifically, the spectral normalization rescales the weight matrix by its spectral norm , which is equivalent to solving the following problem:

where .

Spectrum Constraint: Note that the spectral normalization essentially reparameterize the Lipschitz constraint :


This essentially controls by forcing each . Instead of spectral normalization, we consider directly solving the problem with the Lipschitz constraint. To maintain the feasibility of , we only need a simple projection for each in the back propagation, which can be implemented by a simple entry-wise clipping operator defined as


where if , and otherwise.

These two methods are essentially solving the same problem, but in different formulations. Therefore, different algorithms are adopted. Due to the nonconvex-nonconcave structure of (2.4), different solutions are obtained.

Lipschitz Regularizer: We can also directly penalize to control the Lipschitz constant of the discriminator . Specifically, we define the Lipschitz regularizer as:

Compared to the spectral constraint, which enforces all , the Lipschitz regularizer is more flexible since it allows for some .

2.2.2 Slow singular value decay

Miyato et al. (2018) owe their empirical success of training SN-GAN to controlling the spectral norm while allowing flexibility. This perspective, however, is not very concrete. As we know, orthogonal regularization and spectral normalization with SVD can both control the spectral norm. Their empirical performance is actually worse than SN-GAN. For example, on the STL-10 dataset, SN-GAN achieves an inception score of 8.83, while singular value truncation only achieves 8.69 and orthogonal regularization achieves 8.77.

The reason behind is that SN-GAN implements the spectral normalization via one-step power iteration. This procedure consistently underestimates spectral norms of weight matrices. Consequently, in addition to controlling the spectral norms, the spectral normalization in SN-GAN affects the whole spectrum of the weight matrix (encourages slow singular value decay as in Figure 1

), which we refer to as “flexibility”. Encouraging slow decay is essentially encouraging the network to capture as many features as possible while allowing correlation between neurons. Built upon these empirical observations, we conjecture that controlling the whole spectrum better improves the performance of GANs, which is further corroborated by our numerical experiments (Section 


Orthogonal Reg. SN w/ Power Iteration SN w/ SVD D-optimal Reg.
No Decay Slow Decay Fast Decay Slower Decay
Figure 1: An illustration of smooth singular value decays with different methods. The vertical axis denotes the value and the horizontal axis denotes the normalized rank. The inception scores on STL-10 are also reported.
Figure 2: The plot of normalized ranks versus values of K order statistics sampled from reference distributions. The vertical axis denotes the value; the horizontal axis denotes the normalized rank.

D-Optimal Regularizer: We propose the -Optimal Regularizer as follows:


which is motivated by -optimal design. -optimal design (Wu and Hamada, 2011)

is a popular principle in experimental design, where people aim to estimate parameters of statistical models with a minimum number of experiments. Specifically,

-optimal design maximizes the determinant of Fisher information matrix while allowing correlation between features in experiments. Existing literature has shown the superiority of -optimal design to the orthogonal (uncorrelated) design on nonlinear model estimation (Yang et al., 2013; Li and Majumdar, 2009; Mentre et al., 1997). Analogously, our proposed -Optimal Regularizer essentially maximizes the log Gram determinant of the weight matrix,

The approximation holds due to the SVD reparameterization , with approximately orthogonal. Moreover, note that the derivative of is , a monotone decreasing function. Then has a significant impact when is small. Thus, -optimal regularizer encourages a slow singular value decay.

Divergence Regularizer: We propose a divergence regularization to precisely control the slow decay as shown in Figure 1. To mimic such a decay, we consider a reference distribution, , where . Figure 2 shows the decays of K order statistics sampled from . We then denote the density function of as

and the probability mass function of a uniform discrete distribution over

as . Note that the K-L divergence between a discrete distribution and a continuous distribution is . To address this issue, we discretize . Specifically, given , we construct a discrete distribution over with a probability mass function , defined as follows:

Ignoring the normalization term in the denominator, we then define the regularizer as follows:

Note that the divergence regularizer requires the singular values in the interval and -optimal regularizer cannot control the Lipschitz constant of the discriminator . Therefore, we incorporate the divergence regularizer with the spectrum constraint and combine the -optimal regularizer with the spectral normalization to bound the Lipschitz constant. Our experimental results show that both combinations improve the training of GANs on CIFAR10 and STL-10 datasets.

3 Theory

We show how the spectrum control benefits the generalization of GANs. Before proceed, we define -distance as follows. [-distance] Let be a class of functions from to such that if , . Let be a concave function. Then given two distributions and supported on , the -distance with respect to is defined as

Note that -distance unifies Jensen-Shannon distance, Wasserstein distance and neural distance as proposed in Arora et al. (2017). For example, when taking and all 1-Lipschitz functions from , the -distance is the Wasserstein distance. Recall that by (1.1), the training of GANs is essentially minimizing the -distance with being the collection of composite functions , where is the -layer discriminator network defined by (2.1). To establish the generalization bound, we impose the following assumption. The activation operator is -Lipschitz with for any . is -Lipschitz such that if , . is -Lipschitz. The spectral norms of weight matrices are bounded respectively, i.e., for any . Note that commonly used functions

, such as the sigmoid function, satisfy the assumption. We denote by

the underlying data distribution, and by the empirical data distribution. We further denote as the distribution given by the generator that minimizes the loss (1.1) up to accuracy , i.e.,

where is the class of distributions generated by generators. Then we give the generalization bound based on the PAC-learning framework as follows. Under Assumption 3, assume that the input data is bounded, i.e., for . Then given activation operators , , and , with probability at least

over the joint distribution of

, we have

where and . The detailed proof is provided in Appendix A.1. By constraining each , the generalization bound is reduced to of the order , which is polynomial in and . On the contrary, without such spectrum constraints, the bound can be exponentially dependent on . For example, if with some constant for any , we have , which implies that GANs cannot generalize with polynomial number of samples.

Empirical Rademacher complexity (ERC) is adopted to derive our generalization bound, which is of the order . Directly applying the ERC based generalization bound in Bartlett et al. (2017) yields a bound of the order . Our bound is tighter, and is derived by exploiting the Lipschitz continuity of the discriminator with respect to its model parameters (weight matrices). Similar idea is used in Zhang et al. (2017), however, we derive sharper Lipschitz constants 333The Lipschitz constant in Zhang et al. (2017) can be of the order . by the key step of decoupling the spectral norms of weight matrices and the number of parameters, i.e., separating and .

Theorem 3 shows the advantage of spectrum control in generalization by constraining the class of discriminators. However, as suggested in Arora et al. (2017), the class of discriminators needs to be large enough to detect lack of diversity. Despite of a lack of theoretical justifications, empirical results in Miyato et al. (2018) show that discriminators with spectral normalization are powerful in distinguishing from , and suffer less from the mode collapse. We conjecture that the observed singular value decay (as illustrated in Figure 1) contributes to preventing mode collapse. We leave this for future theoretical investigation.

4 Experiment

To demonstrate our proposed new methods, we conduct experiments on CIFAR-10 (Krizhevsky and Hinton, 2009), STL-10 (Coates et al., 2011), and ImageNet (Russakovsky et al., 2015). We illustrate the importance of spectrum control in GANs training by revealing a close relation between the performance and the singular value decays.

All implementations are done in Chainer as the official implementation of the SN-GAN (Miyato et al., 2018). Note that SN-GAN is using power iteration. If not specified, all orther Spectral Normalization (SN) methods are under SVD framework. For quantitative assessment of generated examples, we use inception score (Salimans et al., 2016) and Fréchet inception distance (FID, Heusel et al. (2017)

). All reported results correspond to 10 runs of the GAN training with different random initializations. The discussion of this paper is based on fully connected layer. When dealing with convolutional layer, we only need to reshape the 4D weight tensor to a 2D matrix. Denote the weight tensor of a convolutional layer as

, where denotes the output channel, the input channel and the kernel size. We reshape as (Huang et al., 2017), i.e., merging the last three dimensions while preserving the first dimension. See more implementation details in Appendix C.1.

4.1 Dc-Gan

We test our methods on DC-GANs with two datasets, CIFAR-10 and STL-10. Specifically, we adopt a -layer CNN as the generator and a -layer CNN as the discriminator. Recall that our proposed training framework tries to solve the equilibrium for equation (2.4). We set and being the sigmoid function. Denote for a fixed and for fixed , and , where . We maximize for iterations () followed by minimizing for one iteration. Note that we use a trick (Goodfellow et al., 2014) to ease the computation of minimizing . Detailed implementations are provided in Appendices B and C.2. We choose tuning parameters444

In fact, the performance is not sensitive to these hyperparameters, since we only observe negligible difference by fine tuning these parameters. Specifically, when

and ( for Divergence regularizer), the algorithm yields similar results. and in all the experiments except for the Divergence regularizer, where we pick and . is chosen according to the output range of different regularizers. We set a smaller gamma for Divergence Regularizer, since its output is much larger than other regularizers. We take K iterations in all the experiments on CIFAR- and K iterations on STL- as suggested in Miyato et al. (2018).

To solve (2.4), we adopt the setting in Radford et al. (2015), which has been shown to be robust for different GANs by Miyato et al. (2018). Specifically, we use the Adam optimizer (Kingma and Ba, 2014) with the following hyperparameters: (1) ; (2) , the initial learning rate; (3) , the first and second order momentum parameters of Adam respectively.

Before we present our results, we show the effectiveness of our proposed reparameterization, which aims to approximate the singular values of weight matrices while avoiding direct SVDs. As can be seen, in Table 1, and have nearly orthonormal columns respectively, i.e., . Although the reparameterization introduces more model parameters, it maintains comparable computational efficiency. See more details in Appendix D.2.

Layer 0 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6
2.3e-5 1.2e-5 1.5e-5 1.6e-5 2.7e-5 2.5e-5 2.1e-5
7.9e-5 1.0e-5 1.7e-5 2.5e-5 4.1e-5 7.1e-5 3.9e-5
Table 1: The sub-orthogonality of ’s and ’s in the discriminator with the divergence regularizer on CIFAR-10 after 100K iterations. For other settings, we also observe that all ’s and ’s have nearly orthonormal columns.
Figure 3: Inception scores on ImageNet. We can see that our method outperforms SN-GAN.

Figure 4 shows that the singular value decays of weight matrices with two different methods: SN-GAN and -optimal regularizer with spectral normalization. As can be seen, our method achieves a slower decay in singular values than that of SN-GAN. See more results of other methods in Appendix D.3. Such a slower decay improves the performance of GANs. Specifically, Table 2 presents the inception scores and FIDs of our proposed methods as well as other methods on CIFAR-10 and STL-10. As can be seen, under CNN architecture, our methods achieve significant improvements on STL-. Compared with STL-, CIFAR- is easy to learn, and thus GAN training can only limitedly benefits from encouraging the slow singular value decay. As a result, on CIFAR-, our methods slightly improve the result of SN-GAN. Moreover, as shown in Figure 4, at the early stage (5k iteration), our method achieves slow decay while SN-GAN still decays fast. Thus, it converge faster than SN-GAN as shown in Figure 5.

Figure 4: Illustrations of singular value decay in layers at K-th, K-th, K-th, and K-th iteration. The above figures are for the SN-GANs; the below are for -optimal regularizer with SN.
CIFAR: FID CIFAR: Inception Score STL: FID STL: Inception Score
Figure 5: The inception scores and FID’s with error bar over 10 runs. Due to the space limit we only present the comparisaon between SN-GAN and D-optimial regularizer with SN, which is the best among our proposed methods. The full comparison with all proposed methods is in Appendix D.4.

4.2 ResNet-GAN

We also test our proposed method on ResNet, a more advanced structure, on both discriminator and generator (Appendix  C.2). For these experiments, we adopt the hinge loss for adversarial training on discriminators:

We also adopt the commonly used hyperparameter settings for the Adam optimizer on ResNet: and (Gulrajani et al., 2017). Due to our computational resource limit, we only test the method of spectral normalization (our version) with -optimal regularizer, which achieves the best performance on CNN experiments. We also test on the official subsampled ImageNet data using the conditional GAN with a projected discriminator Miyato and Koyama (2018).

The results of our experiments on CIFAR- and STL- are listed in Table 2, and results on ImageNet are shown in Figure 3. We see that our method is much better than the other methods on STL-10 and ImageNet. As for CIFAR-10, our method is better than orthogonal regularizer but slightly worse than SN-GAN. We believe the reason behind is that CIFAR-10 is relatively easy. As can be seen, for CIFAR-10, the inception scores of all methods are around , while the inception score of real data is around . In contrast, for STL-10, the inception score of real data is around , while inception scores of all methods are less than . As a result, when the dataset is complicated and network needs high capacity, our method performs better than SN-GAN.

Method Inception Score FID
Real Data
CNN Baseline
Orthogonal Reg.
SN-GAN (Power Iter.)
Ours CNN (Under SVD)
Spectral Norm.
Spectral Constraint
Lipschitz Reg.
SC + Divergence Reg.
SN + -Optimal Reg.
ResNet Structure
Orthogonal Reg.
SN-GAN (Power Iter.)
SN + -Optimal Reg.
Table 2: The inception scores and FIDs on CIFAR-10 and STL-10. For consistency, we reimplement baselines under our Chainer environment.

5 Conclusion

In this paper, we propose a new SVD-type reparameterization for weight matrices of the discriminator in GANs, allowing us to efficiently manipulate the spectra of weight matrices. We than establish a new generalization bound of GAN to justify the importance of spectrum control on weight matrices. Moreover, we propose new regularizers to encourage the slow singular value decay. Our experiments on CIFAR-10, STL-10, and ImageNet datasets support our proposed methods, theory, and discoveries.


Appendix A Proof in Section 3

a.1 Proof of Theorem 3


We bound the output of as follows,

Consider . We have


Note that given and , we have

Then McDiarmid’s inequality gives us, with probability at least ,


By the argument of symmetrization, we have



’s are i.i.d. Rademacher random variables, i.e.,

McDiarmid’s inequality again gives us, with probability at least , we have


Not that is essentially the empiricial Rademacher complexity of . Since and are both Lipschitz, by Talagrand’s lemma, we have

We then use the standard Dudley’s entropy integral to bound . We exploit the parametric form of discriminators to find a tight covering number. We have to investigate the Lipschitz continuity of with respect to the weight matrices . We based our argument on telescoping. Given two sets of weight matrices and and fix the activation operators and , we have