1 Introduction
Many efforts have been recently devoted to studying Generative Adversarial Networks (GANs, Goodfellow et al. (2014)
). GANs provide a general unsupervised framework to learn a generative model from unlabeled real data. Successful applications of GANs include many unsupervised learning tasks, such as image generation, dialogue generation, and image inpainting
(Abadi and Andersen, 2016; Goodfellow, 2016; Ho and Ermon, 2016; Li et al., 2017; Yu et al., 2018). Different from other unsupervised learning methods, which directly maximize the likelihood of deep generative models (e.g., Variational Autoencoder, Nonlinear ICA, and Restricted Boltzmann Machine), GANs introduce a competition between two neural networks. Specifically, one neural network serves as the generator that yields artificial samples, and the other serves as the discriminator that distinguishes the artificial samples from the real data.
Mathematically, GANs can be formulated as the following minmax optimization problem:
(1.1) 
where are real data points, denotes the generative deep neural network parameterized by , denotes the discriminative neural network parameterized by , denotes the distribution generated by , is a properly chosen monotone function, and denotes a monotone function related to the function . There have been many options for and in existing literature. For example, the original GAN proposed in Goodfellow et al. (2014) chooses , ; Arjovsky et al. (2017) use , , and (1.1) becomes the Wasserstein GAN. Minmax problem (1.1) has a natural interpretation: The minimization problem aims to find a discriminator , which can distinguish between the real data and the artificial samples generated by , while the maximization problem aims to find a generator , which can fool the discriminator
. From the perspective of game theory, the generator and discriminator are essentially two players competing with each other and eventually achieving some equilibrium.
From an optimization perspective, problem (1.1) is a nonconvexnonconcave minmax problem, that is, is nonconvex in given a fixed and nonconcave in given a fixed
. Unlike convexconcave minmax problems, which have been well studied in existing optimization literature, there is very limited understanding of general nonconvexnonconcave minmax problems. Thus, most of existing algorithms for training GANs are heuristics. Although some theoretical guarantees have been established for a few algorithms, they all require very strong assumptions, which are not satisfied in practice
(Heusel et al., 2017).Despite of the lack of theoretical justifications, significant progress has been made in empirical studies of training GANs. Numerous empirical evidence has suggested several approaches for stabilizing the training of the discriminator, which can eventually improve the training of the generator. For example, Goodfellow et al. (2014) adopt a simple algorithmic trick that updates for multiple iterations after updating for one iteration, i.e., training the discriminator more frequently than the generator. Besides, Xiang and Li (2017) suggest that the weight normalization approach proposed in Salimans and Kingma (2016) can also stabilize the training of the discriminator. More recently, Miyato et al. (2018) propose a spectral normalization approach to control the spectral norm of the weight matrix in each layer. Specifically, in each forward step, they normalize the weight matrix by the approximation of its spectral norm, which is obtained by the onestep power method. They further show that spectral normalization essentially controls the Lipschitz constant of the discriminator with respect to the input. Compared to other methods for controlling the Lipschitz constant of the discriminator, e.g., gradient penalty (Gulrajani et al., 2017; Gao et al., 2017), the experiments in Miyato et al. (2018) show that the spectral normalization approach achieves better performance with fairly low computational cost. Moreover, Miyato et al. (2018) show that spectral normalization suffers less from the mode collapse, that is, the generator outputs only over a fairly small support. Such a phenomenon, though not well understood, suggests that the spectral normalization will balance the discrimination and representation well.
Besides the aforementioned algorithmic tricks and normalization approaches, regularization can also stabilize the training of the discriminator (Brock et al., 2016; Roth et al., 2017; Nagarajan and Kolter, 2017; Liu et al., 2018). For instance, orthogonal regularization, proposed by Brock et al. (2016), forces the columns of weight matrices in the discriminator to be orthonormal by augmenting the objective function with , where is the regularization parameter, denotes the weight matrix of the th layer in the discriminator,
denotes the identity matrix, and
is the depth of the discriminator. The experimental results in Brock et al. (2016) show that the orthogonal regularization improves the performance and generalization ability of GANs. However, the empirical evidence in Miyato et al. (2018) shows that the orthogonal regularization is still less competitive than the spectral normalization approach. One possible explanation is that the orthogonal normalization, forcing all nonzero singular values to be , is more restrictive than the spectral normalization, which only forces the largest singular value of each weight matrix to be .Motivated by the spectral normalization, we propose a novel training framework, which provides more flexible and precise control over the spectra of weight matrices in the discriminator. Specifically, we reparameterize each weight matrix as , where and are required to have orthonormal columns, denotes a diagonal matrix with , and are singular values of . With such a reparameterization, an layer discriminator becomes
where is the entrywise activation operator of the th layer, , , and denote the parameters of the discriminator , and
denotes the input vector. This reparameterization allows us to control the spectra of the original weight matrix
by manipulating . For example, we can rescale by its largest diagonal element, which essentially is the spectral normalization. Besides, we can also manipulate the diagonal entries of to control the decays in singular values (e.g., fast or slow decays). Recall that our reparameterization requires and to have orthonormal columns. This requirement can be achieved by several methods in the existing literature, such as the stiefel manifold gradient method. However, Huang et al. (2017) show that the stochastic stiefel manifold gradient method is unstable. Moreover, other methods, such as cayley transformation and householder transformation, suffer from several disadvantages: (I). High computational cost^{1}^{1}1Without a sparse matrix implementation, these methods are highly unscalable and inefficient (not supported by the existing deep learning libraries such as TensorFlow and PyTorch in GPU).
; (II). Sophisticated implementation (Shepard et al., 2015). Different from the methods mentioned above, our framework applies the orthogonal regularization to all ’s and ’s. Such a regularization suffices to guarantee the approximate orthogonality of ’s and ’s in practice, which is supported by our experiments. Moreover, our experimental results on CIFAR10, STL10 and ImageNet datasets show that our proposed method achieves competitive performance on CIFAR10 and better performance than the spectral normalization and other competing approaches on STL10 and ImageNet. Besides the empirical studies, we provide theoretical analysis, which characterizes how the spectrum control benefits the generalization ability of GANs. Specifically, denote as the underlying data distribution and as the distribution given by the well trained generator. We establish a generalization bound under spectrum control as follows (informal):where , is the distance, and denotes the class of distributions generated by generators. Compared to the results in Zhang et al. (2017), our result improves the generalization bound up to an exponential factor of the depth of the discriminator. More details will be discussed in Section 3.
The rest of the paper is organized as follows: Section 2 introduces our proposed training framework in detail; Section 3 presents the generalization bound for GANs under spectrum control; Section 4 presents numerical experiments on CIFAR10, STL10, and ImageNet datasets.
Notations: Given an integer , we denote . Given a vector , we denote as its Euclidean norm. Given a matrix , we denote the spectral norm by as the largest singular value of . We adopt the standard notation, which is defined as as , if and only if there exists and , such that for . We use to denote with hidden logarithmic factors.
2 Methodology
We present a new framework for flexibly controlling the spectra of weight matrices. We first consider an layer discriminator as follows:
(2.1) 
where denotes the entrywise activation operator of the th layer, denotes the weight matrix of the th layer, denotes the input feature, denotes the parameters of the discriminator , and .
2.1 SVD reparameterization
Our framework directly applies an SVD reparameterization to each weight matrix in the discriminator , i.e., , where , and denote two matrices with orthonormal columns, denotes a diagonal matrix, and are the singular values of . The discriminator can be rewritten as follows:
(2.2) 
where , , and ^{2}^{2}2 essentially is a vector. To be consistent, we still use to reparametrize . Actually, it is not necessary. We can directly control the norm of in practice. denote the parameters of the discriminator . Throughout the rest of the paper, if not clear specified, we denote by for notational simplicity. The motivation behind this reparameterization is to control the singular values of each weight matrix by explicitly manipulating . We then consider a new minmax problem as follows:
(2.3) 
where denotes the identity matrix of size , is the regularizer with a regularization parameter , and denotes a feasible set. By choosing different and , (2.1) can control the spectrum of the weight matrix flexibly. For example, if we take the feasible set and , then our method essentially is the orthogonal regularization. We will discuss some options of and later in detail.
As mentioned earlier, the orthogonal constraints in (2.1) suffer from the high computational cost and sophisticated implementation. To address these drawbacks, we directly apply the orthogonal regularization to all ’s and ’s. Therefore, problem (2.1) becomes
(2.4) 
where is a regularization parameter. A relative large (e.g., ), ensures the orthogonality of and . See more details in Section 4.1. Moreover, (2.4) can be efficiently solved by stochastic gradient algorithms. Projection may be needed to handle the constraint . See more details later.
2.2 Spectrum Control
We provide a few options of and for controlling the spectra of weight matrices in the discriminator, which is motivated by Miyato et al. (2018). Miyato et al. (2018) have shown that for an layer discriminator , we have:
(2.5) 
where is the Lipschitz constant of
. The last equation holds for our proposed reparameterization. For commonly used activation operators, such as the sigmoid, ReLU, and leakReLU functions,
. Therefore, is essentially an upper bound for the Lipschitz constant, which can be controlled by our proposed and . Note that is a vector with only one singular value. For simplicity, we set in the following analysis.2.2.1 Flexible Spectral Control
Comparing to the orthogonal regularization, Miyato et al. (2018) suggest that we should allow more flexibility by using spectral normalization, which only bounds the largest singular value. They implement spectral normalization by onestep power iteration.
Spectrum Normalization: We can also easily implement spectral normalization under our SVD reparameterization framework. Specifically, the spectral normalization rescales the weight matrix by its spectral norm , which is equivalent to solving the following problem:
where .
Spectrum Constraint: Note that the spectral normalization essentially reparameterize the Lipschitz constraint :
(2.6) 
This essentially controls by forcing each . Instead of spectral normalization, we consider directly solving the problem with the Lipschitz constraint. To maintain the feasibility of , we only need a simple projection for each in the back propagation, which can be implemented by a simple entrywise clipping operator defined as
(2.7) 
where if , and otherwise.
These two methods are essentially solving the same problem, but in different formulations. Therefore, different algorithms are adopted. Due to the nonconvexnonconcave structure of (2.4), different solutions are obtained.
Lipschitz Regularizer: We can also directly penalize to control the Lipschitz constant of the discriminator . Specifically, we define the Lipschitz regularizer as:
Compared to the spectral constraint, which enforces all , the Lipschitz regularizer is more flexible since it allows for some .
2.2.2 Slow singular value decay
Miyato et al. (2018) owe their empirical success of training SNGAN to controlling the spectral norm while allowing flexibility. This perspective, however, is not very concrete. As we know, orthogonal regularization and spectral normalization with SVD can both control the spectral norm. Their empirical performance is actually worse than SNGAN. For example, on the STL10 dataset, SNGAN achieves an inception score of 8.83, while singular value truncation only achieves 8.69 and orthogonal regularization achieves 8.77.
The reason behind is that SNGAN implements the spectral normalization via onestep power iteration. This procedure consistently underestimates spectral norms of weight matrices. Consequently, in addition to controlling the spectral norms, the spectral normalization in SNGAN affects the whole spectrum of the weight matrix (encourages slow singular value decay as in Figure 1
), which we refer to as “flexibility”. Encouraging slow decay is essentially encouraging the network to capture as many features as possible while allowing correlation between neurons. Built upon these empirical observations, we conjecture that controlling the whole spectrum better improves the performance of GANs, which is further corroborated by our numerical experiments (Section
4).Orthogonal Reg.  SN w/ Power Iteration  SN w/ SVD  Doptimal Reg. 
No Decay  Slow Decay  Fast Decay  Slower Decay 
IS:  IS:  IS:  IS: 
DOptimal Regularizer: We propose the Optimal Regularizer as follows:
(2.8) 
which is motivated by optimal design. optimal design (Wu and Hamada, 2011)
is a popular principle in experimental design, where people aim to estimate parameters of statistical models with a minimum number of experiments. Specifically,
optimal design maximizes the determinant of Fisher information matrix while allowing correlation between features in experiments. Existing literature has shown the superiority of optimal design to the orthogonal (uncorrelated) design on nonlinear model estimation (Yang et al., 2013; Li and Majumdar, 2009; Mentre et al., 1997). Analogously, our proposed Optimal Regularizer essentially maximizes the log Gram determinant of the weight matrix,The approximation holds due to the SVD reparameterization , with approximately orthogonal. Moreover, note that the derivative of is , a monotone decreasing function. Then has a significant impact when is small. Thus, optimal regularizer encourages a slow singular value decay.
Divergence Regularizer: We propose a divergence regularization to precisely control the slow decay as shown in Figure 1. To mimic such a decay, we consider a reference distribution, , where . Figure 2 shows the decays of K order statistics sampled from . We then denote the density function of as
and the probability mass function of a uniform discrete distribution over
as . Note that the KL divergence between a discrete distribution and a continuous distribution is . To address this issue, we discretize . Specifically, given , we construct a discrete distribution over with a probability mass function , defined as follows:Ignoring the normalization term in the denominator, we then define the regularizer as follows:
Note that the divergence regularizer requires the singular values in the interval and optimal regularizer cannot control the Lipschitz constant of the discriminator . Therefore, we incorporate the divergence regularizer with the spectrum constraint and combine the optimal regularizer with the spectral normalization to bound the Lipschitz constant. Our experimental results show that both combinations improve the training of GANs on CIFAR10 and STL10 datasets.
3 Theory
We show how the spectrum control benefits the generalization of GANs. Before proceed, we define distance as follows. [distance] Let be a class of functions from to such that if , . Let be a concave function. Then given two distributions and supported on , the distance with respect to is defined as
Note that distance unifies JensenShannon distance, Wasserstein distance and neural distance as proposed in Arora et al. (2017). For example, when taking and all 1Lipschitz functions from , the distance is the Wasserstein distance. Recall that by (1.1), the training of GANs is essentially minimizing the distance with being the collection of composite functions , where is the layer discriminator network defined by (2.1). To establish the generalization bound, we impose the following assumption. The activation operator is Lipschitz with for any . is Lipschitz such that if , . is Lipschitz. The spectral norms of weight matrices are bounded respectively, i.e., for any . Note that commonly used functions
, such as the sigmoid function, satisfy the assumption. We denote by
the underlying data distribution, and by the empirical data distribution. We further denote as the distribution given by the generator that minimizes the loss (1.1) up to accuracy , i.e.,where is the class of distributions generated by generators. Then we give the generalization bound based on the PAClearning framework as follows. Under Assumption 3, assume that the input data is bounded, i.e., for . Then given activation operators , , and , with probability at least
over the joint distribution of
, we havewhere and . The detailed proof is provided in Appendix A.1. By constraining each , the generalization bound is reduced to of the order , which is polynomial in and . On the contrary, without such spectrum constraints, the bound can be exponentially dependent on . For example, if with some constant for any , we have , which implies that GANs cannot generalize with polynomial number of samples.
Empirical Rademacher complexity (ERC) is adopted to derive our generalization bound, which is of the order . Directly applying the ERC based generalization bound in Bartlett et al. (2017) yields a bound of the order . Our bound is tighter, and is derived by exploiting the Lipschitz continuity of the discriminator with respect to its model parameters (weight matrices). Similar idea is used in Zhang et al. (2017), however, we derive sharper Lipschitz constants ^{3}^{3}3The Lipschitz constant in Zhang et al. (2017) can be of the order . by the key step of decoupling the spectral norms of weight matrices and the number of parameters, i.e., separating and .
Theorem 3 shows the advantage of spectrum control in generalization by constraining the class of discriminators. However, as suggested in Arora et al. (2017), the class of discriminators needs to be large enough to detect lack of diversity. Despite of a lack of theoretical justifications, empirical results in Miyato et al. (2018) show that discriminators with spectral normalization are powerful in distinguishing from , and suffer less from the mode collapse. We conjecture that the observed singular value decay (as illustrated in Figure 1) contributes to preventing mode collapse. We leave this for future theoretical investigation.
4 Experiment
To demonstrate our proposed new methods, we conduct experiments on CIFAR10 (Krizhevsky and Hinton, 2009), STL10 (Coates et al., 2011), and ImageNet (Russakovsky et al., 2015). We illustrate the importance of spectrum control in GANs training by revealing a close relation between the performance and the singular value decays.
All implementations are done in Chainer as the official implementation of the SNGAN (Miyato et al., 2018). Note that SNGAN is using power iteration. If not specified, all orther Spectral Normalization (SN) methods are under SVD framework. For quantitative assessment of generated examples, we use inception score (Salimans et al., 2016) and Fréchet inception distance (FID, Heusel et al. (2017)
). All reported results correspond to 10 runs of the GAN training with different random initializations. The discussion of this paper is based on fully connected layer. When dealing with convolutional layer, we only need to reshape the 4D weight tensor to a 2D matrix. Denote the weight tensor of a convolutional layer as
, where denotes the output channel, the input channel and the kernel size. We reshape as (Huang et al., 2017), i.e., merging the last three dimensions while preserving the first dimension. See more implementation details in Appendix C.1.4.1 DcGan
We test our methods on DCGANs with two datasets, CIFAR10 and STL10. Specifically, we adopt a layer CNN as the generator and a layer CNN as the discriminator. Recall that our proposed training framework tries to solve the equilibrium for equation (2.4). We set and being the sigmoid function. Denote for a fixed and for fixed , and , where . We maximize for iterations () followed by minimizing for one iteration. Note that we use a trick (Goodfellow et al., 2014) to ease the computation of minimizing . Detailed implementations are provided in Appendices B and C.2. We choose tuning parameters^{4}^{4}4
In fact, the performance is not sensitive to these hyperparameters, since we only observe negligible difference by fine tuning these parameters. Specifically, when
and ( for Divergence regularizer), the algorithm yields similar results. and in all the experiments except for the Divergence regularizer, where we pick and . is chosen according to the output range of different regularizers. We set a smaller gamma for Divergence Regularizer, since its output is much larger than other regularizers. We take K iterations in all the experiments on CIFAR and K iterations on STL as suggested in Miyato et al. (2018).To solve (2.4), we adopt the setting in Radford et al. (2015), which has been shown to be robust for different GANs by Miyato et al. (2018). Specifically, we use the Adam optimizer (Kingma and Ba, 2014) with the following hyperparameters: (1) ; (2) , the initial learning rate; (3) , the first and second order momentum parameters of Adam respectively.
Before we present our results, we show the effectiveness of our proposed reparameterization, which aims to approximate the singular values of weight matrices while avoiding direct SVDs. As can be seen, in Table 1, and have nearly orthonormal columns respectively, i.e., . Although the reparameterization introduces more model parameters, it maintains comparable computational efficiency. See more details in Appendix D.2.
Layer 0  Layer 1  Layer 2  Layer 3  Layer 4  Layer 5  Layer 6  
2.3e5  1.2e5  1.5e5  1.6e5  2.7e5  2.5e5  2.1e5  
7.9e5  1.0e5  1.7e5  2.5e5  4.1e5  7.1e5  3.9e5 
Figure 4 shows that the singular value decays of weight matrices with two different methods: SNGAN and optimal regularizer with spectral normalization. As can be seen, our method achieves a slower decay in singular values than that of SNGAN. See more results of other methods in Appendix D.3. Such a slower decay improves the performance of GANs. Specifically, Table 2 presents the inception scores and FIDs of our proposed methods as well as other methods on CIFAR10 and STL10. As can be seen, under CNN architecture, our methods achieve significant improvements on STL. Compared with STL, CIFAR is easy to learn, and thus GAN training can only limitedly benefits from encouraging the slow singular value decay. As a result, on CIFAR, our methods slightly improve the result of SNGAN. Moreover, as shown in Figure 4, at the early stage (5k iteration), our method achieves slow decay while SNGAN still decays fast. Thus, it converge faster than SNGAN as shown in Figure 5.
CIFAR: FID  CIFAR: Inception Score  STL: FID  STL: Inception Score 
4.2 ResNetGAN
We also test our proposed method on ResNet, a more advanced structure, on both discriminator and generator (Appendix C.2). For these experiments, we adopt the hinge loss for adversarial training on discriminators:
We also adopt the commonly used hyperparameter settings for the Adam optimizer on ResNet: and (Gulrajani et al., 2017). Due to our computational resource limit, we only test the method of spectral normalization (our version) with optimal regularizer, which achieves the best performance on CNN experiments. We also test on the official subsampled ImageNet data using the conditional GAN with a projected discriminator Miyato and Koyama (2018).
The results of our experiments on CIFAR and STL are listed in Table 2, and results on ImageNet are shown in Figure 3. We see that our method is much better than the other methods on STL10 and ImageNet. As for CIFAR10, our method is better than orthogonal regularizer but slightly worse than SNGAN. We believe the reason behind is that CIFAR10 is relatively easy. As can be seen, for CIFAR10, the inception scores of all methods are around , while the inception score of real data is around . In contrast, for STL10, the inception score of real data is around , while inception scores of all methods are less than . As a result, when the dataset is complicated and network needs high capacity, our method performs better than SNGAN.
Method  Inception Score  FID  
CIFAR10  STL10  CIFAR10  STL10  
Real Data  
CNN Baseline  
WGANGP  
Orthogonal Reg.  
SNGAN (Power Iter.)  
Ours CNN (Under SVD)  
Spectral Norm.  
Spectral Constraint  
Lipschitz Reg.  
SC + Divergence Reg.  
SN + Optimal Reg.  
ResNet Structure  
Orthogonal Reg.  
SNGAN (Power Iter.)  
SN + Optimal Reg. 
5 Conclusion
In this paper, we propose a new SVDtype reparameterization for weight matrices of the discriminator in GANs, allowing us to efficiently manipulate the spectra of weight matrices. We than establish a new generalization bound of GAN to justify the importance of spectrum control on weight matrices. Moreover, we propose new regularizers to encourage the slow singular value decay. Our experiments on CIFAR10, STL10, and ImageNet datasets support our proposed methods, theory, and discoveries.
References
 Abadi and Andersen (2016) Abadi, M. and Andersen, D. G. (2016). Learning to protect communications with adversarial neural cryptography. arXiv preprint arXiv:1610.06918 .
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875 .
 Arora et al. (2017) Arora, S., Ge, R., Liang, Y., Ma, T. and Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573 .
 Barratt and Sharma (2018) Barratt, S. and Sharma, R. (2018). A note on the inception score. arXiv preprint arXiv:1801.01973 .
 Bartlett et al. (2017) Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems.
 Brock et al. (2016) Brock, A., Lim, T., Ritchie, J. M. and Weston, N. (2016). Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 .

Coates et al. (2011)
Coates, A., Ng, A. and Lee, H. (2011).
An analysis of singlelayer networks in unsupervised feature
learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
. 
Dowson and Landau (1982)
Dowson, D. and Landau, B. (1982).
The fréchet distance between multivariate normal distributions.
Journal of multivariate analysis
12 450–455. 
Gao et al. (2017)
Gao, R., Chen, X. and Kleywegt, A. J. (2017).
Wasserstein distributional robustness and regularization in
statistical learning.
CoRR abs/1712.06050.
URL http://arxiv.org/abs/1712.06050  Goodfellow (2016) Goodfellow, I. (2016). Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 .
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. and Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in Neural Information Processing Systems.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G. and Hochreiter, S. (2017). Gans trained by a two timescale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 .

Ho and Ermon (2016)
Ho, J. and Ermon, S. (2016).
Generative adversarial imitation learning.
In Advances in Neural Information Processing Systems.  Huang et al. (2017) Huang, L., Liu, X., Lang, B., Yu, A. W. and Li, B. (2017). Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. arXiv preprint arXiv:1709.06079 .
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
 Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., Citeseer.
 Li and Majumdar (2009) Li, G. and Majumdar, D. (2009). Some results on doptimal designs for nonlinear models with applications. Biometrika 96 487–493.
 Li et al. (2017) Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A. and Jurafsky, D. (2017). Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547 .
 Liu et al. (2018) Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B. and Song, L. (2018). Learning towards minimum hyperspherical energy. arXiv preprint arXiv:1805.09298 .
 Mentre et al. (1997) Mentre, F., Mallet, A. and Baccar, D. (1997). Optimal design in randomeffects regression models. Biometrika 84 429–442.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M. and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 .
 Miyato and Koyama (2018) Miyato, T. and Koyama, M. (2018). cgans with projection discriminator. arXiv preprint arXiv:1802.05637 .
 Nagarajan and Kolter (2017) Nagarajan, V. and Kolter, J. Z. (2017). Gradient descent gan optimization is locally stable. In Advances in Neural Information Processing Systems.
 Radford et al. (2015) Radford, A., Metz, L. and Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
 Roth et al. (2017) Roth, K., Lucchi, A., Nowozin, S. and Hofmann, T. (2017). Stabilizing training of generative adversarial networks through regularization. In Advances in Neural Information Processing Systems.

Russakovsky et al. (2015)
Russakovsky, O., Deng, J., Su, H., Krause,
J., Satheesh, S., Ma, S., Huang, Z.,
Karpathy, A., Khosla, A., Bernstein, M.
et al. (2015).
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
115 211–252.  Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems.
 Salimans and Kingma (2016) Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems.
 Shepard et al. (2015) Shepard, R., Brozell, S. R. and Gidofalvi, G. (2015). The representation and parametrization of orthogonal matrices. The Journal of Physical Chemistry A 119 7924–7939.
 Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A. et al. (2015). Going deeper with convolutions. Cvpr.
 Wu and Hamada (2011) Wu, C. J. and Hamada, M. S. (2011). Experiments: planning, analysis, and optimization, vol. 552. John Wiley & Sons.
 Xiang and Li (2017) Xiang, S. and Li, H. (2017). On the effects of batch and weight normalization in generative adversarial networks. stat 1050 22.
 Yang et al. (2013) Yang, M., Biedermann, S. and Tang, E. (2013). On optimal designs for nonlinear models: a general and efficient algorithm. Journal of the American Statistical Association 108 1411–1420.
 Yu et al. (2018) Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X. and Huang, T. S. (2018). Generative image inpainting with contextual attention. arXiv preprint arXiv:1801.07892 .
 Zhang et al. (2017) Zhang, P., Liu, Q., Zhou, D., Xu, T. and He, X. (2017). On the discriminationgeneralization tradeoff in gans. arXiv preprint arXiv:1711.02771 .
Appendix A Proof in Section 3
a.1 Proof of Theorem 3
Proof.
We bound the output of as follows,
Consider . We have
(A.1) 
Note that given and , we have
Then McDiarmid’s inequality gives us, with probability at least ,
(A.2) 
By the argument of symmetrization, we have
(A.3) 
where
’s are i.i.d. Rademacher random variables, i.e.,
McDiarmid’s inequality again gives us, with probability at least , we have(A.4) 
Not that is essentially the empiricial Rademacher complexity of . Since and are both Lipschitz, by Talagrand’s lemma, we have
We then use the standard Dudley’s entropy integral to bound . We exploit the parametric form of discriminators to find a tight covering number. We have to investigate the Lipschitz continuity of with respect to the weight matrices . We based our argument on telescoping. Given two sets of weight matrices and and fix the activation operators and , we have