1 Introduction
Generative modelling has gained attention due to its improvements for numerous applications including semisupervised learning, image and 3D modeling, data completion or superresolution. Inspired by game theory, generative adversarial networks (GANs) are based on the competition of two players – represented in terms of respective generator and discriminator networks – where the generator tries to generate samples so that the discriminator cannot distinguish whether they are real or generated samples. In the original definition, the objective to be minimized is given by the JensenShannon divergence
[goodfellow2014], which is a symmetric extension of the KullbackLeibler divergence and measures the overlap between two distributions.
However, GANs in their original formulation face several problems such as lacking stability during training, which includes vanishing gradients, mode collapse, as well as a nonconverging loss for both generator and discriminator. The Kantorovich duality [villani2008optimal] allows the KullbackLeibler divergence to be replaced by the Wasserstein distance, which mitigates the convergence problem due to the preservation of gradient information, guarantees differentiability of the objective function, and less susceptibility to modecollapse, partially by enabling the discriminator to differentiate between overlapping manifolds [arjovsky2017]. This requires enforcing a Lipschitz constraint (introduced by the Kantorovich duality) on the discriminator as the unconstrained problem would result in exploding gradients. This can be achieved by clipping the weights to lie within a compact interval [arjovsky2017]. Other methods soften this constraint by a regularization with the gradient norm to improve the framework’s robustness with regard to different architectures and the quality of generated samples [salimans2016improved, petzka2018on]
. In this paper, we will demonstrate among other things that this regularization does not encourage a broad distribution of spectralvalues in the discriminator weights. A narrow distribution of singular values results in a model which is unable to capture all details of the distribution
[miyato2018spectral]. Approaches which have been valuable in the context of standard GANs such as spectral normalization (SN) [miyato2018spectral] can only improve a WGAN’s stability when used in addition to a gradient penalty. We found in initial experiments that WGANs regularized only with SN did not converge, which is consistent with Miyato’s comment [miyatocomment]. SN forces a network to learn a Lipschitz continuous function by bounding the 2norm of the weights. The discriminator of a WGAN has a gradient norm of 1 almost everywhere. Therefore, according to Theorem 1 and 2 by Anil et al. [anil2018sorting] orthogonality is necessary.However, these improvements were achieved at the cost of of a higher computational burden due to the additional regularization term that has to be also considered during backpropagation, which dramatically increased training time of WGANs when compared to the original GAN framework. Initially WGAN discriminators were trained until convergence (or at least multiple steps) before the generator was updated. While this problem has been addressed by the two timesscale update rule
[heusel2017gans] which allows to reduce the number of discriminator updates per generator update and influenced the training of more recent architectural methods to reduce the computational complexity such as the progressive insertion of layers [karras2017progressive], the additional costs of computing the regularization remain. Furthermore we demonstrate that the two timesscale update rule leads to a reduced ability in capturing the modes of a distribution.In this paper, we direct our attention on increasing the fidelity of learned distribution by investigating the possibility of substituting the Lipschitz constraint required by WassersteinGANs with an orthogonality constraint on the weight matrices during training. The major contributions of this work are:

We investigate the possibility to replace the Lipschitz constraint with an orthogonality constraint on the weights, where we compare three weight orthogonalization methods regarding their convergence properties, their ability to ensure the Lipschitz condition and the achieved quality of the learned distribution.

We introduce a new metric to compare WassersteinGAN discriminators based on their approximated Wasserstein distance in order to compare their fitness, i.e. the generalization capabilities of discriminators.

We demonstrate the benefits of using weight orthogonalization during the training of WassersteinGANs to enforce its Lipschitz constraint and increase its generalization capability.
2 Background
As we focus on the use of orthogonality constraints to enforce the Lipschitz constraint in the WGAN setting, we first provide a general overview regarding orthogonality regularization for CNNs. This is followed by a review of the Wasserstein objective function for GANs [arjovsky2017], a discussion of improvements for Wasserstein GANs and a survey of standard evaluation measures applied for comparing the performances of GANs.
2.1 Orthogonality regularization for CNNs
The training of deep convolutional neural networks (CNNs) is complicated by a multitude of phenomena such as vanishing/exploding gradients or shifting feature statistics
[ioffe2015batch]. Besides solutions such as parameter initialization, residual connections, and normalization of internal activations
[ioffe2015batch], much attention has been paid to regularization. In particular, structural regularization such as the energypreserving orthogonality regularization has been explored to stabilize the optimization and increase its efficiency [Rodriguez:2017, Desjardins2015NaturalNN]. Further investigations [Jia:2017, Harandi:2016, Ozay:2016, Xie:2017, Huang:2018] proposed the use of specialized orthogonality regularizations or constraints for various tasks such as using Stiefel manifoldbased hard orthogonality constraints of weights [Harandi:2016, Ozay:2016, Huang:2018] during optimization or using a singular value bounding (SVB) [Jia:2017], i.e. enforcing the singular values of weight matrices to be close to one based on a prespecified threshold. Recent work [Xie:2017]additionally investigated soft orthonormal regularization by penalizing deviations of each weight matrix’ Gram matrix to the identity matrix in the Frobenius sense. The benefits of such soft orthonormal regularization are its differentiability and its reduced computational burden due to not relying on singular value decomposition. However, Frobenius normbased orthogonality regularization represents only a rough approximation and may be inaccurate especially for dense matrices. Other work focused on penalizing the spectral norm of weight matrices in CNNs
[Yoshida:2017]. A further generalization of soft orthogonality regularization to nonsquare weight matrices and consistent performance gains for different network architectures has been achieved by Basal et al. [Bansal:2018] that introduced double soft orthogonality regularization, mutual coherence regularization and spectral restricted isometry property regularization. While impressive results have been achieved (especially with spectral restricted isometry property regularization), the extension of enforcing orthogonality in the training of GANs has been left as future work.2.2 Wasserstein objective function for GANs
Let be a compact metric space with algebra
. We denote the set of probability distributions over
with and the distribution of real data as . Furthermore,is a random variable over a space
and we assume an apriori probability density for .The distance between two distributions can be measured by the Wasserstein distance
where is the metric on ,
denotes the set of all joint distributions over
whose marginals are and . Its dual presentation given by the Kantorovich duality [villani2008optimal] is the following optimization problem over the set of real valued Lipschitz continuous functions:(1) 
The WassersteinGAN can now be modelled by two parametrized functions and where denotes the critic and the generator. The generator (with its objective to optimize and produce samples that cannot be distinguished from real samples) and the discriminator (with its objective to approximate the dual potential with a parametrized function ) compete in the minimax game
When implemented, this minimax game is relaxed and the critic is trained until convergence and the Lipschitz constraint is enforced either via clipping the weights in a compact space [arjovsky2017]
or regularizing the critics objective with an estimated gradient norm
[gulrajani2017improved].2.3 Improving Wasserstein GANs
Enforcing the Lipschitz constraint on the WassersteinGAN’s discriminator is crucial to ensure the models convergence. Numerous normalization procedures have been demonstrated to increase a networks adversarial robustness by limiting its Lipschitz constant [NIPS2018_7515]
. Most prominent in the literature on GANs are instance/batchnormalization. Other techniques such as weightnormalization
[salimans2016weight] have been found to be limiting when compared to spectral normalization [miyato2018spectral]. However, we have found that weight and spectral normalization do not ensure a successful training, although they limit the discriminator’s Lipschitz constant. Only additional regularization with the gradient norm lead a successful training of a WassersteinGAN and we discuss its theoretical problems in Section 3.1.In its original formulation, the Lipschitz constraint is enforced by clipping the weights so that they are contained in a compact interval [arjovsky2017]. Based on the theoretical insight that an optimal discriminator has the gradient norm almost everywhere [gulrajani2017improved], further improvements have been made by enforcing this constraint with regularization [gulrajani2017improved, petzka2018on]. These improvements increase the stability of the model and quality of the generated images but require additional computation during training. The application of a two timescale update rule for GANs (WGANTTUR) allows to reduce the number of discriminator updates per generator update and further enhances the convergence properties and the sample quality [heusel2017gans]. We demonstrate that WGANTTUR decreases the ability to represent all modes of the real distribution, and introduce a trainings procedure to mitigate this problem by allowing the network to increase its capacity.
Further relevant investigations, which focus on improvements to the networks architectures, include the progressive insertion of layers [karras2017progressive] and the use of large conditional GANs [brock2018large]. The progressive insertion of layers by fadein [karras2017progressive] further increases the computational efficiency and quality on image datasets by architectural means. In contrast, training largescale conditional GANs [brock2018large] has been approached based on a hinge version of the original GAN objective and a regularization by conditioning the GAN according to the large annotated JFT300m dataset to mitigate mode collapse. However, these improvements are specific to largescale GANs and not related to improving WGANs that mitigate mode collapse based on the Wasserstein objective function. In this paper, we will discuss the suitability of soft orthogonality enforcing techniques as well as hard constraint and discuss why constraint on the norm are not sufficient to enforce the Lipschitz constraint.
2.4 Measures for evaluating GANs
When applied to image datasets the current stateoftheart approach in automatically evaluating the image quality are InceptionScore [salimans2016improved] and the FrechetInceptionDistance (FID) [heusel2017gans]. However, the score computed by both methods becomes better if the network overfits. Methods to directly evaluate overfitting or modecollapse in GANs [srivastava2017veegan, arora2017gans, santurkar2018classification] either require human supervision or knowledge about the modes or label distribution of a dataset. An estimate of the Wasserstein distance between local image features [karras2017progressive]
denoted as sliced Wasserstein distance (SWD) provides a value that indicates the difficulty to distinguish real from generated images, however, its high computational complexity makes this approach less feasible. In contrast, we propose a novel and easy to compute WGAN evaluation metric that scores the models’ generalization capabilities based on the estimated Wasserstein distance.
3 What can we gain from orthogonality regularization?
WassersteinGANs [arjovsky2017] have been introduced to mitigate the major problems of standard GANs [goodfellow2014] regarding their unstable training, vanishing gradients, strange convergence behaviour and modecollapse. However, enforcing the Lipschitz constraint introduced by the Kantorovich duality in Equation 1 is necessary as an unconstrained maximization problem would diverge and the discriminator would provide no meaningful gradient to the generator. In this section, we demonstrate drawbacks of previous methods to enforce the Lipschitz constraint and elucidate how WassersteinGANs can benefit from an orthogonal weight constraint.
3.1 Problems of regularization based on the gradient norm
Stochastic gradient descent does not directly allow for conditional optimization, and therefore additional techniques have been established to enforce the Lipschitz constraint for a neural network which approximate the dual potential . Methods such as clipping the weights to lie within a compact interval [arjovsky2017] or enforcing constraints do not achieve stateoftheart results due to the fact that these constraint allow the discriminator to collapse to a linear function [anil2018sorting]. Recent stateoftheart methods which aim to minimize a Wasserstein loss [karras2017progressive, adler2018banach] have adopted regularization to enforce the Lipschitz constraint. The discriminator is regularized with its gradient norm
(2) 
where
is based on interpolated samples between the generated and target distributions
[gulrajani2017improved] to mitigate vanishing/exploding gradients.Such a regularization increases the computational capacity needed during training by and scales (almost) linearly with the number of layers as demonstrated in the supplemental. The improved stability offered by this regularization [gulrajani2017improved] allows to reduce the number of discriminator updates between each generator update to , and instead use a two timescale update rule (TTUR) [heusel2017gans] to avoid losses in image quality. In this TTUR, the generator and discriminator are trained in an alternating scheme with different learning rates, allowing the use of a higher learning rate for the discriminator to reduce the trainings time. However, as demonstrated in Figure 1, a WassersteinGAN trained according to the two timescale update rule has a reduced ability to capture the modes of the target distribution.
Furthermore this problem does not only occur for synthetic distributions. We utilize the method introduced by [richardson2018gans] to evaluate the mode collapse of a WassersteinGAN regularized with the gradient penalty from Equation 2 (WGANGP) [gulrajani2017improved] and a WassersteinGAN using the same regularization but trained according to the TTUR on the benchmark dataset CelebA. The modes are approximated by computing a Voronoi partition. As the true number of modes is unknown for the distribution that is assumed to underlie the datasets, we tested with a range between and Voronoi partitions. A statistical analysis reveals that the number of modes is significantly less wellrepresented on both the CIFAR10 dataset [krizhevsky2009learning] and the CelebA dataset [Liu:2015:DLF] for WGANTTUR in comparison to WGANGP as demonstrated in Figure 0(a). Results for other datasets are included in the supplemental. A relevant question is therefore how the representation can be improved without drastically increasing the training time.
3.2 Relation between Lipschitz continuity and orthogonality constraints
Orthogonal weight constraints have been proven to stabilize the training of RNNs [wisdom2016full] and increase the generalization capabilities [Bansal:2018]. A quadratic matrix is orthogonal if and only if . For simplicity of notation, we call a nonquadratic matrix orthogonal if the matrix has dimensions and orthogonal columns () or dimension and orthogonal rows ().
It is wellknown that a function between two metric spaces is Lipschitz continuous if and only if its gradient is bounded. Let be a linear layer with the weight matrix and inputs , then this implies that is Lipschitz continuous if is bounded. If we assume that the discriminator is a feedforward network built from linear layers and
Lipschitz continuous activation functions
, then(3) 
which implies that such a discriminator is Lipschitz continuous if the norm of all weight matrices is bounded.
One might assume that limiting the norm of the weight matrices would be sufficient to guarantee its Lipschitz constant to be at most . However, upperbounding the norm of the network’s weight matrices to be at most without any additional constraint only bounds its Lipschitz constant and does not prevent the network from collapsing to a linear function (assuming
Lipschitz and monotonic activations (such as ReLU) are used)
[anil2018sorting]. This explains the limited performance reported in [gulrajani2017improved] for hard weight constraints and the limited performance when using spectral normalization [miyato2018spectral] to enforce the discriminator’s Lipschitz condition.Theorem 1 ([anil2018sorting]).
If a function with almost everywhere is represented by a neural network with weights that have a norm of at most , then
can be replaced by an orthogonal matrix
without changing the represented function.The sufficient condition to enforce the Lipschitz constraint of a neural network as provided in Theorem 1 is to constrain the weight matrices to be orthogonal.
4 Orthogonal WassersteinGAN
Motivated by the theoretical connection between a network’s Lipschitz constant and an orthogonal weight constraint, we discuss three methods to enforce such a constraint as well as their runtimes, and analyse their suitability in the context of training a WassersteinGAN. Based on these findings, we propose a new procedure to train a WassersteinGAN.
4.1 Enforcing Lipschitz constraint with orthogonalization
An intuitive approach to enforce the orthogonality of the weight matrices is to add regularization to the discriminator’s objective according to
(4) 
where is a weight matrix, represents the identity matrix, and
weights the contribution of the regularization on the overall objective function. Such or similar regularization methods have gained an increased adoption in deep neural classifier networks
[Bansal:2018] due to the relatively low computational overhead required. For each layer the computational costs are dominated by computing the matrix multiplication, which scales linearly with the number of layers, but needs additional gradient evaluation.Orthogonal regularization is only a soft constraint and there is no guarantee that this additional condition is fulfilled. The set of all orthogonal matrices is a subspace of called Stiefel manifold. To perform the optimization on this manifold, the weights should move along the geodesic, for which the direction is given by the gradient . Solving optimization problems on the Stiefel manifold has been made tractable with Cayley transformations [wen2013feasible]
(5) 
where
is a skewsymmetric
matrix and is the remaining variable to be estimated. This retraction reduces the optimization problem to the following dimensional search problem. For each weight matrix of the network, we now have to find a such that if we set the new weights to be they minimize equation 1. However, solving this optimization problem after each generator update does not yield an efficient training for WassersteinGANs. It has been demonstrated that it is sufficient to fix to a small value proportional to the learning rate [wisdom2016full]. Even though this procedure does not require additional gradient computations, the matrix inversion results in a significantly higher computation burden and higher memory requirements than the regularization according to equation 4.A more efficient but less accurate orthogonalization algorithm has been introduced by Björck and Bowie [bjorck1971iterative]. For a given weight matrix for the step , the algorithm iteratively computes the best orthogonal matrix in a leastsquares sense by applying
(6) 
where is the current iteration and . Since the algorithm is inherently iterative, it is particularly suitable in the context of neural networks. We found that the orthogonality and Lipschitz conditions are sufficiently fulfilled by applying one iteration with before each discriminator update. The asymptotic time complexity is equal to that of regularization but does not require additional gradient computation, which makes it the fastest in an empirical evaluation (see Table 1).
4.2 Suitability and comparison of different orthogonality regularizers for WGANs
We now compare the aforementioned procedures with regard to their suitability in the context of training WassersteinGANs. First, we evaluate the models’ adherence to the Lipschitz and orthogonality condition, because a model’s convergence behaviour directly depends on its Lipschitz constant. The adherence to the orthogonality constraint is quantified by for a weight matrix . Based on Proposition in [gulrajani2017improved] we estimate the networks Lipschitz constant with equation 2, where the points are drawn from the convex combination of the supports from and .
We plot the estimated Lipschitz constant and norm for models trained on CIFAR10 with each of the three methods in Figure 2. We see that all models converge and the Lipschitz constant is bounded in all cases. However, orthogonal regularization does not ensure orthogonal weight matrices, even for high values of such as , and we observe a drift with Cayley transformations, which we believe to be a result of numerical inaccuracy. Iterative orthogonalization enforces both constraints while being significantly faster in comparison to Cayley transformations and comparable in speed to orthogonal regularization as shown in Table 1.
The comparison between the learned synthetic distribution and the real distribution as illustrated in Figure 3 shows that a WassersteinGAN trained with iterative orthogonalization captures the target distribution best. The regularized orthogonalization and the Cayley transformation method both introduce noise, shifts, and distortions in the learned distribution, whereas the iterative orthogonalization method is significantly less affected by these phenomena.
Similar results can be observed in Table 1 that quantifies the quality of sampled images from a learned CIFAR10 representation in terms of both Inception Score (IS) and Fréchet Inception Distance (FID). The WassersteinGAN using the iterative orthogonalization has a significantly higher inception score and lower Fréchet Inception distance than a WassersteinGAN trained using the two methods.
4.3 Proposed method
In the previous section, we demonstrated that the training of a WassersteinGAN converges when we only apply iterative orthogonalization in the discriminator. Note that a solution to equation 1 is only feasible if the discriminator’s Lipschitz constant is smaller than 1. If we compare the estimated Lipschitz constant in Figure 2, we observe that a WassersteinGAN trained using iterative orthogonalization in the discriminator reaches a feasible solution with fewer iterations than WGANTTUR. However, the resulting scores are not better than WGANTTUR’s scores as shown in Table 1.
The strict orthogonalization strongly increases the discriminator’s robustness against adversarial samples, hinders the discriminator from collapsing to a linear function and shows a faster convergence of its Lipschitz constant. The normalization of the row and column vectors resulting from orthogonalization leads to less fidelity in the learned distribution
[miyato2018spectral]. In our proposed method, we use the advantages provided by orthogonalization during the beginning of the models training. As the changes to the generator’s output are largest during this initial training phase, we leverage the increased stability provided by iterative orthogonalization during this phase. We relax this condition for the later training phase and ensure the Lipschitz condition using the onesided gradient normalization introduced in [petzka2018on]. A detailed description of our procedure is provided in Algorithm 1.Note that in an efficient implementation we can neglect the regularization for the first steps. We provide additional information regarding the algorithms extensions regarding CNNs and the used initialization in the supplemental.
5 Experimental results
In this section, we first introduce a new metric to compare the generalization capabilities between WassersteinGAN discriminators. Subsequently, we compare our method to both the WassersteinGAN regularized with gradient penalty (WGANGP) [gulrajani2017improved] and the WassersteinGAN trained according to the two timescale update rule (WGANTTUR) [heusel2017gans]. As recommended in [lucic2017gans], we trained all models with an equal computational budget and architecture.
5.1 New evaluation metric for the generalization capability of WGANs
While the Inception Score (IS) [salimans2016improved] and Fréchet Inception Distance (FID) [heusel2017gans] are wellestablished metrics to evaluate the perceived image quality of generated samples and to compare different models with a common architecture, neither of them measures overfitting. Evaluating overfitting in GANs is nontrivial, because the discriminator can overfit with respect to the real data distribution or the generated samples. A solution to this problem has been presented in the form of a tournament between different GANs in which the generator/discriminator pairs are compared elementwise using an error function [im2016generating]. However, the error function assumes the discriminator to be a classifier and therefore this method cannot be applied to WassersteinGANs as their discriminator approximates a dual potential. Instead, we adapted the idea to use the generator of a different model to provide samples for a learned distribution and use the estimated Wasserstein distance as a metric for comparison. Let be a set of WassersteinGANs where the th WGAN’s generator is denoted as and its critic is denoted as . Then
(7) 
provides an estimate for the Wasserstein distance between and where we use unseen samples from the real data . The estimate allows us to draw the following conclusions about the relative generalization capabilities of the WassersteinGANs when we compare it to , which is the estimated Wasserstein distance on the training data:

If , the ability of model to differentiate between the two distributions increases.

If , the ability of model to differentiate between the distributions decreases.
Note that if a WassersteinGAN has a Lipschitz constant of , it estimates [arjovsky2017]. To avoid this scaling error, we define the generalization score for the th WGAN’s discriminator with the th WGAN’s generator as the relative error . For a given generator the discriminator can better distinguish the data than if . An overall generalization score can be computed with .
5.2 Empirical evaluation
To evaluate our approach we compare it to the WassersteinGAN with Gradient Penalty (WGANGP) [gulrajani2017improved] to establish a baseline and WGANGP trained with a twotime scale update rule as described in [heusel2017gans], which, to the best of our knowledge, is the stateoftheart WassersteinGAN approach which minimizes the 1Wasserstein distance without requiring a special architecture. We consider synthetic distributions as they allow for a more detailed comparison of the captured modes as well as the benchmark dataset CIFAR10 [krizhevsky2009learning] on which we compute both the models’ Inception Score and Fréchet Inception Distance.
Datasets, architecture and parameters: To learn the synthetic distribution, we use a
layer MLP with linear outputs to represent both the generator and the discriminator. Furthermore, we use Rectified Linear Units (ReLU) as activations for the hidden layers and do not consider additional normalizations or constraints in the network. For image datasets, we use a convolution architecture based on the DCGAN
[radford2015unsupervised]. For WGANGP and WGANTTUR, we replaced the batch normalization in the discriminator with layer normalization [ba2016layer] as recommended in [salimans2016improved]. On the synthetic dataset we trained all models for 10 minutes with a batchsize of and on CIFAR10 all models were trained for 60 minutes with a batchsize of on a Nvidia GTX 1080. For WGANGP and WGANTTUR, we used the hyperparameters provided in the original publications.Comparison The visualization of samples drawn from a synthetic distribution and samples generated by WassersteinGANs trained with different procedures are visualized in Figure 5. Both WGANGP and WGANTTUR do not accurately represent the ends of the spiral arms, while samples generated by our method completely cover the target distribution.
In Table 1, we report the Fréchet Inception Distance and Inception Score of the different procedures for the CIFAR10 dataset. In addition, we also show the number of iterations per second during the training process for each of the methods. Our method outperforms the other methods with respect to the Fréchet Inception Distance, while also outperforming WGANGP with respect to the model’s Inception Score. Note that our method additionally offers the highest computational efficiency.
Model  FID  IS  

WganGP  
WganTTuR  
Standard reg.  
Cayley reg.  
Iterative reg.  
Ours 
Comparing the estimated Wassersteindistance using our new metric in Figure 3(c), we observe that our proposed method has the highest overall generalization score of while the next best model WGANGP only reaches . As the diagonal reflects the discriminators’ overfitting with respect to the test data it is of special interest and our method achieves the highest performance in distinguishing generated data from unseen real data with a score of .
To further compare the different procedures for the training of WassersteinGANs and to gain additional insights regarding the benefits of our method, we plot the discriminators’ gradient norm in Figure 3(b). The sudden increase in the gradient norm is a result from relaxing the orthogonality constraint. As that the generator learns to minimize , the gradient norm of is crucial during training. In general, our procedure provides a stronger gradient to the generator for the majority of iterations when compared to WGANTTUR. Furthermore the gradient is more stable than the one of the competing techniques as it shows the lowest amount of noise over the iterations, even though all models have been trained with the same batchsizes. An additional benefit of our method is a more even distribution of the weights’ spectralvalues in the discriminator as shown in Figure 6. As argued in [miyato2018spectral], a more even distribution of spectral values encourages the discriminator to capture more features of the real dataset.
6 Conclusion
In this work, we outlined a connection between the orthogonal weight matrices in neural networks and the Lipschitz continuity required by WassersteinGANs. We have empirically investigated the possibility of replacing the gradient norm regularization by different orthogonalization methods. We found the training with hard constraint orthogonalization methods to be stable and that all considered orthogonalization methods are able to enforce the Lipschitz constraint. However, the learned distributions did not exhibit the same fidelity as the distributions learned by established training methods. Based on the insights gained from this investigation, we proposed a new trainings method which utilizes the increased stability but avoids restricting the model’s capacity. Finally, we were able to demonstrate that a WassersteinGAN discriminator trained with this procedure has an increased generalization capability and its weight matrices exhibit more evenly distributed singularvalues, which enables the model to better represent the target distribution.
References
7 Demonstration of problems in conjunction with gradient norm penalties
Influence of gradient norm regularization on the runtime
We demonstrate the increase in computational complexity by training a WassersteinGAN with weight clipping and a WassersteinGAN with gradient normalization on a synthetic dataset. The used architecture is a multilayer perceptron (MLP) where we vary either the number of layers while keeping the number of (hidden) units fixed or vice versa. We vary the number of layers in the range of
and the number of units per hidden layer . Default values when varying the other parameter are layers and units.The results in Figure 7 show that the number of iterations per second decreases by up to 30% with gradient regularization. While an increase in the number of layers with a small but constant overall number of units in the MLP does not change the computational efficiency, we observe a decrease in the number of iterations per second during training when the number of units is increased. This decrease is significantly larger when using regularization and combined with multiple discriminator updates per generator update solves down the training of WGANGP.
Analysis of mode preservation for different WGAN approaches
One of the main benefits of Wasserstein GANs over standard GANs is their capability to mitigate mode collapse. However, applying techniques such as the TTUR [heusel2017gans] for reducing the training time weaken this effect. To compare the mode collapse for different established WassersteinGAN approaches, we trained a WassersteinGAN with weight clipping (WGAN), a WassersteinGAN with gradient penalty (WGANGP) and a WassersteinGAN with TTUR (WGANTTUR) using the architecture described in the accompanying paper on the MNIST [lecun1998mnist], CIFAR10 [krizhevsky2009learning] and CelebA [Liu:2015:DLF] datasets. Each of the models was trained for iterations using the the hyperparameters provided in the original publications. Finally, we evaluated the modecollapse using the procedure proposed by Richardson and Weiss [richardson2018gans]. The results in Figure 8 demonstrate that WGANTTUR has a significantly lower number of represented modes when compared to WGANGP. In turn, WGANGP outperforms the standard WGAN.
8 Additional implementation details of the proposed method
Initialization
The initialization of network weights has been studied extensively and it has been demonstrated that a careful initialization already improves a network’s performance significantly [Xie:2017]. Inspired by the initialization proposed by Saxe et al. [Saxe14exactsolutions], we initialize the weights by computing an SVD , replacing all singular values by and setting the weights to . We found it to be beneficial to further relax the orthogonality constraint by setting . To motivate this parameter choice, we consider the derivative of the generator’s objective function . Note that its gradient can be written as [arjovsky2017]
and that the chain rule implies that this gradient can be factorized into a product between
with and . If we recall equation 3 from the accompanying paper, we see that there is a direct connection between the gradient norm of and the gradient norm of the generator’s objective, and, as the norm of a matrix is its largest singular value, we can increase the generator’s training speed by scaling the singular values.Extension to convolutions
We assumed that the discriminator is a feedforward network build from linear operations and
Lipschitz continuous activations. However, GANs are predominantly used in imagebased applications which heavily rely on network architectures that are based on convolution operations. Convolution operations can be unrolled into a linear operation. While this procedure is correct from a theoretical perspective, the resulting matrix would be too large to train a complex network in reasonable time. In order to avoid this problem, we extend the procedure by constructing a matrix from the modes of a tensor. Let
be the tensor representing a discrete convolution with a filter size of , where denotes the number of filters of the previous layer and the the number of output filters. Instead of unrolling the operation, we reshape the tensor into a matrix by flattening each kernel into an row vector and concatenating the resulting row vectors vertically into a matrix with dimensions . An exemplary illustration of this tensor reshaping is shown in Figure 9.9 Proposed metric to compare generalization capabilities
In this section, we further elaborate on the design of the generalization score in our proposed metric. Let be the Wasserstein distance estimated by discriminator of the th WassersteinGAN between unseen real samples and data, which was generated using the generator from the th WassersteinGAN. Furthermore, let be the baseline estimate for the th WassersteinGAN, which is computed using its own generator and the trainings dataset. We define the difference as a measure for increase or decrease in generalization capability. However, to compare the differences between and , which result from different indices and we have to ensure that the distances have the same scale. If a discriminator of a WassersteinGAN has a Lipschitz constant of it estimates , which implies that the distance for and could have different scales as well. To avoid such scaling problems, we define the generalization score as
(8) 
where the influence of the positive constant is cancelled out.
10 Effect of mode preservation on image quality and loss behaviour
Figures 10, 11 and 12 demonstrate the effect of the better mode preservation regarding the resulting image quality of our approach in comparison to WGANGP and WGANTTUR for different datasets. For all data sets, WGANGP generates samples that show significantly more distortions and artefacts than the other methods. While WGANTTUR is able to create more realistic samples than WGANGP, it still generates more artefacts than our approach. This is especially prevalent in Figures 10 and 12. In addition, we provide the loss characteristics in Figures 13 and 14. Note that both the generator loss and discriminator loss converge when using the proposed training procedure while the other algorithms can lead to a diverging generator loss.
Comments
There are no comments yet.