Log In Sign Up

Virtual Adversarial Lipschitz Regularization

by   Dávid Terjék, et al.

Generative adversarial networks (GANs) are one of the most popular approaches when it comes to training generative models, among which variants of Wasserstein GANs are considered superior to the standard GAN formulation in terms of learning stability and sample quality. However, Wasserstein GANs require the critic to be K-Lipschitz, which is often enforced implicitly by penalizing the norm of its gradient, or by globally restricting its Lipschitz constant via weight normalization techniques. Training with a regularization term penalizing the violation of the Lipschitz constraint explicitly, instead of through the norm of the gradient, was found to be practically infeasible in most situations. With a novel generalization of Virtual Adversarial Training, called Virtual Adversarial Lipschitz Regularization, we show that using an explicit Lipschitz penalty is indeed viable and leads to state-of-the-art performance in terms of Inception Score and Fréchet Inception Distance when applied to Wasserstein GANs trained on CIFAR-10.


On the regularization of Wasserstein GANs

Since their invention, generative adversarial networks (GANs) have becom...

Orthogonal Wasserstein GANs

Wasserstein-GANs have been introduced to address the deficiencies of gen...

Improved Training of Wasserstein GANs

Generative Adversarial Networks (GANs) are powerful generative models, b...

Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect

Despite being impactful on a variety of problems and applications, the g...

HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN

In this paper, we present a novel network for high resolution video gene...

Improving the Speed and Quality of GAN by Adversarial Training

Generative adversarial networks (GAN) have shown remarkable results in i...

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

We generalize the concept of maximum-margin classifiers (MMCs) to arbitr...

1 Introduction

In recent years, Generative adversarial networks (GANs) Goodfellow et al. (2014) have been becoming the state-of-the-art in several generative modeling tasks, ranging from image generation Karras et al. (2018)

to imitation learning

Ho and Ermon (2016). They are based on an idea of a two-player game, in which a discriminator tries to distinguish between real and generated data samples, while a generator tries to fool the discriminator, learning to produce realistic samples on the long run. Wasserstein GAN (WGAN) was proposed as a solution to the issues present in the original GAN formulation. WGAN trains a critic to approximate the Wasserstein distance between the real and generated distributions. This introduced a new challenge, as WGAN requires the function space of the critic to only consist of 1-Lipschitz functions.

To enforce the Lipschitz constraint on the WGAN critic, Arjovsky et al. (2017) originally used weight clipping, which was soon replaced by the much more effective method of Gradient Penalty (GP) Gulrajani et al. (2017), which consists of penalizing the deviation of the critic’s gradient norm from 1 at certain input points. Since then, several variants of gradient norm penalization have been introduced Petzka et al. (2018); Wei et al. (2018); Adler and Lunz (2018); Zhou et al. (2019b). As an alternative, a weight normalization technique called Spectral Normalization (SN) Miyato et al. (2018)

is a very efficient and simple method for enforcing a Lipschitz constraint on a per-layer basis, which is applicable to neural networks consisting of affine layers and K-Lipschitz activation functions.

Virtual Adversarial Training (VAT) Miyato et al. (2017)

is a well-known semi-supervised learning method for regularizing neural networks. It is applied to improve the network’s robustness against local perturbations of the input. Using an iterative method based on power iteration, it approximates the adversarial direction corresponding to certain input points. Perturbing an input towards its adversarial direction changes the network’s output the most.

We propose a method called Virtual Adversarial Lipschitz Regularization (VALR) as a generalization of VAT, that enables the training of neural networks with regularization terms penalizing the violation of the Lipschitz constraint explicitly, instead of through the norm of the gradient. VALR can be used with all kinds of activation functions and neural network layers. It provides means to generate a pair for each input point, for which the Lipschitz constraint is likely to be violated with high probability. In general, enforcing Lipschitz continuity of complex models can be useful for a lot of applications. In this work, we focus on applying VALR to Wasserstein GANs, as regularizing or constraining Lipschitz continuity has proven to have a high impact on training stability and reducing mode collapse.

Our contributions are as follows:

  • We derive VALR as a generalization of VAT.

  • We apply VALR to penalize the violation of the Lipschitz constraint directly, resulting in Virtual Adversarial Lipschitz Penalty (VALP).

  • Applying VALP on the critic in WGAN (WGAN-VALP), we show state-of-the-art performance in terms of Inception Score and Fréchet Inception Distance when trained on CIFAR-10.

2 Background

2.1 Wasserstein Generative Adversarial Networks

Generative adversarial networks (GANs) provide generative modeling by a generator network that transforms samples of a low-dimensional latent space into samples from the data space , transporting mass from a fixed noise distribution to the generated distribution . The generator is trained simultaneously with another network called the discriminator, which is trained to distinguish between fake samples drawn from and real samples drawn from the real distribution , which is often represented by a fixed dataset. This network provides the learning signal to the generator, which is trained to generate samples that the discriminator considers real. This iterative process implements the minimax game


played by the networks and . This training procedure minimizes the approximate Jensen-Shannon divergence (JSD) between and Goodfellow et al. (2014). However, during training these two distributions might differ strongly or even have non-overlapping supports, which might result in gradients received by the generator that are unstable or zero Arjovsky and Bottou (2017).

Wasserstein GAN (WGAN) Arjovsky et al. (2017) was proposed as a solution to this instability. Originating from Optimal Transport theory Villani (2008)

, the Wasserstein metric provides a distance between probability distributions with much better theoretical and practical properties than the JSD. It provides a smooth optimizable distance even if the two distributions have non-overlapping supports, which is not the case for JSD. It raises a metric

from the space of the supports of the probability distributions and to the space of the probability distributions itself. For these purposes, the Wasserstein- distance requires the probability distributions to be defined on a metric space and is defined as


where is the set of distributions on the product space whose marginals are and , respectively. The optimal achieving the infimum in (2) is called the optimal coupling of and , and is denoted by . The case of has an equivalent formulation


called the Kantorovich-Rubinstein formula Villani (2008), where is called the potential function, is the set of all functions that are 1-Lipschitz with respect to the ground metric , and the Wasserstein-1 distance corresponds to the supremum over all 1-Lipschitz potential functions. The smallest Lipschitz constant for a real-valued function with the metric space as its domain is given by


Based on (3), the critic in WGAN Arjovsky et al. (2017) implements an approximation of the Wasserstein-1 distance between and . The minimax game played by the critic and the generator becomes


a formulation that proved to be superior to the standard GAN in practice, with substantially more stable training behaviour and improved sample quality. The challenge became effectively restricting the smallest Lipschitz constant of the critic , sparking the birth of a plethora of Lipschitz regularization techniques for neural networks.

2.2 Lipschitz Function Approximation

A general definition of the smallest Lipschitz constant of a function is


where the metric spaces and are the domain and codomain of the function , respectively. The function is called Lipschitz continuous if there exists a real constant for which for any . Then, the function is also called K-Lipschitz. Theoretical properties of K-Lipschitz neural networks with low values of were explored in Oberman and Calder (2018), showing that training neural networks with Lipschitz constraints is good for generalization and convergence.

Learning mappings with Lipschitz constraints became prevalent in the field of deep learning with the introduction of WGAN

Arjovsky et al. (2017). Enforcing the Lipschitz property on the critic was first done by clipping the weights of the network Arjovsky et al. (2017). This approach achieved superior results compared to the standard GAN formulation, but still sometimes yielded poor quality samples or even failed to converge. While clipping the weights enforces a global Lipschitz constant, it also reduces the function space, which might not include the optimal critic any more. Soon this method has been replaced by a softened one called Gradient Penalty (GP) Gulrajani et al. (2017). Motivated by the fact that the optimal critic should have unit gradient norm on lines connecting the coupled points according to (2), they proposed a regularizer that enforces unit gradient norm along these lines, which not only enforces the Lipschitz constraint, but other properties of the optimal solution as well. However, is not known in practice, which is why Gulrajani et al. (2017) proposed to apply GP on samples of the induced distribution

, by interpolating samples from the marginals

and . The critic in the WGAN-GP formulation is regularized with the loss


where denotes the distribution of samples obtained by interpolating pairs of samples drawn from and , and

is a hyperparameter acting as a Lagrange multiplier.

Theoretical arguments against GP were pointed out in Petzka et al. (2018), arguing that unit gradient norm on samples of the distribution is not valid, as the pairs of samples being interpolated are generally not from the optimal coupling , and thus do not necessarily need to match gradient norm 1. Furthermore, they point out that differentiability assumptions of the optimal critic are not met. Therefore, the regularizing effect of GP might be too strong. As a solution, they suggested using a loss penalizing the violation of the Lipschitz constraint either explicitly with


or implicitly with


where in both cases denotes . The first method has only proved viable when used on toy datasets, and led to considerably worse results on relatively more complex datasets like CIFAR-10, which is why Petzka et al. (2018) used the second one, which they termed Lipschitz Penalty (LP). Compared to GP, this term only penalizes the gradient norm when it exceeds . As they evaluated the interpolation method described above, and also sampling random local perturbations of real and generated samples, but found no significant improvement compared to . Wei et al. (2018) proposed dropout in the critic as a way for creating perturbed input pairs to evaluate the explicit Lipschitz penalty, which led to improvements, but still relied on using GP simultaneously. One of the strengths of the Wasserstein distance is that it can be defined with any metric , a fact that Adler and Lunz (2018) built on by proposing Banach WGAN (BWGAN), which generalizes WGAN to separable Banach spaces. They resort to these spaces because in order to use GP, they need a tractable dual metric on the topological dual of . This approach brought considerable improvements, and Adler and Lunz (2018) emphasized the fact that through explicit Lipschitz penalties one could extend WGANs to general metric spaces as well. We hypothesize that using the explicit Lipschitz penalty in itself is insufficient because if one takes pairs of samples randomly from , or

(or just one sample and generates a pair for it with random perturbation), the violation of the Lipschitz penalty evaluated at these sample pairs will be of high variance, hence a more sophisticated strategy for sampling pairs is required.

Recently Zhou et al. (2019b) argued that both GP and LP introduce superfluous constraints, altering the optimal critic and hence damaging the gradient that the generator receives. Their contribution is twofold. They introduced maximum gradient penalty, penalizing only the maximum of the Lipschitz constraint violations instead of their mean, not to overconstrain the critic, and employed the augmented Lagrangian method, widely used for strict constraint satisfaction in constrained optimization problems, to enforce the Lipschitz constraint more strictly. They formulated the regularization term as


where is a hyperparameter, and is the Lagrange multiplier, which is updated during training iteratively by the rule


They found that this formulation makes tuning the hyperparameter easier and restricts the Lipschitz constant of the network more strictly, but found no significant increase in sample quality compared to GP and LP.

A second family of Lipschitz regularization methods is based on weight normalization, restricting the Lipschitz constant of a network globally instead of only at points of the input space. One such technique is called spectral normalization (SN) proposed in Miyato et al. (2018), which is a very efficient and simple method for enforcing a Lipschitz constraint with respect to the -norm on a per-layer basis, applicable to neural networks consisting of affine layers and K-Lipschitz activation functions. Gouk et al. (2018) proposed a similar approach, which can be used to enforce a Lipschitz constraint with respect to the -norm and -norm in addition to the

-norm, while also being compatible with batch normalization and dropout.

Anil et al. (2018)

argued that any Lipschitz-constrained neural network must preserve the norm of the gradient during backpropagation, and to this end proposed another weight normalization technique, showing that it compares favorably to SN, and an activation function based on sorting.

2.3 Virtual Adversarial Training

VAT Miyato et al. (2017) is a semi-supervised learning method that is able to regularize networks to be robust to local adversarial perturbation. Virtual adversarial perturbation means perturbing input sample points in such a way that the change in the output of the network induced by the perturbation is maximal in terms of a distance between distributions. This defines a direction for each sample point called the virtual adversarial direction, in which the perturbation is performed. It is called virtual to make the distinction with the adversarial direction introduced in Goodfellow et al. (2015) clear, as VAT uses unlabeled data with virtual labels, assigned to the sample points by the network being trained. The regularization term of VAT is called Local Distributional Smoothness (LDS). It is defined as


where is a conditional distribution implemented by a neural network, is a divergence between two distributions and (for which KL divergence is a natural choice), and


is the virtual adversarial direction, where is a hyperparameter. It is approximated by




represents one iteration of a more general iterative approximation scheme,

is a randomly sampled unit vector and

is another hyperparameter. In this work, we generalize the formulation of VAT to show that it is actually a special case of a more general Lipschitz regularization scheme, and show that in practice this renders viable the use of the explicit Lipschitz penalty previously explored in Petzka et al. (2018).

3 Virtual Adversarial Lipschitz Regularization

VALR is the method of adding a regularization term to the training objective that penalizes the violation of the Lipschitz constraint evaluated at sample pairs obtained by virtual adversarial perturbation. We call this term Virtual Adversarial Lipschitz Penalty (VALP) and define it as




is the virtual adversarial direction, is a neural network and is the Lipschitz constant that we’d like to enforce. and are metrics on the domain and codomain of , respectively, and is drawn uniformly from the interval , where and are hyperparameters.

VALP can be seen as a generalization of VAT if we disregard the fact that is a metric and its analogy in VAT is a divergence. To recover VAT from the above formula, let , , , and . Substituting these into (16) and (17) results in (12) and (13), respectively.

To put it in words, VALP measures the deviation of from being K-Lipschitz evaluated at pairs of sample points where one is the virtual adversarial perturbation of the other. If added to the training objective, it makes the learned mapping approximately K-Lipschitz in an radius around the sample points it is applied at.

3.1 Approximation of

Similarly to VAT, the virtual adversarial perturbation is approximated by




is the approximated virtual adversarial direction and is the norm corresponding to the metric . In the case where the chosen does not define a norm this way, one has to construct a different method for ensuring .

The derivation of this formula is essentially the same as the one described in Miyato et al. (2017), but is included here for completeness. We assume that and are both twice differentiable with respect to their arguments almost everywhere, the latter specifically at . Note that one can easily find a for which the last assumption does not hold, for example the distance. If is translation invariant, meaning that for each , then its subderivatives at will be independent of , hence the method described below will still work. Otherwise, one can resort to using a proxy metric in place of for the approximation, for example the distance.

We denote by for simplicity. Because and , it is easy to see that


so that the second-order Taylor approximation of is , where

is the Hessian matrix. The eigenvector


corresponding to its eigenvalue with the greatest absolute value is the direction of greatest curvature, which is approximately the adversarial direction that we are looking for. The power iteration

Householder (1964) defined by


where is a randomly sampled unit vector, converges to if and are not perpendicular. We use the Euclidean norm , as this numerical algorithm is oblivious of the metric spaces and . Calculating is computationally heavy, which is why is approximated using the finite differences method as


where the equality follows from (20). The hyperparameter is introduced here. In summary, the virtual adversarial direction is approximated by the iterative scheme


of which one iteration is found to be sufficient and necessary in practice, which is why we presented the formulas corresponding to the one iteration case above in (19).

3.2 Comparison with other Lipschitz regularization techniques

In terms of applicability and usage of metrics, VALR is the most flexible Lipschitz regularization method to the best of our knowledge. Theoretically, it can be used with all kinds of metrics and , and any kind of model that is twice differentiable, but the approximation of described above imposes a practical restriction. It searches for the adversarial perturbation around in the -ball of radius , which is why the topology induced by has to be similar enough to the one induced by the distance for VALR to be efficient. Additionally, the normalization in (18) might be difficult with certain metrics.

We experimented with WGAN variants where we regularize the critic to be 1-Lipschitz with respect to different metrics, like the Sobolev norms introduced in (Adler and Lunz, 2018), and also learned metrics such as the distance on the activation values of the upper hidden layers of a pretrained Inception network Szegedy et al. (2015). The trainings did converge and the results were competitive, but still underperformed slightly compared to the setting. Also one can easily think of metrics for which the approximation scheme would most probably fail, such as certain string metrics like the Levenshtein distance Deza and Deza (2009). More work is needed to explore the family of metrics for which VALR is efficient in its current form, and also to improve the approximation of the virtual adversarial direction described above to be able to handle a wider range of metrics, or replace it entirely with a more flexible one.

In terms of efficiency, VALR compares favorably to the implicit methods penalizing the gradient norms, and to weight normalization techniques as well, as demonstrated in the experiments section. Adler and Lunz (2018) argued that penalizing the norm of the gradient as in (9) is more effective than penalizing the Lipschitz quotient directly as in (8), as the former regularizes in all spatial directions around , unlike the latter which does so only in the direction . We argue that this is the exact reason why the explicit method works better when the samples to evaluate (8) are chosen appropriately, as the regularization effect in all spatial directions can result in being overregularized. Regarding weight normalization techniques such as SN (Miyato et al., 2018), they can be prone to overregularize as well. Compared to the actual spectral norm of a weight matrix, its approximation can be either lower or higher, and dividing the weights with the latter results in overregularization, which can even exclude the optimal critic from the hypothesis space during training. The former case could theoretically result in underregularization, but there’s no evidence that this is actually the case in practice. We argue that VALR with the explicit method (8) outperforms the implicit methods GP and LP, as well as weight normalization methods like SN, because it results in the softest form of regularization, only resulting in a nonzero penalty when, where and in which direction the Lipschitz constraint is actually violated.

Regarding performance, the approximation of is as computationally demanding as the evaluation of the gradient norms, resulting in a similar running speed for VALR as for GP and LP, which also means that VALR cannot compete with SN on these grounds. However, one can sacrifice some of the efficiency by only applying VALP to a fraction of the minibatch the network is trained with, resulting in a slight decrease in Inception Score and FID in the case of WGAN with half batch regularization.

See Table 1 for a summary of the comparison detailed above.

Usable metrics
Optimal critic in
hypothesis space
GP 2x differentiable Sobolev norms No Slower
LP 2x differentiable Sobolev norms Yes Slower
Affine layers
and K-Lipschitz
activation functions
distance No Fast
VALR 2x differentiable
with restrictions
but scalable
Table 1: Comparison of Lipschitz regularization techniques

3.3 Wgan-Valp

We specialize the VALP formula (16) with being the critic, , and , to arrive at a version of the explicit penalty described in Petzka et al. (2018), which uses virtual adversarial perturbations as a sampling strategy. It is formulated as


where we denote the virtual adversarial perturbation as to emphasize that it’s a function of . We found it beneficial to use the augmented Lagrangian method described in Zhou et al. (2019b), albeit with lower values of , and penalizing the mean of the Lipschitz constraint violations instead of their maximum. To sum up, we define the training objective of the critic in WGAN-VALP as


where is a combination of the real and generated distributions, meaning that a sample can come from both, and and , respectively, are the hyperparameter and the Lagrange multiplier of the augmented Lagrangian method. is updated iteratively according to the rule


This formulation of WGAN results in a stable explicit Lipschitz penalty, overcoming the difficulties experienced when one tries to apply it to random sample pairs, resulting in state-of-the-art sample quality when trained on CIFAR-10, as demonstrated below.

4 Computational Results

To evaluate the performance of WGAN-VALP, we used the residual architecture from Gulrajani et al. (2017), with the number of channels in the generator layers doubled from to

for a slight increase in performance. The implementation was done in TensorFlow, and the trainings were run on a single NVIDIA GTX 1080Ti GPU. Following

Gulrajani et al. (2017), we used the Adam optimizer Kingma and Ba (2015) with parameters , and an initial learning rate of decaying linearly to 0 over iterations, training the critic for steps and the generator for per iteration with minibatches of size (doubled for the generator). We used an exponential moving average (EMA) Yazıcı et al. (2019) of the weights to evaluate performance. We used (25

) as a loss function to optimize the critic. The hyperparameters of the approximation of

were set to , , with power iteration. The only difference compared to the values used in Miyato et al. (2017) is that they used a fixed , but for our purposes it’s important to apply the penalty at different scales randomly. was an obvious choice, and we found to be optimal. Both batches from and were used for regularization, but using half of them selected randomly decreased performance only marginally while considerably reducing the run time.

We monitored the Inception Score and FID during training using samples every iteration, and evaluated them at the end of training using

samples. To measure the performance of WGAN-VALP, we ran the training setting described above 10 times, and calculated the mean, standard deviation and maximum of the final Inception Scores and FIDs, which we report for WGAN-VALP and other relevant GANs

Gulrajani et al. (2017); Petzka et al. (2018); Zhou et al. (2019a); Wei et al. (2018); Miyato et al. (2018); Adler and Lunz (2018); Karras et al. (2018); Yazıcı et al. (2019) in Table 2. Competing variants reported either or both of the average and the best Inception Scores in the corresponding papers, which is why we chose to report both. When trained on CIFAR-10, our model is state of the art both in terms of Inception Score and FID. We note that this is achieved without using any progressive growing techniques, which could possibly be combined with VALP to reach an even higher performance. We show some generated samples in Figure 1.

Inception Score
Method Average Best FID
Progressive GAN
EMA Progressive GAN
WGAN-VALP (ours)
Table 2: Inception Scores and FIDs on CIFAR-10
Figure 1: Generated CIFAR-10 samples

5 Conclusions

Derived as a generalization of VAT, we have shown that VALR is an efficient and powerful method for learning Lipschitz constrained mappings implemented by neural networks. Already resulting in state-of-the-art performance when applied to the training of WGANs, VALR is a generally applicable regularization method for a potentially wide range of applications, providing more flexibility than other Lipschitz regularization methods. The growing interest in Lipschitz constrained deep learning suggests an increasing demand for such methods in the future.