# Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

We generalize the concept of maximum-margin classifiers (MMCs) to arbitrary norms and non-linear functions. Support Vector Machines (SVMs) are a special case of MMC. We find that MMCs can be formulated as Integral Probability Metrics (IPMs) or classifiers with some form of gradient norm penalty. This implies a direct link to a class of Generative adversarial networks (GANs) which penalize a gradient norm. We show that the Discriminator in Wasserstein, Standard, Least-Squares, and Hinge GAN with Gradient Penalty is an MMC. We explain why maximizing a margin may be helpful in GANs. We hypothesize and confirm experimentally that L^∞-norm penalties with Hinge loss produce better GANs than L^2-norm penalties (based on common evaluation metrics). We derive the margins of Relativistic paired (Rp) and average (Ra) GANs.

## Authors

• 8 publications
• 32 publications
06/18/2018

### Banach Wasserstein GAN

Wasserstein Generative Adversarial Networks (WGANs) can be used to gener...
07/12/2019

Generative adversarial networks (GANs) are one of the most popular appro...
12/18/2017

### On the Effectiveness of Least Squares Generative Adversarial Networks

Unsupervised learning with generative adversarial networks (GANs) has pr...
02/15/2019

In this paper we study the convergence of generative adversarial network...
04/24/2014

### Maximum Margin Vector Correlation Filter

Correlation Filters (CFs) are a class of classifiers which are designed ...
09/18/2021

05/31/2021

### The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC

Semi-supervision in Machine Learning can be used in searches for new phy...

## Code Repositories

### MaximumMarginGANs

Code for paper: "Support Vector Machines, Wasserstein's distance and gradient-penalty GANs maximize a margin"

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Support Vector Machines (SVMs) Cortes and Vapnik (1995) are a very popular type of maximum-margin classifier (MMC). The margin can be conceptualized as the minimum distance between the decision boundary of the classifier and any data-point. An SVM is a linear classifier which maximizes the minimum margin. A significant body of work has been done on generalizing SVM beyond a simple linear classifier through the kernel trick Aizerman (1964)

. However, until very recently, SVMs had not been generalized to arbitrary norms with non-linear classifiers (e.g., neural networks). In this paper, we describe how to train MMCs (which generalize SVMs) through different approximations of the

-norm margin and we show that this results in loss functions with a gradient norm penalty.

Generative adversarial networks (GANs) (Goodfellow et al., 2014) are a very successful class of generative models. Their most common formulation involves a game played between two competing neural networks, the discriminator and the generator . is a classifier trained to distinguish real from fake examples, while is trained to generate fake examples that will confuse

into recognizing them as real. When the discriminator’s objective is maximized, it yields the value of a specific divergence (i.e., a distance between probability distributions) between the distributions of real and fake examples. The generator then aims to minimize that divergence (although this interpretation is not perfect; see

Jolicoeur-Martineau (2018a)). Importantly, many GANs apply some form of gradient norm penalty to the discriminator (Gulrajani et al., 2017; Fedus et al., 2017a; Mescheder et al., 2018; Karras et al., 2019). This penalty is motivated by a Wasserstein distance formulation in Gulrajani et al. (2017), or by numerical stability arguments Mescheder et al. (2018); Karras et al. (2019).

In this paper, we show that discriminator loss functions that use a gradient penalty correspond to specific types of MMCs. Our contributions are the following:

1. We define the concept of expected margin maximization and show that Wasserstein-, Standard-, Least-Squares-, and Hinge-GANs can be derived from this framework.

2. We derive a new method from this framework, a GAN that penalize gradient norm values above 1 (instead of penalizing all values unequal to 1 as done by Gulrajani et al. (2017)). We hypothesize and experimentally show that this method leads to better generated outputs.

3. We describe how margin maximization (and thereby gradient penalties) help reduce vanishing gradients at fake (generated) samples, a known problem in many GANs.

4. We derive the margins of Relativistic paired and average GANs (Jolicoeur-Martineau, 2018b).

It is worth noting that Lim and Ye (2017) explore a similar connection between GANs and SVMs, which they use to propose Geometric GANs. The main difference to our work is that they assume a linear classifier working on the feature space of a neural network’s output instead of the input space. Furthermore, that work does not exploit the duality theory of SVMs. Thereby, it does not draw a connection to gradient penalty terms. Our work explores this new connection which motivates an norm gradient penalty and shows great promise over the standard norm gradient penalty.

The paper is organized as follows. In Section 2, we review SVMs and GANs. In Section 3, we generalize the concept of maximum-margin classifiers (MMCs). In Section 4, we explain the connections between MMCs and GANs with gradient penalty. In Section 4.1, we mention that enforcing 1-Lipschitz is equivalent to assuming a bounded gradient; this implies that Wasserstein’s distance can be approximated with an MMC formulation. In Section 4.2, we describe the benefits of using MMCs in GANs. In Section 4.3, we hypothesize that -norm margins may lead to more robust classifiers. In Section 4.4, we derive margins for Relativistic paired and average GANs. Finally, in Section 5, we provide experiments to support the hypotheses in our contributions.

## 2 Review of SVMs and GANs

### 2.1 Notation

We focus on binary classifiers. Let be the classifier and the distribution (of a dataset ) with data samples and labels . As per SVM literature, when is sampled from class 1 and when is sampled from class 2. Furthermore, we denote and as the data samples from class 1 and class 2 respectively (with distributions and ). When discussing GANs, (class 1) refer to real data samples and (class 2) refer to fake data samples (produced by the generator). The -norm is defined as: . Note that we sometimes refer to a function ; this is an objective function to be maximized (not to be confused with the classifier ).

The critic (C) is the discriminator (D) before applying any activation function (i.e.,

, where is the activation function). For consistency with existing literature, we will generally refer to the critic rather than the discriminator.

### 2.2 SVMs

In this section, we explain how to obtain a linear maximum-margin classifier (MMC) which maximizes the minimum -norm margin (i.e., SVMs).

#### 2.2.1 Decision boundary and margin

The decision boundary of a classifier is defined as the set of points such that .

The margin is either defined as i) the minimum distance between a sample and the boundary, or ii) the minimum distance between the closest sample to the boundary and the boundary. The former thus corresponds to the margin of a sample and the latter corresponds to the margin of a dataset . In order to disambiguate the two cases, we refer to the former as the margin and the latter as the minimum margin.

The first step towards obtaining a linear MMC is to define the -norm margin:

 γ(x)= minx0||x0−x||p s.t. f(x0)=0 (1)

With a linear classifier (i.e., ) and , we have:

 γ(x)=|wTx−b|||w||2=α(x)β

Our goal is to maximize this margin, but we also want to obtain a classifier. To do so, we simply replace by . We call the functional margin. After replacement, we obtain the geometric margin:

 ˜γ(x,y)=y(wTx−b)||w||2=˜α(x,y)β

The specific goal of SVMs is to find a linear classifier which maximizes the minimum -norm geometric margin (in each class):

 maxw,bmin(x,y)∈D˜γ(x,Y). (2)

#### 2.2.2 Formulations

Directly solving equation (2) is an ill-posed problem for multiple reasons. Firstly, the numerator and denominator are dependent on one another; increasing the functional margin also increases the norm of the weights (and vice-versa). Thereby, there are infinite solutions which maximize the geometric margin. Secondly, maximizing means minimizing the denominator which can cause numerical issues (division by near zero). Thirdly, it makes for a very difficult optimization given the max-min formulation.

For these reasons, we generally prefer to i) constrain the numerator and minimize the denominator, or ii) constrain the denominator and maximize the numerator.

The classical approach is to minimize the denominator and constrain the numerator using the following formulation:

 minw,b||w||22 s.t. y(wTx−b)≥1∀(x,y)∈D (3)

This formulation corresponds to Hard-Margin SVM. The main limitation of this approach is that it only works when the data are separable. However, if we take the opposite approach of maximizing a function of and constraining the denominator , we can still solve the problem with non-separable data. For this reason, we prefer solving of the following Soft-Margin SVM:

 minw,b1n∑(x,y)∈D[max(0,1−y(wTx−b))] s.t. ||w||2=1

This can be rewritten equivalently with a KKT multiplier in the following way:

 minw,b1n∑(x,y)∈D[max(0,1−y(wTx−b))]+λ(||w||22−1),

Note that the Hinge function is simply a relaxation of the hard constraint . Thereby, we are not actually solving equation (2) anymore.

### 2.3 GANs

GANs can be formulated in the following way:

 maxC:X→REx1∼P[f1(C(x1))]+Ez∼Z[f2(C(G(z)))], (4) minG:Z→XEz∼Z[f3(C(G(z)))], (5)

where , is the distribution of real data with support ,

is a multivariate normal distribution with support

, is the critic evaluated at , is the generator evaluated at , and , where is the distribution of fake data.

Many variants exist; to name a few: Standard GAN (SGAN) (Goodfellow et al., 2014) corresponds to , , and . Least-Squares GAN (LSGAN) (Mao et al., 2017) corresponds to , , and . HingeGAN (Lim and Ye, 2017) corresponds to , , and .

An important class of GANs are those based on Integral probability metrics (IPMs) (Müller, 1997). IPMs are statistical divergences (distances between probability distributions) defined in the following way:

 IPMF(P||Q)=supC∈FEx1∼P[C(x1)]−Ex2∼Q[C(x2)],

where is a class of real-valued functions. Of note, certain connections between IPMs and SVMs have been identified in Sriperumbudur et al. (2009).

IPM-based GANs attempt to solve the following problem

 minGmaxC∈FEx2∼P[C(x1)]−Ez∼Z[C(G(z))].

There are many GANs based on IPMs (Mroueh et al., 2017; Mroueh and Sercu, 2017), but we will focus on two of them: WGAN (Arjovsky et al., 2017) and WGAN-GP (Gulrajani et al., 2017).

WGAN is an IPM-based GAN which uses the first-order Wasserstein’s distance (), the IPM restricted to the class of all 1-Lipschitz functions. This corresponds to the set of functions such that for all ,, where is a metric. also has a primal form which can be written in the following way:

 W1(P,Q):=infπ∈Π(P,Q)∫M×Md(x1,x2)dπ(x1,x2),

where is the set of all distributions with marginals and and we call a coupling.

The original way to enforce the 1-Lipschitz property on the critic was to clamp its weights after each update. This was later shown to be problematic (Gulrajani et al., 2017). Albeit with its issues, WGAN improves the stability of GANs and reduces the incidence of mode collapse (when the generator produces less diversity than the training dataset) (Arjovsky et al., 2017).

Gulrajani et al. (2017) showed that if the optimal critic is differentiable everywhere and for , we have that almost everywhere for all pair which comes from the optimal coupling . Sampling from the optimal coupling is difficult so they suggest to softly penalize , where , , , and

. This penalty works well in practice and is a popular way to approximate Wasserstein’s distance. However, this is not equivalent to estimating Wasserstein’s distance since we are not sampling from

and does not need to be differentiable everywhere (Petzka et al., 2017).

Of importance, gradient norm penalties of the form , for some are very popular in GANs. Remember that ; in the case of IPM-based-GANs, we have that . It has been shown that the GP-1 penalty (), as in WGAN-GP, also improves the performance of non-IPM-based GANs (Fedus et al., 2017b). Another successful variant is GP-0 ( and ) (Mescheder et al., 2018; Karras et al., 2019). Although there are explanations to why gradient penalties may be helpful (Mescheder et al., 2018; Kodali et al., 2017; Gulrajani et al., 2017), the theory is still lacking.

There are other GAN variants which improve the stability of training and will be relevant to our discussion. The first one is HingeGAN (Lim and Ye, 2017) which uses the Hinge loss as objective function. This corresponds to using equation (4) and (2.3) using , , and .

Another class of GANs relevant to our discussion are Relativistic paired GANs (RpGANs) (Jolicoeur-Martineau, 2018b, 2019):

 maxC:X→R Epx1∼Px2∼Q[f1(C(x1)−C(x2))], minG Epx1∼Px2∼Q[f2(C(x1)−C(x2))],

and Relativistic average GANs (RaGANs) (Jolicoeur-Martineau, 2018b, 2019):

 maxC:X→R maxG

Most loss functions can be represented as RaGANs or RpGANs; SGAN, LSGAN, and HingeGAN all have relativistic counterparts.

## 3 Generalizing SVMs

The main approach used to generalize SVMs beyond the linear classifier is to apply the kernel trick (Aizerman, 1964). This simply consists in replacing by , where is a kernel. Kernels can be chosen a priori or learned (Goodfellow et al., 2016). In this section, we generalize SVMs to arbitrary classifiers , -norms and loss functions. We start by showing how to derive an -norm geometric margin. Then, we present the concept of maximizing the expected margin, rather than the minimum margin.

### 3.1 Approximating the geometric margin

Calculating the geometric margin involves computing a projection. For general norms it has no closed form. One way to approximate it, is using a Taylor expansion. Depending on the order of the Taylor expansion (before or after solving for the projection), we can get two different approximations: one new and one existing.

#### 3.1.1 Taylor approximation (After solving)

The formulation of the -norm margin (1) has no closed form for arbitrary non-linear classifiers. However, when , if we use a Taylor’s expansion, we can show that

 γ2(x) =|∇x0f(x0)T(x−x0)|||∇x0f(x0)||2 ≈|f(x)|||∇x0f(x0)||2 (Taylor's expansion) ≈|f(x)|||∇xf(x)||2 (% if f(x)≈wTx−b)

This approximation depends on approximate linearity of the classifier. If , we have that (and vice-versa). This means that if we enforce for all , we have for all points on the boundary. This may appear to bring us back to the original scenario with a linear classifier. However, in practice, we only penalize the gradient norm in expectation which means that we do not obtain a linear classifier.

Thus, we can use the following pseudo-margin:

 γ(x)+2=yf(x)||∇xf(x)||2.

#### 3.1.2 Taylor approximation (Before solving)

An alternative approach to derive a pseudo-margin is to use Taylor’s approximation before solving the problem rather than after (as done by Matyasko and Chau (2017) and Elsayed et al. (2018)):

 γp(x) =minr||r||p s.t. f(x+r)=0 ≈minr||r||p s.t. f(x)+∇xf(x)Tr=0 =|f(x)|||∇xf(x)||q,

where is the dual norm (Boyd and Vandenberghe, 2004) of . By Hölder’s inequality (Hölder, 1889; Rogers, 1888), we have that . This means that if , we still get ; if , we get ; if , we get .

We can then define the geometric margin as:

 γ−p=yf(x)||∇xf(x)||q.

### 3.2 Maximizing the expected margin

As previously discussed, the goal of hard-margin SVMs is to maximize the minimum margin as in equation (2). However, this problem is infeasible in non-linearly separable datasets. In these cases, the soft-margin formulation of SVM is most common:

 maxfE(x,y)∼D[F(γ(x,y))], (6)

where is an objective to be maximized (not to be confused with the classifier ) and the expectation represents the empirical average over a sampled dataset . For large datasets, the empirical average is a good approximation of the expectation of the data distribution, .

This is an easier optimization problem to solve compared to equation (2), and is also always feasible. If is chosen to be the negative hinge function (i.e., ), we ignore samples far from the boundary (as in SVMs). For general choices of , every sample may influence the solution. The identity function , cross entropy with sigmoid activation and least-squares are also valid choices.

However, as before, we prefer to separate the numerator from the denominator of the margin. Furthermore, the denominator (the norm of the gradient) is now a random variable. To make things as general as possible, we use the following formulation:

 maxfE(x,y)∼D[F(yf(x))−λg(||∇xf(x)||q)]. (7)

where and is a scalar penalty term. There are many potential choices of and which we can use.

The standard choice of (in SVMs) is . This corresponds to constraining or for all (by KKT conditions). Since the gradient norm is a random variable, we do not want it to be equal to one everywhere. For this reason, we will generally work with softer constraints of the form or . The first function enforces a soft equality constraint so that while the second function enforces a soft inequality constraint so that .

Of note, under perfect separation of the data and with a linear classifier, it has been shown that the empirical version of equation (7) (integrating over a dataset drawn from distribution ) divided by its norm is equivalent to (2) under the constraint when (Rosset et al., 2004). This is true for cross-entropy and Hinge loss functions, but not least-squares. This implies that, under strong assumptions, maximizing the expected margin could also maximize the minimum margin.

### 3.3 Better approximation of the margin

In Section 3.1.1, we showed an approximation to the -norm geometric margin. To reach a closed form, we had to assume that the classifier was approximately linear. This approximation is problematic since samples are pushed away from the boundary so we may never minimize the gradient norm at the boundary (as needed to actually maximize the geometric margin).

Given that we separate the problem of estimating an MMC into maximizing a function of the numerator () and minimizing a function of the denominator (gradient norm), we do not need to make this approximation. Rather than finding the closest element of the decision boundary for a given sample , we can simply apply the penalty on the decision boundary. However, working on the boundary is intractable given the infinite size of the decision boundary.

Although sampling from the decision boundary is difficult, sampling around it is easy. Rather than working on the decision boundary, we can instead apply the constraint in a bigger region encompassing all points of the decision boundary. A simple way to do so is to sample from all linear interpolations between samples from classes 1 and 2. This can be formulated as:

 maxfE(x,y)∼D[F(yf(x))]−λE~x[g(||∇~xf(~x)||2)], (8)

where , , , and . This is same interpolation as used in WGAN-GP; this provides an additional argument in favor of this practice.

## 4 Connections to GANs

Let . Although not immediately clear given the different notations, the objective functions of the discriminator/critic in many penalized GANs are equivalent to the ones from MMCs based on (8). If , we have that corresponds to WGAN-GP, corresponds to SGAN, corresponds to LSGAN, and corresponds to HingeGAN with gradient penalty. Thus, all of these penalized GANs maximize an expected -norm margin.

### 4.1 Equivalence between gradient norm constraints and Lipschitz functions

As stated in Section 2.3, the popular approach of softly enforcing at all interpolations between real and fake samples does not ensure that we estimate the Wasserstein distance (). On the other hand, we show here that enforcing is sufficient in order to estimate .

Assuming is a -norm, and is differentiable, we have that:

 ||∇f(x)||p≤K⟺f is K-Lipschitz on Lp.

See appendix for the proof. Adler and Lunz (2018) showed a similar result on dual norms. This suggests that, in order to work on the set of Lipschitz functions, we can penalize for all . This can be done easily through (7) by choosing . Petzka et al. (2017) also suggested using a similar function (the square hinge) in order to only penalize gradient norms above 1.

If we let and , we have an IPM over all Lipschitz functions; thus, we effectively approximate . This means that can be found through maximizing a geometric margin.

Importantly, most successful GANs (Brock et al., 2018; Karras et al., 2019, 2017) either enforce the 1-Lipschitz property using Spectral normalization (Miyato et al., 2018) or use some form of gradient norm penalty (Gulrajani et al., 2017; Mescheder et al., 2018). Since 1-Lipschitz is equivalent to enforcing a gradient norm constraint (as shown above), we have that most successful GANs effectively train a discriminator/critic to maximize a geometric margin. This suggests that the key contributor to stable and effective GAN training may not be having a 1-Lipschitz discriminator, but may be maximizing a geometric margin.

### 4.2 Why do maximum-margin classifiers make good GAN discriminators/critics?

To answer this question, we focus on a simple two-dimensional example where

. Let real data (class 1) be uniformly distributed on the line between

and . Let fake data (class 2) be uniformly distributed on the line between and . This is represented by Figure 1. Clearly, the maximum-margin boundary is the line and any classifier should learn to ignore . For a classifier of the form , the maximum-margin classifier is for any choice of . We can see this by looking at the expected geometric margin:

 E(z,y)∼D[γ(x,y)] =E(z,y)∼D[yw1x(1)|w1|] =2w1x(1)|w1| ={2if w1>0−2if w1<0.

This means that the problem is overparameterized (there are infinitely many solutions). We will show that this is problematic in GANs.

In GANs, the dynamics of the game depends in great part on where ’s are samples from the fake, or generated, distribution (not to be confused with , see Section 2.1 for definition). This is because the generator only learns through the discriminator/critic and it uses in order to improve its objective function. Thus, for stable training, should not be too big or too small. There are two ways of ensuring this property in this example: we can either i) fix the gradient norm to 1 or ii) fix . Both solutions lead to . The former is the approach taken by soft-margin SVMs and the latter is the approach taken by hard-margin SVMs.

This means that, in order to get stable GAN training, maximizing a margin is not enough. We need to ensure that we obtain a solution with a stable non-zero gradient around fake samples. Thus, it is preferable to solve the penalized formulation from equation (8) and choose a large penalty term in order to obtain a small-gradient solution.

When the gradient norm is 1 everywhere, the only solution is a linear classifier which leads to the gradient being fixed everywhere. In this case, the placement of the margin may not be particularly important for GAN stability since the gradient is the same everywhere (although we do still obtain an MMC).

When we have a non-linear classifier and we impose through , the gradient norm will fade toward zero as we move away from the boundary. Thus, in this case, obtaining a maximum-margin solution is important because it reduces the risk of vanishing gradients at fake samples. To see this, we can consider our simple example, but assume (See Figure 2).

We can enforce , by choosing . We let because it leads to the best classifier. The maximum-margin boundary is at (which we get by taking ; blue curve in Figure 2); for this choice, we have that and for real and fake samples respectively. Meanwhile, if we take a slightly worse margin with boundary at (equivalent to choosing ; red curve in Figure 2), we have that and for real and fake samples respectively. Thus, both solutions almost perfectly classify the samples. However, the optimal margin has gradient , while the worse margin has gradient at fake samples. Thus, the maximum-margin provides a stronger signal for the generator. Had we not imposed a gradient penalty constraint, we could have chosen (green curve in Figure 2) and we would have ended up with vanishing gradients at fake samples while still using a maximum-margin classifier.

In summary, imposing , as done in WGAN-GP, may be helpful because it approximately fixes the gradient to 1 everywhere which stabilizes the generator’s training. However, it imposes a strong constraint (linear discriminator) which only leads to a lower bound on Wasserstein’s distance. Meanwhile, imposing , as we suggested to properly estimate Wasserstein’s distance, may be helpful because it reduces the risk of having no gradient at fake samples.

### 4.3 Are certain margins better than others?

It is well known that -norms (with

) are more sensitive to outliers as

increases which is why many robust methods minimize the -norm (Bloomfield and Steiger, 1983). Furthermore, minimizing the -norm loss results in a median estimator (Bloomfield and Steiger, 1983). This suggests that penalizing the gradient norm penalty () may not lead to the most robust classifier. We hypothesize that gradient norm penalties may improve robustness in comparision to gradient norm penalties since they correspond to maximizing -norm margin. In Section 5 we provide experimental evidence in support of our hypothesis.

### 4.4 Margins in Relativistic GANs

Relativistic paired GANs (RpGANs) and Relativistic average GANs (RaGAN) (Jolicoeur-Martineau, 2018b) are GAN variants which tend to be more stable than their non-relativistic counterparts. Below, we explain how we can link both approaches to MMCs.

#### 4.4.1 Relativistic average GANs

From the loss function of RaGAN, we can deduce its decision boundary. Contrary to typical classifiers, we define two boundaries, depending on the label. The two surfaces are defined as two sets of points such that:

 f(x0) =Ex∼Q[f(x)], when y0=1(% real) f(x0) =Ex∼P[f(x)], when y0=−1(% fake)

It can be shown that the relativistic average geometric margin is approximated as:

 γRa−p(x,y)= ((y+1)/2)(f(x)−Ex∼Q[f(x)])||∇xf(x)||q+ ((y−1)/2)(f(x)−Ex∼P[f(x)])||∇xf(x)||q = αRa(x,y)β(x).

Maximizing the boundary of RaGANs can be done in the following way:

 maxfE(x,y)∼D[F(αRa(x,y))−λg(||∇xf(x)||q)].

Of note, when (identity function), the loss is equivalent to its non-relativistic counterpart (WGAN-GP). Of all the RaGAN variants presented by Jolicoeur-Martineau (2018b), RaHingeGAN with is the only one which maximizes the relativistic average geometric margin when using a gradient norm penalty.

#### 4.4.2 Relativistic paired GANs

From its loss function (as described in section 2.3), it is not clear what the boundary of RpGANs can be. However, through reverse engineering, it is possible to realize that the boundary is the same as the one from non-relativistic GANs, but using a different margin. We previously derived that the approximated margin (non-geometric) for any point is . We define the geometric margin as the margin after replacing by so that it depends on both and . However, there is an alternative way to transform the margin in order to achieve a classifier. We call it the relativistic paired margin:

 γ∗p(x1,x2) =γp(x1)−γp(x2) =f(x1)||∇x1f(x1)||q−f(x2)||∇x2f(x2)||q.

where is a sample from and is a sample from . This alternate margin does not depend on the label , but only ask that for any pair of class 1 (real) and class 2 (fake) samples, we maximize the relativistic paired margin. This margin is hard to work with, but if we enforce , for all ,, we have that:

 γ∗p(x1,x2)≈f(x1)−f(x2)||∇xf(x)||q,

where is any sample (from class 1 or 2).

Thus, we can train an MMC to maximize the relativistic paired margin in the following way:

 maxfEpx1∼Px2∼Q [F(f(x1)−f(x2))]− λ E(x,y)∼D[g(||∇xf(x)||q)],

where must constrain to a constant.

This means that maximizing without gradient penalty can be problematic if we have different gradient norms at samples from class 1 (real) and 2 (fake). This provides an explanation as to why RpGANs do not perform very well unless using a gradient penalty (Jolicoeur-Martineau, 2018b).

## 5 Experiments

Following our analysis and discussion in the previous sections, we hypothesized that margins, corresponding to a gradient norm penalty, would perform better than margins ( gradient norm penalty). As far as we know, researchers have not yet tried using a gradient norm penalty in GANs. In addition, we showed that it would be more sensible to penalize violations of rather than .

To test these hypotheses, we ran experiments on CIFAR-10 (Krizhevsky et al., 2009) using HingeGAN () and WGAN () loss functions with , , gradient norm penalties. We enforce either using Least Squares (LS) or using Hinge . We used the ADAM optimizer (Kingma and Ba, 2014) with and a DCGAN architecture (Radford et al., 2015). As per convention, we report the Fréchet Inception Distance (FID) (Heusel et al., 2017); lower values correspond to better generated outputs (higher quality and diversity). We ran all experiments using seed 1 and with gradient penalty . Details on the architectures are in the Appendix. Code is available on https://github.com/AlexiaJM/MaximumMarginGANs. The results are shown in Table 1.

Due to space constraint, we only show the previously stated experiments in Table 1. However, we also ran additional experiments on CIFAR-10 with 1) Relativistic paired and average HingeGAN, 2) , 3) the standard CNN architecture from Miyato et al. (2018). Furthermore, we ran experiments on CAT (Zhang et al., 2008) with 1) Standard CNN (in 32x32), and 2) DCGAN (in 64x64). These experiments correspond to Table 2345, and 6 from the appendix.

In all sets of experiments, we generally observed that we obtain smaller FIDs by using: i) a larger (as theorized), ii) the Hinge penalty to enforce an inequality gradient norm constraint (in both WGAN and HingeGAN), and iii) HingeGAN instead of WGAN.

## 6 Conclusion

This work provides a framework in which to derive MMCs that results in very effective GAN loss functions. In the future, this could be used to derive new gradient norm penalties which further improve the performance of GANs. Rather than trying to devise better ways of enforcing 1-Lipschitz, researchers may instead want to focus on constructing better MMCs (possibly by devising better margins).

This research shows a strong link between GANs with gradient penalties, Wasserstein’s distance, and SVMs. Maximizing the minimum -norm geometric margin, as done in SVMs, has been shown to lower bounds on the VC dimension which implies lower generalization error (Vapnik and Vapnik, 1998; Mount, 2015). This paper may help researchers bridge the gap needed to derive PAC bounds on Wasserstein’s distance and GANs/IPMs with gradient penalty. Furthermore, it may be of interest to theoreticians whether certain margins lead to lower bounds on the VC dimension.

## 7 Acknowledgements

This work was supported by Borealis AI through the Borealis AI Global Fellowship Award. We would also like to thank Compute Canada and Calcul Québec for the GPUs which were used in this work. This work was also partially supported by the FRQNT new researcher program (2019-NC-257943), the NSERC Discovery grant (RGPIN-2019-06512), a startup grant by IVADO, a grant by Microsoft Research and a Canada CIFAR AI chair.

## Appendices

Note that the smooth maximum is defined as

 smax(x(1),…,x(k))=∑ki=1x(i)exi∑ki=1exi.

We sometime use the smooth maximum as a smooth alternative to the -norm margin; results are worse with it.

### B Proofs

Note that both of the following formulations represent the margin:

 γ(x)= minx0||x0−x|| s.t. f(x0)=0 = minr||r|| s.t. f(x+r)=0

#### b.1 Bounded Gradient ⟺ Lipschitz

Assume that and is a convex set.

Let , where be the interpolation between any two points