Summable Reparameterizations of Wasserstein Critics in the One-Dimensional Setting

09/19/2017 ∙ by Christopher Grimm, et al. ∙ Brown University University of Michigan Beihang University 0

Generative adversarial networks (GANs) are an exciting alternative to algorithms for solving density estimation problems---using data to assess how likely samples are to be drawn from the same distribution. Instead of explicitly computing these probabilities, GANs learn a generator that can match the given probabilistic source. This paper looks particularly at this matching capability in the context of problems with one-dimensional outputs. We identify a class of function decompositions with properties that make them well suited to the critic role in a leading approach to GANs known as Wasserstein GANs. We show that Taylor and Fourier series decompositions belong to our class, provide examples of these critics outperforming standard GAN approaches, and suggest how they can be scaled to higher dimensional problems in the future.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Generative Adversarial Networks (GANs), introduced by goodfellow2014generative (goodfellow2014generative), have quickly become a leading technique for learning to generate data points matching samples from a distribution. GANs produce samples without directly modeling the target probability distribution. They do so by jointly training two neural networks: a

generator, which attempts to produce synthetic data points in a way that is consistent with the source distribution, and a discriminator, which seeks to determine whether any given data point was drawn from the source distribution or the generated one.

This joint training procedure is difficult to stabilize and many conceptual variants of the GAN framework have been proposed to improve results. We focus specifically on one such variant: the Wasserstein GAN or wGAN [Arjovsky, Chintala, and Bottou2017]. While the standard GAN framework is derived as a minimax game between two agents, the wGAN framework reformulates the problem in terms of minimizing a distance metric between two probability distributions. Particularly, wGAN is formulated using the dual form of the Earth-Mover’s distance, which can be reasonably approximated by a neural network. This construction results in a similar two-network setup, with one network acting as a generator and another acting as a critic—its role is to maintain an estimate of the Earth-Mover’s distance between the generator’s distribution and the target distribution in a functional form that can be used as a guide to improving the generator.

Informally, Earth-Mover’s distance between two probability distributions can be thought of as the amount of “work” that would go into transporting probability mass within each distribution to make them indistinguishable. A particularly nice property of Earth-Mover’s distance is that, under mild constraints, it has a defined gradient almost-everywhere [Arjovsky, Chintala, and Bottou2017], making it ideal for gradient-based optimization. To optimize over the space of critics, the optimizer must ensure that the functions it produces are -Lipschitz—that the norm of their gradients is less than some scalar over the domain. A popular approach toward enforcing this constraint [Gulrajani et al.2017] involves assigning a penalty to functions that violate it on a subset of the domain. While this approach has been shown to produce visually appealing results on a variety of popular image benchmarks [Gulrajani et al.2017], there is no guarantee that the critic network will converge to the optimal critic. This failure of the critic to achieve optimality can result in generators that diverge or suffer from mode-collapse. In this work, we introduce a reparameterization of the critic network in the one-dimensional setting that has guarantees on its convergence. We show that this reparameterized critic performs better than standard gradient-penalty wGAN approaches on a set of one-dimensional simulated domains.

Background

This section provides necessary mathematical background and also summarizes related work.

Generative Adversarial Networks

Generative Adversarial Networks are traditionally introduced in a setting where there is some target (“real”) data source from which samples can be drawn. The GAN itself is defined in terms of two distinct network components: a generator, and a discriminator, . The generator takes randomly sampled noise as input, and produces “generated samples” distributed according to as output. The discriminator takes real or generated data points as input and returns a scalar indicating whether a given input is real or generated. These networks are trained together in a mini-max game with the following objective:

To optimize this objective, the generator and discriminator networks take turns, updating their own parameters while the other network’s parameters are held fixed. Collectively, this objective can be thought of as the “certainty” of the discriminator. The generator aims to minimize this certainty, and in doing so, produce generated samples that are distributionally indistinguishable from those drawn from the real data-generating source. Conversely, the discriminator aims to maximize its own certainty by learning to discern real samples from generated ones, providing pressure on the generator to more closely match the real distribution.

Wasserstein Generative Adversarial Networks

As an alternative to this game-theoretic approach, Wasserstein GANs seek to minimize the Earth-Mover’s distance between two probability distributions

(2)

where represents the set of all joint probability distributions with and as marginal distributions.

While the represention of Earth-Mover’s distance provided in Eq. (2) is not tractable to compute, it can be approximated in its dual form [Villani2008]

(3)

where denotes the space of -Lipschitz functions.

Eq. (3) can be optimized similarly to the GAN setup described above. The same form of the generator network is used to produce generated data samples. However, in place of a discriminator, a critic network is used to represent the function in Eq. (3). Collectively, the resulting optimization procedure takes the following form:

(4)

where the critic network, , is required to span a sufficiently large class of functions to approximate the supremum in Eq. (3).

In this setting, the critic updates its parameters successively while the generator network is held fixed to maximize the above expression. After is sufficiently maximized, the critic’s parameters are frozen and the generator network takes a step to minimize .

It is important to note that during this procedure special care must be taken to ensure that the critic function, , belongs to the class of

-Lipschitz functions. The most successful method of ensuring this property is by applying a gradient penalty as an additional term in the loss function 

[Gulrajani et al.2017].

Bearing this constraint in mind, the success of gradient-based optimization relies heavily on the parameter space of the optimized function being “nice” in a topological sense. Particularly, a highly concave or disconnected parameter space provides a much harder optimization problem and increases the likelihood of the optimizer settling on a local optimum. Imposing a -Lipschitz constraint on the optimization procedure certainly complicates the topology of the parameter space.

Taylor Series Approximations

The Taylor series is a popular method that approximates a function with a sum of polynomials of increasing degree. In the one-dimensional setting, it can be expressed as

(5)

where denotes the th derivative of , denotes the factorial of , and is an arbitrary point in the domain of . It is important to note the approximation is centered at —as moves away from this central point, the approximation can become less accurate.

Fourier Series Approximations

The Fourier series is another method of approximating functions with a sum of functions, this time sinusoidal functions of decreasing periodicity. In the one-dimensional setting, it can be expressed as

(6)

where the sequences and are parameters particular to the function and is the period of the resulting approximation.

Derivation of the Summable Critic

We now define a set of properties for a critic representation that we will show leads to improvements on the wGAN framework.

Let be some sequence of functions such that both the functions and their derivatives are bounded: and for all , where is some bounding constant that does not depend on .

We then define the weighted sum of these functions as follows:

(7)

where and .

From the above properties, we can derive an upper bound on the gradient of :

(8)

To simply our notation, we refer to this upper bound as

(9)

If we further assume that most functions can be expressed by for some sequence of coefficients , then we can express the dual-form of the Wasserstein distance between distributions with support as follows:

where the last approximation follows from the property that for all .

We emphasize again how, under this new parameterization, the critic’s parameters are the coefficients . The structure of these parameters in the constraint and objective function give us useful properties. Specifically, our approximation of the -Lipschitz constraint is convex and the optimization objective with respect to the critic is linear. Hence, we have the following theorem.

Theorem 1.

During the optimization of a summable critic network, any setting of parameters that is a local maximum is also a global maximum.

Proof.

Consider two critic networks:

and suppose that each critic satisfies the constraint given by Eq. (9): and .

Next, consider any critic produced by linearly interpolating between the parameters of

and . Denote such a critic as

We can then bound the derivative of the interpolated critic by

where the last inequality is due to both and satisfying our modified 1-Lipschitz constraint: .

Hence, any critic linearly interpolated between two critics satisfying also satisfies the constraint. Thus, the space of critics that satisfies our constraint is convex under our summable parameterization.

Next, we observe that the Wasserstein distance under a summable critic parameterization is linear in the parameters of the critic. Collectively, the procedure of maximizing the Wasserstein distance with respect to the critic parameters now has a convex objective and a convex constraint.

Let

(13)

be the critic’s objective function. Notice how, under the approximation in Eq. (Derivation of the Summable Critic),

is the equation for a hyperplane. Thus, we have

for any settings and of the critic’s parameters.

Next, suppose that and satisfy our constraint and are local maxima of with without loss of generality. If and exist, then, from the convexity of the constraint, all interpolations also satisfy the constraint. Thus, we can consider the sequence of parameter settings given by . Clearly, and each satisfies the constraint. Moreover, we can write

Since we can construct a sequence of parameters that approaches with for each , this contradicts being a local maximum. Hence, any local maximum must be a global maximum. ∎

As a result, we can frame the optimization of the critic as a convex optimization problem, where all local maxima are global maxima.

For appropriate settings of , we can use the Fourier or Taylor bases giving

and

respectively.

Thus, by representing the class of critic functions as Taylor or Fourier expansions, we obtain a clean way to enforce the -Lipschitz constraint over the entire domain, while ensuring that gradient-based optimization schemes can find the globally optimal critic. We note that by enforcing an upper bound on the -Lipschitz constraint, we are optimizing over a smaller set of functions. However, we have not noticed this additional constriction to affect performance empirically.

When minimizing the expresions above, slight modifications must be made for computational tractability. First, we must choose some and cut off the remaining terms in the outer sum. Second, we must enforce our constraint as a penalty term in the loss function. Fortunately, neither of these practical considerations change the theoretical guarantees proved above. Particularly, limiting the number of terms in the outer sum to does not affect the our convexity-based arguments and embedding the constraint into the loss function as a penalty still results in a convex optimization problem.

Connection with Moment Matching and Maximum Mean Discrepancy

In this section, we review two methods in statistics that exhibit similar characteristics to the Wasserstein distance metric and the summable parameterization we presented in this work.

Maximum Mean Discrepancy

Recent work by Gretton2012AKT (Gretton2012AKT) explores the “Maximum Mean Discrepancy” technique to distinguish between samples drawn from different data sources. The Maximum Mean Discrepancy (also known as an Integral Probability Metric) between two data sources is defined as

(16)

where and are the distributions of the data sources and is some function class that is sufficiently rich that when .

Notice how the dual-form of Wasserstein distance is a specical case of the above Integral Probability Metric when is the set of -Lipschitz functions.

In their work, Gretton2012AKT (Gretton2012AKT) explore using Reproducing Kernel Hilbert Spaces [Berlinet and Thomas-Agnan2011]

as the function class to perform their maximum mean discrepancy tests. This kernel-based approach is adopted by Li2015GenerativeMM (Li2015GenerativeMM) in their work on Generative Moment Matching Networks. This work offers a method that competes directly with GANs. Rather than a mini-max game between a generator and discriminator, Generative Moment Matching Networks boast only needing a generator network that is trained to minimize the Maximum Mean Discrepancy between the real and generated sources. The authors note that their use of kernels approximates matching the moments of the sampled and generated random variables.

Moment Matching

Moment matching, also known as the “method-of-moments” is the process of fitting a model to a distribution by sampling from that distribution and setting the model’s parameters to be the distribution’s sampled moments. In general, moments can refer to any set of functions that characterize the behavior of a random variable, but they are most commonly represented as the random variable raised to different powers. For any , we denote the th moment of a random variable as

(17)

Particularly notice that for critics represented by the Taylor series parameterization, the Wasserstein distance can be expressed as a sum of weighted moments:

(18)

Experiments

In the following subsections, we describe our experimental procedure.

Domains

We evaluated our method against three different synthetic data sources and one real-world data source. Our synthetic data sources consisted of a “sawtooth” distribution, a discrete distribution with three possible values, and a mixture of two Gaussian distributions. Each distribution was sampled

times to construct a dataset that was then used across all experiments and models. These distributions correpond to the following random variables defined below:

(19)

where , , and .

Our real-world data source is a collection of city populations from the Free World Cities Database (https://www.maxmind.com/en/free-world-cities-database). We pre-processed this data by applying a logarithmic scaling to the population numbers and normalizing the resulting log-populations to be between and . We denote the random variable associated with this data source as .

Figure 1 characterizes each of these data sources by sampling a million points from each and plotting their histograms.

Figure 1: Histograms generated by drawing 1,000,000 samples of random variables (a) , (b) , (c) and (d) .

Network Architectures

To maintain consistency between experiments, we used the same generator network architecture for both the wGAN-GP experiments and for our method. This generator network architecture consists of batch-normalized, fully connected layers with neurons each and leaky ReLU activation, followed by a single fully connected output layer with neuron and a tanh activation. The wGAN-GP experiment used a discriminator with fully connected layers with neurons each and leaky ReLU activations, followed by a single fully connected output layer with neuron and linear activation. Following Gulrajani2017ImprovedTO (Gulrajani2017ImprovedTO), we used to enforce constraints across all experiments. Additionally, we used the AdamOptimizer [Kingma and Ba2014] with , and a learning rate of . For all reparameterized critic models we clipped the infinite sums in the expansions at . Additionally, batch normalization [Ioffe and Szegedy2015] is used for all generator networks in our experiments.

Evaluation Procedure

All of the comparison algorithms attempt to learn a representation of the target 1-dimensional probability distribution. For each model, we measure its accuracy by computing the sample Earth-Mover’s distance. This quantity is computed by sampling the model times and constructing a histogram out of its sampled outputs. The entries in these histograms are then normalized so that the sum of the bin values is . A similar histogram is then constructed using the true data source, and the Earth-Mover’s distance is computed between them. For computing the Earth-Mover’s distance, we used the publicly available Python library pyEMD. For each of the GAN methods, training was conducted over iterations, with an estimate of the Earth-Mover’s distance being computed with the training data every iterations. At the end of training, the lowest estimate over the course of training is reported as the model’s Earth-Mover’s distance (EMD).

Results

We present the results of running trials for each of the GAN-based models. We denote our runs with reparameterized critics as “Taylor Critic” and “Fourier Critic” for the Fourier Series and Taylor Series reparameterizations, respectively. The best obtained Earth-Mover’s distances for each run and model are reported in Tables 2, 3, 4 and 5

. We additionally report the average Earth-Mover’s distances across the 4 trials and compare these numbers to the performance of a Kernel Density Estimator as a nonparametric baseline. These results are posted in Table 

1.

KDE 0.0073 0.01002 0.0040 0.0027
wGAN-GP 0.0822 0.1318 0.26055 0.0188
Taylor Critic 0.0216 0.0106 0.0151 0.0096
Fourier Critic 0.0186 0.0193 0.0109 0.0103
Table 1: Table containing the average Earth-Mover’s Distances over the runs detailed in Tables 2, 3, 4 and 5 for each GAN-based model.
wGAN-GP 0.0206 0.0279 0.2578 0.0226
Taylor Critic 0.0179 0.0216 0.0217 0.0250
Fourier Critic 0.0204 0.0164 0.0201 0.0175
Table 2: Table of Earth-Mover’s distances for runs of the wGAN-GP, Taylor Critic and Fourier Critic on the Gaussian Mixture dataset. Run 3 of the wGAN-GP illustrates its instability.
wGAN-GP 0.1381 0.1333 0.1314 0.1242
Taylor Critic 0.0091 0.0121 0.0103 0.0110
Fourier Critic 0.0129 0.0129 0.0287 0.0226
Table 3: Table of Earth-Mover’s Distances for runs of the wGAN-GP, Taylor Critic and Fourier Critic on the Discrete dataset.
wGAN-GP 0.4891 0.0226 0.4653 0.0652
Taylor Critic 0.0157 0.0133 0.0152 0.0161
Fourier Critic 0.0081 0.0132 0.0132 0.0091
Table 4: Table of Earth-Mover’s distances for runs of the wGAN-GP, Taylor Critic and Fourier Critic on the Sawtooth dataset. Runs 1 and 3 of the wGAN-GP illustrate its instability.
wGAN-GP 0.0225 0.0189 0.0182 0.0157
Taylor Critic 0.0120 0.0103 0.0066 0.0094
Fourier Critic 0.0108 0.0069 0.0086 0.0150
Table 5: Table of Earth-Mover’s distances for runs of the wGAN-GP, Taylor Critic and Fourier Critic on the City Population dataset.

We observe that both models with reparameterized critics significantly outperform wGAN-GP and are frequently competitive with Kernel Density Estimation. From Tables 2, 3, 4 and 5

we observe that the reparameterized critic models’ worst runs are generally better than the wGAN-GP model’s best runs, and the reparameterized critic models have significantly lower variance across runs than wGAN-GP.

Since all GAN-based methods in this paper have the same network architecture for their generators, it is reasonable to attribute this difference to the forms of the critics. As we showed in Theorem 1, the process of optimizing the critic with respect to a given generator cannot “get stuck” in some locally maximal region of the space of critics. Thus, as long as the set of critics satisfying is sufficiently close to the set of critics satisfying , then the generator should always have a clean gradient to follow during its optimization as shown in Lemma 1 of Gulrajani2017ImprovedTO (Gulrajani2017ImprovedTO).

While this does not preclude the possibility that the generator itself could “get stuck” during its own optimization against the critic, the difference in consistency across runs between the reparameterized critic models and the wGAN-GP models is evidence that the additional guarantees on reparameterized critics helps empirically. Note that we made every effort to set the hyperparameters of GP-wGAN to reduce or eliminate its instability. It is possible that it would perform better with some other parameter setting, but we were not able to find such a setting. That being said the performance of the reparameterized critic models was relatively unchanged across the parameter settings we explored.

Conclusion and Future Work

In this work, we illustrated an alternate parameterization of the critic networks that has ideal theoretical properties for gradient-based optimization. We demonstrated that, in the one-dimensional setting, our summable critic models categorically outperform Wasserstein GAN with gradient penalty and are competitive with Kernel Density Estimation on a variety of synthetic and real-world domains.

While our work on this paper focuses on the one-dimensional setting, there is considerable room to explore extending the approach to higher dimensions. For both the Taylor and Fourier series expansions, there are high-dimensional analogues. These higher-dimensional decompositions generally require exponentially many terms in the number of input dimensions. It may be possible to alleviate this computational cost by exploiting recent techniques to learn sparse polynomials or Fourier series [Andoni et al.2014, Hassanieh et al.2012]. Particularly, while all exponentially many terms of these series may be necessary to model arbitrarily messy functions, it is unlikely that all or even most of them will be required to reasonably approximate the space of -Lipschitz functions.

References