On reproduction of On the regularization of Wasserstein GANs

12/16/2017 ∙ by Junghoon Seo, et al. ∙ Satrec Initiative Co., Ltd. 0

This report has several purposes. First, our report is written to investigate the reproducibility of the submitted paper On the regularization of Wasserstein GANs (2018). Second, among the experiments performed in the submitted paper, five aspects were emphasized and reproduced: learning speed, stability, robustness against hyperparameter, estimating the Wasserstein distance, and various sampling method. Finally, we identify which parts of the contribution can be reproduced, and at what cost in terms of resources. All source code for reproduction is open to the public.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This report is a submission for participating ICLR 2018 Reproducibility Challenge.111For more information about the challenge, please visit the following website: http://www.cs.mcgill.ca/~jpineau/ICLR2018-ReproducibilityChallenge.html. The target paper of reproducibility verification is Anonymous (2018). The target paper proposes a new regularization term that stabilizes training of generative adversarial network (GANs) based on 1-Wasserstein metric and makes it robust to selection of hyperparameters.

1.1 Wasserstein distance and regularization

Generative adversarial networks, which is dealt as one of the most attractive frameworks among generative models, is first introduced in Goodfellow et al. (2014). It is well known that training process of GANs could be interpreted as metric minimization between data sample and generated sample. For example, the original GANs paper argues that its training process is equivalent to minimization of Jensen-Shanon distance between data sample and generated sample. Moreover, some works of literatures such as Nowozin et al. (2016); Zhao et al. (2016); Li et al. (2017); Chen et al. (2016) have visited various metric optimization other than the original’s Jensen-Shanon distance.

Arjovsky et al. (2017)

proposed Wasserstein GAN (WGAN), which is that the concept of Wasserstein distance is introduced into GANs framework. The authors argued that optimizing Wasserstein distance has advantages over other metrics that measure the difference between the probability measures. In

Arjovsky et al. (2017), this advantage is described well in Theorem 1 and theoretically justified in Appendix C. However, some uncapacitated minimum cost flow algorithms like Orlin (2013), which are commonly used computing optimal transport distance, are highly intractable. According to Villani (2008), infimum problem for computing Wasserstein distance could be converted as supremum problem thanks to Kantorovich-Rubinstein duality. The derived optimization problem for adversarially solving optimal transport problem is the following:

(1)

where each notation refers:

  • is the probability distribution of sample data.

  • is the probability distribution of generated data, which is modeled by the generator neural network and parameterized by

    .

  • is the critic function, which is modeled by the discriminator neural network and parameterized by .

  • is the set of all 1-Lipschitz functions.

The most critical part of equation (1) is that the critic function should have 1-Lipschitz continuity. This is because specifying Lipschitz constant of the continuous function is computationally feasible. In Arjovsky et al. (2017), weight clipping technique is used to enforce 1-Lipschitz continuity for the critic function but it is rarely effective. Thus, some follow-up studies face this problem.

Gulrajani et al. (2017)

proposed a new regularization term for enforcing convergence on the gradient of data point, with sampling strategy on interpolation between the sample point and the generated point.

Kodali et al. (2017) modified the sampling strategy into local perturbation on sample data. Anonymous (2018) improved the regularization term by relieving convergence of gradient when the gradient of a sampled point is under one.

Taken together, the ultimate optimization problems in each paper are:

  1. Arjovsky et al. (2017) (WGAN)

    (2)

    where .

  2. Gulrajani et al. (2017) (WGAN-GP)

    (3)

    where , , , , and .

  3. Kodali et al. (2017) (DRAGAN)222The content of recent revision version of this paper differs greatly from one of the original arXiv version. The sampling strategy is also one of the major differences between the two versions. This report follows the old version because we think that the reproduction paper probably followed the old version.

    (4)

    where , and .

  4. Anonymous (2018) (WGAN-LP), which is the reproduction target on this report

    (5)

    where follows the sampling strategy which is used in either Equation (3) or Equation (4), and .

1.2 The target questions and experimental methodology

The ultimate goal of all experiments in Anonymous (2018) is that Equation (5) is better than Equation (3) and Equation (4) in aspect of training stability and convergence speed to optimize optimal transport problem. The detailed target questions of this reproducibility report are listed up: (All subsection names appears in the section experiment of Anonymous (2018).)

  1. Does the critic learn the function faster?

    • Subsection Level sets of the critic.

  2. Does the critic learn the function more stably?

    • Subsection Evolution of the critic loss.

  3. Is learning of the critic more robust against change of hyperparameter ?

    • Subsections Level sets of the critic and Evolution of the critic loss.

  4. Is the optimal transport problem solved more well?

    • Subsection Estimating the Wasserstein distance.

  5. Is the learning more robust against change of sampling method?

    • Subsection Different sampling methods.

Note that subsection Sample quality on CIFAR-10 in the experiment section of Anonymous (2018) and Appendix D.5 Optimizing the Wasserstein-2 distance are out of our reproduction scope.333At the beginning of the reproduction, we refer to the arXiv uploaded version of the paper. However, we noticed that the revised version of OpenReview recently added subsection Sample quality on CIFAR-10. As a matter of time, the experiments in this subsection were excluded from the scope of this report. On the other hand, we are having trouble implementing a regularization loss term that minimizes the Wasserstein-2 distance in Tensorflow.

Please refer to the experimental section of the original article to see the intent of each experiment.

2 Details of implementation

Since the implementation of the target article is not public, we have implemented it firsthand. Before revealing the detailed implementation, we leave some Github repository links that could be helpful for the implementation of the paper. All the repositories are implemented in Python and Tensorflow.

  1. @igul222/improved_wgan_training444https://github.com/igul222/improved_wgan_training : Implementation of Gulrajani et al. (2017).

  2. @kodalinaveen3/DRAGAN555https://github.com/kodalinaveen3/DRAGAN : Implementation of Kodali et al. (2017).

  3. @hwalsuklee/tensorflow-generative-model-collections666https://github.com/hwalsuklee/tensorflow-generative-model-collections : Implementation of GANs variants.

All source code in this report is open to the public on https://github.com/mikigom/WGAN-LP-tensorflow. We have implemented the experiments on Tensorflow 1.3.0. The experiments are conducted on Intel(R) Xeon(R) CPU E5-1650 v4 @3.60GHz and GeForce GTX 1080 Ti Graphics Cards.

2.1 Project structure

The project consists of four Python modules: data_generator.py, model.py, reg_losses.py, and trainer.py. data_generator.py is a module that provides a class that generates the sample data needed for learning. model.py is a module that implements 3-layer neural networks for a generator and a critic. reg_losses.py defines the sampling method and loss term for regularization. trainer.py includes a pipeline for model learning and visualization.

2.2 Notes for our implementation

  • All hyperparameters followed those presented in the original paper. The hyperparameters of the RMSProp optimizer, the only unknown hyperparameters (except for learning rate), followed the

    Tensorflow default values.

  • data_generator.py includes Python class-form generators into which code of the data generator in @igul222/improved_wgan_training is refactored. Three sample distributions, 8Gaussians, 25Gaussians, and Swiss Roll, are implemented. For generating Swiss Roll dataset, sklearn.datasets.make_swiss_roll() is used.

  • model.py is implemented in Python class form with TF-slim.

  • All sample perturbation methods in reg_losses.py are implemented in Tensorflow, not in numpy.

  • For drawing level sets and 2-D data, matplotlib is used. For visualizing loss, Tensorboard summary operator is simply used.

  • scipy.optimize.linear_sum_assignment() is used for computing earth mover’s distance(EMD). It is equivalent to Hungarian method, or Kuhn–Munkres algorithm.

To see our implementation in more detail, please check out the repository above mentioned.

3 Analysis and discussion of reproduction

We first want to specify that the experiment follows the trends presented in the paper as a whole, but the overall learning speed is relatively slow. It is assumed that this is due to differences of hyperparameter in RMSProp, or in unrecognized elements. However, since this is not very inconsistent with the overall tendency of the experiment, we proceeded to reproduce the experiments without searching to solve it. Thus, we reproduced the figures from 500, 2500, 5000, 10000 iteration for drawing figures of the subsection 3.1 Level sets of the critic. (instead of the iteration presented in the original paper). In order to reproduce experiments of subsection 3.2 Estimating the Wasserstein distance, the learning was done up to 20k steps. However, in the case of reproduction of subsection 3.3 Estimating the Wasserstein distance, the training is done up to the 2k step (which is the same value with the original paper), since the calculation of EMD takes a considerable amount of time.

It takes about 12 minutes to learn the 20k steps without EMD calculation, and it takes about 2 hours to learn the 2k steps when EMD calculation is included. In this report, we have simplified some experiments somewhat and decided to skip the part where the median of the critic’s negative loss was shown by repeating the experiment and show only the single run results. Note that the critic’s negative loss in the original paper and the implementation in this report is the opposite of the sign.

In the original paper, it is not described in which data set the experiment was carried out in experiments other than those included in subsection Level sets of the critic. In this report, Swiss Roll dataset is used.

3.1 Level sets of the critic

Figure 1: Reproduction results of the original paper’s Fig. 3. Level sets of the critic f of WGANs during training on Swiss Roll dataset. The bright line corresponds to high, dark line to low values of the critic. Training samples are indicated in yellow, generated samples in green, and samples used for the penalty term in red. Top: GP-penalty with . Middle: GP-penalty with . Bottom: LP-penalty with .

Fig. 1, Fig. 4, and Fig. 5 correspond to the reproduction results of this section. It can confirm that WGAN-LP is robust to the selection of than WGAN-GP. Unlike the other two datasets, in WGAN-GP training in 8Gaussians under , it is consistent with our original thesis and implementation that the critic neural network leans the correct function ultimately. (although of course slower and unstable than other conditions.)

3.2 Evolution of the critic loss

Figure 2: Reproduction results of the original paper’s Fig. 4. Evolution of the WGAN critic’s negative loss (without the regularization term) for . Blue line: For the GP-penalty. Red line: For the LP-penalty.

Fig. 2 and Fig. 6 correspond to the reproduction results of this section. Again, we can see that the discussions in the experiment have been properly reproduced: WGAN-LP is robust to the selection of than WGAN-GP, and critic function is learned much more stable.

3.3 Estimating the Wasserstein distance

Figure 3: Reproduction results of the original paper’s Fig. 5. Evolution of the approximated Wasserstein-1 distance during training of WGANs on . Mint line: For the GP-penalty. Gray line: For the LP-penalty

Fig. 2 and Fig. 6 correspond to the reproduction results of this section. At , WGAN-LP model can generate a distribution with a smaller EMD faster than WGAN-GP model. On the other hand, WGAN-GP is not robust to the choice of lambda. Also, as discussed in the original paper, the WGAN-LP model shows stable EMD even at .

3.4 Different sampling methods

Fig. 8 and Fig. 9 correspond to the reproduction results of this section. Slightly different from the discussion of the paper, we could not observe the difference in learning stability for the sampling method of the WGAN-GP model in a single run. However, in the top of original paper’s Fig. 11, we can see that there is the case of remarkable fluctuation, despite showing overall stable trends. On the other hand, the middle of original paper’s Fig. 11 exhibits overall unstable trends, but there are several extremely stable cases. Thus, failure to observe the stability difference according to sampling method in WGAN-GP is likely to be a sort of sampling error in our experiment.

4 Conclusions

We covered to reproduce the target paper on regularization for Wasserstein distance and showed all experiments are well reproducible. We wrote all the source code to reproduce the target paper. There is no existing source code for the target paper, so we made it based on various repositories related to WGAN. It took 3 days to write all the source code and reproduce the experiment. As mentioned above, we did not perform any further extended experiments in the review process. We have confirmed that the experimental results are reproducible in a verifiable and easy way. However, in some parts, it has been difficult to implement and reproduce.(e.g. ’How to implement 2-Wasserstein based regularization term in Tensorflow

’) Thus, we prefer that authors who work on machine learning always publish their code. (for the target paper as well as for other papers)

We have confirmed what the target papers claimed: First, WGAN-LP has more stable learning and faster convergence property than WGAN-GP. Second, WGAN-LP is much more robust to regularization fraction than WGAN-GP. Finding and determining the appropriate hyperparameter is an important but cumbersome, on study of machine learning. Therefore, presenting a robust model to the selection of hyperparameters can be a sufficient contribution to other researchers and the field itself. We are able to accept that the target paper contributes to this part in a reproducible way.

References

Appendix

Figure 4: Reproduction results of the original paper’s Fig. 7. Level sets of the critic f of WGANs during training on 8Gaussians dataset. The same representation with Fig. 1 is used. Top: GP-penalty with . Middle: GP-penalty with . Bottom: LP-penalty with .
Figure 5: Reproduction results of the original paper’s Fig. 8. Level sets of the critic f of WGANs during training on 25Gaussians dataset. The same representation with Fig. 1 is used. Top: GP-penalty with . Middle: GP-penalty with . Bottom: LP-penalty with .
Figure 6: Reproduction results of the original paper’s Fig. 9. Evolution of the WGAN-GP critics loss without the regularization term on .
Figure 7: Reproduction results of the original paper’s Fig. 10. Evolution of the approximated EM distance during training WGAN-GPs on .
Figure 8: Reproduction results of the original paper’s Fig. 11. Evolution of the WGAN critic’s negative loss with local sampling (without the regularization term). Top: GP-penalty when generating samples by perturbing training samples only. Middle: For GP-penalty, perturbing training and generated samples. Bottom: LP-penalty, perturbing training and generated samples.
Figure 9: Reproduction results of the original paper’s Fig. 12. Evolution of the approximated EM distance during training of WGANs with local perturbation with . Orange line: For the GP-penalty. Blue line: For the LP-penalty