1 Introduction
Normalizing flows and generative adversarial networks (GANs) can be seen as alternative approaches: both are flexible generative models that, in contrast to both variational autoencoders and traditional “shallow” Bayesian models, do not assume that either the likelihood or the prior has a simple parametric form. This flexibility makes both approaches appealing for modeling complex scientific data. GANs in particular have caught the attention of scientists
(Mustafa et al., 2019; Choi et al., 2017; Baowaly et al., 2018).To date, however, GANs have been validated primarily using image data. Little research exists investigating whether GANs are suitable for general statistical modeling, as would be required for scientific applications. Performance metrics used to assess GANs thus far have unfortunately only revealed half the story: they measure whether the data GANs generate are realistic (i.e., precision) but not whether the fitted GAN model has support for heldout samples (i.e., recall). The difficulty associated with measuring recall stems from the intractability of the likelihood GANs assign to highdimensional data—a well known limitation of implicit models
(Lucic et al., 2018).Synthetic lowdimensional data, on the other hand, offers us the potential to establish a negative result. Using these data, we can accurately assess the performance of both GANs and flows. We study the performance of both GANs and flows on synthetic univariate data from mixture data (Section 2). Although accurately learning distributions of univariate data is not sufficient for scientific modeling, it is a necessary condition for a tool to be reliable.
Even with univariate data, measuring performance is not trivial. We confront subtle issues of selecting kernel density estimation bandwidth, and present a lowvariance estimator for Wasserstein distance between low dimension distributions (Section
3). The visualization methods we develop allow us to demonstrate several distinct failure modes of GANs.An additional challenge of establishing negative results with respect to GANs in general is their diversity: there are hundreds of different GAN algorithms, each of which can be combined with numerous tricks and tweaks, implying an exponential number of combinations. Using eight NVIDIA GeForce RTX 2080 Ti GPUs, and weeks of runtime, we systematically search these combinations. For WGAN, we experiment with gradient penalties, spectral normalization, batch normalization, cyclic learning rates, ResNet architectures, various layer widths, different noise distributions, and additional tuning parameters (Section
4). We also experiment extensively with normalizing flows.GANs failed to learn even basic structures in our synthetic data, whereas some normalizing flows modeled the data well (Section 5). Surprisingly, normalizing flows outperform WGANs even in terms of the metric that only WGAN targets: minimizing Wasserstein1 distance.
The common wisdom is that GANs are “difficult to train;” however, perhaps we should instead be asking when tuning GANs properly, itself an optimization problem, is simpler than the original density estimation problem. At present, normalizing flows show greater potential for applications where recall is of primary importance (Section 6).
2 Synthetic data
We developed two synthetic datasets for evaluating GANs and normalizing flows. Both datasets are univariate. To avoid confounding our results with issues of data efficiency and overfitting, which are beyond the scope of this work, we make the size of both datasets effectively infinite by drawing fresh data at every epoch.
2.1 Unimodal dataset
One data model we consider is a twocomponent univariate mixture with equal means, specified by the following generative process:
(1)  
(2)  
(3) 
Here
is an unobserved random variable and
is the data. We set . Figure 1 (a) shows this density. The density has several qualitative aspects we would expect a good model to recover: sharply changing density at 4.5 and 5.5, symmetry around 5.0, and one bellshaped mode.2.2 Multimodal dataset
We also consider a mixture of unimodal distributions with unequal means, as specified by the following generative process:
(4)  
(5)  
(6)  
(7) 
Here is the data and and are unobserved random variables, used only to facilitate data generation. We set , , , and . For , we set . Figure 1 (e) shows this density.
We designed this density to have multiple modes, with nonnegligible density between them. Compared to our unimodal mixture, our multimodal mixture has narrower modes with sharper boundaries. The density has several qualitative aspects that we would expect a good model to recover: the correct number of modes, the correct high and low density areas, equal densities at the modes, and equal densities within the lowprobability regions between modes.
3 Metrics
Evaluating the quality of GANs is not trivial because the density of GAN generators cannot be evaluated directly—generators can only be sampled. Our strategy is to first draw a large number of samples from the generator. Because our data is lowdimensional, we can attain high sample density. Then, we compute two metrics—one qualitative and the other quantitative–using these samples.
3.1 Kernel density estimation
Kernel density estimation (KDE) lets us visualize the GAN density and compare it qualitatively to the normalizing flow density and the true data generating density. However, KDE can be inaccurate if the bandwidths are chosen improperly: too large and the GAN appears smoother than it is, too small and the GAN density incorrectly appears to be highly variable. Either case can mask the extent to which a GAN captures structure in the true data distribution.
We strike a good balance by using a large sample size (), and by choosing a bandwidth such that each band contains, on average, samples, where is chosen to approximately maximize the likelihood of heldout data.
3.2 Wasserstein1 distance
Wasserstein1 distance, also known as EarthMover distance, measures the difference between two distributions. Wasserstein1 distance is an especially interesting distance for our research questions because it is the objective function that WGAN targets. While computing Wasserstein1 distance in general requires solving a constraint optimization, for univariate data it can be readily estimated to high precision (Ramdas et al., 2015; Panaretos & Zemel, 2019). Suppose (resp. ) is a random variable following (resp. ). Let (resp. ) denote i.i.d. samples with size (resp. ) of (resp. ) and let (resp.
) be the empirical cumulative distribution function based on
(resp. ). Then, Wasserstein1 distance can be estimated with(8) 
This formula provides a convenient method of validating the predictions of a GAN critic (Arjovsky et al., 2017; Gulrajani et al., 2017), which estimates the Wasserstein1 distance as
(9) 
4 Methods
We initially set out to demonstrate a positive result: that GANs are versatile statistical tools. As evidence to the contrary accumulated, we embarked on the more challenging task of showing a negative result of some generality by searching large numbers of permutations of GAN types, architectures, and training techniques. For normalizing flows, a positive result emerged fairly soon, so it was not necessary to try many combinations of flow training techniques and flow architectures.
4.1 Methods for training GANs
We considered the original GAN (Goodfellow et al., 2014) at first, but focused on the Wasserstein GAN (WGAN) after it became apparent the former would not adequately model our data. Prior work establishes that WGAN is among the best performing GANs if architectures are wellchosen (Rosca et al., 2018; Lucic et al., 2018; Dinh et al., 2017).
The WGAN hyperparameters can be divided into two sets based on whether they determine network architecture or training strategy. The former includes activation functions, width, and depth, as well as different ways to initialize weights—namely, uniform and Xavier
(Glorot & Bengio, 2010). We also considered both fully connected and ResNet architectures (He et al., 2016). For the generator, we considered both Gaussian and uniform priors of various dimensions, as well as batch normalization (Ioffe & Szegedy, 2015). For the critic, we explored both gradient penalties (Gulrajani et al., 2017) and spectral normalization (Miyato et al., 2018), and consider different strengths for both regularizers.The WGAN hyperparameters dictating the training strategy include optimizer learning rate, weight decay, and both coefficients for Adam (Kingma & Ba, 2014). We also considered cyclic learning rates (Smith, 2017) following some failure modes we observed. We also experimented with various numbers of critic updates per generator update.
Exhaustive grid searching on such a large search space is prohibitively expensive. We approximate an exhaustive grid search with random search (Bergstra & Bengio, 2012) through ASHA (Li et al., 2018), which leverages early stopping scheduler of parallel hyperparameter tuning, to study the performances of different combinations.
4.2 Methods for training flows
We first considered Masked Autoregressive Flow (Papamakarios et al., 2017), Inverse Autoregressive Flow (Kingma et al., 2016), RealNVP (Dinh et al., 2017) and Planar flows (Rezende & Mohamed, 2015). For univariate data such as ours, however, these flows have just two learnable scalar parameters. With so few learnable parameters, the capacity of such flows is limited for univariate data, so we did not consider them further.
Our experiments with flows focused instead on Freeform Jacobian of Reversible Dynamics (FFJORD) (Grathwohl et al., 2019) and Gaussianization Flows (GF) (Meng et al., 2020)
. The former, FFJORD, is a continuous flow based on an ordinary differential equation. The latter, GF, stacks two types of learnable transformations: a linear transformation to rotate data such that correlations between different dimensions is minimized, and a nonlinear transformation to learn each marginal distributions separately by composing inverse Gaussian CDF with a mixture of logistic distributions. The first type of transformation was unnecessary for our univariate data. We did not run a grid search to find optimal parameters for either FFJORD or GF. Instead, we used the same architectures and training strategies from
Grathwohl et al. (2019) and Meng et al. (2020).5 Results
We evaluated WGAN, FFJORD and Gaussianization Flows on both our synthetic datasets. The results reported for WGAN are always for models tuned with extensive hyperparameter optimization. Typically spectral normalization led to the best results for WGAN. We consider the experimental results both qualitatively and quantitatively.
5.1 Qualitative
Figure 1 is a key result of ours that shows the dataset densities (ground truth) and the learned densities. Only the best performing WGAN is shown. Even the best WGAN failed to reconstruct key qualitative features of the datasets. On the unimodal dataset, WGAN assigned too much mass in both regions around the boundaries of data distribution’s support, as well as at the mode of the data distribution. Further, WGAN failed to capture the symmetry and bellshaped structure. On the multimodal dataset, WGAN recovered all modes and gaps between modes. However, WGAN misrepresented the local structure of the individual modes (e.g., the symmetry), as well as the relative densities of the modes, which should have been equal.
Both types of normalizing flows, in contrast, recovered both the local and global structure of the data distributions. GF appears to be slightly more accurate than FFJORD.
5.2 Quantitative
Table 1 summarizes the quantitative performances on two datasets in terms of the Wasserstein1 distance, as estimated by Equation 8. Surprisingly, both flows outperformed the best WGAN in terms of Wasserstein1 distance—a metric that only WGAN targets.
The best performing WGAN for unimodal data used spectral normalization to constrain the Lipschitz constant of the critic, whereas for multimodal data, a gradient penalty worked better than spectral normalization. Many of the modifications of the WGAN that we thought might help did not. We report results for several of these modifications in Table 2.
We made some progress in understanding why WGAN does not perform better. Because our data is low dimensional, Equation 8
, gives us low variance (and unbiased) estimates of the Wasserstein1 distance. The WGAN critic computes Wasserstein distance differently, using Equation
9. When WGAN converged, the Wasserstein1 distance estimated by the critic often severely underestimated the true Wasserstein1 distance. Surprisingly, in some cases, the critic estimate of the Wasserstein1 distance was negative. Without reliable estimates from the critic of the distance between the data distribution and the model/generator density at the current iterate, there is no sound basis for updates to the generator.Unimodal  Multimodal  

WGAN  0.0087  0.4814 
FFJORD  0.0066  0.289 
GF  0.0035  0.138 
Unimodal  Multimodal  

baseline  0.0087  0.4814 
with uniform prior  0.0490  0.6648 
with cyclic LR  0.0491  0.7664 
with dropout  0.1171  1.0461 
with ResNet  0.1827  0.7238 
5.3 Runtime
Gaussianization Flows was by far the fastest model to train, requiring just 1.5 hours. FFJORD required nearly two weeks to converge on our multimodal dataset. However, this extreme runtime may have in part been due to an issue with the reference implementation, which caused the ODE solver to run more slowly with each iteration. WGAN with spectral normalization and 100 critic updates per generator update required around two days per run. Hundreds of runs were necessary to find good hyperparameters.
6 Discussion
We developed synthetic datasets, reporting metrics, and a modelsearch methodology for evaluating both GANs and normalizing flows. Our results are surprising: GANs failed to learn key qualitative aspects of both unimodal and multimodal data. Quantitatively, normalizing flows outperformed the Wasserstein GAN in terms of the very metric that only latter targets: Wasserstein1 distance. These negative results echo some concerns raised in Rosca et al. (2018).
There are caveats to our results. First, the lessons from lowdimensional data may not generalize to higher dimensional settings. However, for applications that require good recall, including many scientific applications, these results from lowdimensional data are enough to raise serious doubts about the performance of GANs in high dimensions, where there is no rigorous way to detect that an implicit model has poor recall.
Another caveat to our work is that establishing a general negative result requires exhaustively searching an infinite number of GANs, including GAN variants that have not yet been invented. We did our best. Our benchmarking software is publicly available at https://github.com/lliutianc/ganflow, and we invite others to test additional GAN variants with them. As it stands, however, at least for problems that require high recall, our results suggest that normalizing flows are more reliable tools for inference than GANs.
References

Arjovsky et al. (2017)
Arjovsky, M., Chintala, S., and Bottou, L.
Wasserstein generative adversarial networks.
In
International Conference on Machine Learning
, 2017.  Baowaly et al. (2018) Baowaly, M. K., Lin, C.C., Liu, C.L., and Chen, K.T. Synthesizing electronic health records using improved generative adversarial networks. Journal of the American Medical Informatics Association, 26(3):228–241, 12 2018.
 Bergstra & Bengio (2012) Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13(10):281–305, 2012.
 Choi et al. (2017) Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., and Sun, J. Generating multilabel discrete patient records using generative adversarial networks. In Machine Learning for Healthcare Conference, 2017.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using Real NVP. In International Conference on Learning Representations, 2017.

Glorot & Bengio (2010)
Glorot, X. and Bengio, Y.
Understanding the difficulty of training deep feedforward neural
networks.
In Teh, Y. W. and Titterington, M. (eds.),
International Conference on Artificial Intelligence and Statistics
, pp. 249–256, 2010.  Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Neural Information Processing Systems, 2014.
 Grathwohl et al. (2019) Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. Scalable reversible generative models with freeform continuous dynamics. In International Conference on Learning Representations, 2019.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. In Neural Information Processing Systems, 2017.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, 2015.
 Kingma & Ba (2014) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Neural Information Processing Systems, 2016.
 Li et al. (2018) Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Hardt, M., Recht, B., and Talwalkar, A. Massively parallel hyperparameter tuning. arXiv:1810.05934, 2018.
 Lucic et al. (2018) Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bousquet, O. Are GANs created equal? A largescale study. In Neural Information Processing Systems, 2018.
 Meng et al. (2020) Meng, C., Song, Y., Song, J., and Ermon, S. Gaussianization flows. ArXiv, abs/2003.01941, 2020.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
 Mustafa et al. (2019) Mustafa, M., Bard, D., Bhimji, W., Lukić, Z., AlRfou, R., and Kratochvil, J. M. CosmoGAN: creating highfidelity weak lensing convergence maps using generative adversarial networks. Computational Astrophysics and Cosmology, 6(1):1–13, 2019.
 Panaretos & Zemel (2019) Panaretos, V. M. and Zemel, Y. Statistical aspects of Wasserstein distances. Annual Review of Statistics and Its Application, 6(1):405–431, 2019.
 Papamakarios et al. (2017) Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation, 2017.
 Ramdas et al. (2015) Ramdas, A., Garcia, N., and Cuturi, M. On Wasserstein two sample testing and related families of nonparametric tests, 2015.
 Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.
 Rosca et al. (2018) Rosca, M., Lakshminarayanan, B., and Mohamed, S. Distribution matching in variational inference. arXiv preprint arXiv:1802.06847, 2018.
 Smith (2017) Smith, L. N. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision, 2017.
Comments
There are no comments yet.