AWGAN: Empowering High-Dimensional Discriminator Output for Generative Adversarial Networks

Empirically multidimensional discriminator (critic) output can be advantageous, while a solid explanation for it has not been discussed. In this paper, (i) we rigorously prove that high-dimensional critic output has advantage on distinguishing real and fake distributions; (ii) we also introduce an square-root velocity transformation (SRVT) block which further magnifies this advantage. The proof is based on our proposed maximal p-centrality discrepancy which is bounded above by p-Wasserstein distance and perfectly fits the Wasserstein GAN framework with high-dimensional critic output n. We have also showed when n = 1, the proposed discrepancy is equivalent to 1-Wasserstein distance. The SRVT block is applied to break the symmetric structure of high-dimensional critic output and improve the generalization capability of the discriminator network. In terms of implementation, the proposed framework does not require additional hyper-parameter tuning, which largely facilitates its usage. Experiments on image generation tasks show performance improvement on benchmark datasets.



page 10


Bridging the Gap Between f-GANs and Wasserstein GANs

Generative adversarial networks (GANs) have enjoyed much success in lear...

Sliced Wasserstein Generative Models

In generative modeling, the Wasserstein distance (WD) has emerged as a u...

Generalization Error of GAN from the Discriminator's Perspective

The generative adversarial network (GAN) is a well-known model for learn...

Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators

Generative adversarial networks (GANs) have shown great success in appli...

Image Generation Via Minimizing Fréchet Distance in Discriminator Feature Space

For a given image generation problem, the intrinsic image manifold is of...

Sobolev GAN

We propose a new Integral Probability Metric (IPM) between distributions...

A Characteristic Function Approach to Deep Implicit Generative Modeling

In this paper, we formulate the problem of learning an Implicit Generati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative Adversarial Networks (GAN) have led to numerous success stories in various tasks in recent years Niemeyer and Geiger (2021); Chan et al. (2021); Han et al. (2021); Karras et al. (2020); Nauata et al. (2020); Heim (2019); Zhan et al. (2019). The goal in a GAN framework is to learn a distribution (and generate fake data) that is as close to real data distribution as possible. This is achieved by playing a two-player game, in which a generator and a discriminator compete with each other and try to reach a Nash equilibrium Goodfellow et al. (2014). Arjovsky et al.  Arjovsky and Bottou (2017); Arjovsky et al. (2017) pointed out the shortcomings of using Jensen-Shannon Divergence in formulating the objective function, and proposed using the -Wasserstein distance instead. Numerous promising frameworks Li et al. (2017); Mroueh et al. (2017b); Mroueh and Sercu (2017); Mroueh et al. (2017a); Wu et al. (2019); Deshpande et al. (2019); Ansari et al. (2020) based on other discrepancies were developed afterwards. Although some of these works use dimension of critic output , empirical evidence can be found that using multiple dimension could be advantageous. For examples, in Li et al. (2017) authors pick different s () for different datasets; In Sphere GAN Park and Kwon (2019) their ablation study shows the best performance with . However, the reason for this phenomenon has not been well explored yet.

One of the contribution of this paper is to rigorously prove that high-dimensional critic output is advantageous in the revised WGAN framework. Particularly, we propose a new metric on the space of probability distributions, called

maximal -centrality discrepancy. This metric is closely related with -Wasserstein distance (Theorem 3.8) and can serve as a replacement of the objective function in WGAN framework especially when the discriminator has high-dimensional output. In this revised WGAN framework we are able to prove that using high-dimensional critic output makes discriminator more sensitive on distinguishing real and fake distributions (Proposition 3.10). In classical WGAN with only one critic output, the discriminator push-forwards (or project) real and fake distributions to -dimensional space, and then look at their maximal mean discrepancy. This -dimensional push-forward may hide significant differences of distributions in the shadow. Even though ideally there exists a “perfect” push-forward which reveals any tiny differences, practically the discriminator has difficulties to reach that global optimal push-forward. However, using -centrality allows to push-forward distributions to higher dimensional space. Since even an average high-dimensional push-forward may reveal more differences than a good -dimensional push-forward, this reduces the burden on discriminator.

Another novelty of our framework is to break the symmetry structure of the discriminator network by compositing with an asymmetrical square-root velocity transformation (SRVT). In the general architecture of GAN, people assume that the output layer of discriminator is fully connected. This setup puts all output neurons in equal and symmetric positions. As a result, any permutation of the high-dimensional output vector will leave the value of objective function unchanged. This permutation symmetry implies that the weights connected to output layer are somehow correlated and this would undermine the generalization power of the discriminator network

Liang et al. (2019); Badrinarayanan et al. (2015). After adding the asymmetrical SRVT block, each output neuron would be structurally unique (Proposition 3.13). Our intuition is that the structural uniqueness of output neurons would imply their functionality uniqueness. This way, different output neurons are forced to reflect distinct features of input distribution. Hence SRVT serves as an magnifier which takes full use of high-dimensional critic output. Since the proposed framework utilizes an Asymmetric transformation SRVT in a general WGAN framework, we name it as AWGAN.

In terms of implementation, our experiments show performance improvement on unconditional and conditional image generation tasks. The novelty of our work is summarised as follows:

1. We propose a metric between probability distributions which is bounded by scaled

-Wasserstein distance. We have theoretically proved it as a valid metric and use it in GAN objectives;
2. We utilize an asymmetrical (square-root velocity) transformation which breaks the symmetric structure of the discriminator network. We have conducted experiments to show its effectiveness in our work;
3. The proposed framework improves stability of training in the setting of high-dimensional critic output without introducing additional hyper tuning parameters.

2 Related work

Wasserstein Distance and Other Discrepancies Used in GAN: Arjovsky et al. Arjovsky et al. (2017) applied Kantorovich-Rubinstein duality for

-Wasserstein distance as loss function in GAN objective. WGAN makes great progress toward stable training compared with previous GANs, and marks the start of using Wasserstein distance in GAN. However, sometimes it still may converge to sub-optimal optima or fail to converge due to the raw realization of Lipschitz condition by weight clipping. To resolve these issues, researchers proposed sophisticated ways

Gulrajani et al. (2017); Wei et al. (2018); Miyato et al. (2018) to enforce Lipschitz condition for stable training. Recently, people come up with another way to involve Wasserstein distance in GAN Wu et al. (2019); Kolouri et al. (2019); Deshpande et al. (2018); Lee et al. (2019). They use the Sliced Wasserstein Distance Rabin et al. (2011); Kolouri et al. (2016)

to estimate the Wasserstein distance from samples based on a summation over the projections along random directions. Either of these methods rely on pushforwards of real and fake distributions through Lipschitz functions or projections on to

-dimensional space. In our work, we attempt to distinguish two distributions by looking at their pushforwards in high dimensional space. This would add a lot of flexibility to convergence path which may prevents the minimizer getting stuck on poor local optimum.

Another way people used to distinguish real data and fake data distributions in generative network is by moment matching 

Li et al. (2015); Dziugaite et al. (2015). Particularly, in Li et al. (2017) the authors used the kernel maximum mean discrepancy (MMD) in GAN objective, which aims to match infinite order of moments. In our work we propose to use the maximum discrepancy between -centrality functions to measure the distance of two distributions. The -centrality function (Definition 3.1) is exactly the -th root of the -th moment of a distribution. Hence, the maximal -centrality discrepancy distance we propose can be viewed as an attempt to match the -th moment for any given .

-Centrality Functions: The mean or expectation of a distribution is a basic statistic. Particularly, in Euclidean spaces, it is well known that the mean realizes the unique minimizer of the so-called Fréchet function of order (cf. Grove and Karcher (1973); Bhattacharya and Patrangenaru (2003); Arnaudon et al. (2013)). Generally speaking, a Fréchet function of order summarizes the -th moment of a distribution with respect to any base point. A topological study of Fréchet functions is carried out in Hang et al. (2019) which shows that by taking -th root of a Fréchet function, the -centrality function can derive topological summaries of a distribution which is robust with respect to -Wasserstein distance. In our work, we propose using -centrality functions to build a nice discrepancy distance between distributions, which would benefit from its close connection with -Wasserstein distance.

Asymmetrical Networks:

Symmetries occur frequently in deep neural networks. By symmetry we refer to certain group actions on the weight parameter space which keep the objective function invariant. These symmetries would cause redundancy in the weight space and affects the generalization capacity of network

Liang et al. (2019); Badrinarayanan et al. (2015). There are two types of symmetry: (i) permutation invariant; (ii) rescaling invariant. A straight forward way to break symmetry is by random initialization (cf. Glorot and Bengio (2010); He et al. (2015)). Another way to break symmetry is via skip connections to add extra connections between nodes in different layers He et al. (2016a, b); Huang et al. (2017). In our work, we attempt to break the permutation symmetry of the output layer in the discriminator using a nonparametric asymmetrical transformation specified by square-root velocity function (SRVF) Srivastava et al. (2011); Srivastava and Klassen (2016). The simple transformation that converts functions into their SRVFs changes Fisher-Rao metric into the

norm, enabling efficient analysis of high-dimensional data. Since the discretised formulation of SRVF is equivalent with an non-fully connected network (as depicted in Fig. 

2), our approach can be viewed as breaking symmetry by deleting specific connections from the network.

Figure 1: The discriminator design of our framework. D represents a general discriminator network with multidimensional output. The output of its last dense layer is then transformed by the SRVT block. The -block implements the proposed objective function.

3 Proposed Framework

In this section we introduce the proposed GAN framework in which the objective function is built on the maximal -centrality discrepancy and the discriminator is composited with an SRVT block.

3.1 Objective Function

The objective function of the proposed GAN is as follows:


where denotes norm. and denotes generator and discriminator respectively. refers to the order of moments. is the input real sample and is a noise vector for the generated sample. The output of the last dense layer of discriminator is an -dimensional vector in the Euclidean space .

The forward pass pipeline of our framework is shown in Fig. 1. In contrast to traditional WGAN with -dimensional discriminator output, our framework allows the last dense layer of discriminator to have multi-dimensional output, which is required for further implementation of an asymmetrical transformation (SRVT) block. We will discuss the motivation for this transformation in Section 3.3. Here we use residual blocks as feature extractors for illustration and implementation, while in practice they can be replaced by any other reasonable feature extractors.

3.2 The maximal -centrality discrepancy

The -centrality function was introduced in Hang et al. (2019) which offers a way to obtain robust topological summaries of a probability distribution. In this section, we introduce a metric on the space of probability distributions formed by the -centrality functions and show its relation with the -Wasserstein distance and an type distance.

Definition 3.1 (-centrality function).

Given a Borel probability measure on a metric space and , the -centrality function is defined as

Particularly, the value of -centrality function at is the -th root of the -th moment of with respect to . As we know it, the -th moments are important statistics of a probability distribution. After taking the -th root, the -centrality function retains those important information in -th moments, and it also shows direct connection with the -Wasserstein distance :

Lemma 3.2.

For any , let be the Dirac measure centered at . Then .

Lemma 3.3.

For any two Borel probability measures and on , we have


For any , by Lemma 3.2 and triangle inequality we have

The result follows by letting run over all . ∎

Let be the set of all probability measures on and let be the set of all continuous functions on . We define an operator s.t. . The above lemma implies that is -Lipschitz, which makes a powerful indicator of a probability measure. Specifically, since -Wasserstein distance metrizes weak convergence when is compact, we have:

Proposition 3.4.

If is compact and weakly converges to , then converges to with respect to distance.

Recall that in WGAN, the discriminator is viewed as a -Lipschitz function. In our understanding, this requirement is enforced to prevent the discriminator from distorting input distributions too much. More precisely, in the more general setting, the following is true:

Proposition 3.5.

Given any -Lipschitz map and Borel probability distributions . Then the pushforward distributions satisfy


Let be the set of all joint probability measures of and . For any , we have . By definition of the -Wasserstein distance,

For the purpose of our paper, we focus on Lipschitz maps to Euclidean spaces. Denote by the set of all -Lipschitz functions . When , the dual formulation of gives

Motivated by this, we replace the expectations by values of the -centrality functions at base point and define:

Lemma 3.6.

The definition of is independent of the choice of the base point. Or simply


Let be the translation map on with . Then iff. and

The following proposition implies that is a direct generalization of Wasserstein distance:

Proposition 3.7.

If and are both compact, then


Since implies , we easily have .

On the other hand, for any , there exists a -Lipschitz map s.t. . Let and , then and . Hence for any which implies . ∎

More generally, is closely related with -Wasserstein distance:

Theorem 3.8.

For any -Lipschitz map ,


By Lemma 3.2, we have

Applying triangle inequality and Proposition 3.5, we have

Also is closely related with an distance:

Proposition 3.9.

For any -Lipschitz map ,


The lower bound implies that, when we feed two distributions into the discriminator , as long as some differences retained in the pushforwards and , they would be detected by . The upper bound implies that, if and only differ a little bit under distance , then would not change too much. Furthermore,

Proposition 3.10.

If integers , then for any , we have .


For any we have natural embedding . Hence any -Lipschitz function with domain can also be viewed as a -Lipschitz function with domain . Hence larger gives larger candidate pool for searching the maximal discrepancy and the result follows. ∎

Hence the maximal -centrality discrepancy becomes more sensitive to the differences between distributions with larger .

3.3 Square Root Velocity Transformation

Proposition 3.10 suggests us to choose high-dimensional discriminator output to improve the performance of GAN. However, if the last layer of discriminator is fully connected, then all output neurons are in symmetric positions and the loss function is permutation invariant. Thus the generalization power of discriminator only depends on the equivalence class obtained by identifying each output vector with its permutations Badrinarayanan et al. (2015); Liang et al. (2019). Correspondingly the advantage of high-dimensional output vector would be significantly undermined. In order to further improve the performance of our proposed framework, we consider adding an SRVT block to the discriminator to break the symmetric structure. SRVT usually is used in shape analysis to define a distance between curves or functional data.

Particularly, we view the high-dimensional discriminator output as an ordered sequence.

Definition 3.11.

The signed square root function is given by .

Given any differentiable function , its SRVT is a function with


SRVT is invertible. Particularly, from we can recover :

Lemma 3.12.

By assuming , a discretized SRVT

is given by

Similarly, is given by

With this transformation, the pullback of norm gives


Applying SRVT on a high-dimensional vector results in an ordered sequence which captures the velocity difference at each consecutive position. The discretized SRVT can be represented as a neural network with activation function to be signed square root function

as depicted in Fig 2. Particularly, for the purpose of our paper, each output neuron of SRVT is structurally unique:

Proposition 3.13.

Any (directed graph) automorphism of the SRVT block leaves each output neuron fixed.


View the SRVT block as a directed graph, then all output neurons has out-degree . By the definition of discritized SRVT, there is a unique output neuron with in-degree and any two different output neurons have different distance to . Since any automorphism of directed graph would preserve in-degrees, out-degrees and distance, it has to map each output neuron to itself. ∎

Also, the square-root operation has smoothing effect which forces the magnitudes of derivatives to be more concentrated. Thus, values at each output neuron would contribute more similarly to the overall resulting discrepancy. It reduces the risk of over-emphasizing features on certain dimensions and ignoring the rest ones.

Figure 2: A representation of the SRVT block.

The whole training procedure of the proposed framework is summarized in Algorithm 1.

Input: Real data distribution and a prior distribution
    Output: Generator and discriminator parameters

  while  has not converged do
     for  to  do
        Sample real data from ; sample random noise from ;
        ; ;
     end for
     for  to  do
        Sample random noise from ;
     end for
  end while
Algorithm 1 Our proposed framework

4 Experiments

In this section we provide experimental results supporting the proposed framework. We explore various setups to study characteristics of the proposed blocks. Since the proposed framework is closely related to WGAN, we also make a comparison between the two approaches in ablation study. Final evaluation results on benchmark datasets are presented afterwards.

4.1 Implementation Details

We applied the proposed framework in unconditional and conditional image generation tasks. For unconditional generation task, the generator and discriminator architectures were built following the ResNet architecture provided in Miyato et al. (2018). The SRVT block and -block were added to the last dense layer in the discriminator consecutively. Spectral normalization was utilized to ensure Lipschitz condition. Adam optimizer was used with learning rate , and . The length of input noise vector was set to , and batch size was fixed to in all experiments. Dimension of output from the last dense layer in discriminator was set to except for ablation study. For conditional generation task, we adopted the network architectures in BigGAN Brock et al. (2019) and used their default parameter settings except for necessary changes for the proposed framework. All training tasks were conducted on Tesla V100 GPU.

4.2 Datasets and Evaluation Metrics

For unconditional image generation task, we implemented experiments on CIFAR-10

Krizhevsky et al. (2010)

, STL-10

Coates et al. (2011) and LSUN bedroom Yu et al. (2015) datasets. We used 60K images including 50K training images and 10K test images in CIFAR-10, 100K unlabelled images in STL-10, and 3M images in LSUN bedroom, respectively. For conditional generation task we used CIFAR-10 and CIFAR-100 datasets. For each dataset, we center-cropped and resized the images, where images in STL-10 were resized to and images in LSUN bedroom were resized to . Results were evaluated with Frechet Inception Distance (FID) Heusel et al. (2017), Kernel Inception Distance (KID) Bińkowski et al. (2018b)

and Precision and Recall (PR)

Sajjadi et al. (2018). Lower FID and KID scores and higher PR indicate better performance. In ablation study we generated 10K images for fast evaluation. For final evaluation on three datasets we used 50K generated samples. Please refer to the supplementary material for more details.

4.3 Results

In the following sections we first present ablation experimental results on CIFAR-10 with analysis, and then report final evaluation scores on all datasets. We display randomly generated samples in supplementary material.

Ablation Study:

We first conducted experiments under different settings to explore the effects of -centrality function and SRVT used in our framework. Since our approach is tightly related to WGAN, we also include results from WGAN-GP and WGAN-SN for comparison. For each setup we trained 100K generator iterations on CIFAR-10 dataset, and reported average FID scores calculated from 5 runs in Fig 3. For this experiment we used 10K generated samples for fast evaluation. One can see without the use of SRVT (three green curves), settings with higher dimensional critic output resulted in better evaluation performances. The pattern is the same when comparing cases with SRVT (three blue curves). These observations are consistent with our Proposition 3.10. Furthermore, the results shows the asymmetric transformation boosts performances for different choices of s, especially when (blue green). Our settings with high dimensional critic output outperform both WGAN-GP and WGAN-SN. In fact, sample qualities can further be improved with more training iterations, and we observe our training session can lead to a better convergence point.

Figure 3: FID comparison under different settings.

In Fig 4 we also present plots of precision and recall from these settings. Note that for WGAN-GP we obtained recall and precision. Since the scores are far behind those from other methods, for display convenience we did not include WGAN-GP in the figure. As we see our default setting with the highest dimensional critic output and with the use of SRVT outperforms results from other settings.

Figure 4: Precision and recall plot under different settings.

We further present comparisons using KID under different settings in Fig 5. Results in Fig 5(a) are aligned with previous evaluations which shows the advantage of using higher dimensional critic output. Performance was further boosted with SRVT. Fig 5(b) shows KID evaluations under different choices of s, where SRVT was used with fixed . We observe using only, or both and resulted in better performance compared with using only. In practice one can customize for usage.

To keep a stable training session on CIFAR-10, WGAN-GP requires , WGAN-SN requires , and the proposed framework is able to train stably with with spectral normalization. For the same number of generator iterations, our approach can produce better sample qualities, with nearly the same amount of network parameters and training time. In experiments we observe generally a higher dimensional critic output requires less to result in a stable training session. This is consistent with our theoretical results that a bigger leads to a “stronger” discriminator, and to result in a balanced game for the two networks, a smaller can be used to ensure the training session going forward stably.

(a) (b)
Figure 5: KID evaluation under different settings. (a) Left: without SRVT; Right: default setting with SRVT. (b) Evaluation with SRVT under different s with fixed .

We further conducted experiments to validate the effect of SRVT with MMD-GAN objective Li et al. (2017). For implementation we used the authors’ default hyper-parameter settings and network architectures.

Dimension of critic output 16 128 1024
w/o SRVT (Default) 17(1) 16(1) 20(1)
w/ SRVT 14(1) 13(1) 16(1)
Table 1: Evaluation of KID(x) on the effect of SRVT with MMD-GAN objective and DCGAN architectures.

From Table 1 one can see SRVT significantly boosts performance for different s. The best result was obtained with (default setup in [28]). We also notice for MMD-GAN, higher () did not improve performance Bińkowski et al. (2018a), while we have shown our framework can take advantage of higher dimension critic output features.

In the following section we display our final evaluation results on four datasets.

Quantitative Results:
Since GAN training heavily depends on network architectures, for fair comparison we only list comparable results using the same network architectures. For unconditional generation task, we present our evaluations of FID scores on the three datasets averaged over 5 random runs in Table 2 . We compare with methods that are related to our work, including WGAN-GP Gulrajani et al. (2017), MMD GAN-rq Li et al. (2017), SNGAN Miyato et al. (2018), CTGAN Wei et al. (2018), Sphere GAN Park and Kwon (2019), SWGAN Wu et al. (2019), CRGAN Zhang et al. (2020) and DGflow Ansari et al. (2021).

Method CIFAR-10 STL-10 LSUN
WGAN-GP 19.0(0.8) 55.1 26.9(1.1)
SNGAN 14.1(0.6) 40.1(0.5) 31.3(2.1)
MMD GAN-rq - - 32.0
CTGAN 17.6(0.7) - 19.5(1.2)
Sphere GAN 17.1 31.4 16.9
SWGAN 17.0(1.0) - 14.9(1.0)
CRGAN 14.6 - -
DGflow 9.6(0.1) - -
Ours 8.5(0.3) 26.1(0.4) 14.2(0.2)
Table 2: FIDs from unconditional generation experiments with ResNet architectures.
Objective CIFAR-10 CIFAR-100
Hinge 9.7(0.1) 13.6(0.1)
Ours 8.9(0.1) 12.3(0.1)
Table 3: FIDs from conditional generation experiments with BigGAN architectures.

CIFAR-10: As we see in Table 2, the proposed method outperforms other comparable approaches under the same network architectures. It is able to generate high quality samples from different classes of objects using the simple architectures as presented in supplementary material.

STL-10: Data distribution of STL-10 is more diverse and complicated compared to CIFAR-10. Despite the difficulty of training on the dataset using simple network architectures, our method was able to obtain competitive results. We also conducted experiments with original images on the dataset, and display randomly generated samples in supplementary material. While STL-10 is harder to train using ResNet architectures compared with the other two datasets, we observe our method manages to generate visually distinguishable samples from different classes for the diverse and complicated data distribution.

LSUN bedroom: LSUN bedroom dataset has relatively simpler data distribution, and most of the methods listed in Table 2 are able to generate high quality samples. Our competitive results also show the capability.

For conditional generation, we show evaluation results from the original BigGAN setting and the proposed objective in Table 3. The results indicate the proposed framework can also be applied in the more sophisticated training setting and obtain competitive performance.

Overall the proposed method is capable of obtaining competitive performances with different network architectures on the four datasets. In addition, compared to some classic approaches Li et al. (2017); Wu et al. (2019); Ansari et al. (2020), the proposed framework does not require additional parameter tuning, which greatly facilitates implementation. In our experiments we did not see evidence of mode collapse.

5 Conclusion and Discussion

In this paper we have proposed the maximal -centrality discrepancy as a nice metric on the space of probability distributions, and used it in GAN objectives. The proposed metric fits well in the framework of WGAN especially when critic has multidimensional output. We have also proved that when , maximal -centrality discrepancy is equivalent to -Wasserstein distance. We have further utilized an asymmetrical (square-root velocity) transformation added to discriminator to break the symmetric structure of its network output. The use of the nonparametric transformation takes advantage of multidimensional features and improves the generalization capability of critic network. In terms of implementation, the proposed framework improves training performance without need of extra hyper-parameters tuning. Experiments on unconditional and conditional image generation tasks show its effectiveness.


  • A. F. Ansari, M. L. Ang, and H. Soh (2021) Refining deep generative models via discriminator gradient flow. In International Conference on Learning Representations, Cited by: §4.3.
  • A. F. Ansari, J. Scarlett, and H. Soh (2020)

    A characteristic function approach to deep implicit generative modeling


    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §4.3.
  • M. Arjovsky and L. Bottou (2017) Towards principled methods for training generative adversarial networks. Cited by: §1.
  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    Proceedings of the 34th International Conference on Machine Learning

    Vol. 70, pp. 214–223. Cited by: §1, §2.
  • M. Arnaudon, F. Barbaresco, and L. Yang (2013) Medians and means in riemannian geometry: existence, uniqueness and computation. In Matrix Information Geometry, pp. 169–197. Cited by: §2.
  • V. Badrinarayanan, B. Mishra, and R. Cipolla (2015) Understanding symmetries in deep networks. arXiv preprint arXiv:1511.01029. Cited by: §1, §2, §3.3.
  • R. Bhattacharya and V. Patrangenaru (2003) Large sample theory of intrinsic and extrinsic sample means on manifolds. The Annals of Statistics 31 (1), pp. 1–29. Cited by: §2.
  • M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018a) Demystifying MMD GANs. In International Conference on Learning Representations, Cited by: §4.3.
  • M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018b) Demystifying MMD GANs. In International Conference on Learning Representations, Cited by: §4.2.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, Cited by: §4.1.
  • E. R. Chan, M. Monteiro, P. Kellnhofer, J. Wu, and G. Wetzstein (2021) Pi-gan: periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5799–5809. Cited by: §1.
  • A. Coates, A. Ng, and H. Lee (2011) An analysis of single-layer networks in unsupervised feature learning. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    Vol. 15, pp. 215–223. Cited by: §4.2.
  • I. Deshpande, Y. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. G. Schwing (2019) Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • I. Deshpande, Z. Zhang, and A. G. Schwing (2018) Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • G. K. Dziugaite, D. M. Roy, and Z. Ghahramani (2015) Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pp. 258–267. Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • K. Grove and H. Karcher (1973) How to conjugatec -close group actions. Mathematische Zeitschrift 132 (1), pp. 11–20. Cited by: §2.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 5769–5779. Cited by: §2, §4.3.
  • X. Han, X. Chen, and L. Liu (2021)

    GAN ensemble for anomaly detection

    Proceedings of the AAAI Conference on Artificial Intelligence 35 (5), pp. 4090–4097. Cited by: §1.
  • H. Hang, F. Mémoli, and W. Mio (2019) A topological study of functional data and fréchet functions of metric measure spaces. Journal of Applied and Computational Topology 3 (4), pp. 359–380. Cited by: §2, §3.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §2.
  • E. Heim (2019) Constrained generative adversarial networks for interactive image generation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6629–6640. Cited by: §4.2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.
  • T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. In Proc. NeurIPS, Cited by: §1.
  • S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Rohde (2019) Generalized sliced wasserstein distances. In Advances in Neural Information Processing Systems, Vol. 32, pp. 261–272. Cited by: §2.
  • S. Kolouri, Y. Zou, and G. K. Rohde (2016) Sliced wasserstein kernels for probability distributions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5258–5267. Cited by: §2.
  • A. Krizhevsky, V. Nair, and G. Hinton (2010) CIFAR-10 (canadian institute for advanced research). URL http://www. cs. toronto. edu/kriz/cifar. html 5. Cited by: §4.2.
  • C. Lee, T. Batra, M. H. Baig, and D. Ulbricht (2019) Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • C. Li, W. Chang, Y. Cheng, Y. Yang, and B. Póczos (2017) Mmd gan: towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pp. 2203–2213. Cited by: §1, §2, §4.3, §4.3, §4.3.
  • Y. Li, K. Swersky, and R. S. Zemel (2015) Generative moment matching networks. CoRR abs/1502.02761. Cited by: §2.
  • T. Liang, T. Poggio, A. Rakhlin, and J. Stokes (2019) Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 888–896. Cited by: §1, §2, §3.3.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §2, §4.1, §4.3, §6.
  • Y. Mroueh, C. Li, T. Sercu, A. Raj, and Y. Cheng (2017a) Sobolev GAN. CoRR abs/1711.04894. Cited by: §1.
  • Y. Mroueh, T. Sercu, and V. Goel (2017b) McGan: mean and covariance feature matching GAN. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 2527–2535. Cited by: §1.
  • Y. Mroueh and T. Sercu (2017) Fisher gan. In Advances in Neural Information Processing Systems, pp. 2513–2523. Cited by: §1.
  • N. Nauata, K. Chang, C. Cheng, G. Mori, and Y. Furukawa (2020) House-gan: relational generative adversarial networks for graph-constrained house layout generation. In European Conference on Computer Vision, pp. 162–177. Cited by: §1.
  • M. Niemeyer and A. Geiger (2021) GIRAFFE: representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11453–11464. Cited by: §1.
  • S. W. Park and J. Kwon (2019) Sphere generative adversarial network based on geometric moment matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.3.
  • J. Rabin, G. Peyré, J. Delon, and M. Bernot (2011) Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pp. 435–446. Cited by: §2.
  • M. S. M. Sajjadi, O. Bachem, M. Lučić, O. Bousquet, and S. Gelly (2018) Assessing Generative Models via Precision and Recall. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §4.2.
  • A. Srivastava, E. Klassen, S. H. Joshi, and I. H. Jermyn (2011) Shape analysis of elastic curves in euclidean spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (7), pp. 1415–1428. Cited by: §2.
  • A. Srivastava and E. P. Klassen (2016) Functional and shape data analysis. Vol. 1, Springer. Cited by: §2.
  • X. Wei, Z. Liu, L. Wang, and B. Gong (2018) Improving the improved training of wasserstein GANs. In International Conference on Learning Representations, Cited by: §2, §4.3.
  • J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P. Paudel, and L. V. Gool (2019) Sliced wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.3, §4.3.
  • F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao (2015)

    LSUN: construction of a large-scale image dataset using deep learning with humans in the loop

    arXiv preprint arXiv:1506.03365. Cited by: §4.2.
  • F. Zhan, H. Zhu, and S. Lu (2019) Spatial fusion gan for image synthesis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • H. Zhang, Z. Zhang, A. Odena, and H. Lee (2020) Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, Cited by: §4.3.

6 Appendix

Network architectures for unconditional generation task: We employed ResNet architectures in Miyato et al. (2018). Detailed generator and discriminator architectures are shown in Table 4 and 5.

ResBlock up 256
ResBlock up 256
ResBlock up 256

BN, ReLu,

conv, 3 Tanh
Table 4: Generator architecture for images.
ResBlock down 128
ResBlock down 128
ResBlock 128
ResBlock 128
Global avg pooling
SRVT block
Table 5: Discriminator architecture for images.

For STL-10 with image size and LSUN with image size , we changed the number of units of the dense layer in generator to and respectively. All other setups were the same as above. Table Table 6 and 7 show network architectures implemented on images.

ResBlock up 512
ResBlock up 256
ResBlock up 128
ResBlock up 64
BN, ReLu, conv, 3 Tanh
Table 6: Generator architecture for images.
ResBlock down 64
ResBlock down 128
ResBlock down 256
ResBlock down 512
ResBlock 512
Global avg pooling
SRVT block
Table 7: Discriminator architecture for images.
Figure 6: Randomly generated samples on CIFAR-10 (left), LSUN bedroom (middle) and STL-10 (right) images with ResNet architectures using the proposed method.

Evaluation setup details: We used 50000 randomly generated samples comparing against real sets for testing. For each dataset we randomly sampled 50000 images and computed FID using 10 bootstrap resamplings. Features were extracted from the pool3 layer of a pre-trained Inception network. We display randomly generated examples from unconditional tasks in Fig 6.