1 Introduction
Generative Adversarial Networks (GANs, Goodfellow et al. (2014)) and its variations (Radford et al., 2016; Arjovsky et al., 2017; Li et al., 2017) are powerful models for learning complex distributions. Broadly, these methods rely on an adversary that compares samples from the true and learned distributions, giving rise to a notion of divergence between them. The divergences implied by current methods require the two distributions to be supported in sets that are identical or at the very least comparable; examples include Optimal Transport (OT) distances (Salimans et al., 2018; Genevay et al., 2018)
or Integral Probability Metrics (IPM)
(Müller, 1997; Sriperumbudur et al., 2012; Mroueh et al., 2017). In all of these cases, the spaces over which the distributions are defined must have the same dimensionality (e.g., the space ofpixel vectors for MNIST), and the generated distribution that minimizes the objective has the same support as the reference one. This is of course desirable when the goal is to generate samples that are
indistinguishable from those of the reference distribution.Many other applications, however, require modeling only topological or relational aspects of the reference distribution. In such cases, the absolute location of the data manifold is irrelevant (e.g., distributions over learned representations, such as word embeddings, are defined only up to rotations), or it is not available (e.g., if the data is accessible only as a weighted graph indicating similarities among sample points). Another reason for modeling only topological aspects is the desire to, e.g., change the appearance or style of the samples, or downscale images. Divergences that directly compare samples from the two distributions, and hence most current generative models, do not apply to those settings.
In this work, we develop a novel class of generative models that can learn across incomparable spaces, e.g., spaces of different dimensionality or data type. Here, the relational information between samples, i.e., the topology of the reference data manifold, is preserved, but other characteristics, such as the ambient dimension, can vary. A key component of our approach is the GromovWasserstein (GW) distance (Mémoli, 2011), a generalization of classic Optimal Transport distances to incomparable ground spaces. Instead of directly comparing points in the two spaces, the GW distance computes pairwise intraspace distances, and compares those distances across spaces, greatly increasing the modeling scope. Figure 1 illustrates the new model.
To realize this model, we address several challenges. First, we enable the use of the GromovWasserstein distance in various learning settings by improving its robustness and ensuring unbiased learning. Similar to existing OTbased generative models (Salimans et al., 2018; Genevay et al., 2018), we leverage the differentiability of this distance to provide gradients for the generator. Second, for efficiency, we further parametrize it via a learnable adversary. The added flexibility of the GW distance necessitates to constrain the adversary. To this end, we propose a novel orthogonality regularization, which might be of independent interest.
A final challenge —which doubles as one of the main advantages of this approach— arises from the added flexibility of the generator: it allows to freely alter superficial characteristics of the generated distribution while still learning the basic structure of the reference distribution. We show examples how to steer these additional degrees of freedom via regularization or adversaries in the model. The resulting model subsumes the traditional (i.e., samespace) adversarial models as a special case, but can do much more. For example, it learns cluster structure across spaces of different dimensionality and across different data types, e.g., from graphs to Euclidean space. Thus, our
Gw Gancan also be viewed as performing dimensionality reduction or manifold learning, but, departing from classical approaches to these problems, it recovers, in addition to the manifold structure of the data, the probability distribution defined over it. Moreover, we propose a general framework for stylistic modifications by integrating a style adversary; we demonstrate its use by changing the thickness of learned MNIST digits. In summary, this work provides a framework that substantially expands the potential applications of generative adversarial learning.
Contributions.
We make the following contributions:

[noitemsep, nolistsep, label=.]

We introduce a new class of generative models that can learn distributions across different dimensionalities or data types.

We demonstrate the model’s range of applications by deploying it to manifold learning, relational learning and crossdomain learning tasks.

More generally, our modifications of the GromovWasserstein discrepancy enable its use as a loss function in various machine learning applications.

Our new approach to approximately enforce orthogonality in neural networks based on the orthogonal Procrustes problem also applies beyond our model.
2 Model
Given a dataset of observations drawn from a reference distribution , we aim to learn a generative model parametrized by purely based on relational and intrastructural characteristics of the dataset. The generative model , typically a neural network, maps random noise to a generator space that is independent of data space .
2.1 GromovWasserstein Discrepancy
Learning generative models typically relies on a statistical divergence between the target distribution and the model’s current estimate. Classical statistical divergences only apply when comparing distributions whose supports lie in the same metric space, or when at least a meaningful distance between points in the two supports can be computed. When the data space
and generator space are different, these divergences no longer apply.Hence, instead, we will use a more suitable divergence measure. Rather than relying on a metric across the spaces, the GromovWasserstein (GW) distance (Mémoli, 2011) compares distributions by computing a discrepancy between the metrics defined within each of the spaces. As a consequence, it is oblivious to specific characteristics or the dimensionality of the spaces.
Given samples of the compared distributions and , the discrete formulation of the GW distance needs a similarity (or distance) matrix between the samples and a probability vector for each space, say and , with , where is the dimensional probability simplex. Then, the GW discrepancy is
(1) 
where is the set of all couplings between and . The loss function in our case is . If , then defines a (true) distance (Mémoli, 2011).
2.2 GromovWasserstein Generative Model
To learn across incomparable spaces, one key idea of our model is to use the GromovWasserstein distance as a loss function to compare the generated and true distribution. As in traditional adversarial approaches, we parametrize the generator as a neural network that maps noise samples to features . We train by using GW as a loss, i.e., for minibatches and of reference and generated samples, respectively, we compute pairwise distance matrices and and solve the GW problem, taking and as uniform distributions.
While this procedure alone is often sufficient for simple problems, in high dimensions, the statistical efficiency of classical divergence measures can be poor and a large number of input samples is needed to achieve good discrimination between generated and data distribution (Salimans et al., 2018). To improve discriminability, we learn the intraspace metrics adversarially. An adversary parametrized by maps data and generator samples into feature spaces in which we compute Euclidean intraspace distances:
(2) 
with modeled by a neural network. The feature mapping may, for instance, reduce the dimensionality of and extract important features. The original loss minimization problem of the generator thus becomes a minimax problem
(3) 
where and denote pairwise distance matrices of samples originating from the generator and reference domain, respectively, mapped into the feature space via (Eq. (2)). We refer to our model as Gw Gan.
3 Training
We optimize the adversary and generator in an alternating scheme, where we train the generator more frequently than the adversary to avoid the adversariallylearned distance function to become degenerate (Salimans et al., 2018). Algorithm 1 shows the Gw Gan training algorithm.
While training of standard GANs suffers from undamped oscillations and mode collapse (Metz et al., 2017; Salimans et al., 2016), following an argument of Salimans et al. (2018), the GW objective is well defined and statistically consistent if the adversarially learned intraspace distances are nondegenerate. Trained with the GW loss, the generator thus does not diverge even when adversary is kept fixed. Empirical validation (see Appendix A) confirms this: we stopped updating the adversary while continuing to update the generator. Even with fixed adversary, the generator further improved its learned distribution and did not diverge.
Note that Problem (3) makes very few assumptions on the spaces and , requiring only that a metric be defined on them. This remarkable flexibility can be exploited to enforce various characteristics on the generated distribution. We discuss examples in Section 3.1. However, this same flexibility combined with the added degrees of freedom due to the learned metric, demands to regularize the adversary to ensure stable training and prevent it from overpowering the generator. We propose an effective method to do so in Section 3.2.
Moreover, using the GromovWasserstein distance as a differentiable loss function for training a generative model requires modifying its original formulation to ensure robust and fast computation, unbiased gradients, and numerical stability, as described in detail in Section 3.3.
3.1 Constraining the Generator
The GW loss encourages the generator to recover the relational and geometric properties of the reference dataset, but leaves other global aspects undetermined. We can thus shape the generated distribution by enforcing desired properties through constraints. For example, while any translation of a distribution would achieve the same GW loss, we can enforce centering around the origin by penalizing the norm of the generated samples. Figure 2a illustrates an example.
For computer vision tasks, we need to ensure that the generated samples still look like natural images. We found that a total variation regularization
(Rudin et al., 1992) induces the right bias here and hence greatly improves the results (see Figures 2b, c, and d).Moreover, the invariances of the GW loss allow for shaping stylistic characteristics of the generated samples by integrating design constraints into the learning process. In contrast, current generative models (Arjovsky et al., 2017; Salimans et al., 2018; Genevay et al., 2018; Li et al., 2017) cannot perform style transfer as modifications in surfacelevel features of the generated samples conflict with their adversarially computed loss used for training the generator. We propose a modular framework, which enables style transfer to the generated samples when given an additional style reference besides the provided data samples. We incorporate design constraints into the generator’s objective via a style adversary , i.e., any function that quantifies a certain style and thereby, as a penalty, enforces this style on the generated samples. The resulting objective is
(4) 
As a result, the generator learns structural content of the target distribution via the adversarially learned GW loss, and stylistic characteristics via the style adversary. We demonstrate this framework via the example of stylistic changes to learned MNIST digits in Section 4.4 and in Appendix G.
3.2 Regularizing the Adversary
During training, the adversary maximizes the objective function (3). However, the GW distance is easily maximized by stretching the space and thus distorting the intraspace distances used for its computation. To avoid such arbitrary distortion of the space, we propose to regularize the adversary by (approximately) enforcing it to define a unitary transformation, thus restricting the magnitude of stretching it can do. Note that directly parametrizing
as an orthogonal matrix would defeat its purpose, as the Frobenius norm is unitarily invariant. Instead, we allow
to take a more general form, but limit its expansivity and contractivity through approximate orthogonality.Previous work has explored various orthogonalitybased regularization methods to stabilize neural networks training(Vorontsov et al., 2017). Saxe et al. (2014) introduced a new class of random orthogonal initial conditions on the weights of neural networks stabilizing the initial training phase. By enforcing the weight matrices to be Parseval tight frames, layerwise orthogonality constraints are introduced in Cisse et al. (2017); Brock et al. (2017, 2019); they penalize deviations of the weights from orthogonality via , where are weights of layer and is the Frobenius norm.
However, these approaches enforce orthogonality on the weights of each layer rather than constraining the network in its entirety to function as an approximately orthogonal operator. An empirical comparison to these layerwise approaches (shown in Appendix D) reveals that, for Gw Gan, regularizing the full network is desirable. To enforce the approximation of as an orthogonal operator, we introduce a new orthogonal regularization approach, which ensures orthogonality of a network by minimizing the distance to the closest orthogonal matrix . The regularization term is defined as
(5) 
where is an orthogonal matrix that most closely maps to , and
is a hyperparameter. The matrix
, where and is the dimensionality of the feature space, can be obtained by solving an orthogonal Procrustes problem. If the dimensionality of the feature space equals the input dimension, then has a closedform solution , where and are the left and right singular vectors of , i.e. (Schönemann, 1966). Otherwise, we need to solve for with an iterative optimization method.This novel Procrustesbased regularization principle for neural networks is remarkably flexible since it constrains global inputoutput behavior without making assumptions about specific layers or activations. It preserves the expressibility of the network while efficiently enforcing orthogonality. We use this orthogonal regularization principle for training the adversary of the GW generative model across different applications and network architectures.
3.3 GromovWasserstein as a Loss Function
To serve as a robust training objective for general machine learning settings we modify the naïve formulation of the GromovWasserstein discrepancy in various ways.
Regularization of GromovWasserstein
Optimal transport metrics and extensions such as the GromovWasserstein distance are particularly appealing because they take into account the underlying geometry of the data when comparing distributions. However, their computational cost is prohibitive for largescale machine learning problems. More precisely, Problem (1) is a quadratic programming problem, and solving it directly is intractable for large . Regularizing this objective with an entropy term results in significantly more efficient optimization (Peyré et al., 2016). The resulting smoothed problem can be solved through projected gradient descent methods, where the projection steps rely on the SinkhornKnopp scaling algorithm (Cuturi, 2013). Concretely, the entropyregularized version of the GromovWasserstein discrepancy proposed by Peyré et al. (2016) has the form
(6) 
where is defined in Equation (1), is the entropy of coupling , and a parameter controlling the strength of regularization. Besides leading to significant speedups, entropy smoothing of optimal transport discrepancies results in distances that are differentiable with respect to their inputs, making them a more convenient choice as loss functions for machine learning algorithms. Since the GromovWasserstein distance as loss function in generative models compares noisy features of the generator and the data by computing correspondences between intraspace distances, a soft rather than a hard alignment might be a desirable property. The entropy smoothing yields couplings that are sparser than their nonregularized counterparts, ideal for applications where soft alignments are desired (Cuturi & Peyré, 2016).
The effectiveness of entropy smoothing of the GromovWasserstein discrepancy has been shown in other downstream application such as shape correspondences (Solomon et al., 2016) or the alignment of word embedding spaces (AlvarezMelis & Jaakkola, 2018).
Motivated by Salimans et al. (2018) and justified by the envelope theorem (Carter, 2001)
, we do not backpropagate the gradient through the iterative computation of the
coupling (Problem (6)).Normalization of GromovWasserstein
With entropy regularization, is not a distance any more, as the discrepancy of identical metric measure spaces is then no longer zero. Similar to the Wasserstein metric (Bellemare et al., 2017), the estimation of from samples yields biased gradients. Inspired by Bellemare et al. (2017), we use a normalized entropyregularized GromovWasserstein discrepancy defined as
(7) 
Numerical Stability of GromovWasserstein
Computing the entropyregularized GromovWasserstein formulation relies on a projected gradient algorithm (Peyré et al., 2016), in which each iteration involves a projection into the transportation polytope, efficiently computed with the SinkhornKnopp algorithm (Cuturi, 2013), a matrixscaling procedure that alternatingly updates marginal scaling variables. In the limit of vanishing regularization () these scaling factors diverge, resulting in numerical instabilities.
To improve the numerical stability, we compute using a stabilized version of the Sinkhorn algorithm (Schmitzer, 2016). This significantly increases the robustness of the GromovWasserstein computation. Performing Sinkhorn updates in the logdomain further increases the stability of the algorithm, by avoiding numerical overflow while preserving its efficient matrix multiplication structure.
Normalizing the intraspace distances of the generated and the data samples, respectively, further improves the numerical stability of the GromovWasserstein computation. However, to preserve information on the scale of the samples, we use normalized distances for the Sinkhorn iterates, while the final loss is calculated using the original distances.
4 Empirical Results
In this section, we empirically demonstrate the effectiveness of the Gw Gan formulation and regularization, and illustrate its versatility by tackling various novel settings for generative modeling, including learning distributions across different dimensionalities, data types and styles.
4.1 Learning across Identical Spaces
As a sanity check, we first consider the special case where the two distributions are defined on identical spaces (i.e., the usual GAN setting). Specifically, we test the model’s ability to recover 2D mixtures of Gaussians, a common proof of concept task for mode recovery (Che et al., 2017; Metz et al., 2017; Li et al., 2018)
. For the experiments on synthetic datasets, generator and adversary architectures are multilayer perceptrons (MLPs) with ReLU activation functions. Figure
2a shows that the Gw Gan reproduces a mixture of Gaussians with learned adversary that stabilizes the learning. We observe that regularization indeed helps position the learned distributions around the origin. Comparative results with and without regularization are shown in Appendix B. As opposed to the Ot Gan proposed by Salimans et al. (2016), our model robustly learns Gaussian mixtures with differing number of modes and arrangements (see Appendix E). The Appendix shows several training runs. While the generated distributions vary in orientation in the Euclidean plane, the cluster structure is clearly preserved.To illustrate the ability of the Gw Gan to generate images, we train the model on MNIST (LeCun et al., 1998), fashionMNIST (Xiao et al., 2017) and grayscale CIFAR10 (Krizhevsky et al., 2014). Both generator and adversary follow the deep convolutional architecture introduced by Chen et al. (2016), whereby the adversary maps into rather than applying a final tanh. To stabilize the initial training phase, the weights of the adversary network were initialized with random orthogonal matrices as proposed by Saxe et al. (2014). We train the model using Adam with a learning rate of , , (Kingma & Ba, 2015). Figure 2b, c and d display generated images throughout the training process. The adversary was constrained to approximate an orthogonal operator. The results highlight the effectiveness of the orthogonal Procrustes regularization, which allows successful learning of complex distributions using different network architectures. Additional experiments on the influence of adversary are provided in Appendix C.
Having validated the overall soundness of the Gw Gan on traditional settings, we now demonstrate its usefulness in tasks that go beyond the scope of traditional generative adversarial models, namely, learning across spaces that are not directly comparable.
4.2 Learning across Dimensionalities
Arguably, the simplest instance of incomparable spaces are Euclidean spaces of different dimensionality. In this section, we investigate whether the GromovWasserstein GAN can learn to generate a distribution defined on a space of different dimensionality than that of the reference. We consider both directions: learning to a smaller and higher dimensional space. In this experimental setup, we compute intraspace distances using the Euclidean distance without a parametrized adversary. The generator network follows an MLP architecture with ReLU activation functions. The training task consists of translating between a mixture of Gaussian distributions in two and three dimensions. The results, shown in Figure 3, demonstrate that our model successfully recovers the global structure and relative distances of the modes of the reference distribution, despite the different dimensionality.
4.3 Learning across Data Modalities and Manifolds
Next, we consider distributions with more complex structure, and test whether our model is able to recover manifold structure on the generated distribution. Using the popular threedimensional Sshaped dataset as example, we define distances between the samples via shortest paths on their knearest neighbor graph, computed using the FloydWarshall algorithm (Floyd, 1962). For the generated distribution we use a space of the same intrinsic dimensionality (two) as the reference manifold. The results in Figure 4a show that the generated distribution learnt with Gw Gan successfully recovers the manifold structure of the data.
Taking the notion of incomparability further, we next consider a setting when the reference distribution is accessible only through relational information, i.e., a weighted graph without absolute representations of the samples. While conceptually very different from previous scenarios, applying our model to this setting is just as simple as previous scenarios. Once a notion of distance is defined over the reference graph, our model learns the distribution based on pairwise relations as before. Given merely a graph, we use pairwise shortest paths as the intraspace distance metric, and use the 2D Euclidean space for the generated distribution. Figure 4b shows that Gw Gan is able to successfully learn a distribution that approximately recovers the neighborhood structure of the reference graph.
4.4 Shaping Learned Distributions
The GromovWasserstein GAN enjoys remarkable flexibility, allowing us to actively influence stylistic characteristics of the generated distribution.
While structure and content of the distribution are learned via the adversary , stylistic features can be introduced via a style adversary as outlined in Section 3.1
. As a proof of concept of this modular framework, we learn MNIST digits and enforce their font style to be bold via additional design constraints. The style adversary is parametrized by a binary classifier trained on handwritten letters of the EMNIST dataset
(Cohen et al., 2017) which were assigned thin and bold class labels . The training objective of the generator is augmented with the classification result of the trained binary classifier (Eq. (4)). Further details are provided in Appendix G. After the generator has satisfactorily learnt the data distribution based on training with loss , the style adversary is activated. Figure 5 shows that the style adversary affects the generator to increase the thickness of the MNIST digits, while the structural content learned in the first stage is retained.5 Related Work
Generative adversarial models have been extensively studied and applied in various fields including image synthesis (Brock et al., 2019), semantic image editing (Wang et al., 2018), style transfer (Zhu et al., 2017)
, and semisupervised learning
(Kingma et al., 2014). As the literature is extensive, we provide a brief overview on GANs and focus on selected approaches targeting tasks in crossdomain learning.Generative Adversarial Networks
Goodfellow et al. (2014) proposed generative adversarial networks (GANs) as a zerosum game between a generator and a discriminator, which learns to distinguish between generated and data samples. Despite their success and improvements in optimization, the training of GANs is difficult and unstable (Salimans et al., 2016; Arjovsky & Bottou, 2017). To remedy these issues, various extensions of this framework have been proposed, most of which seek to replace the game objective with more stable or general losses. These include using Maximum Mean Discrepancy (MMD) (Dziugaite et al., 2015; Li et al., 2017; Bińkowski et al., 2018), other IPMs (Mroueh et al., 2017, 2018), or Optimal Transport distances (Arjovsky et al., 2017; Salimans et al., 2018; Genevay et al., 2018). Due to their relevance, we discuss the latter in detail below. A crucial characteristic that distinguishes our approach from other generative models is its ability to learn across different domains and modalities.
GANs and Optimal Transport (OT)
To compare probability distributions supported on low dimensional manifolds in high dimensional spaces, recent GAN variants integrate OT metrics in their training objective (Arjovsky et al., 2017; Salimans et al., 2018; Genevay et al., 2018; Gulrajani et al., 2017). Since OT metrics are computationally expensive, Arjovsky et al. (2017) use the dual formulation of the 1Wasserstein distance. Other approaches approximate the primal via entropically smoothed generalizations of the Wasserstein distance (Salimans et al., 2018; Genevay et al., 2018). Our work departs from these methods in that it relies on a much more general instance of Optimal Transport (the GromovWasserstein distance) as a loss function, which allows us to compare distributions even if crossdomain pairwise distances are not available.
GANs for CrossDomain Learning
GANs have been successfully applied to style transfer between images (Isola et al., 2017; Karacan et al., 2016; Zhu et al., 2017), texttoimage synthesis (Reed et al., 2016; Zhang et al., 2017), visual manipulation (Zhu et al., 2016; Engel et al., 2018) or font style transfer (Azadi et al., 2018). However, to achieve this, these methods depend on conditional variables, training sets of aligned data pairs or cycle consistency constraints. Kim et al. (2017) utilize two different, coupled GANs to discover crossdomain relations given unpaired data. However, the method’s applicability is limited as all images in one domain need to be representable by images in the other domain.
GromovWasserstein Learning
Since its introduction by Mémoli (2011), the GromovWasserstein discrepancy has found applications in many learning problems that rely on a coupling between different metric spaces. Being an effective method to solve matching problems, it has been used in shape and object matching (Mémoli, 2009, 2011; Solomon et al., 2016; Ezuz et al., 2017), for aligning word embedding spaces (AlvarezMelis & Jaakkola, 2018) and for matching weighted directed networks (Chowdhury & Mémoli, 2018). Other recent applications of the GW distance include the computation of barycenters of a set of distance or kernel matrices (Peyré et al., 2016) and heterogeneous domain adaptation where source and target samples are represented in different feature spaces (Yan et al., 2018). While relying on a shared tool —the GW discrepancy— this paper leverages it in a very different framework, generative modeling, where questions of efficiency, degrees of freedom, minimax objectives and endtoend learning pose various challenges that need to be addressed to successfully use this tool.
6 Conclusion
In this paper, we presented a new generative model that can learn a distribution in a space that is different from, and even incomparable to, that of the reference distribution. Our model accomplishes this by relying on relational —rather than absolute— comparisons of samples via the GromovWasserstein distance. Such disentanglement of data and generator spaces opens up a wide array of novel possibilities for generative modeling, as portrayed by our experiments on learning across different dimensional representations and learning across modalities (weighted graph to Euclidean representations). Validated here through simple experiments on digit thickness control, the use of crafted regularization losses on the generator to impose certain stylistic characteristics makes for an exciting avenue of future work.
Acknowledgements
This research was supported in part by NSF CAREER Award 1553284 and The Defense Advanced Research Projects Agency (grant number YFA17 N660011714039). The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense. Charlotte Bunne was supported by the Zeno Karl Schindler Foundation. We thank Suvrit Sra for a question that initiated this research, and MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing computational resources.
References

AlvarezMelis & Jaakkola (2018)
AlvarezMelis, D. and Jaakkola, T.
GromovWasserstein Alignment of Word Embedding Spaces.
In
Conference on Empirical Methods in Natural Language Processing (EMNLP)
. Association for Computational Linguistics, 2018.  Arjovsky & Bottou (2017) Arjovsky, M. and Bottou, L. Towards Principled Methods for Training Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), 2017.
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein Generative Adversarial Networks. In International Conference on Machine Learning (ICML), volume 70. PMLR, 2017.

Azadi et al. (2018)
Azadi, S., Fisher, M., Kim, V. G., Wang, Z., Shechtman, E., and Darrell, T.
MultiContent GAN for FewShot Font Style Transfer.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  Bellemare et al. (2017) Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. The Cramer Distance as a Solution to Biased Wasserstein Gradients. arXiv preprint arXiv:1705.10743, 2017.
 Bińkowski et al. (2018) Bińkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. In International Conference on Learning Representations (ICLR), 2018.
 Brock et al. (2017) Brock, A., Lim, T., Ritchie, J. M., and Weston, N. Neural Photo Editing with Introspective Adversarial Networks. In International Conference on Learning Representations (ICLR), 2017.
 Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. International Conference on Learning Representations (ICLR), 2019.
 Carter (2001) Carter, M. Foundations of Mathematical Economics. MIT Press, 2001.
 Che et al. (2017) Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode Regularized Generative Adversarial Networks. 2017.
 Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
 Chowdhury & Mémoli (2018) Chowdhury, S. and Mémoli, F. The GromovWasserstein distance between networks and stable network invariants. arXiv preprint arXiv:1808.04337, 2018.
 Cisse et al. (2017) Cisse, M., Bojanowski, P., Grave, E., Dauphin, Y., and Usunier, N. Parseval Networks: Improving Robustness to Adversarial Examples. In International Conference on Machine Learning (ICML), volume 70. PMLR, 2017.
 Cohen et al. (2017) Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. EMNIST: an extension of MNIST to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
 Cuturi (2013) Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In Advances in Neural Information Processing Systems (NeurIPS), 2013.
 Cuturi & Peyré (2016) Cuturi, M. and Peyré, G. A Smoothed Dual Approach for Variational Wasserstein Problems. SIAM Journal on Imaging Sciences, 9, 2016.

Dziugaite et al. (2015)
Dziugaite, G. K., Roy, D. M., and Ghahramani, Z.
Training generative neural networks via Maximum Mean Discrepancy
optimization.
In
Conference on Uncertainty in Artificial Intelligence (UAI)
, 2015.  Engel et al. (2018) Engel, J., Hoffman, M., and Roberts, A. Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models. In International Conference on Learning Representations (ICLR), 2018.
 Ezuz et al. (2017) Ezuz, D., Solomon, J., Kim, V. G., and BenChen, M. GWCNN: A Metric Alignment Layer for Deep Shape Analysis. In Computer Graphics Forum, volume 36. Wiley Online Library, 2017.
 Floyd (1962) Floyd, R. W. Algorithm 97: Shortest path. Communications of the ACM, 5(6):345, 1962.
 Genevay et al. (2018) Genevay, A., Peyré, G., and Cuturi, M. Learning Generative Models with Sinkhorn Divergences. In International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84. PLMR, 2018.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
 Isola et al. (2017) Isola, P., Zhu, J.Y., Zhou, T., and Efros, A. A. ImagetoImage Translation with Conditional Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
 Karacan et al. (2016) Karacan, L., Akata, Z., Erdem, A., and Erdem, E. Learning to Generate Images of Outdoor Scenes from Attributes and Semantic Layouts. arXiv preprint arXiv:1612.00215, 2016.
 Kim et al. (2017) Kim, T., Cha, M., Kim, H., Lee, J. K., and Kim, J. Learning to Discover CrossDomain Relations with Generative Adversarial Networks. In International Conference on Machine Learning (ICML), volume 70, 2017.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), volume 5, 2015.
 Kingma et al. (2014) Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. SemiSupervised Learning with Deep Generative Models. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
 Krizhevsky et al. (2014) Krizhevsky, A., Nair, V., and Hinton, G. The CIFAR10 Dataset, 2014.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
 Li et al. (2018) Li, C., AlvarezMelis, D., Xu, K., Jegelka, S., and Sra, S. Distributional Adversarial Networks. International Conference on Learning Representations (ICLR), Workshop Track, 2018.

Li et al. (2017)
Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., and Póczos, B.
MMD GAN: Towards Deeper Understanding of Moment Matching Network.
In Advances in Neural Information Processing Systems (NeurIPS), 2017.  Mémoli (2009) Mémoli, F. Spectral GromovWasserstein Distances for Shape Matching. In Computer Vision Workshops (ICCV Workshops). IEEE, 2009.
 Mémoli (2011) Mémoli, F. GromovWasserstein Distances and the Metric Approach to Object Matching. Foundations of Computational Mathematics, 11(4), 2011.
 Metz et al. (2017) Metz, L., Poole, B., Pfau, D., and SohlDickstein, J. Unrolled generative adversarial networks. International Conference on Learning Representations (ICLR), 2017.
 Mroueh et al. (2017) Mroueh, Y., Sercu, T., and Goel, V. McGAN: Mean and Covariance Feature Matching GAN. In International Conference on Machine Learning (ICML), volume 70. PMLR, 2017.
 Mroueh et al. (2018) Mroueh, Y., Li, C.L., Sercu, T., Raj, A., and Cheng, Y. Sobolev GAN. In International Conference on Learning Representations (ICLR), 2018.
 Müller (1997) Müller, A. Integral Probability Metrics and Their Generating Classes of Functions. Advances in Applied Probability, 29(2), 1997.
 Peyré et al. (2016) Peyré, G., Cuturi, M., and Solomon, J. GromovWasserstein Averaging of Kernel and Distance Matrices. In International Conference on Machine Learning (ICML), volume 48, 2016.
 Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In International Conference on Learning Representations (ICLR), 2016.
 Reed et al. (2016) Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. Generative Adversarial Text to Image Synthesis. In International Conference on Machine Learning (ICML), volume 48, 2016.
 Rudin et al. (1992) Rudin, L. I., Osher, S., and Fatemi, E. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(14), 1992.
 Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., and Chen, X. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems (NeurIPS). 2016.
 Salimans et al. (2018) Salimans, T., Zhang, H., Radford, A., and Metaxas, D. Improving GANs Using Optimal Transport. In International Conference on Learning Representations (ICLR), 2018.
 Saxe et al. (2014) Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations (ICLR), 2014.
 Schmitzer (2016) Schmitzer, B. Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems. arXiv preprint arXiv:1610.06519, 2016.
 Schönemann (1966) Schönemann, P. H. A generalized solution of the Orthogonal Procrustes problem. Psychometrika, 31(1), 1966.
 Solomon et al. (2016) Solomon, J., Peyré, G., Kim, V. G., and Sra, S. Entropic Metric Alignment for Correspondence Problems. ACM Transactions on Graphics (TOG), 35(4), 2016.
 Sriperumbudur et al. (2012) Sriperumbudur, B., Fukumizu, K., Gretton, A., Schölkopf, B., and Lanckriet, G. On the Empirical Estimation of Integral Probability Metrics. Electronic Journal of Statistics, 6, 2012.
 Vorontsov et al. (2017) Vorontsov, E., Trabelsi, C., Kadoury, S., and Pal, C. On orthogonality and learning recurrent networks with long term dependencies. In International Conference on Machine Learning (ICML), volume 70. PMLR, 2017.
 Wang et al. (2018) Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., and Catanzaro, B. HighResolution Image Synthesis and Semantic Manipulation with Conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 Xiao et al. (2017) Xiao, H., Rasul, K., and Roland, V. FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Yan et al. (2018) Yan, Y., Li, W., Wu, H., Min, H., Tan, M., and Wu, Q. SemiSupervised Optimal Transport for Heterogeneous Domain Adaptation. International Joint Conference on Artificial Intelligence (IJCAI), 2018.
 Zhang et al. (2017) Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., and Metaxas, D. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In International Conference on Computer Vision (ICCV), 2017.
 Zhu et al. (2016) Zhu, J.Y., Krähenbühl, P., Shechtman, E., and Efros, A. A. Generative Visual Manipulation on the Natural Image Manifold. In European Conference on Computer Vision (ECCV), 2016.
 Zhu et al. (2017) Zhu, J.Y., Park, T., Isola, P., and Efros, A. A. Unpaired ImagetoImage Translation using CycleConsistent Adversarial Networks. In International Conference on Computer Vision (ICCV), 2017.
Appendix
Appendix A Training of the Gw Gan with Fixed Adversary
Appendix B Influence of Generator Constraints on the Gw Gan
Appendix C Influence of the Adversary
Appendix D Comparison of the Effectiveness of Orthogonal Regularization Approaches
Appendix E Comparison of the Gw Gan with Salimans et al. (2018)
Appendix F Comparison of Training Times
Model 
Average Training Time (Seconds per Epoch) 

Wasserstein Gan with Gradient Penalty (Gulrajani et al., 2017)  
Sinkhorn Gan (Genevay et al., 2018) (default configuration, )  
Sinkhorn Gan (Genevay et al., 2018) (default configuration, )  
Gw Gan (this paper, ) 
Training time comparisons of PyTorch implementations of different GAN architectures. The generative models were trained on generating MNIST digits and their average training time per epoch was recorded. All experiments were performed on a single GPU for consistency.
Appendix G Training Details of the Style Adversary
We introduce a novel framework which allows a modular application of style transfer tasks by integrating a style adversary into the architecture of the GromovWasserstein GAN. In order to demonstrate the practicability of this modular framework, we learn MNIST digits and enforce their font style to be bold via additional design constraints. The style adversary is parametrized by a binary classifier trained on handwritten letters of the EMNIST dataset (Cohen et al., 2017) which were assigned bold and thin class labels based on the letterwise
norm of each image. As the style adversary is trained based on a different dataset, it is independent of the original learning task. The binary classifier is parametrized by a convolutional neural network and trained by computing a binary crossentropy loss. The dataset, classification results of bold and thin letters as well as the loss curve of training the binary classifier are shown in figure
11.