Flat Metric Minimization with Applications in Generative Modeling

05/12/2019 ∙ by Thomas Möllenhoff, et al. ∙ 0

We take the novel perspective to view data not as a probability distribution but rather as a current. Primarily studied in the field of geometric measure theory, k-currents are continuous linear functionals acting on compactly supported smooth differential forms and can be understood as a generalized notion of oriented k-dimensional manifold. By moving from distributions (which are 0-currents) to k-currents, we can explicitly orient the data by attaching a k-dimensional tangent plane to each sample point. Based on the flat metric which is a fundamental distance between currents, we derive FlatGAN, a formulation in the spirit of generative adversarial networks but generalized to k-currents. In our theoretical contribution we prove that the flat metric between a parametrized current and a reference current is Lipschitz continuous in the parameters. In experiments, we show that the proposed shift to k>0 leads to interpretable and disentangled latent representations which behave equivariantly to the specified oriented tangent planes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is concerned with the problem of representation learning, which has important consequences for many tasks in artificial intelligence, cf. the work of

Bengio et al. (2013). More specifically, our aim is to learn representations which behave equivariantly with respect to selected transformations of the data. Such variations are often known beforehand and could for example describe changes in stroke width or rotation of a digit, changes in viewpoint or lighting in a three-dimensional scene but also the arrow of time (Pickup et al., 2014; Wei et al., 2018) in time-series, describing how a video changes from one frame to the next, see Fig. 1.

We tackle this problem by introducing a novel formalism based on geometric measure theory (Federer, 1969)

, which we find to be interesting in itself. To motivate our application in generative modeling, recall the manifold hypothesis which states that the distribution of real-world data tends to concentrate nearby a low-dimensional manifold, see

Fefferman et al. (2016) and the references therein. Under that hypothesis, a possible unifying view on prominent methods in unsupervised and representation learning such as generative adversarial networks (GANs) (Goodfellow et al., 2014) and variational auto-encoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) is the following: both approaches aim to approximate the true distribution concentrating near the manifold with a distribution on some low-dimensional latent space that is pushed through a decoder or generator

mapping to the (high-dimensional) data space

(Genevay et al., 2017; Bottou et al., 2017).

We argue that treating data as a distribution potentially ignores useful available geometric information such as orientation and tangent vectors to the data manifold. Such tangent vectors describe the aforementioned local variations or pertubations. Therefore we postulate that data should not be viewed as a distribution but rather as a -current.

We postpone the definition of -currents (de Rham, 1955) to Sec. 3, and informally think of them as distributions over -dimensional oriented planes. For the limiting case , currents simply reduce to distributions in the sense of Schwartz (1951, 1957) and positive -currents with unit mass are probability measures. A seminal work in the theory of currents was written by Federer & Fleming (1960), which established compactness theorems for subsets of currents (normal and integral currents). In this paper, we will work in the space of normal -currents with compact support in , denoted by .

Similarly as probabilistic models build upon -divergences (Csiszár et al., 2004), integral probability metrics (Sriperumbudur et al., 2012) or more general optimal transportation related divergences (Peyré & Cuturi, 2018; Feydy et al., 2018), we require a sensible notion to measure “distance” between -currents.

In this work, we will focus on the flat norm111The terminology “flat” carries no geometrical significance and refers to Whitney’s use of musical notation flat and sharp . due to Whitney (1957). To be precise, we consider a scaled variant introduced and studied by Morgan & Vixie (2007); Vixie et al. (2010). This choice is motivated in Sec. 4, where we show that the flat norm enjoys certain attractive properties similar to the celebrated Wasserstein distances. For example, it metrizes the weak-convergence for normal currents.

A potential alternative to the flat norm are kernel metrics on spaces of currents (Vaillant & Glaunès, 2005; Glaunès et al., 2008). These have been proposed for diffeomorphic registration, but kernel distances on distributions have also been sucessfully employed for generative modeling, see Li et al. (2017). Constructions similar to the Kantorovich relaxation in optimal transport but generalized to -currents recently appeared in the context of convexifications for certain variational problems (Möllenhoff & Cremers, 2019).

2 Related Work

Our main idea is illustrated in Fig. 2, which was inspired from the optimal transportation point of view on GANs given by Genevay et al. (2017).

Tangent vectors of the data manifold, either prespecified (Simard et al., 1992, 1998; Fraser et al., 2003)

or learned with a contractive autoencoder

(Rifai et al., 2011)

, have been used to train classifiers that aim to be

invariant to changes relative to the data manifold. In contrast to these works, we use tangent vectors to learn interpretable representations and a generative model that aims to be equivariant. The principled introduction of tangent -vectors into probabilistic generative models is one of our main contributions.

Various approaches to learning informative or disentangled latent representations in a completely unsupervised fashion exist (Schmidhuber, 1992; Higgins et al., 2016; Chen et al., 2016; Kim & Mnih, 2018). Our approach is orthogonal to these works, as specifying tangent vectors further encourages informative representations to be learned. For example, our GAN formulation could be combined with a mutual information term as in InfoGAN (Chen et al., 2016).

Our work is more closely related to semi-supervised approaches on learning disentangled latent representations, which similarly also require some form of knowledge of the underlying factors (Hinton et al., 2011; Denton et al., 2017; Mathieu et al., 2016; Narayanaswamy et al., 2017) and also to conditional GANs (Mirza & Osindero, 2014; Odena et al., 2017). However, the difference is the connection to geometric measure theory which we believe to be completely novel, and our specific FlatGAN formulation that seamlessly extends the Wasserstein GAN (Arjovsky et al., 2017), cf. Fig. 2.

Since the concepts we need from geometric measure theory are not commonly used in machine learning, we briefly review them in the following section.

3 Geometric Measure Theory

The book by Federer (1969) is still the formidable, definitive reference on the subject. As a more accessible introduction we recommend (Krantz & Parks, 2008) or (Morgan, 2016). While our aim is to keep the manuscript self-contained, we invite the interested reader to consult Chapter 4 in (Morgan, 2016), which in turn refers to the corresponding chapters in the book of Federer (1969) for more details.

3.1 Grassmann Algebra

Notation.

Denote a basis of with dual basis such that is the linear functional that maps every to the -th component . For , denote as the ordered multi-indices with .

One can multiply vectors in to obtain a new object:

(1)

called a -vector in . The wedge (or exterior) product is characterized by multilinearity

(2)

and it is alternating

(3)

In general, any -vector can be written as

(4)

for coefficients . The vector space of -vectors is denoted by and has dimension . We define for two -vectors , an inner product and the Euclidean norm .

A simple (or decomposable) -vector is any that can be written using products of -vectors. Simple -vectors such as (1) are uniquely determined by the -dimensional space spanned by the , their orientation and the norm corresponding to the area of the parallelotope spanned by the . Simple -vectors with unit norm can therefore be thought of as oriented -dimensional subspaces and the rules (2)-(3) can be thought of as equivalence relations.

It turns out that the inner product of two simple -vectors can be computed by the -determinant

(5)

where the columns of , contain the individual -vectors. This will be useful later for our practical implementation.

Not all -vectors are simple. An illustrative example is , which describes two -dimensional subspaces in intersecting only at zero.

The dual space of is denoted as , and its elements are called -covectors. They are similarly represented as (4) but with dual basis . Analogously to the previous page, we can define an inner product between -vectors and -covectors. Next to the Euclidean norm , we define two additional norms due to Whitney (1957).

Definition 1 (Mass and comass).

The comass norm defined for -covectors is given by

(6)

and the mass norm for is given by

(7)

The mass norm is by construction the largest norm that agrees with the Euclidean norm on simple -vectors. For the non-simple -vector from before, we compute

(8)

Interpreting the non-simple vector as two tangent planes, we see that the mass norm gives the correct area, while the Euclidean norm underestimates it. The comass will be used later to define the mass of currents and the flat norm.

3.2 Differential Forms

In order to define currents, we first need to introduce differential forms. A differential -form is a -covectorfield . The support is defined as the closure of the set .

Differential forms allow one to perform coordinate-free integration over oriented manifolds. Given some manifold , possibly with boundary, an orientation is a continuous map which assigns to each point a simple -vector with unit norm that spans the tangent space at that point. Integration of a differential form over an oriented manifold is then defined by:

(9)

where the second integral is the standard Lebesgue integral with respect to the -dimensional Hausdorff measure restricted to , i.e., . The -dimensional Hausdorff measure assigns to sets in their -dimensional volume, see Chapter 2 in Morgan (2016) for a nice illustration. For the Hausdorff measure coincides with the Lebesgue measure.

The exterior derivative of a differential -form is the -form defined by

(10)

where is the oriented boundary of the parallelotope spanned by the at point . The above definition is for example used in the textbook of Hubbard & Hubbard (2015). To get an intuition, note that for this reduces to the familiar directional derivative . In case is sufficiently smooth, the limit in (10) is given by

(11)

where means that the vector is omitted. The formulation (11) will be used in the practical implementation. Interestingly, with (9) and (10) in mind, Stokes’ theorem

(12)

becomes almost obvious, as (informally speaking) integrating (10) one obtains (12) since the oppositely oriented boundaries of neighbouring parallelotopes cancel each other out in the interior of .

To define the pushforward of currents which is central to our formulation, we require the pullback of differential forms. The pullback by a map of the -form is given by

(13)

where and is the Jacobian. We will also require (13) for the practical implementation.

3.3 Currents

We have now the necessary tools to define currents and the required operations on them, which will be defined through duality with differential forms. Consider the space of compactly supported and smooth -forms in which we denote by . When furnished with an appropriate topology (cf. §4.1 in Federer (1969) for the details) this is a locally convex topological vector space. -currents are continuous linear functionals on smooth, compactly supported differential forms, i.e., elements from the topological dual space . Some examples for currents are given in Fig. 3. The -current in (a) could be an empirical data distribution, and the -current in (b) represents the data distribution with a two dimensional oriented tangent space at each data point. The -current in (c) simply represents the set as an oriented manifold, its action on a differential form is given as in (9).

A natural notion of convergence for currents is given by the weak topology:

(14)
(a) (b) (c)
Figure 3: Example of a -current (a), and -currents (b), (c).

The support of a current , , is the complement of the largest open set, so that when testing with compactly supported forms on that open set the answer is zero. Currents with compact support are denoted by . The boundary operator is defined using exterior derivative

(15)

and Stokes’ theorem (12) ensures that this coincides with the intuitive notion of boundary for currents which are represented by integration over manifolds in the sense of (9).

The pushforward of a current is defined using the pullback

(16)

where the intuition is that the pushforward transforms the current with the map , see the illustration in Fig. 2.

The mass of a current is given by

(17)

If the current is an oriented manifold then the mass is the volume of that manifold. One convenient way to construct -currents, is by combining a smooth -vectorfield with a Radon measure :

(18)

A concrete example is illustrated in Fig. 3 (b), where given samples and tangent -vectors a -current is constructed.

For currents with finite mass there is a measure and a map with almost everywhere so that we can represent it by integration as follows:

(19)

Another perspective is that finite mass currents are simply -vector valued Radon measures. Currents with finite mass and finite boundary mass are called normal currents (Federer & Fleming, 1960). The space of normal currents with support in a compact set is denoted by .

Figure 4: Illustration of distances between -currents on the example of two Dirac measures , . The flat metric has the following advantages: unlike the mass it is continuous, and unlike Wasserstein- it easily generalizes to -currents (see Fig. 5).

4 The Flat Metric

As indicated in Fig. 2, we wish to fit a current that is the pushforward of a low-dimensional latent current to the current given by the data. A more meaningful norm on currents than the mass turns out to be the flat norm.

Definition 2 (Flat norm and flat metric).

The flat norm with scale222We picked a different convention for as in (Morgan & Vixie, 2007), where it bounds the other constraint, to emphasize the connection to the Wasserstein-1 distance. is defined for any -current as

(20)

For we simply write and will be denoted as the flat metric.

The flat norm also has a primal formulation

(21)
(22)

where the minimum in (21)–(22) can be shown to exist, see §4.1.12 in Federer (1969). The flat norm is finite if is a normal current and it can be verified that it is indeed a norm.

To get an intuition, we compare the flat norm to the mass (17) and the Wasserstein-1 distance in Fig. 4 on the example of Dirac measures , . The mass is discontinuous and has zero gradient and is therefore unsuitable as a distance between currents. While the Wasserstein-1 metric is continuous in , it does not easily generalize from probability measures to -currents. In contrast, the flat metric has a meaningful geometric interpretation also for arbitrary -currents. In Fig. 5 we illustrate the flat norm for two -currents. In that figure, if and are of length one and are apart, then which converges to zero for .

Note that for -currents, the flat norm (20) is strongly related to the Wasserstein-1 distance except for the additional constraint on the dual variable , which in the example of Fig. 4 controls the truncation cutoff. Notice also the similarity of (21) to the Beckmann formulation of the Wasserstein-1 distance (Beckmann, 1952; Santambrogio, 2015), with the difference being the implementation of the “divergence constraint” with a soft penalty . Considering the case as in the Wasserstein distance is problematic in case we have , since not every current is the boundary of a -current, see the example above in Fig. 5.

The following proposition studies the effect of the scale parameter on the flat norm.

Proposition 1.

For any , the following relation holds

(23)

meaning that and are equivalent norms.

Proof.

By a result of Morgan & Vixie (2007) we have the interesting relation

(24)

where is the -dilation. Using the bound , §4.1.14 in Federer (1969), and the fact that , one inequality directly follows. For the other side, notice that

(25)

and dividing by yields the result. ∎

Figure 5: The flat metric is given an optimal decomposition into a -current and the boundary of a -current with minimal weighted mass . An intuition is that is a penalty that controls how closely should approximate , while is the -dimensional volume of .

The importance of the flat norm is due to the fact that it metrizes the weak-convergence (14) on compactly supported normal currents with uniformly bounded mass and boundary mass.

Proposition 2.

Let be a compact set and some fixed constant. For a sequence with we have that:

(26)
Proof.

Due to Prop. 1 it is enough to consider the case , which is given by Corollary 7.3 in the paper of Federer & Fleming (1960). ∎

5 Flat Metric Minimization

Motivated by the theoretical properties of the flat metric shown in the previous section, we consider the following optimization problem:

(27)

where and . We will assume that is parametrized with parameters in a compact set and write to abbreviate for some . We need the following assumption to be able to prove the existence of minimizers for the problem (27).

Assumption 1.

The map is smooth in with uniformly bounded derivative. Furthermore, we assume that is locally Lipschitz continuous and that the parameter set is compact.

Under this assumption, we will show that the objective in (27) is Lipschitz continuous. This will in turn guarantee existence of minimizers, as the domain is assumed to be compact.

Proposition 3.

Let , be normal currents with compact support. If the pushforward map fulfills Assumption 1, then the function is Lipschitz continuous and hence differentiable almost everywhere.

Proof.

In Appendix A. ∎

5.1 Application to Generative Modeling

We now turn towards our considered application illustrated in Fig. 2. There, we denote by the number of tangent vectors we specify at each sample point. The latent current is constructed by combining a probability distribution

, which could for example be the uniform distribution, with the unit

-vectorfield as follows:

(28)

For an illustration, see the right side of Fig. 2 and Fig. 3. The data current is constructed from the samples and tangent vectorfields .

(29)

The tangent -vectorfields are given by individual tangent vectors to the data manifold . For an illustration, see the left side of Fig. 2 or Fig. 3. After solving (27), the map will be our generative model, where changes in the latent space along the unit directions are expected to behave equivariantly to the specified tangent directions near .

Epoch Epoch Epoch Epoch
Figure 6: We illustrate the effect of moving from to and plot the measure of the pushforward of a -current (shown in orange) for different epochs. The black curve illustrates a walk along the first latent dimension . For , which is similar to WGAN-GP (Gulrajani et al., 2017), the latent walk is not meaningful. The proposed approach () allows to specify tangent vectors at the samples to which the first latent dimension behaves equivariantly, yielding an interpretable representation.

5.2 FlatGAN Formulation

To get a primal-dual formulation (or two player zero-sum game) in the spirit of GANs, we insert the definition of the flat norm (20) into the primal problem (27):

(30)

where

are for example the parameters of a neural network. In the above equation, we also used the definition of pushforward (

16). Notice that for the exterior derivative in (30) specializes to the gradient. This yields a Lipschitz constraint, and as for sufficiently large the other constraint becomes irrelevant, the problem (30) is closely related to the Wasserstein GAN (Bottou et al., 2017). The novelty in this work is the generalization to .

Combining (28) and (29) into (30) we arrive at the objective

(31)

Interestingly, due to the pullback, the discriminator inspects not only the output of the generator, but also parts of its Jacobian matrix. As a remark, relations between the generator Jacobian and GAN performance have recently been studied by Odena et al. (2018).

The constraints in (30) are implemented using penalty terms. First notice that due to the definition of the comass norm (6), the first constraint is equivalent to imposing for all simple -covectors with . We implement this with the a penalty term with parameter as follows:

(32)

where denotes the Haar measure on the Grassmannian manifold of -dimensional subspaces in , see Chapter 3.2 in Krantz & Parks (2008). Similarly, the constraint on the exterior derivative is implemented by another penalty term as follows:

(33)

5.3 Implementation with Deep Neural Networks

For high dimensional practical problems it is completely infeasible to directly work with

due to the curse of dimensionality. For example, already for the MNIST dataset augmented with two tangent vectors (

, ), we have that .

varying (rotation) varying (thickness)
Figure 7: We show the effect of varying the first two components in -dimensional latent space, corresponding to the two selected tangent vectors which are rotation and thickness. As seen in the figure, varying the corresponding latent representation yields an interpretable effect on the output, corresponding to the specified tangent direction.

To overcome this issue, we unfortunately have to resort to a few heuristic approximations. To that end, we first notice that in the formulations the dual variable

only appears as an inner product with simple -vectors, so we can implement it by implicitly describing its action, i.e., interpret it as a map :

(34)

Theoretically, the “affine term” is not fully justified as the map does not describe an inner product on anymore, but we found it to improve the quality of the generative model. An attempt to justify this in the context of GANs is that the function is the usual “discriminator” while the are combined to discriminate oriented tangent planes.

In practice, we parametrize , using deep neural networks. For efficiency reasons, the networks share their parameters up until the last few layers.

The inner product in (34) between the simple vectors is implemented by a -determinant, see (5). The reason we do this is to satisfy the properties of the Grassmann algebra (2) – (3). This is important, since otherwise the “discriminator” could distinguish between different representations of the same oriented tangent plane.

varying (lighting) varying (elevation) varying (azimuth)
Figure 8: From left to right we vary the latent codes in after training on the smallNORB dataset (LeCun et al., 2004).

For the implementation of the penalty term (33), we use the definition of the exterior derivative (11) together with the “approximate form” (34). To be compatible with the affine term we use a seperate penalty on , which we also found to give better results:

(35)

In the above equation, is the matrix with columns given by the vectors but with omitted and is the matrix with columns given by the . Another motivation for this implementation is, that in the case the second term in (35) disappears and one recovers the well-known “gradient penalty” regularizer proposed by Gulrajani et al. (2017).

For the stochastic approximation of the penalty terms (32) – (33) we sample from the Haar measure on the Grassmannian (i.e., taking random -dimensional and -dimensional subspaces in

) by computing singular value decomposition of random

Gaussian matrices. Furthermore, we found it beneficial in practice to enforce the penalty terms only at the data points as for example advocated in the recent work (Mescheder et al., 2018). The right multiplied Jacobian vector products (also referred to as “rop” in some frameworks) in (35

) as well as in the loss function (

31

) are implemented using two additional backpropagations.

[-0.14cm]

varying (time)

Figure 9: Varying the learned latent representation of time. The model captures behaviours such as people walking on the beach, see also the results shown in Fig. 1.

6 Experiments

The specific hyperparameters, architectures and tangent vector setups used in practice

333See https://github.com/moellenh/flatgan for a PyTorch implementation to reproduce Fig. 6 and Fig. 7. are detailed in Appendix B.

6.1 Illustrative 2D Example

As a first proof of concept, we illustrate the effect of moving from to on a very simple dataset consisting of five points on a circle. As shown in Fig. 6, for (corresponding to a WGAN-GP formulation) varying the first latent variable has no clear meaning. In contrast, with the proposed FlatGAN formulation, we can specify vectors tangent to the circle from which the data is sampled. This yields an interpretable latent representation that corresponds to an angular movement along the circle. As the number of epochs is increasing, both formulations tend to concentrate most of the probability mass on the five data points. However, since is continuous by construction an interpretable path remains.

6.2 Equivariant Representation Learning

In Fig. 7 and Fig. 8 we show examples for and on MNIST respectively the smallNORB dataset of LeCun et al. (2004). For MNIST, we compute the tangent vectors manually by rotation and dilation of the digits, similar as done by Simard et al. (1992, 1998). For the smallNORB example, the tangent vectors are given as differences between the corresponding images. As observed in the figures, the proposed formulation leads to interpretable latent codes which behave equivariantly with the generated images. We remark that the goal was not to achieve state-of-the-art image quality but rather to demonstrate that specifying tangent vectors yields disentangled representations. As remarked by Jaderberg et al. (2015), representing a 3D scene with a sequence of 2D convolutions is challenging and a specialized architecture based on a voxel representation would be more appropriate for the smallNORB example.

6.3 Discovering the Arrow of Time

In our last experiment, we set and specify the tangent vector as the difference of two neighbouring frames in video data. We train on the tinyvideo beach dataset (Vondrick et al., 2016), which consists of more than 36 million frames. After training for about half an epoch, we can already observe a learned latent representation of time, see Fig. 1 and Fig. 9. We generate individual frames by varying the latent coordinate from to .

Even though the model is trained on individual frames in random order, a somewhat coherent representation of time is discovered which captures phenomena such as ocean waves or people walking on the beach.

7 Discussion and Conclusion

In this work, we demonstrated that -currents can be used introduce a notion of orientation into probabilistic models. Furthermore, in experiments we have shown that specifying partial tangent information to the data manifold leads to interpretable and equivariant latent representations such as the camera position and lighting in a 3D scene or the arrow of time in time series data.

The difference to purely unsupervised approaches such as InfoGAN or -VAE is, that we can encourage potentially very complex latent representations to be learned. Nevertheless, an additional mutual information term as in (Chen et al., 2016) can be directly added to the formulation so that some representations could be encouraged through tangent vectors and the remaining ones are hoped to be discovered in an unsupervised fashion.

Generally speaking, we believe that geometric measure theory is a rather underexploited field with many possible application areas in probabilistic machine learning. We see this work as a step towards leveraging this potential.

Acknowledgements

We thank Kevin R. Vixie for his detailed feedback and comments on the manuscript. The work was partially supported by the German Research Foundation (DFG); project 394737018 “Functional Lifting 2.0 – Efficient Convexifications for Imaging and Vision”.

References

  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017.
  • Beckmann (1952) Beckmann, M. A continuous model of transportation. Econometrica: Journal of the Econometric Society, pp. 643–660, 1952.
  • Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
  • Bottou et al. (2017) Bottou, L., Arjovsky, M., Lopez-Paz, D., and Oquab, M. Geometrical insights for implicit generative modeling. arXiv:1712.07822, 2017.
  • Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, 2016.
  • Csiszár et al. (2004) Csiszár, I., Shields, P. C., et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
  • de Rham (1955) de Rham, G. Variétés différentiables, formes, courants, formes harmoniques, volume 1222. Hermann, 1955.
  • Denton et al. (2017) Denton, E. L. et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, 2017.
  • Federer (1969) Federer, H. Geometric Measure Theory. Springer, 1969.
  • Federer & Fleming (1960) Federer, H. and Fleming, W. H. Normal and integral currents. Annals of Mathematics, pp. 458–520, 1960.
  • Fefferman et al. (2016) Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
  • Feydy et al. (2018) Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-I., Trouvé, A., and Peyré, G. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. arXiv:1810.08278, 2018.
  • Fraser et al. (2003) Fraser, A. M., Hengartner, N. W., Vixie, K. R., and Wohlberg, B. E.

    Incorporating invariants in Mahalanobis distance based classifiers: Application to Face Recognition

    .
    In International Joint Conference on Neural Networks, 2003.
  • Genevay et al. (2017) Genevay, A., Peyré, G., and Cuturi, M. GAN and VAE from an optimal transport point of view. arXiv:1706.01807, 2017.
  • Glaunès et al. (2008) Glaunès, J., Qiu, A., Miller, M. I., and Younes, L. Large deformation diffeomorphic metric curve mapping.

    International Journal of Computer Vision (IJCV)

    , 80(3):317, 2008.
  • Goodfellow et al. (2014) Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  • Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. arXiv:1704.00028, 2017.
  • Higgins et al. (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. –VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016.
  • Hinton et al. (2011) Hinton, G. E., Krizhevsky, A., and Wang, S. D. Transforming auto-encoders. In International Conference on Artificial Neural Networks, 2011.
  • Hubbard & Hubbard (2015) Hubbard, J. H. and Hubbard, B. B. Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Matrix Editions, 2015.
  • Jaderberg et al. (2015) Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K. Spatial transformer networks. In Advances in Neural Information Processing Systems, 2015.
  • Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv:1802.05983, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. arXiv:1312.6114, 2014.
  • Krantz & Parks (2008) Krantz, S. G. and Parks, H. R. Geometric Integration Theory. Birkhäuser Boston, 2008.
  • LeCun et al. (2004) LeCun, Y., Huang, F. J., and Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2004.
  • Li et al. (2017) Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Póczos, B.

    MMD GAN: Towards deeper understanding of moment matching network

    .
    In Advances in Neural Information Processing Systems, 2017.
  • Mathieu et al. (2016) Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., and LeCun, Y. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, 2016.
  • Mescheder et al. (2018) Mescheder, L., Geiger, A., and Nowozin, S. Which training methods for GANs do actually Converge? In International Conference on Machine Learning, 2018.
  • Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv:1411.1784, 2014.
  • Möllenhoff & Cremers (2019) Möllenhoff, T. and Cremers, D. Lifting vectorial variational problems: A natural formulation based on geometric measure theory and discrete exterior calculus. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • Morgan (2016) Morgan, F. Geometric Measure Theory: A Beginner’s Guide. Academic Press, 5th edition, 2016.
  • Morgan & Vixie (2007) Morgan, S. P. and Vixie, K. R. LTV computes the flat norm for boundaries. In Abstract and Applied Analysis, 2007.
  • Narayanaswamy et al. (2017) Narayanaswamy, S., Paige, T. B., Van de Meent, J.-W., Desmaison, A., Goodman, N., Kohli, P., Wood, F., and Torr, P. Learning disentangled representations with semi-supervised deep generative models. In Advances in Neural Information Processing Systems, 2017.
  • Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In International Conference on Machine Learning, 2017.
  • Odena et al. (2018) Odena, A., Buckman, J., Olsson, C., Brown, T. B., Olah, C., Raffel, C., and Goodfellow, I. Is generator conditioning causally related to GAN performance? In International Conference on Machine Learning, 2018.
  • Peyré & Cuturi (2018) Peyré, G. and Cuturi, M. Computational optimal transport. arXiv:1803.00567, 2018.
  • Pickup et al. (2014) Pickup, L. C., Pan, Z., Wei, D., Shih, Y., Zhang, C., Zisserman, A., Schölkopf, B., and Freeman, W. T. Seeing the arrow of time. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082, 2014.
  • Rifai et al. (2011) Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X. The manifold tangent classifier. In Advances in Neural Information Processing Systems, 2011.
  • Santambrogio (2015) Santambrogio, F. Optimal Transport for Applied Mathematicians. Birkhäuser, New York, 2015.
  • Schmidhuber (1992) Schmidhuber, J. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
  • Schwartz (1951, 1957) Schwartz, L. Théorie des distributions I, II, volume 1245, 1122. Hermann, 1951, 1957.
  • Simard et al. (1992) Simard, P., Victorri, B., LeCun, Y., and Denker, J. Tangent prop – a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems, 1992.
  • Simard et al. (1998) Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. Transformation invariance in pattern recognition – tangent distance and tangent propagation. In Neural networks: tricks of the trade, pp. 239–274, 1998.
  • Sriperumbudur et al. (2012) Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet, G. R., et al.

    On the empirical estimation of integral probability metrics

    .
    Electronic Journal of Statistics, 6:1550–1599, 2012.
  • Vaillant & Glaunès (2005) Vaillant, M. and Glaunès, J. Surface matching via currents. In Biennial International Conference on Information Processing in Medical Imaging, 2005.
  • Vixie et al. (2010) Vixie, K. R., Clawson, K., Asaki, T. J., Sandine, G., Morgan, S. P., and Price, B. Multiscale flat norm signatures for shapes and images. Applied Mathematical Sciences, 4(14):667–680, 2010.
  • Vondrick et al. (2016) Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, 2016.
  • Wei et al. (2018) Wei, D., Lim, J. J., Zisserman, A., and Freeman, W. T. Learning and using the arrow of time. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • Whitney (1957) Whitney, H. Geometric Integration Theory. Princeton University Press, 1957.

A Proof of Proposition 3

Since and are normal currents we know for all .

We now directly show Lipschitz continuity. First notice that

(36)
(37)

yields the following bound:

(38)

Due to Prop. 1 we have that

(39)

Now define the compact set as

(40)

and as in §4.1.12 in Federer (1969) for compact the “stronger” flat norm

(41)

Since the constraint in the supremum in (41) is less restrictive than in the definition of the flat norm (20), we have

(42)

Then, the inequality after §4.1.13 in Federer (1969) bounds the right side of (42) for by

(43)

where due to Assumption 1 and we write , where is defined in the sense of (19). For , a similar bound can be derived without the term .

For , by setting we can further bound the term in (43) by

(44)

where . For , the bound is derived analogously.

Now since is locally Lipschitz and is compact, is Lipschitz and we denote the constant as , leading to the bound

(45)

Since is a normal current, . Thus by combining (38), (39), (42), (43), (44) and (45) there is a finite such that

(46)

Therefore, the cost in (27) is Lipschitz in and by Rademacher’s theorem, §3.1.6 in Federer (1969), also differentiable almost everywhere.

B Parameters and Network Architectures

For all experiments we use Adam optimizer (Kingma & Ba, 2014), with step size and momentum parameters , . The batch size is set to in all experiments except the first one (which runs full batch with batch size ). We always set .

b.1 Illustrative 2D Example

We pick the same parameters for . We set the penalty to and use discriminator updates per generator update as in (Gulrajani et al., 2017). The generator is a

fully connected network with leaky ReLU activations. The first layer ensures that the latent coordinate

has the topology of a circle, i.e., it is implemented as . The discriminators and are respectively nets with leaky ReLUs. The distribution on the latent is a uniform and