1 Introduction
This work is concerned with the problem of representation learning, which has important consequences for many tasks in artificial intelligence, cf. the work of
Bengio et al. (2013). More specifically, our aim is to learn representations which behave equivariantly with respect to selected transformations of the data. Such variations are often known beforehand and could for example describe changes in stroke width or rotation of a digit, changes in viewpoint or lighting in a threedimensional scene but also the arrow of time (Pickup et al., 2014; Wei et al., 2018) in timeseries, describing how a video changes from one frame to the next, see Fig. 1.We tackle this problem by introducing a novel formalism based on geometric measure theory (Federer, 1969)
, which we find to be interesting in itself. To motivate our application in generative modeling, recall the manifold hypothesis which states that the distribution of realworld data tends to concentrate nearby a lowdimensional manifold, see
Fefferman et al. (2016) and the references therein. Under that hypothesis, a possible unifying view on prominent methods in unsupervised and representation learning such as generative adversarial networks (GANs) (Goodfellow et al., 2014) and variational autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) is the following: both approaches aim to approximate the true distribution concentrating near the manifold with a distribution on some lowdimensional latent space that is pushed through a decoder or generatormapping to the (highdimensional) data space
(Genevay et al., 2017; Bottou et al., 2017).We argue that treating data as a distribution potentially ignores useful available geometric information such as orientation and tangent vectors to the data manifold. Such tangent vectors describe the aforementioned local variations or pertubations. Therefore we postulate that data should not be viewed as a distribution but rather as a current.
We postpone the definition of currents (de Rham, 1955) to Sec. 3, and informally think of them as distributions over dimensional oriented planes. For the limiting case , currents simply reduce to distributions in the sense of Schwartz (1951, 1957) and positive currents with unit mass are probability measures. A seminal work in the theory of currents was written by Federer & Fleming (1960), which established compactness theorems for subsets of currents (normal and integral currents). In this paper, we will work in the space of normal currents with compact support in , denoted by .
Similarly as probabilistic models build upon divergences (Csiszár et al., 2004), integral probability metrics (Sriperumbudur et al., 2012) or more general optimal transportation related divergences (Peyré & Cuturi, 2018; Feydy et al., 2018), we require a sensible notion to measure “distance” between currents.
In this work, we will focus on the flat norm^{1}^{1}1The terminology “flat” carries no geometrical significance and refers to Whitney’s use of musical notation flat and sharp . due to Whitney (1957). To be precise, we consider a scaled variant introduced and studied by Morgan & Vixie (2007); Vixie et al. (2010). This choice is motivated in Sec. 4, where we show that the flat norm enjoys certain attractive properties similar to the celebrated Wasserstein distances. For example, it metrizes the weakconvergence for normal currents.
A potential alternative to the flat norm are kernel metrics on spaces of currents (Vaillant & Glaunès, 2005; Glaunès et al., 2008). These have been proposed for diffeomorphic registration, but kernel distances on distributions have also been sucessfully employed for generative modeling, see Li et al. (2017). Constructions similar to the Kantorovich relaxation in optimal transport but generalized to currents recently appeared in the context of convexifications for certain variational problems (Möllenhoff & Cremers, 2019).
2 Related Work
Our main idea is illustrated in Fig. 2, which was inspired from the optimal transportation point of view on GANs given by Genevay et al. (2017).
Tangent vectors of the data manifold, either prespecified (Simard et al., 1992, 1998; Fraser et al., 2003)
or learned with a contractive autoencoder
(Rifai et al., 2011), have been used to train classifiers that aim to be
invariant to changes relative to the data manifold. In contrast to these works, we use tangent vectors to learn interpretable representations and a generative model that aims to be equivariant. The principled introduction of tangent vectors into probabilistic generative models is one of our main contributions.Various approaches to learning informative or disentangled latent representations in a completely unsupervised fashion exist (Schmidhuber, 1992; Higgins et al., 2016; Chen et al., 2016; Kim & Mnih, 2018). Our approach is orthogonal to these works, as specifying tangent vectors further encourages informative representations to be learned. For example, our GAN formulation could be combined with a mutual information term as in InfoGAN (Chen et al., 2016).
Our work is more closely related to semisupervised approaches on learning disentangled latent representations, which similarly also require some form of knowledge of the underlying factors (Hinton et al., 2011; Denton et al., 2017; Mathieu et al., 2016; Narayanaswamy et al., 2017) and also to conditional GANs (Mirza & Osindero, 2014; Odena et al., 2017). However, the difference is the connection to geometric measure theory which we believe to be completely novel, and our specific FlatGAN formulation that seamlessly extends the Wasserstein GAN (Arjovsky et al., 2017), cf. Fig. 2.
Since the concepts we need from geometric measure theory are not commonly used in machine learning, we briefly review them in the following section.
3 Geometric Measure Theory
The book by Federer (1969) is still the formidable, definitive reference on the subject. As a more accessible introduction we recommend (Krantz & Parks, 2008) or (Morgan, 2016). While our aim is to keep the manuscript selfcontained, we invite the interested reader to consult Chapter 4 in (Morgan, 2016), which in turn refers to the corresponding chapters in the book of Federer (1969) for more details.
3.1 Grassmann Algebra
Notation.
Denote a basis of with dual basis such that is the linear functional that maps every to the th component . For , denote as the ordered multiindices with .
One can multiply vectors in to obtain a new object:
(1) 
called a vector in . The wedge (or exterior) product is characterized by multilinearity
(2)  
and it is alternating
(3) 
In general, any vector can be written as
(4) 
for coefficients . The vector space of vectors is denoted by and has dimension . We define for two vectors , an inner product and the Euclidean norm .
A simple (or decomposable) vector is any that can be written using products of vectors. Simple vectors such as (1) are uniquely determined by the dimensional space spanned by the , their orientation and the norm corresponding to the area of the parallelotope spanned by the . Simple vectors with unit norm can therefore be thought of as oriented dimensional subspaces and the rules (2)(3) can be thought of as equivalence relations.
It turns out that the inner product of two simple vectors can be computed by the determinant
(5) 
where the columns of , contain the individual vectors. This will be useful later for our practical implementation.
Not all vectors are simple. An illustrative example is , which describes two dimensional subspaces in intersecting only at zero.
The dual space of is denoted as , and its elements are called covectors. They are similarly represented as (4) but with dual basis . Analogously to the previous page, we can define an inner product between vectors and covectors. Next to the Euclidean norm , we define two additional norms due to Whitney (1957).
Definition 1 (Mass and comass).
The comass norm defined for covectors is given by
(6) 
and the mass norm for is given by
(7)  
The mass norm is by construction the largest norm that agrees with the Euclidean norm on simple vectors. For the nonsimple vector from before, we compute
(8) 
Interpreting the nonsimple vector as two tangent planes, we see that the mass norm gives the correct area, while the Euclidean norm underestimates it. The comass will be used later to define the mass of currents and the flat norm.
3.2 Differential Forms
In order to define currents, we first need to introduce differential forms. A differential form is a covectorfield . The support is defined as the closure of the set .
Differential forms allow one to perform coordinatefree integration over oriented manifolds. Given some manifold , possibly with boundary, an orientation is a continuous map which assigns to each point a simple vector with unit norm that spans the tangent space at that point. Integration of a differential form over an oriented manifold is then defined by:
(9) 
where the second integral is the standard Lebesgue integral with respect to the dimensional Hausdorff measure restricted to , i.e., . The dimensional Hausdorff measure assigns to sets in their dimensional volume, see Chapter 2 in Morgan (2016) for a nice illustration. For the Hausdorff measure coincides with the Lebesgue measure.
The exterior derivative of a differential form is the form defined by
(10) 
where is the oriented boundary of the parallelotope spanned by the at point . The above definition is for example used in the textbook of Hubbard & Hubbard (2015). To get an intuition, note that for this reduces to the familiar directional derivative . In case is sufficiently smooth, the limit in (10) is given by
(11)  
where means that the vector is omitted. The formulation (11) will be used in the practical implementation. Interestingly, with (9) and (10) in mind, Stokes’ theorem
(12) 
becomes almost obvious, as (informally speaking) integrating (10) one obtains (12) since the oppositely oriented boundaries of neighbouring parallelotopes cancel each other out in the interior of .
To define the pushforward of currents which is central to our formulation, we require the pullback of differential forms. The pullback by a map of the form is given by
(13) 
where and is the Jacobian. We will also require (13) for the practical implementation.
3.3 Currents
We have now the necessary tools to define currents and the required operations on them, which will be defined through duality with differential forms. Consider the space of compactly supported and smooth forms in which we denote by . When furnished with an appropriate topology (cf. §4.1 in Federer (1969) for the details) this is a locally convex topological vector space. currents are continuous linear functionals on smooth, compactly supported differential forms, i.e., elements from the topological dual space . Some examples for currents are given in Fig. 3. The current in (a) could be an empirical data distribution, and the current in (b) represents the data distribution with a two dimensional oriented tangent space at each data point. The current in (c) simply represents the set as an oriented manifold, its action on a differential form is given as in (9).
A natural notion of convergence for currents is given by the weak topology:
(14) 
(a)  (b)  (c) 

The support of a current , , is the complement of the largest open set, so that when testing with compactly supported forms on that open set the answer is zero. Currents with compact support are denoted by . The boundary operator is defined using exterior derivative
(15) 
and Stokes’ theorem (12) ensures that this coincides with the intuitive notion of boundary for currents which are represented by integration over manifolds in the sense of (9).
The pushforward of a current is defined using the pullback
(16) 
where the intuition is that the pushforward transforms the current with the map , see the illustration in Fig. 2.
The mass of a current is given by
(17) 
If the current is an oriented manifold then the mass is the volume of that manifold. One convenient way to construct currents, is by combining a smooth vectorfield with a Radon measure :
(18) 
A concrete example is illustrated in Fig. 3 (b), where given samples and tangent vectors a current is constructed.
For currents with finite mass there is a measure and a map with almost everywhere so that we can represent it by integration as follows:
(19) 
Another perspective is that finite mass currents are simply vector valued Radon measures. Currents with finite mass and finite boundary mass are called normal currents (Federer & Fleming, 1960). The space of normal currents with support in a compact set is denoted by .
4 The Flat Metric
As indicated in Fig. 2, we wish to fit a current that is the pushforward of a lowdimensional latent current to the current given by the data. A more meaningful norm on currents than the mass turns out to be the flat norm.
Definition 2 (Flat norm and flat metric).
The flat norm with scale^{2}^{2}2We picked a different convention for as in (Morgan & Vixie, 2007), where it bounds the other constraint, to emphasize the connection to the Wasserstein1 distance. is defined for any current as
(20)  
For we simply write and will be denoted as the flat metric.
The flat norm also has a primal formulation
(21)  
(22) 
where the minimum in (21)–(22) can be shown to exist, see §4.1.12 in Federer (1969). The flat norm is finite if is a normal current and it can be verified that it is indeed a norm.
To get an intuition, we compare the flat norm to the mass (17) and the Wasserstein1 distance in Fig. 4 on the example of Dirac measures , . The mass is discontinuous and has zero gradient and is therefore unsuitable as a distance between currents. While the Wasserstein1 metric is continuous in , it does not easily generalize from probability measures to currents. In contrast, the flat metric has a meaningful geometric interpretation also for arbitrary currents. In Fig. 5 we illustrate the flat norm for two currents. In that figure, if and are of length one and are apart, then which converges to zero for .
Note that for currents, the flat norm (20) is strongly related to the Wasserstein1 distance except for the additional constraint on the dual variable , which in the example of Fig. 4 controls the truncation cutoff. Notice also the similarity of (21) to the Beckmann formulation of the Wasserstein1 distance (Beckmann, 1952; Santambrogio, 2015), with the difference being the implementation of the “divergence constraint” with a soft penalty . Considering the case as in the Wasserstein distance is problematic in case we have , since not every current is the boundary of a current, see the example above in Fig. 5.
The following proposition studies the effect of the scale parameter on the flat norm.
Proposition 1.
For any , the following relation holds
(23) 
meaning that and are equivalent norms.
Proof.
The importance of the flat norm is due to the fact that it metrizes the weakconvergence (14) on compactly supported normal currents with uniformly bounded mass and boundary mass.
Proposition 2.
Let be a compact set and some fixed constant. For a sequence with we have that:
(26) 
5 Flat Metric Minimization
Motivated by the theoretical properties of the flat metric shown in the previous section, we consider the following optimization problem:
(27) 
where and . We will assume that is parametrized with parameters in a compact set and write to abbreviate for some . We need the following assumption to be able to prove the existence of minimizers for the problem (27).
Assumption 1.
The map is smooth in with uniformly bounded derivative. Furthermore, we assume that is locally Lipschitz continuous and that the parameter set is compact.
Under this assumption, we will show that the objective in (27) is Lipschitz continuous. This will in turn guarantee existence of minimizers, as the domain is assumed to be compact.
Proposition 3.
Let , be normal currents with compact support. If the pushforward map fulfills Assumption 1, then the function is Lipschitz continuous and hence differentiable almost everywhere.
Proof.
In Appendix A. ∎
5.1 Application to Generative Modeling
We now turn towards our considered application illustrated in Fig. 2. There, we denote by the number of tangent vectors we specify at each sample point. The latent current is constructed by combining a probability distribution
, which could for example be the uniform distribution, with the unit
vectorfield as follows:(28) 
For an illustration, see the right side of Fig. 2 and Fig. 3. The data current is constructed from the samples and tangent vectorfields .
(29) 
The tangent vectorfields are given by individual tangent vectors to the data manifold . For an illustration, see the left side of Fig. 2 or Fig. 3. After solving (27), the map will be our generative model, where changes in the latent space along the unit directions are expected to behave equivariantly to the specified tangent directions near .





Epoch  Epoch  Epoch  Epoch 
5.2 FlatGAN Formulation
To get a primaldual formulation (or two player zerosum game) in the spirit of GANs, we insert the definition of the flat norm (20) into the primal problem (27):
(30) 
where
are for example the parameters of a neural network. In the above equation, we also used the definition of pushforward (
16). Notice that for the exterior derivative in (30) specializes to the gradient. This yields a Lipschitz constraint, and as for sufficiently large the other constraint becomes irrelevant, the problem (30) is closely related to the Wasserstein GAN (Bottou et al., 2017). The novelty in this work is the generalization to .Combining (28) and (29) into (30) we arrive at the objective
(31)  
Interestingly, due to the pullback, the discriminator inspects not only the output of the generator, but also parts of its Jacobian matrix. As a remark, relations between the generator Jacobian and GAN performance have recently been studied by Odena et al. (2018).
The constraints in (30) are implemented using penalty terms. First notice that due to the definition of the comass norm (6), the first constraint is equivalent to imposing for all simple covectors with . We implement this with the a penalty term with parameter as follows:
(32) 
where denotes the Haar measure on the Grassmannian manifold of dimensional subspaces in , see Chapter 3.2 in Krantz & Parks (2008). Similarly, the constraint on the exterior derivative is implemented by another penalty term as follows:
(33) 
5.3 Implementation with Deep Neural Networks
For high dimensional practical problems it is completely infeasible to directly work with
due to the curse of dimensionality. For example, already for the MNIST dataset augmented with two tangent vectors (
, ), we have that .varying (rotation)  varying (thickness) 
To overcome this issue, we unfortunately have to resort to a few heuristic approximations. To that end, we first notice that in the formulations the dual variable
only appears as an inner product with simple vectors, so we can implement it by implicitly describing its action, i.e., interpret it as a map :(34)  
Theoretically, the “affine term” is not fully justified as the map does not describe an inner product on anymore, but we found it to improve the quality of the generative model. An attempt to justify this in the context of GANs is that the function is the usual “discriminator” while the are combined to discriminate oriented tangent planes.
In practice, we parametrize , using deep neural networks. For efficiency reasons, the networks share their parameters up until the last few layers.
The inner product in (34) between the simple vectors is implemented by a determinant, see (5). The reason we do this is to satisfy the properties of the Grassmann algebra (2) – (3). This is important, since otherwise the “discriminator” could distinguish between different representations of the same oriented tangent plane.
varying (lighting)  varying (elevation)  varying (azimuth) 
For the implementation of the penalty term (33), we use the definition of the exterior derivative (11) together with the “approximate form” (34). To be compatible with the affine term we use a seperate penalty on , which we also found to give better results:
(35)  
In the above equation, is the matrix with columns given by the vectors but with omitted and is the matrix with columns given by the . Another motivation for this implementation is, that in the case the second term in (35) disappears and one recovers the wellknown “gradient penalty” regularizer proposed by Gulrajani et al. (2017).
For the stochastic approximation of the penalty terms (32) – (33) we sample from the Haar measure on the Grassmannian (i.e., taking random dimensional and dimensional subspaces in
) by computing singular value decomposition of random
Gaussian matrices. Furthermore, we found it beneficial in practice to enforce the penalty terms only at the data points as for example advocated in the recent work (Mescheder et al., 2018). The right multiplied Jacobian vector products (also referred to as “rop” in some frameworks) in (35) as well as in the loss function (
31) are implemented using two additional backpropagations.
6 Experiments
The specific hyperparameters, architectures and tangent vector setups used in practice
^{3}^{3}3See https://github.com/moellenh/flatgan for a PyTorch implementation to reproduce Fig. 6 and Fig. 7. are detailed in Appendix B.6.1 Illustrative 2D Example
As a first proof of concept, we illustrate the effect of moving from to on a very simple dataset consisting of five points on a circle. As shown in Fig. 6, for (corresponding to a WGANGP formulation) varying the first latent variable has no clear meaning. In contrast, with the proposed FlatGAN formulation, we can specify vectors tangent to the circle from which the data is sampled. This yields an interpretable latent representation that corresponds to an angular movement along the circle. As the number of epochs is increasing, both formulations tend to concentrate most of the probability mass on the five data points. However, since is continuous by construction an interpretable path remains.
6.2 Equivariant Representation Learning
In Fig. 7 and Fig. 8 we show examples for and on MNIST respectively the smallNORB dataset of LeCun et al. (2004). For MNIST, we compute the tangent vectors manually by rotation and dilation of the digits, similar as done by Simard et al. (1992, 1998). For the smallNORB example, the tangent vectors are given as differences between the corresponding images. As observed in the figures, the proposed formulation leads to interpretable latent codes which behave equivariantly with the generated images. We remark that the goal was not to achieve stateoftheart image quality but rather to demonstrate that specifying tangent vectors yields disentangled representations. As remarked by Jaderberg et al. (2015), representing a 3D scene with a sequence of 2D convolutions is challenging and a specialized architecture based on a voxel representation would be more appropriate for the smallNORB example.
6.3 Discovering the Arrow of Time
In our last experiment, we set and specify the tangent vector as the difference of two neighbouring frames in video data. We train on the tinyvideo beach dataset (Vondrick et al., 2016), which consists of more than 36 million frames. After training for about half an epoch, we can already observe a learned latent representation of time, see Fig. 1 and Fig. 9. We generate individual frames by varying the latent coordinate from to .
Even though the model is trained on individual frames in random order, a somewhat coherent representation of time is discovered which captures phenomena such as ocean waves or people walking on the beach.
7 Discussion and Conclusion
In this work, we demonstrated that currents can be used introduce a notion of orientation into probabilistic models. Furthermore, in experiments we have shown that specifying partial tangent information to the data manifold leads to interpretable and equivariant latent representations such as the camera position and lighting in a 3D scene or the arrow of time in time series data.
The difference to purely unsupervised approaches such as InfoGAN or VAE is, that we can encourage potentially very complex latent representations to be learned. Nevertheless, an additional mutual information term as in (Chen et al., 2016) can be directly added to the formulation so that some representations could be encouraged through tangent vectors and the remaining ones are hoped to be discovered in an unsupervised fashion.
Generally speaking, we believe that geometric measure theory is a rather underexploited field with many possible application areas in probabilistic machine learning. We see this work as a step towards leveraging this potential.
Acknowledgements
We thank Kevin R. Vixie for his detailed feedback and comments on the manuscript. The work was partially supported by the German Research Foundation (DFG); project 394737018 “Functional Lifting 2.0 – Efficient Convexifications for Imaging and Vision”.
References
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017.
 Beckmann (1952) Beckmann, M. A continuous model of transportation. Econometrica: Journal of the Econometric Society, pp. 643–660, 1952.
 Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
 Bottou et al. (2017) Bottou, L., Arjovsky, M., LopezPaz, D., and Oquab, M. Geometrical insights for implicit generative modeling. arXiv:1712.07822, 2017.
 Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, 2016.
 Csiszár et al. (2004) Csiszár, I., Shields, P. C., et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
 de Rham (1955) de Rham, G. Variétés différentiables, formes, courants, formes harmoniques, volume 1222. Hermann, 1955.
 Denton et al. (2017) Denton, E. L. et al. Unsupervised learning of disentangled representations from video. In Advances in Neural Information Processing Systems, 2017.
 Federer (1969) Federer, H. Geometric Measure Theory. Springer, 1969.
 Federer & Fleming (1960) Federer, H. and Fleming, W. H. Normal and integral currents. Annals of Mathematics, pp. 458–520, 1960.
 Fefferman et al. (2016) Fefferman, C., Mitter, S., and Narayanan, H. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
 Feydy et al. (2018) Feydy, J., Séjourné, T., Vialard, F.X., Amari, S.I., Trouvé, A., and Peyré, G. Interpolating between Optimal Transport and MMD using Sinkhorn Divergences. arXiv:1810.08278, 2018.

Fraser et al. (2003)
Fraser, A. M., Hengartner, N. W., Vixie, K. R., and Wohlberg, B. E.
Incorporating invariants in Mahalanobis distance based classifiers: Application to Face Recognition
. In International Joint Conference on Neural Networks, 2003.  Genevay et al. (2017) Genevay, A., Peyré, G., and Cuturi, M. GAN and VAE from an optimal transport point of view. arXiv:1706.01807, 2017.

Glaunès et al. (2008)
Glaunès, J., Qiu, A., Miller, M. I., and Younes, L.
Large deformation diffeomorphic metric curve
mapping.
International Journal of Computer Vision (IJCV)
, 80(3):317, 2008.  Goodfellow et al. (2014) Goodfellow, I. J., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of Wasserstein GANs. arXiv:1704.00028, 2017.
 Higgins et al. (2016) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. –VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016.
 Hinton et al. (2011) Hinton, G. E., Krizhevsky, A., and Wang, S. D. Transforming autoencoders. In International Conference on Artificial Neural Networks, 2011.
 Hubbard & Hubbard (2015) Hubbard, J. H. and Hubbard, B. B. Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Matrix Editions, 2015.
 Jaderberg et al. (2015) Jaderberg, M., Simonyan, K., Zisserman, A., and Kavukcuoglu, K. Spatial transformer networks. In Advances in Neural Information Processing Systems, 2015.
 Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv:1802.05983, 2018.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. arXiv:1312.6114, 2014.
 Krantz & Parks (2008) Krantz, S. G. and Parks, H. R. Geometric Integration Theory. Birkhäuser Boston, 2008.

LeCun et al. (2004)
LeCun, Y., Huang, F. J., and Bottou, L.
Learning methods for generic object recognition
with invariance to pose and lighting.
In
IEEE Conference on Computer Vision and Pattern Recognition
, 2004. 
Li et al. (2017)
Li, C.L., Chang, W.C., Cheng, Y., Yang, Y., and Póczos, B.
MMD GAN: Towards deeper understanding of moment matching network
. In Advances in Neural Information Processing Systems, 2017.  Mathieu et al. (2016) Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., and LeCun, Y. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, 2016.
 Mescheder et al. (2018) Mescheder, L., Geiger, A., and Nowozin, S. Which training methods for GANs do actually Converge? In International Conference on Machine Learning, 2018.
 Mirza & Osindero (2014) Mirza, M. and Osindero, S. Conditional generative adversarial nets. arXiv:1411.1784, 2014.
 Möllenhoff & Cremers (2019) Möllenhoff, T. and Cremers, D. Lifting vectorial variational problems: A natural formulation based on geometric measure theory and discrete exterior calculus. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 Morgan (2016) Morgan, F. Geometric Measure Theory: A Beginner’s Guide. Academic Press, 5th edition, 2016.
 Morgan & Vixie (2007) Morgan, S. P. and Vixie, K. R. LTV computes the flat norm for boundaries. In Abstract and Applied Analysis, 2007.
 Narayanaswamy et al. (2017) Narayanaswamy, S., Paige, T. B., Van de Meent, J.W., Desmaison, A., Goodman, N., Kohli, P., Wood, F., and Torr, P. Learning disentangled representations with semisupervised deep generative models. In Advances in Neural Information Processing Systems, 2017.
 Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In International Conference on Machine Learning, 2017.
 Odena et al. (2018) Odena, A., Buckman, J., Olsson, C., Brown, T. B., Olah, C., Raffel, C., and Goodfellow, I. Is generator conditioning causally related to GAN performance? In International Conference on Machine Learning, 2018.
 Peyré & Cuturi (2018) Peyré, G. and Cuturi, M. Computational optimal transport. arXiv:1803.00567, 2018.
 Pickup et al. (2014) Pickup, L. C., Pan, Z., Wei, D., Shih, Y., Zhang, C., Zisserman, A., Schölkopf, B., and Freeman, W. T. Seeing the arrow of time. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434, 2015.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv:1401.4082, 2014.
 Rifai et al. (2011) Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X. The manifold tangent classifier. In Advances in Neural Information Processing Systems, 2011.
 Santambrogio (2015) Santambrogio, F. Optimal Transport for Applied Mathematicians. Birkhäuser, New York, 2015.
 Schmidhuber (1992) Schmidhuber, J. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863–879, 1992.
 Schwartz (1951, 1957) Schwartz, L. Théorie des distributions I, II, volume 1245, 1122. Hermann, 1951, 1957.
 Simard et al. (1992) Simard, P., Victorri, B., LeCun, Y., and Denker, J. Tangent prop – a formalism for specifying selected invariances in an adaptive network. In Advances in Neural Information Processing Systems, 1992.
 Simard et al. (1998) Simard, P. Y., LeCun, Y. A., Denker, J. S., and Victorri, B. Transformation invariance in pattern recognition – tangent distance and tangent propagation. In Neural networks: tricks of the trade, pp. 239–274, 1998.

Sriperumbudur et al. (2012)
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., Lanckriet,
G. R., et al.
On the empirical estimation of integral probability metrics
. Electronic Journal of Statistics, 6:1550–1599, 2012.  Vaillant & Glaunès (2005) Vaillant, M. and Glaunès, J. Surface matching via currents. In Biennial International Conference on Information Processing in Medical Imaging, 2005.
 Vixie et al. (2010) Vixie, K. R., Clawson, K., Asaki, T. J., Sandine, G., Morgan, S. P., and Price, B. Multiscale flat norm signatures for shapes and images. Applied Mathematical Sciences, 4(14):667–680, 2010.
 Vondrick et al. (2016) Vondrick, C., Pirsiavash, H., and Torralba, A. Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, 2016.
 Wei et al. (2018) Wei, D., Lim, J. J., Zisserman, A., and Freeman, W. T. Learning and using the arrow of time. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
 Whitney (1957) Whitney, H. Geometric Integration Theory. Princeton University Press, 1957.
A Proof of Proposition 3
Since and are normal currents we know for all .
We now directly show Lipschitz continuity. First notice that
(36)  
(37) 
yields the following bound:
(38) 
Due to Prop. 1 we have that
(39) 
Now define the compact set as
(40)  
and as in §4.1.12 in Federer (1969) for compact the “stronger” flat norm
(41)  
Since the constraint in the supremum in (41) is less restrictive than in the definition of the flat norm (20), we have
(42) 
Then, the inequality after §4.1.13 in Federer (1969) bounds the right side of (42) for by
(43)  
where due to Assumption 1 and we write , where is defined in the sense of (19). For , a similar bound can be derived without the term .
For , by setting we can further bound the term in (43) by
(44)  
where . For , the bound is derived analogously.
Now since is locally Lipschitz and is compact, is Lipschitz and we denote the constant as , leading to the bound
(45) 
Since is a normal current, . Thus by combining (38), (39), (42), (43), (44) and (45) there is a finite such that
(46) 
Therefore, the cost in (27) is Lipschitz in and by Rademacher’s theorem, §3.1.6 in Federer (1969), also differentiable almost everywhere.
B Parameters and Network Architectures
For all experiments we use Adam optimizer (Kingma & Ba, 2014), with step size and momentum parameters , . The batch size is set to in all experiments except the first one (which runs full batch with batch size ). We always set .
b.1 Illustrative 2D Example
We pick the same parameters for . We set the penalty to and use discriminator updates per generator update as in (Gulrajani et al., 2017). The generator is a – – – – –
fully connected network with leaky ReLU activations. The first layer ensures that the latent coordinate
has the topology of a circle, i.e., it is implemented as . The discriminators and are – – – – respectively – – – nets with leaky ReLUs. The distribution on the latent is a uniform and
Comments
There are no comments yet.