# Explorations in Homeomorphic Variational Auto-Encoding

The manifold hypothesis states that many kinds of high-dimensional data are concentrated near a low-dimensional manifold. If the topology of this data manifold is non-trivial, a continuous encoder network cannot embed it in a one-to-one manner without creating holes of low density in the latent space. This is at odds with the Gaussian prior assumption typically made in Variational Auto-Encoders (VAEs), because the density of a Gaussian concentrates near a blob-like manifold. In this paper we investigate the use of manifold-valued latent variables. Specifically, we focus on the important case of continuously differentiable symmetry groups (Lie groups), such as the group of 3D rotations SO(3). We show how a VAE with SO(3)-valued latent variables can be constructed, by extending the reparameterization trick to compact connected Lie groups. Our experiments show that choosing manifold-valued latent variables that match the topology of the latent data manifold, is crucial to preserve the topological structure and learn a well-behaved latent space.

## Authors

• 4 publications
• 5 publications
• 4 publications
• 5 publications
• 8 publications
• 7 publications
• 17 publications
• ### Embedding-reparameterization procedure for manifold-valued latent variables in generative models

Conventional prior for Variational Auto-Encoder (VAE) is a Gaussian dist...
12/06/2018 ∙ by Eugene Golikov, et al. ∙ 0

• ### Topological Constraints on Homeomorphic Auto-Encoding

When doing representation learning on data that lives on a known non-tri...
12/27/2018 ∙ by Pim de Haan, et al. ∙ 0

• ### Spatial Variational Auto-Encoding via Matrix-Variate Normal Distributions

The key idea of variational auto-encoders (VAEs) resembles that of tradi...
05/18/2017 ∙ by Zhengyang Wang, et al. ∙ 0

• ### Neural Ideal Point Estimation Network

Understanding politics is challenging because the politics take the infl...
04/26/2019 ∙ by Kyungwoo Song, et al. ∙ 0

• ### Increasing Expressivity of a Hyperspherical VAE

Learning suitable latent representations for observed, high-dimensional ...
10/07/2019 ∙ by Tim R. Davidson, et al. ∙ 32

• ### Learning Correlated Latent Representations with Adaptive Priors

Variational Auto-Encoders (VAEs) have been widely applied for learning c...
06/14/2019 ∙ by Da Tang, et al. ∙ 3

• ### Correlated Variational Auto-Encoders

Variational Auto-Encoders (VAEs) are capable of learning latent represen...
05/14/2019 ∙ by Da Tang, et al. ∙ 7

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many complex probability distributions can be represented more compactly by introducing latent variables. Intuitively, the idea is that there is some simple underlying latent structure, which is mapped to the observation space by a potentially complex nonlinear function. It will come as no surprise then, that most research effort has aimed at using maximally simple priors for the latent variables (e.g. Gaussians), combined with flexible likelihood functions (e.g. based on neural networks).

However, it is not hard to see (Fig. 1) that if the data is concentrated near a low-dimensional manifold with non-trivial topology, there is no continuous and invertible mapping to a blob-like manifold (the region where prior mass is concentrated). We believe that for purposes of representation learning, the embedding map (encoder) should be homeomorphic (i.e. continuous and invertible, with continuous inverse), which means that although dimensionality reduction and geometrical simplification (flattening) may be possible, the topological structure should be preserved.

Once could encode such a manifold in a higher dimensional flat space with a regular variational auto-encoder (VAE, Kingma & Welling (2013); Rezende et al. (2014)

), rather than learning a homeomorphism. This has two disadvantages. The prior on the flat space will put density outside of the embedding and traversals along the extra dimensions that are normal to the manifold will either leave the decoding invariant, or move out of the data manifold. This is because at each point there will be many more degrees of freedom than the dimensionality of the manifold.

In this paper we investigate this idea for the special case of Lie groups, which are symmetry groups that are simultaneously differentiable manifolds. Lie groups include rotations, translations, scaling, and other geometric transformations, which play an important role in many application domains such as robotics and computer vision. More specifically, we show how to construct

111Our implementation is available at https://github.com/pimdh/lie-vae. a VAE with latent variables that live on a Lie group, which is done by generalizing the reparameterization trick.

We will describe an approach for reparameterizing densities on , the group of 3D rotations, which can be extended to general compact and connected Lie group VAEs in a straightforward manner. The primary technical difficulty in the construction of this theory is to show that the pushforward measure induced by our reparameterization has a density that is absolutely continuous w.r.t. the Haar measure. Moreover, we show how to construct the encoder such that it can learn a homeomorphic map from the data manifold to the mean parameter of the posterior. Finally, we propose a decoder that uses the group action to further encourage the latent space to respect the group structure.

We perform experiments on two types of synthetic data: embedded into a high dimensional space through its group representation, and images of 3D rotations of a single colored cube. We find that a theoretically sound architecture is capable of continuously mapping the data manifold to the latent space. On the other hand, models that do not respect topological structure, and in particular those with a standard Gaussian latent space, show discontinuities when trajectories in the latent space are visualized. To better study this phenomenon, we introduce a way to measure the continuity of the embedding based on the concept of Lipschitz continuity. We empirically demonstrate that only a manifold-valued latent variable with the required topological structure is capable of fully solving the difficult task of the more complicated experiment.

Our main contributions in this work are threefold:

1. A reparameterization trick for distributions on the group of rotations in three dimensions.

2. An encoder for the mean parameter that learns a homeomorphism between the manifold embedded in the data and itself.

3. A decoder that uses the group action to respect the group structure.

## 2 Preliminary Concepts

In this section we will first cover a number of preliminary concepts that will be used in the rest of the paper.

### 2.1 Variational Auto-Encoders

The VAE is a latent variable model, in which denotes a set of observed variables, stochastic latent variables, and

a parameterized model of the joint distribution called the

generative model. Given a dataset , we typically wish to maximize the average marginal log-likelihood , w.r.t. the parameters. However when the model is parameterized by neural networks, the marginalization of this expression is generally intractable. One solution to overcome this issue is applying variational inference in order to maximize the Evidence Lower Bound (ELBO) for each observation:

 logp(\x) =log∫p(\x,\z)d\z ≥\Eq(\z)[logp(\x|\z)]−KL(q(\z)||p(\z)), (1)

where the approximate posterior belongs to the variational family . To make inference scalable an inference network is introduced that outputs a probability distribution for each data point , leading to the final objective

 L(\x;θ)=\Eq(\z|\x) [logp(\x|\z)]−KL(q(\z|\x)||p(\z)), (2)

with representing the parameters of and . The ELBO can be efficiently approximated for continuous latent variable

by Monte Carlo estimates using the

reparameterization trick of (Kingma & Welling, 2013; Rezende et al., 2014).

### 2.2 Lie Groups and Lie Algebras

#### Lie Group

A group is a set equipped with a product that follows the four group axioms: the product is closed and associative, there exists an identity element, and every group element has an inverse. This is closely linked to symmetry transformations that leave some property invariant. For example, composing two symmetry transformations should still maintain the invariance. A Lie group has additional structure, as its set is also a smooth manifold. This means that we can, at least in local regions, describe group elements continuously with parameters. The number of parameters equals the dimension of the group. We can see (connected) Lie groups as continuous symmetries where we can continuously traverse between group elements222We refer the interested reader to (Hall, 2003)..

#### Lie Algebra

The Lie algebra , of a

dimensional Lie group is its tangent space at the identity, which is a vector space of

dimensions. We can see the algebra elements as infinitesimal generators, from which all other elements in the group can be created. For matrix Lie groups we can represent vectors in the tangent space as matrices .

#### Exponential Map

The structure of the algebra creates a map from an element of the algebra to a vector field on the group manifold. This gives rise to the exponential map which maps an algebra element to the group element at unit length from the identity along the flow of the vector field. The zero vector is thus mapped to the identity. For compact connected Lie groups, such as , the exponential map is surjective.

### 2.3 The group 3

The special orthogonal Lie group of three dimensional rotations is defined as:

 \SO3:={R∈GL(R3):R⊤R=I∧det(R)=1} (3)

where is the general linear group, which is the set of square invertible matrices under the operation of matrix multiplication. Note that is not homeomorphic to , since on every continuous path can be continuously contracted to a point, while this is not true for . Consider for example a full rotation around a fixed axis.

The elements of Lie algebra of group

, are represented by the 3 dimensional vector space of the skew-symmetric

matrices. We choose a basis for the algebra:

 L1,2,3:=⎡⎢⎣00000−1010⎤⎥⎦,⎡⎢⎣001000−100⎤⎥⎦,⎡⎢⎣0−10100000⎤⎥⎦ (4)

This provides a vector space isomorphism between and , written as .

Assuming the decomposition , s.t. , the exponential map is given by the Rodrigues rotation formula (Rodrigues, 1840):

 exp(v×)=I+sin(θ)u×+(1−cos(θ))u2× (5)

Since is a compact and connected Lie group this map is surjective, however it is not injective.

## 3 Reparameterizing \So3

In this section we will explain our reparameterization trick by analogy to the classic version described in (Kingma & Welling, 2013; Rezende et al., 2014). An overview of the different steps and their relation to the classical case are given in Figure 2.

We sample from a scale-reparameterizable distribution on that is concentrated around the origin. Due to the isomorphism between and this can be identified with a sample from the Lie algebra . Next we apply the exponential map to obtain a sample of the group as visualized in Figure 2 (a) to (b). Since the distribution is concentrated around the origin, the distribution of will be concentrated around the group identity. In order to change the location of the distribution , we left multiply by another element , see Figure 2 (b) to (c).

To see the connection with the classical case, identify under addition as a Lie group, with the Lie algebra isomorphic to . As the group and the algebra are in this case isomorphic, the step of taking the exponential map can be taken as the identity operation such that . The multiplication with a group element to change the location corresponds to a translation by .

One critical complication is that it is not obvious that the measure we defined above through the exp map has a density function . For this to be the case we need the constructed measure to be absolutely continuous with respect to the Haar measure , the natural measure on the Lie group. This is proven by the following theorem.

Let the real space, provided with the Lebesgue measure on the Borel algebra on . Let the group of 3 dimensional rotations, provided with the normalized Haar measure on the Borel -algebra on . Consider then the probability measure absolutely continuous w.r.t , with density . Consider the exponential map that is differentiable, thus continuous, thus measurable. Let then be the pushforward of by the function. then is absolutely continuous with respect of the Haar measure ().

###### Proof.

See Appendix B

As further derived in Appendix B this implies the pushforward measure on to be absolutely continuous w.r.t. to the Haar measure where the density

 ^q(R|σ)=∑k∈Zr(log(R)θ(R)(θ(R)+2kπ)∣∣∣σ)(θ(R)+2kπ)23−tr(R), (6)

is defined almost everywhere. Here and

 θ(R)=∥log(R)∥=cos−1(tr(R)−12) (7)

Further, is defined as a principal branch and maps back the group element to the unique Lie algebra element next to the origin. Notice that even if the density is singular at , it still integrates to 1. After rotating by left multiplying with another element , we obtain the final sample:

 Rz∼q(Rz|Rμ,σ)=^q(R⊤μRz|σ), (8)

where the second step is valid because of the left invariance of the Haar measure.

#### Kullback-Leibler Divergence

The KL divergence, or relative entropy can be decomposed into the entropy and the cross-entropy terms, . Since the Haar measure is invariant to left multiplication, we can compute the entropy of the distribution instead of . As we have the expression of the density, the entropy can be computed using Monte Carlo samples:

 H(q) =H(^q)≈−1NN∑i=1log^q(Ri|σ),Ri∼^q(Ri|σ) =−1NN∑i=1log^q(exp(\vvi)|σ) =−1NN∑i=1log∑k∈Zr(\vvi∥\vvi∥(∥\vvi∥+2kπ)|σ)⋅ (∥\vvi∥+2kπ)22−2cos(∥\vvi∥),\vvi∼r(\vvi|σ) (9)

Notice that the last expression only depends on the samples taken on the Lie algebra. We found that one sample suffices when mini-batches are used. In general the cross-entropy term can be similarly approximated by MC estimates. However, in the special but important case of a uniform prior, , the cross-entropy reduces to: .

## 4 Encoder and Decoder networks

Having defined the reparameterizable density , we need to design encoder networks which map elements from the input space to the reparameterization parameters , and decoder networks which map group elements to the output prediction.

### 4.1 Homeomorphic Encoder

We split the encoder network in two parts and , which predict reparameterization parameters and respectively. Since are parameters of a distribution in , the corresponding network does not pose any problems and can be chosen similarly as in classical VAEs. However, special attention needs to be paid to designing which predicts a group element .

We consider the data as lying in a lower dimensional manifold , embedded in the input space . In our particular problem the manifold is assumed to be generated by , acting on a canonical object and a subsequent projection into ambient space (e.g. pixel space) which, for simplicity we assume to be injective. This means that we can make the simplifying assumption that can be recovered from its image in , i.e. that the map is a homeomorphism. The encoder is now meant to learn the inverse map, i.e. to learn a map from to , which when restricted to is a homeomorphism and thus preserves the topological structure of .

In general there is no standard way to define and parameterize the class of functions which are guaranteed to have these properties by design via a neural network. Instead we will give a general way to build capable of learning such a mapping. We divide the encoder network in two functions: , where , for some space , is parametrized by a neural network, and is a fixed surjective function. Not any space or function is suited: since neural networks can only model continuous functions, a necessary condition on for to be able to learn to be a homeomorphism (when its domain is restricted to ), is that there exists an embedding . Then by definition a function exist, such that is a homeomorphism. Any extension of to is a suitable candidate. Moreover, if we choose and for some , then some continuous exists (which we can approximate with neural networks) such that an appropriate exist. Several choices for and are proposed in Appendix D and investigated in the experimental section 6.

### 4.2 Group Action Decoder

Our decoder must be capable to map a group element and optionally additional latent structure back to the original high dimensional input space. When the factor of variation in the input is the pose of an object, and we learn a latent variable , we desire that a transformation applied to a latent object representation results in a corresponding transformation of the pose of the decoded object. The task of the decoder is thus to learn a three dimensional representation of the object, to rotate it according to the latent variable and finally to project it back to the two dimensional frame of the input image. A naive approach could be to simply provide the 9 elements of the rotation matrix to a neural network. However, although it may learn to reconstruct the input well, it provides no guarantee that the latent space accurately reflects the pose variations of the object. Therefore, we like to make the method more explicit.

Hypothetically, one could learn a vector-valued signal on the sphere, , to represent the three dimensional object in the input, as 3D shapes can be well represented by its information projected to a sphere (Cohen et al., 2018a). The decoder can rotate this signal with the latent variable, before projecting it back to pixel space. A major downside of this approach is that parameterizing and projecting a function on the sphere is highly non-trivial.

Alternatively, we propose a method based on the representation theory of groups (Hall, 2003). Rather than learning the function , we directly learn its (band-limited) Fourier modes, which form a simple vector space. It can be shown (Chirikjian & Kyatkin, 2000)

that rotations of a signal on the sphere correspond to a linear transformation of the Fourier modes. The transformed Fourier modes are subsequently fed through an image generative network, and the linear transformation is the Wigner-D-matrix, which is a function of the

element. Technically, the Wigner-D-matrices form representations of the group. This means that as mapping to the linear transformations is a homomorphism, it preserves the group structure: , for and a Wigner-D-matrix. This method encourages the latent space to represent the actual pose of the input, while only requiring the construction of the matrices and performing a linear transformation. We refer to Figure 3 and Appendix F for details.

An additional advantage is that this decoder allows for disentangling of content and pose, as it is forced to encode the pose in a meaningful way. The Fourier modes are in that case also generated by the encoder. We leave this for future work.

## 5 Related Work

As VAEs utilize VI to recover some distribution on a latent manifold responsible for generating the observed data, the majority of extensions is focused on increasing the flexibility of the prior and approximate posterior. Although the majority of approaches make use of a normal Gaussian prior, recently there has been a surge to provide additional options to offset some of this distribution’s perceived limitations. Tomczak & Welling (2017) propose to directly tie the prior to the approximate posterior and learn it as a mixture over approximate posteriors. Nalisnick & Smyth (2017) introduce a non-parametric prior applying a truncated stick-breaking method. Research to support discrete latent variables was done in Jang et al. (2017); Maddison et al. (2017), while in Naesseth et al. (2017); Figurnov et al. (2018) recently novel techniques were introduced to reparameterize a suite of continuous distributions. In (Davidson et al., 2018), the reparameterization technique of Naesseth et al. (2017) is extended to explore the properties of the hyperspherical von Mises-Fisher distribution to better capture intrinsically hyperspherical data. This is done in the context of avoiding manifold mismatches, and as such is closely related to the motivation of this work.

The predominant procedure to generate a more complex approximate posterior is through normalizing flows (Rezende & Mohamed, 2015), in which a class of invertible transformations is applied sequentially to a reparameterizable density. This general idea has later been extended Kingma et al. (2016); Berg et al. (2018), to improve flexibility even further. As this framework does not hold any specific distributional requirements on the prior besides being reparameterizable, it would be interesting to investigate possible applications to in future work.

The problem of defining distributions on homogeneous spaces, including Lie groups, was investigated in (Chirikjian & Kyatkin, 2000; Chirikjian, 2010; Chirikjian & Kyatkin, 2016; Chirikjian, 2012). Cohen & Welling (2015) devised harmonic exponential families which are a powerful family of distributions defined on homogeneous spaces. These works did not concentrate on making the distributions reparameterizable.

Rendering complex scenes from multiple poses has been explored in (Eslami et al., 2018). However, this work assumes access to ground truth poses and does not do unsupervised pose learning as in the presented framework.

The idea of incorporating prior knowledge on mathematical structures into machine learning models has proven fruitful in many works. Cohen et al. (2018a) adapt convolutional networks to operate on spherical and valued data. Equivariant networks, investigated in Cohen & Welling (2016, 2017); Worrall et al. (2017); Weiler et al. (2018); Cohen et al. (2018b) reduce the complexity of a learning task by taking a quotient over group orbits which explain a subset of dimensions of the data manifold.

## 6 Experiments

We perform two experiments to investigate the importance of using a homeomorphic parameterization of the VAE in recovering the original underlying manifold. In both experiments we explore three main axes of comparison: (1) manifold topology, (2) decoder architecture, and (3) specifically for the models we compare different mean parameterizations as discussed in section 4.1. For each model we compute a tight bound on the negative log likelihood (NLL) through importance sampling following Burda et al. (2016).

For manifold topology we examine VAEs with the Gaussian parameterization (-VAE), the hyperspherical parameterization of Davidson et al. (2018) (-VAE), and the latent variable discussed above. The two decoder variants are a simple MLP versus the group action decoder described in section 4.2. Lastly we explore mean parameterizations through unit Quaternions (q), the Lie algebra (alg), (s2s1), and (s2s2). These parameterizations are chosen to be either valid () or invalid (q, alg, ) for the purpose of investigating the soundness of our theoretical considerations and to compare their behaviour. Details and derivations on the properties of these different parameterizations can be found in Appendix D.

### 6.1 Toy experiment

The simplest way of creating a non-linear embedding of in a high dimensional space is through the representation as discussed in Section 4.2. The data is created in the following way: a fixed representation is chosen, in this experiment this is three copies of the direct sum of the Wigner-D-matrices up to order 3, making the embedding space . Subsequently a single element of is generated. The data set now consists of the vectors , where sampled uniformly. Since the representation is faithful ( is injective) the data points lie on a 3 dimensional submanifold of homeomorphic to .

In order to verify the ability of the models to correctly learn to encode from the embedded manifold to the manifold itself, we learn various variational and non-variational auto-encoders on this data set. The encoder is a 3 layer MLP, and for the decoder we use the group action decoder of Section 4.2. The same representation is used as in the data generation, but we learn . In addition to the models, we use a 3 dimensional normal, which we map to using the ZYZ-Euler angles, and a von Mises-Fisher, which we map to by identifying as the unit quaternions.

#### Results

The quantitative results are shown in Table 1. We observe that the choice for the mean parametrization significantly impacts the ability of the model to correctly learn the manifold. The method strongly outperforms the competing methods in the non-variational Auto-Encoder achieving near-perfect reconstructions. Additionally, the metric indicating the continuity of the encoder, which we define in Appendix E, shows it is the only model that does not have discontinuities in the latent space. These results are in line with our theoretical prediction outlined in Section 4.1 and Appendix D.

The qualitative results in Figure 4 and Figures 6, 7 in Appendix A tell a similar story. These plots are created by taking a subgroup of and making a submanifold in the data space using the same process with which the data was generated. This embedded trajectory is then encoded and reconstructed. The trajectory is divided in four equally sized partitions, each shown in a different color. We clearly see that only the method is able to learn a continuous latent space.

Moreover, the worst performing models are the 3 dimensional and algebra mean models. Interestingly, these share that at one intermediate point is represented by . This indicates that using flat space to represent a non-trivial manifold results in a poorly structured latent space and worse reconstruction performance and Log Likelihoods.

### 6.2 Sphere-Cube

For this experiment we learn auto-encoders on renderings of a cube. The cube is made highly asymmetrical through the colors of the faces and the colored spheres at the vertices. This should make it easier for the encoder to detect the orientation. This sphere-cube is then rotated by applying uniformly sampled group elements from , to create a training set of 1M images. Ideally the model learns to correctly represent these encodings in the latent space.

The encoder consists of 5 convolutional layers, followed by one of the mean encoders and reparameterization methods. The decoder uses either the group action or a 3 layer MLP, both followed by a 5 deconvolutional layers. In order to balance reconstruction and the KL divergence in a controlled manner, we follow Burgess et al. (2018) and replace the negative KL term in the original VAE loss with a squared difference of the computed KL value and a target value. We found that a target value of 7 early in training to 15 at the end of the training gave good results. This allows the model to first organize the space and later become more certain of its predictions. We found that two additional regularizing loss terms were needed to correctly learn the latent space. Details can be found in Appendix G.

#### Results

Quantitative results comparing the best performing parameterization to -VAEs of diff dimensionality are shown in Table 2. Although the higher dimensional -VAEs are able to achieve competitive metrics compared to the best model, they only learn to embed in a high dimensional space in an unstructured fashion. As can be seen in in 5, the latent space with mean parameterization learns a nearly perfect encoding, while the 10 dimensional Normal learns disconnected patches of the data manifold.333Animated interpolations can be found at https://sites.google.com/view/lie-vae.

It can be seen in Table 3 that the results from the Toy experiment extend to this more complicated task. We observe that only the continuous encoding, , achieves low log likelihood and reconstruction losses compared to the other mean parameterizations.

Lastly, we observe that the group action decoder yields significantly higher performance than the MLP decoder. This is in line with the hypotheses that using the group action encourages structure in the latent space.

## 7 Discussion & Conclusion

In this paper we explored the use of manifold-valued latent variables, by proposing an extension of the reparameterization trick to compact connected Lie groups. We worked out the implementation details for the specific case of , and highlighted the various subtleties that must be taken into account to ensure a successful parameterization of the VAE. Through a series of experiments, we showed the importance of matching the topology of the latent data manifold with that of the latent variables to induce a continuous, well-behaved latent space. Additionally we demonstrated the improvement in learned latent space structure by using a group action decoder, and the need for care in choosing an embedding space for the posterior distribution’s mean parameter.

We believe that the use of and other well-known manifold-valued latent variables could present an interesting addition to tackling problems in such fields as model based RL and computer vision. Moving forward we thus aim to extend this theory to other Lie groups such as . A limitation of the current work, and reparameterizing distributions on specific manifolds in general, is that it relies on the assumption of a priori knowledge about the observed data’s latent structure. Hence in future work our ambition is to find a general theory to learn arbitrary manifolds not known in advance.

## Acknowledgements

The authors would like to thank Rianne van den Berg, Jakub Tomczak, and Yvan Scher for their suggestions and support in improving this paper.

## References

• Berg et al. (2018) Berg, Rianne van den, Hasenclever, Leonard, Tomczak, Jakub M, and Welling, Max. Sylvester normalizing flows for variational inference. UAI, 2018.
• Burda et al. (2016) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan.

Importance weighted autoencoders.

ICLR, 2016.
• Burgess et al. (2018) Burgess, Christopher P, Higgins, Irina, Pal, Arka, Matthey, Loic, Watters, Nick, Desjardins, Guillaume, and Lerchner, Alexander. Understanding disentangling in -vae. arXiv preprint arXiv:1804.03599, 2018.
• Chirikjian (2010) Chirikjian, Gregory S. Information-theoretic inequalities on unimodular lie groups. Journal of geometric mechanics, 2(2):119, 2010.
• Chirikjian (2012) Chirikjian, Gregory S. Stochastic Models, Information Theory, and Lie Groups. 2012.
• Chirikjian & Kyatkin (2000) Chirikjian, Gregory S and Kyatkin, Alexander B. Engineering applications of noncommutative harmonic analysis: with emphasis on rotation and motion groups. CRC press, 2000.
• Chirikjian & Kyatkin (2016) Chirikjian, Gregory S and Kyatkin, Alexander B. Harmonic Analysis for Engineers and Applied Scientists: Updated and Expanded Edition. Courier Dover Publications, July 2016.
• Cohen & Welling (2016) Cohen, Taco and Welling, Max. Group equivariant convolutional networks. In ICML, pp. 2990–2999, 2016.
• Cohen & Welling (2015) Cohen, Taco S and Welling, Max. Harmonic exponential families on manifolds. ICML, 2015.
• Cohen & Welling (2017) Cohen, Taco S and Welling, Max. Steerable cnns. ICLR, 2017.
• Cohen et al. (2018a) Cohen, Taco S., Geiger, Mario, Köhler, Jonas, and Welling, Max. Spherical CNNs. ICLR, 2018a.
• Cohen et al. (2018b) Cohen, Taco S, Geiger, Mario, and Weiler, Maurice. Intertwiners between induced representations (with applications to the theory of equivariant neural networks). March 2018b.
• Davidson et al. (2018) Davidson, Tim R., Falorsi, Luca, Cao, Nicola De, Kipf, Thomas, and Tomczak, Jakub M. Hyperspherical Variational Auto-Encoders. UAI, 2018.
• Eslami et al. (2018) Eslami, S. M. Ali, Jimenez Rezende, Danilo, Besse, Frederic, Viola, Fabio, Morcos, Ari S., Garnelo, Marta, Ruderman, Avraham, Rusu, Andrei A., Danihelka, Ivo, Gregor, Karol, Reichert, David P., Buesing, Lars, Weber, Theophane, Vinyals, Oriol, Rosenbaum, Dan, Rabinowitz, Neil, King, Helen, Hillier, Chloe, Botvinick, Matt, Wierstra, Daan, Kavukcuoglu, Koray, and Hassabis, Demis.

Neural scene representation and rendering.

Science, 360(6394):1204–1210, 2018. ISSN 0036-8075.
• Figurnov et al. (2018) Figurnov, Michael, Mohamed, Shakir, and Mnih, Andriy. Implicit reparameterization gradients. arXiv preprint arXiv:1805.08498, 2018.
• Hall (2003) Hall, B. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction. Graduate Texts in Mathematics. Springer, 2003. ISBN 9780387401225.
• Jang et al. (2017) Jang, Eric, Gu, Shixiang, and Poole, Ben. Categorical reparameterization with gumbel-softmax. ICLR, abs/1611.01144, 2017.
• Kingma & Welling (2013) Kingma, Diederik P. and Welling, Max. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
• Kingma et al. (2016) Kingma, Diederik P, Salimans, Tim, Jozefowicz, Rafal, Chen, Xi, Sutskever, Ilya, and Welling, Max. Improved variational inference with inverse autoregressive flow. In NIPS, pp. 4743–4751, 2016.
• Maddison et al. (2017) Maddison, Chris J, Mnih, Andriy, and Teh, Yee Whye.

The concrete distribution: A continuous relaxation of discrete random variables.

ICLR, 2017.
• Naesseth et al. (2017) Naesseth, Christian, Ruiz, Francisco, Linderman, Scott, and Blei, David. Reparameterization gradients through acceptance-rejection sampling algorithms. In AISTATS, pp. 489–498, 2017.
• Nalisnick & Smyth (2017) Nalisnick, Eric and Smyth, Padhraic. Stick-breaking variational autoencoders. ICLR, 2017.
• Rezende & Mohamed (2015) Rezende, Danilo and Mohamed, Shakir. Variational inference with normalizing flows. ICML, 37:1530–1538, 2015.
• Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan.

Stochastic backpropagation and approximate inference in deep generative models.

ICML, pp. 1278–1286, 2014.
• Rodrigues (1840) Rodrigues, Olinde. Des lois géométriques qui régissent les déplacements d’un système solide dans l’espace: et de la variation des cordonnées provenant de ces déplacements considérés indépendamment des causes qui peuvent les produire. 1840.
• Tomczak & Welling (2017) Tomczak, Jakub M and Welling, Max. VAE with a VampPrior. AISTATS, 2017.
• Weiler et al. (2018) Weiler, Maurice, Hamprecht, Fred A, and Storath, Martin. Learning steerable filters for rotation equivariant CNNs. In CVPR, 2018.
• Worrall et al. (2017) Worrall, Daniel E, Garbin, Stephan J, Turmukhambetov, Daniyar, and Brostow, Gabriel J. Harmonic networks: Deep translation and rotation equivariance. In CVPR, 2017.

## Appendix B Pushforward Measure \So3

1 Let the real space provided with the Lebesgue measure on the Borel algebra on . Let The group of 3 dimensional rotations provided with the normalized Haar measure on the Borel algebra on . Consider then the probability measure absolutely continuous w.r.t with density . Consider the exponential map that is differentiable thus continuous thus measurable. Let then the pushforward of by the function. then is absolutely continuous with respect of the Haar measure . ()

###### Proof.

Define the sets:

 Ak={x∈R3:∥x∥∈(kπ,(k+1)π)}Bk={x∈R3:∥x∥=kπ}k∈N (10)

Then note that:

 (11)

And since , then . Therefore:

 (12)

Then we have .
Now consider the pushforward measure , we then have that:

 (exp∗(μ))(E)=μ(exp−1(E))=∑k∈NμAk(exp−1(E))=∑k∈N(exp∗(μAk))(E)= (13) ∑k∈N((exp|Ak)∗(μ))(E)∀E∈B[\SO3] (14)

Then we have . Where is the function restricted to . Moreover notice that is a injective, therefore we can apply the change of variable formula:

 ((exp|Ak)∗(μ))(E)=∫exp−1|Ak(E)r dλ=∫E(r∘exp−1|Ak)⋅|Jexp−1|Ak| dν (15)

Then and since then . ∎

The proof then tells us how to compute the Radon-Nikodym derivative of the pushforward with respect to the Haar measure. In fact:

 d(exp|Ak)∗(μ)dν=(r∘exp−1|Ak)⋅|Jexp−1|Ak|,dexp∗(μ)dν=∑k∈Nd(exp|Ak)∗(μ)dν (16)

Defining we then have:

 ^q(R)=∑k∈N(r∘exp−1|Ak(R))⋅|Jexp−1|Ak(R)|=∑v∈exp−1(R)r(v)⋅|Jexp(v)|−1 (17)

From (Chirikjian, 2010) we have that

 |Jexp(v)|=2−2cos∥v∥∥v∥2 (18)

We then have:

 ^q(R)=∑v∈exp−1(R)r(v)∥v∥22−2cos(∥v∥) (19)

To then have an expression explicitly dependent on consider that

 exp−1|Ak(R)=exp−1|A0(R)∥exp−1|A0(R)∥(∥exp−1|A0(R)∥+2kπ)=log(R)∥log(R)∥(∥log(R)∥+kπ)if k is even (20)
 exp−1|Ak(R)=exp−1|A0(R)∥exp−1|A0(R)∥(∥exp−1|A0(R)∥+2kπ)=log(R)∥log(R)∥(∥log(R)∥+−(k+1)π)if k is odd (21)

Where we have defined . Moreover we then have:

 |Jexp−1|Ak(R)|=∥exp−1|Ak(R)∥22−2cos(∥exp−1|Ak(R)∥)=(∥log(R)∥+kπ)22−2cos(∥log(R)∥)if k is even (22)
 |Jexp−1|Ak(R)|=∥exp−1|Ak(R)∥22−2cos(∥exp−1|Ak(R)∥)=(∥log(R)∥−(k+1)π)22−2cos(∥log(R)∥)if k is odd (23)

Putting everything together:

 ^q(R)=∑k∈Zr(log(R)∥log(R)∥(∥log(R)∥+2kπ))(∥log(R)∥+2kπ)22−2cos(∥log(R)∥) (24)

Where from (Chirikjian, 2010) we have:

 log(R)=θ(R)2sin(θ(R))(R−R⊤)θ(R)=cos−1(tr(R)−12) (25)

This gives us the final expression:

 ^q(R|σ)=∑k∈Zr(log(R)θ(R)(θ(R)+2kπ))(θ(R)+2kπ)23−tr(R) (26)

## Appendix C Entropy computation

We oprimize MC estimates of the Entropy:

 H(q)=H(^q)≈−1NN∑i=1log^q(Ri),Ri∼^q(Ri)

(Where we dropped dependency on the parameters for simplicity) Then using, Equation (24):

 H(q)=H(^q)≈−1NN∑i=1log∑k∈Zr(log(Ri)∥log(Ri)∥(∥log(Ri)∥+2kπ))(∥log(Ri)∥+2kπ)22−2cos(∥log(Ri)∥),Ri∼^q(Ri)

In the way we defined we obtain samples from it in the following way:

 Ri=exp(\vvi)\vvi∼r(\vvi) (27)

Substituting it in in the previous expression we get:

 H(q) ≈−1NN∑i=1log^q(exp(\vvi)) =−1NN∑i=1log∑k∈Zr(\vvi∥\vvi∥(∥\vvi∥+2kπ))⋅(∥\vvi∥+2kπ)22−2cos(∥\vvi∥),\vvi∼r(\vvi)

Notice that this expression depends only on the samples from in the lie algebra

Assuming the density decays quickly enough to zero, the above infinite summation can be truncated. This is always the case for exponentially decaying distributions, like the Normal. The truncated summation can then can be computed using the logsumexp trick:

 H(q)≈−1NN∑i=1logsumexpk(logr(\vvi∥\vvi∥(∥\vvi∥+2kπ))+log(∥\vvi∥+2kπ)22−2cos(∥\vvi∥)),\vvi∼r(\vvi) (28)

## Appendix D Mean parameterization

As discussed above, some requirements exist on for the encoder to correctly represent the data manifold.

We split in the composition of and , both generally discontinuous. We assume to be a neural network output. The functions below are known ways to surjectively map to . are constructed to map from to the domain of .

We discuss the existence of a map such that it is a right inverse of (), which is necessary for the correct encoder to exist.

1. Algebra with , . This method simply uses the exponential map:

 π:R3 →\SO3 (29) \vv ↦exp(\vv×) (30)

It’s inverses are the branches of the log map. However, a path in that is a full rotation around a fixed axis is continuous in but discontinuous in the algebra, when mapped with the log map. Thus the log map is not continuous.

2. Quaternions with , . The unit Quaternions, which are homeomorphic to are a ’double cover’ of , which means that a continuous surjective projection exists that is two-to-one. The projection map can be found in Chirikjian & Kyatkin (2000, Eqn. (5.60)). Using the theory of Fiber Bundles (recognizing as a non-trivial principle bundle with base space ), one can show that no embedding exists.

3. s2s1() with , . This is the map from an axis in and angle in .

 π:S2×S →\SO3 (31) (u,\vv) ↦I+v2u×+(1−v1)u2× (32)

For , to be continuous, its image must be closed (as it is a compact subset of a Hausdorff space). Thus so must the set . However, as for , is a hemisphere (times a point) that does not contain its entire boundary, thus it is not closed and is not continuous.

4. s2s2() with ,

. This method creates two orthonormal vectors.

 π:S2×S2 →\SO3 (33) (u,\vv) ↦concat(w1,w2,w3) (34) Where: (35) w1=u (36) w2′=\vv−⟨u,\vv⟩u (37) w2=w2′∥w2′∥ (38) w3=w1×w2 (39)

Notice that there exists a continuous and injective map . It simply consists of taking the first two rows of the matrix representation of the element (The third row is the vector product between the first two, so it can always be recovered). Moreover we have that

## Appendix E Continuity Metric

Consider a map where , are metric spaces with metrics and respectively. In order to compute the proposed continuity metric we take a continuous path , defined as pairwise close points, and compute the relative distances

 Li=dY(f(xi+1),f(xi))dX(xi+1,xi) (40)

From this we further compute the quantities

 M:=maxiLiandPα:=α-th percentile of% {Li:i∈[N−1]}. (41)

By comparing these two values, we want to discover whether there is at least one outlier in the set of

. Such outliers corresponds to a transition with a big jump, signalling a discontinuity point. We define a path to be discontinuous if .

In order to capture stochastic effects we repeat the above procedure with several paths. The final score is the fraction of discontinuous paths. In the practical implementation we chose paths, using and (th percentile).

## Appendix F Group Action

For each degree , the Wigner-D-matrix can be expressed in a real basis as . We choose the copies of each degree and stack the matrices in block-diagonal form to create our representation, which amounts to taking the direct sum of the representations.

The Wigner-D-matrices represent rotations of the Fourier modes of a signal on the sphere, which provides an interpretation for using the group action. We consider a real signal one the sphere . It has a generalized Fourier transformation (Chirikjian & Kyatkin, 2000):

 f(s)=∞∑l=0(2l+1)l∑m,n=−l^flmDlm0(αs,βs,0)

where are the Fourier components and is the Wigner-D-matrix. We use identity , where are the first two Euler angles, to write the spherical harmonics that are the basis functions of the Fourier modes as Wigner-D-matrices. Then for a rotation , using the homomorphism property:

 f(g(s)) =∞∑l=0(2l+1)l∑m=−l^flml∑r=−lDlmr(g)Dlr0(αs,βs,0) =∞∑l=0(2l+1)l∑m=−l(l∑r=−lDlmr(g)^flm)Dlr0(αs,βs,0)

where corresponds to rotating a point on the sphere.

We see that our method of using representations in the decoder corresponds to having the content latent code represent the Fourier coefficients of a virtual signal on the sphere.

## Appendix G Regularizers

Even when an appropriate mean parametrization is selected and proper behaviour of the decoder is encouraged by the group action decoder, the network can still learn a discontinuous latent space. To encourage it to learn the data manifold correctly, we employ two additional loss terms that act as regularizers. An ablative analysis of the effectiveness of these regularizers is shown in Table 4.

### g.1 Equivariance regularizer

If the 3D object on which acts is centered in the frame, then a subgroup of exist such that its action corresponds to the rotations whose axis is orthogonal to the camera frame. For any , angle , decoder , action of subgroup on the latent space, and action on the pixels through planar rotations , we have equivariance relationship:

 g(ψθ(R))=ϕθ(g(R)) (42)

This equivariance is shown in Figure 8. The relationship is exact if the object is centered, acts on all pixels and if the camera is orthographic (located infinitely far away from the subject). If the object is off center, the pixel rotation can be performed around a learned center point. If the images have a rotation-invariant background, a learned mask can be applied. If the camera is not orthographic, the equivariance relationship is not exact, but approximate. The decoder is regularized by enforcing Equation (42) through a mean squared error loss on the pixels for uniformly sampled and . We choose to correspond to rotation around the -axis.

This regularizer helps align all rotations in each orbit, but does not help in correctly aligning the orbits among each other. Thus we reduce the problem from aligning to aligning , since the cosets of after the orbit are identified, are homeomorphic to the sphere.

### g.2 Continuity regularizer

If the learner is provided with pairs images that are nearby with respect to the manifold metric, the encoder can be regularized by penalizing differences in the encodings of the two inputs. This is done by penalizing the mean squared error of the Frobenius norms of the two encoded rotation matrices, which is a proper metric on the manifold.

This simplifies the problem from unsupervised learning on i.i.d. samples to learning a VAE on two frame samples from random trajectories of data lying on the

manifold.