Log In Sign Up

Equivariant Representation Learning via Class-Pose Decomposition

by   Giovanni Luca Marchetti, et al.

We introduce a general method for learning representations that are equivariant to symmetries of data. Our central idea is to decompose the latent space in an invariant factor and the symmetry group itself. The components semantically correspond to intrinsic data classes and poses respectively. The learner is self-supervised and infers these semantics based on relative symmetry information. The approach is motivated by theoretical results from group theory and guarantees representations that are lossless, interpretable and disentangled. We provide an empirical investigation via experiments involving datasets with a variety of symmetries. Results show that our representations capture the geometry of data and outperform other equivariant representation learning frameworks.


Symmetry-Based Disentangled Representation Learning requires Interaction with Environments

Finding a generally accepted formal definition of a disentangled represe...

Equivariant Representations for Non-Free Group Actions

We introduce a method for learning representations that are equivariant ...

Unsupervised Learning of Group Invariant and Equivariant Representations

Equivariant neural networks, whose hidden features transform according t...

Invariant-equivariant representation learning for multi-class data

Representations learnt through deep neural networks tend to be highly in...

Unsupervised Disentanglement of Linear-Encoded Facial Semantics

We propose a method to disentangle linear-encoded facial semantics from ...

Self-Supervised Learning Disentangled Group Representation as Feature

A good visual representation is an inference map from observations (imag...

Product of Orthogonal Spheres Parameterization for Disentangled Representation Learning

Learning representations that can disentangle explanatory attributes und...

1 Introduction

footnotetext: *Equal contribution.

For an intelligent agent aiming to understand the world and to operate therein, it is crucial to construct rich representations reflecting the intrinsic structures of the perceived data [bengio_disentanglement]. A variety of self-supervised approaches have been proposed to address the problem of representation learning, among which (variational) auto-encoders [kingma2013auto, baldi2012autoencoders] and contrastive learning methods [chen2020simple, jaiswal2020survey] are the most prominent. These approaches are applicable to a wide range of scenarios due to their generality but often fail to capture structures hidden in data. For example, it has been recently shown that disentangling intrinsic factors of variation in data requires specific biases or some form of supervision [challenging_disentanglement]. This raises the need for representations relying on additional structures and paradigms.

Figure 1: A depiction of our equivariant representation decomposing intrinsic class and pose.

A fundamental geometric structure of datasets consists in symmetries. Consider for example a space of images depicting three-dimensional rigid objects in different poses. In this case the symmetries are rigid transformations (translations and rotations), in the sense that the latter act on an arbitrary datapoint transforming the pose of the depicted object. Symmetries not only capture the geometry of the pose but additionally preserve the object’s shape, partitioning the dataset into intrinsic invariant classes. The joint information of shape (i.e., the invariant class) and pose describes the data completely and is recoverable from the symmetry structure alone. Another example of symmetries arises in the context of an agent exploring an environment. Actions performed by a mobile robot can be interpreted as changes of frame i.e., symmetries transforming the data the agent perceives. Assuming the agent is capable of odometry (measureament of its own movement), such symmetries are collectable and available for learning. All this motivates the design of representations that rely on symmetries and behave coherently with respect to them – a property known as equivariance.

In this work we introduce a general framework for equivariant representation learning. Our central idea is to encode class and pose separately by decomposing the latent space into an invariant factor and a symmetry component (see Figure 1). We then train a model with a loss encouraging equivariance. The pose component of our latent space preserves the geometry of data while the class component is interpretable and necessary for a lossless representation. To this end, we show theoretically that under mild assumptions an ideal learner achieves lossless representations by being trained on equivariance alone. In presence of multiple symmetry factors one can vary each of them separately by acting on the pose component. This realizes disentanglement in the sense of [higgins2018towards] which, as mentioned, would not be possible without the information carried by symmetries.

We rely on the abstract language of (Lie) group theory in order to formalize symmetries and equivariance. In this sense, our framework is general and applicable to arbitrary groups of symmetries. This is in contrast with previous works on equivariance often focusing on specific scenarios. For example, some works [guo2019affine, worrall2017interpretable, dynamic_enviroments] consider Euclidean representations and linear or affine symmetry groups, which is limiting and does not lead to lossless representations. Another popular technique is forcing equivariance through group convolutions [taco_group_equiv, cohengeneraltheory]. The latter are limited to the specific case when data consists of signals over a base space (typically, a pixel plane or a voxel grid) and symmetries of are induced by the ones of . This is the case for two-dimensional translations and dilations of an image, but does not hold for images of rotating rigid objects or first-person views of a scene explored by an agent. On the other hand, some recently introduced frameworks [kipf2019contrastive, van2020plannable] aim to jointly learn the equivariant representation together with the latent dynamics/symmetries. Although this has the advantage that the group of symmetries is not assumed to be known a priori, the obtained representation is unstructured, uninterpretable, and comes with no theoretical guarantees. In summary, our contributions include:

  • A method for learning equivariant representations separating intrinsic data classes from poses.

  • A general mathematical formalism based on group theory, which ideally guarantees lossless and disentangled representations.

  • An empirical validation of the performance of our method via a set of experiments involving various group actions, together with applications to scene mapping through visual data.

We provide an implementation of our framework together with data and code for all the experiments at the repository:

2 The Mathematics of Symmetries

We now introduce the necessary mathematical background on symmetries and equivariance. The modern axiomatization of symmetries relies on their algebraic structure i.e., composition and inversion. The properties of those operations are captured by the abstract concept of a group [rotman2012introduction].

Definition 2.1.

A group is a set equipped with a composition map denoted by , an inversion map denoted by , and a distinguished identity element such that for all :

Associativity Inversion Identity

Examples of groups include the permutations of a set and the general linear group of invertible real matrices, both equipped with usual composition and inversion of functions. An interesting subgroup of the latter is the special orthogonal group , which consists of linear orientation-preserving isometries of the Euclidean space. An example of commutative group (i.e., such that for all ) is

equipped with vector sum as composition.

The idea of a space having as group of symmetries is abstracted by the notion of group action.

Definition 2.2.

An action by a group on a set is a map denoted by , satisfying for all :

Associativity Identity

In general, the following actions can be defined for arbitrary groups: acts on any set trivially by , and acts on itself seen as a set via (left) multiplication by . A further example of group action is acting on by matrix multiplication.

Maps which preserve symmetries are called equivariant and will constitute the fundamental notion of our representation learning framework.

Definition 2.3.

A map between sets acted upon by is called equivariant if for all . It is called invariant if moreover acts trivially on or, explicitly, if . It is called isomorphism if it is bijective.

Now, group actions induce classes in called orbits by identifying points related by a symmetry.

Definition 2.4.

Consider the equivalence relation on given by deeming and equivalent if for some . The induced equivalence classes are called orbits, and the set of orbits is denoted by .

Figure 2: Orbits of a (free) group action represent intrinsic classes of data. Each orbit is isomorphic to the symmetry group itself.

For example, single points constitute orbits of the trivial action, while the multiplication action has a dingle orbit. Data-theoretically, an orbit may be thought of as an invariant, maximal class of data induced by the symmetry structure. In the example of rigid objects acted upon by translations and rotations, orbits indeed correspond to shapes (see Figure 2).

It is intuitive to assume that a nontrivial symmetry has to produce a change in data. If no difference is perceived, one might indeed consider the given transformation as trivial. We can thus assume that no point in is fixed by an element of different from the identity or, in other words, for . Such action are deemed as free and will be the ones relevant to the present work.

Assumption 2.1.

The action by on is free.

The following is the core theoretical result motivating our representation learning framework, which we will discuss in the following section. The result guarantees a general decomposition into a trivial and a multiplicative action and describes all the equivariant isomorphisms of such a decomposition.

Proposition 2.1.

The following holds:

  • There is an equivariant isomorphism


    where acts trivially on the orbits and via multiplication on itself, i.e., for , . In other words, each orbit can be identified equivariantly with the group itself.

  • Any equivariant map is a right multiplication on each orbit i.e., for each orbit there is an such that for all . In particular, if induces a bijection on orbits then it is an isomorphism.

We refer to the Appendix for a proof. The first part of the statement can be interpreted in plain words as a decomposition of classes from poses for any free group action. According to this terminology, a pose is abstractly an element of an arbitrary group while a class is an orbit. The intuition behind the second part of the statement is that any equivariant map performs an orbit-dependent ‘change of frame’ in the sense that elements of an orbit get composed on the right by a symmetry depending on . This will imply that our representations can differ from ground-truth ones only by such change of frames and will in fact guarantee isomorphic representations for our framework.

3 Method

3.1 General Equivariant Representation Learning

In the context of representation learning the goal of the model is to learn a map (‘representation’) from the data space to a latent space. The learner optimizes a loss over parameters of the map . The so-obtained representation can be deployed in downstream applications for an improved performance with respect to operating in the original data space.

The central assumption of equivariant representation learning is that data carries symmetries which the representation has to preserve. As discussed in Section 2, this means that a group of symmetries acts on both and and that the representation is encouraged to be equivariant through the loss. While the action on is designed as part of the model, the action on is unknown in general and has to be conveyed through data. Concretely, the dataset consists of triples with and . The group element carries symmetry information which is relative between and . Equivariance is then naturally encouraged through a loss in the form:


Here is a similarity function on (not necessarily satisfying the axioms for a distance). Note that we assume the group together with its algebraic structure to be known a priori and not inferred during the learning process. Its action over the latent space is defined in advance and constitutes the primary inductive bias for equivariant representation learning.

3.2 Learning to Decompose Class and Pose

Motivated by Proposition 2.1, we propose to set the latent space as:


with acting trivially on and via multiplication on itself. Here, is any set which is meant to represent classes of the encoded data. Since there is in general no prior information about the action by symmetries on and its orbits, has to be set beforehand. Assuming is big enough to contain , Proposition 2.1 shows that an isomorphic equivariant data representation is possible in . By fixing distances and on and respectively, we obtain a joint latent distance , where the subscripts denote the corresponding components. When is a group of matrices, a typical choice for is the (squared) Frobenius distance i.e., the Euclidean distance for matrices seen as flattened vectors. The equivariance loss in Equation 2 then reads:


Here we denoted the components of the representation map by and omitted the parameter for simplicity. To spell things out, encourages data from the same orbit to lie close in (i.e., is ideally invariant) while aims for equivariance with respect to multiplication on the pose component .

If is injective then Proposition 2.1 guaranatees lossless (i.e., isomorphic) representations, which we summarize in the following corollary:


Suppose that is injective. Then for all if and only if is an equivariant isomorphisms.

In order to force injectivity, we propose a typical solution from contrastive learning literature [chen2020simple] encouraging latent points to spread apart through a prior. To this end, we opt for the the standard InfoNCE loss [oord2018representation], although other choices are possible. This means that we replace the term in Equation 4 with


The hyperparameter

(’temperature’) controls the intensity of the spreading prior encouraged by the InfoNCE loss. Following [oord2018representation, chen2020simple], we set the class component as a sphere , which amounts to normalizing the output of when

is a deep neural network with

output neurons. This allows to deploy the cosine dissimilarity

and is known to lead to improved performances due to the compactness of [wang2020understanding].

3.3 Parametrizing via the Exponential Map

The output space of usual machine learning models such as deep neural networks is Euclidean. Our latent space (Equation

3) contains as a factor, which might be non-Euclidean as in the case of . In order to implement our representation learner it is thus necessary to parametrize the group . To this end, we assume that is a differentiable manifold with composition and inversion being differentiable maps i.e., that is a Lie group. One can then define the Lie algebra of as the tangent space to at .

We propose to rely on the exponential map , denoted by , to parametrize . This means that outputs an element of that gets mapped into as

. Although the exponential map can be defined for general Lie groups by solving an appropriate ordinary differential equation, we focus on the case

. The Lie algebra is then contained in the space of matrices and the exponential map amounts to the matrix Taylor expansion . For specific groups the latter can be simplified via simple closed formulas. For example, the exponential map of is the identity while for it can be efficiently computed via the Rodrigues’ formula [liang2018efficient].

3.4 Connections to Disentanglement

Our equivariant representation learning framework is related to the popular notion of disentanglement [bengio_disentanglement, bvae]. Intuitively, in a disentangled representation a variation of a distinguished aspect in the data is reflected by a change of a single component in the latent space. Although there is no common agreement on a rigorous formulation of the notion [challenging_disentanglement], a proposal has been addressed in [higgins2018towards]. The presence of multiple dynamic aspects in the data is formalized as an action on by a decomposed group


where each of the factors is responsible for the variation of a single aspect. A representation is then defined to be disentangled if (i) there is a decomposition where each is acted upon trivially by the factors with and (ii) is equivariant.

Our latent space (Equation 3) automatically yields to disentanglement in this sense. Indeed, in the case of a group as in Equation 6 we put . In order to deal with the remaining factor , a copy of the trivial group can be added to without altering it up to isomorphism. The group acts on as required for a disentangled latent space.

4 Related Work

Equivariant Representation Learning. Models relying on symmetry and equivariance have been studied in the context of representation learning. Those are typically trained on variations of the equivariance loss (Equation 2) and are designed for specific groups and actions on the latent space . The pioneering

Transforming Autoencoders

[hinton2011transforming] learn to represent image data translated by in the pixel-plane, with consisting of several copies of (‘capsules’) acting on itself. Although such models are capable of learning isomorphic representations, the orbits are not explicitly modelled in the latent space. In contrast, our invariant component is an interpretable alternative to multiple capsules making orbits recoverable from the representation. Homeomorphic Autoencoders [homeomorphic] represent data in the group of three-dimensional rotations . Such latent space has no additional components dedicated to orbits, obtaining a representation that loses information about the intrinsic classes of data. Affine Equivariant Autoencoders [guo2019affine] deal with specific affine transformations of the pixel-plane (shearing an image, for example) and implement a latent action through a hand-crafted map . Groups of rotations linearly acting on a Euclidean latent space are explored in [worrall2017interpretable, dynamic_enviroments]. Since rotating a vector around itself has no effect, linear actions are not free (for ), which makes isomorphic representations impossible. Equivariant Neural Rendering [renderer] proposes a latent voxel grid on which

acts approximately by rotating and interpolating values. In contrast, our latent group action is exact and thus induces no loss of information. We provide an empirical comparison to both linear Euclidean actions and Equivariant Neural Rendering in Section


Convolutional Networks. Convolutional layers in neural networks [taco_group_equiv, cohengeneraltheory] satisfy equivariance a priori with respect to transformations of the pixel plane. They were originally introduced for (discretized) translations and later extended to more general groups [sphericalcnn, symmetricsets, cohen2019gauge]. However, they require data and group actions in a specific form. Abstracty speaking, data need to consist of vector fields over a base space (images seen as RGB fields over the pixel plane, for example) acted upon by , which does not hold in general. Examples of symmetries not in this form are changes in perspective of first-person images of a scene and rotations of rigid objects on an image. Our model is instead applicable to arbitrary (Lie) group actions and infers equivariance in a data-driven manner. Moreover, equivariance through

-convolutions alone is hardly suitable for representation learning as the output dimension coincides with the input one. Dimensionality reduction techniques deployed together with convolutions such as max-pooling or fully-connected layers disrupt equivariance completely. The latent space in our framework is instead compressed and is ideally isomorphic to the data space

(Proposition 2.1).

World Models.

Analogously to group actions, Markov Decision Processes (MDPs) from reinforcement learning and control theory involve a possibly stochastic interaction

with an environment via a set of moves. In general, no algrebraic structure (such as a group composition) is assumed on . In this context, a representation equivariant with respect to the action is referred to as World Model [ha2018world, kipf2019contrastive, park2022learning] or Markov Decision Process Homomorphism (MDPH) [van2020plannable]. MDPHs are usually deployed as pre-trained representations for downstream tasks or trained jointly with the agent for exploration purposes [curiosity]. However, the latent action of an MDPH is learned since no prior knowledge is assumed around or the environment. This implies that the obtained representation is unstructured and uninterpretable. We instead assume that is a group acting (freely) on , which enables us to define an interpretable, disentangled latent space that guarantees isomorphic equivariant representations. We provide an empirical comparison to MDPHs in Section 5.

cccccc & Sprites & Shapes & Multi-Sprites & Chairs & Apartments


& & & &

& & & & &

& & & & &

Table 1: Datasets involved in our experiments, with the corresponding group of symmetries and number of orbits.

5 Experiments

Figure 3: Right: visualization of encodings through from the Sprites, Chairs and Apartments datasets. The images display the projection to the annotated components of and data are colored by their ground-truth class. Each latent orbit from Apartments is compared to the view from the top of the corresponding scene. Left: same visualization for the baseline models MDPH and Linear on the Chairs dataset.

5.1 Dataset Description

Our empirical investigation aims to assess our framework via both qualitative and quantitative analysis on datasets with a variety of symmetries. To this end we deploy five datasets summarized in Table 1: three with translational symmetry extracted from dSprites and 3DShapes [sprites, shapes], one with rotational symmetry extracted from ShapeNet [chang2015shapenet] and one simulating a mobile agent exploring apartments in first-person extracted from Gibson [xiazamirhe2018gibsonenv] and generated via the Habitat simulator [habitat19iccv]. Datapoints are triples where are images, and . We refer to the Appendix for a more detailed description of the datasets.

5.2 Baselines and Implementation Details

We select MDP Homomorphisms (MDPHs; [van2020plannable, kipf2019contrastive, park2022learning]) as the main baseline model for comparison. In an MDPH the representation is learnt jointly with the latent action . Differently from us, an MDPH does not assume any prior knowledge on nor any algebraic structure on the latter. This however comes at the cost of training an additional uninterpretable model . The so-obtained representation is thus geometrically unstructured and does not come with the theoretical guarantees of our framework. For the Chairs dataset we additionally compare with the following two models suitable for groups of rotations. First, a model (Linear) with on which acts by matrix multiplication([worrall2017interpretable, dynamic_enviroments]). Note that the action on is no longer free and the model is thus forced to lose information in order to learn an equivariant representation. Second, a model deemed Equivariant Neural Renderer (ENR; [renderer]). In this case the latent space is a discretized grid of vectors on which acts approximately by rotating the grid and interpolating the obtained values. Although the action on is free, the latent discretization and consequent interpolation make the model only approximately equivariant. We refer to the Appendix for a more detailed description of the baselines and their implementation.

We implement the equivariant representation learner as a ResNet- [resnet]

, which is a deep convolutional neural network with residual connections. We train our models for

epochs through stochastic gradient descent by means of the Adam optimizer with learning rate and batch size . The distance is set as the squared Euclidean one for and for . For we instead deploy the Frobenius metric. The invariant component consists of a sphere (see Section 3) parametrized by the normalized output of neurons in the last layer of .

5.3 Visualizations of the Representation

In this section we present visualizations of the latent space of our model (Equation 3), showcasing its geometric benefits. The preservation of symmetries coming from equivariance enables indeed to transfer the intrinsic geometry of data explicitly to the representation. Moreover, the invariant component separates the orbits of the group action, allowing to distinguish the intrinsic classes of data in the latent space. Finally, the representation from our model automatically disentangles factors of the group as discussed in Section 3.4.

Figure 3 (left) presents visualizations of encodings through for the datasets Sprites, Chairs and Apartments. For each dataset we display the projection to one component of as well as a relevant component of the group . Specifically, for Sprites we display the component corresponding to translations in the pixel plane, for Chairs we display a circle corresponding to one Euler angle while for Apartments we display the component corresponding to translations in the physical world. For Apartments, we additionally compare representation of each of the two apartments with the ground-truth view from the top.

As can be seen, in all cases the model correctly separates the orbits in through self-supervision alone. Since the orbits are isomorphic to the group itself, the model moreover preserves the geometry of each orbit separately. For Sprites, this means that (the displayed component of) each orbit is an isometric copy of the pixel-plane, with disentangled horizontal and vertical translations. (Figure 3

, top-left). For Apartments, this similarly means that each orbit exhibits an isometric copy of the real-world scene. One can recover a map of each of the explored scenes by, for example, estimating the density of data in

(Figure 3, bottom-right) and further use the model to localize the agent within such map. Our equivariant representation thus solves a localization and mapping task in a self-supervised manner and of multiple scenes simultaneously.

As a qualitative comparison Figure 3 (right) includes visualizations for the models MDPH (trained with ) and Linear on the Chairs dataset. As can be seen, the latent space of MDPH is unstructured: the geometry of is not preserved and classes are not separated. This is because the latent action of MDPH is learned end-to-end and is thus uninterpretable and unconstrained a priori. For Linear the classes are organized as spheres in , which are are the orbits of the latent action by . Such orbits are not isomorphic to (one Euler angle is forgotten) since the action is not free. This means that loses information and does not represent the dataset faithfully.

5.4 Performance Comparison

Dataset Model 1 Step 10 Steps 20 Steps
Sprites Ours
Shapes Ours
Multi-Sprites Ours
Charis Ours
Apartments Ours
Table 2: Hit-rate (mean and std over runs), with test trajectories of increasing length.

In this section we numerically compare our method to the equivariant representation learning frameworks described at the beginning of Section 5. We evaluate the models through hit-rate, which is a standard score that allows to compare equivariant representations with different latent space geometries [kipf2019contrastive]. Given a test triple , we say that ‘ hits ’ if is the nearest neighbour in of among a random batch of encodings . For a test set, the hit-rate is then defined as the number of times hits divided by the test set size. We set the number of aforementioned random encodings to . For each model, the nearest neighbour is computed with respect to the same latent metric as the one used for training. In order to test the performance of the models when acted upon multiple times in a row, we generate test sets where is a trajectory i.e., it is factorized as for . Hit-rate is then computed after sequentially acting by the ’s in the latent space. This captures the accumulation of errors in the equivariant representation and thus evaluates the performance for long-term predictions. All the test sets are of size of the corresponding dataset.

The results are presented in Table 2. As can be seen, all the models perform nearly perfectly on single-step predictions with the exception of Linear ( hit-rate). For the latter the latent group action is not free, which prevents from learning a lossless equivariant representation and thus degrades the quality of predictions. On longer trajectories, however, our model outperforms the baselines by an increasing margin. MDPH accumulates errors due to the lack of structure in its latent space. The latent action of MDPH is indeed learned, which does not guarantee stability with respect to composition of multiple symmetries. The degradation of performance for MDPH is particularly evident in the case of Multi-Sprites ( hit-rate on

steps), which is probably due to the large number of orbits (

) and the consequent complexity of the prediction task. Our model is instead robust even in presence of many orbits ( hit-rate on Multi-Sprites) due to the dedicated invariant component in its latent space.

When the latent space is equipped with a group action, stability on long trajectories follows from associativity of the group composition and the action (see Definition 2.1 and 2.2). This is evident from the results for the Chairs dataset, where our model and Linear outperform MDPH on longer trajectories ( and against hit-rate on steps) and exhibit a stable hit-rate as the number of steps increases. Even though ENR carries a latent group action, it still accumulates errors ( hit-rate on steps) due to the discretization of the its latent space i.e., the latent grid acted upon by . Such discretization and the consequent interpolation makes the latent action only approximately associative, causing errors to accumulate on long trajectories.

6 Conclusions and Limitations

In this work we addressed the problem of learning equivariant representations by decomposing the latent space into a group component and an invariant one. We theoretically showed that our representations are lossless, disentangled and preserve the geometry of data. We empirically validated our approach on a variety of groups, compared it to other equivariant representation learning frameworks and discussed applications to the problem of scene mapping.

Our formalism builds on the assumption that the group of symmetries is known a priori and not inferred from data. Additionally, we require access to relative symmetries between pairs of data. This is viable in applications to robotics, but can be problematic in other domains. If a data feature is not taken into account among symmetries, it will formally define distinct orbits. For example, the eventual change in texture for images of rigid objects has to be part of the symmetries in order to still maintain shapes as the only intrinsic classes. A framework where the group structure is learned might be a feasible, although less interpretable alternative to prior symmetry knowledge that constitutes an interesting line of future investigation.

7 Acknowledgements

This work was supported by the Swedish Research Council, the Knut and Alice Wallenberg Foundation and the European Research Council (ERC-BIRD-884807).


Appendix A Appendix

a.1 Proofs of Theoretical Results


The following holds:

  • There is an equivariant isomorphism


    where acts trivially on the orbits and via multiplication on itself, i.e., for , . In other words, each orbit can be identified equivariantly with the group itself.

  • Any equivariant map is a right multiplication on each orbit i.e., for each orbit there is an such that for all . In particular, if induces a bijection on orbits then it is an isomorphism.


We start by proving the first statement. Choose a system of representatives for orbits, that is contains exactly one element for each class. Consider the map given, for and , by where denotes the orbit of . It is straightforward to check that is indeed equivariant. Now if for and then and are thus in the same orbit, which implies because of uniqueness of representatives. But then and, equivalently, , from which we deduce since the action is free. This shows that is injective. Finally, for , one can write for the representative of the orbit of , which means that . That is, is surjective and thus also bijective, which concludes the proof of the first statement.

Let us now prove the second statement. Consider an equivariant map . For each orbit denote by the element such that . Then by equivariance , as desired. ∎

a.2 Description of Datasets

In our experiments we deploy the following datasets, which are also summarized in Table 1:

Sprites: extracted from dSprites [sprites]. It consists of grayscale images depicting three sprites (heart, square, ellipse) translating and dilating in the pixel plane. The group of symmetries is : a factor translates the sprites in the pixel plane while the last copy of , which is isomorphic via exponentiation to equipped with multiplication, acts through dilations. The dataset size is and there are three orbits, each corresponding to a sprite.

Shapes: extracted from 3DShapes [shapes]. It consists of colored images depicting four objects (cube, cylinder, sphere, pill) on a background divided into wall, floor and sky. Again, but with the action given by color shifting: each of the factors acts by changing the color of the corresponding scene component among object, wall and floor. The dataset size is and there are four orbits, each corresponding to a shape.

Multi-Sprites: obtained from Sprites by overlapping images of three colored sprites (with fixed scale). The group of symmetries is , each of whose factors translates one of the three sprites in the pixel plane. The added colors endow the sprites with an implicit ordering, which is necessary for the action to be well-defined. The dataset size is . Since the scene is composed by three possibly repeating sprites, there are orbits corresponding to all the configurations.

Chairs: extracted from ShapeNet [chang2015shapenet]. It consists of colored images depicting three types of chair from different angles. The group of symmetries is , which rotates the depicted chair. The dataset size is and there are three orbits, each corresponding to a type of chair.

Apartments: extracted from Gibson [xiazamirhe2018gibsonenv] and generated via the Habitat simulator [habitat19iccv]. It consists of colored images of first-person renderings of two apartments (‘Elmira’ and ‘Convoy’). The data simulate the visual perception of an agent such as a mobile robot exploring the two apartments and collecting images and symmetries. The latter belong to the group of two-dimensional orientation-preserving Euclidean isometries and coincide with the possible moves (translations and rotations) by part of the agent. One can realistically imagine the agent perceiving the symmetries through some form of odometry i.e., measurement of movement. Note that the action by on is partially defined since is not always possible because of obstacles. As long as the agent is able to reach any part of each of the apartments, the latter still coincide with the two orbits of the group action. The dataset size .

All the datasets consist of triples where are images, and .

a.3 Description of Baselines

In our experiments we compare our framework with the following models:

MDP Homomorphisms (MDPH; [van2020plannable, kipf2019contrastive]): a framework where the representation is learnt jointly with the latent action . The two models are trained with the equivariance loss (cf. Equation 2). In order to avoid trivial solutions such as constant and equal to identity for all , an additional ‘hinge’ loss term is optimized that encourages encodings to spread apart in the latent space. This is analogous to (the denominator of) the InfoNCE loss (Equation 5) which we rely upon to avoid orbit collapse in . Differently from us, an MDPH does not assume any prior knowledge on nor any algebraic structure on the latter. However, this comes at the cost of training an additional model and losing the guarantees provided by a group structure and by our framework in particular.

Linear: a model with on which acts by matrix multiplication. Such a latent space has been employed in previous works [worrall2017interpretable, dynamic_enviroments]. The model is trained with the same loss as MDPH i.e., equivariance loss together with the additional hinge term avoiding collapses such as . Note that the action on is no longer free (even away from ) since rotating a vector around itself has no effect. Differently from our method, the model is thus forced to lose information in order to learn an equivariant representation.

Equivariant Neural Renderer (ENR; [renderer]): a model with a tensorial latent space , thought as a copy of for each point in a grid in . The group approximately acts on by rotating the grid and interpolating the obtained values in . The model is trained trained jointly with a decoder and optimizes variation of the equivariance loss incorporating reconstruction: . We set as the standard binary cross-entropy metric for (normalized) images. Although the action on is free, the latent discretisation and consequent interpolation make the model only approximately equivariant.

All the models are implemented with the same architecture (ResNet-; [resnet]), with the exception of ENR. For the latter we deploy the original architecture from [renderer] consisting of a a similar ResNet, with the main difference being -convolutional layers around the latent space. The latent action model for MDPH is implemented as a two-layer deep neural network (

neurons per layer) with ReLu activation functions. For MDPH we set

, which coincides with the output dimensionality of our model.