# Symmetry-Based Disentangled Representation Learning requires Interaction with Environments

Finding a generally accepted formal definition of a disentangled representation in the context of an agent behaving in an environment is an important challenge towards the construction of data-efficient autonomous agents. Higgins et al. recently proposed Symmetry-Based Disentangled Representation Learning, a definition based on a characterization of symmetries in the environment using group theory. We build on their work and make observations, theoretical and empirical, that lead us to argue that Symmetry-Based Disentangled Representation Learning cannot only be based on fixed data samples. Agents should interact with the environment to discover its symmetries. All of our experiments can be reproduced on Colab: http://bit.do/eKpqv.

There are no comments yet.

## Authors

• 12 publications
• 13 publications
• 27 publications
• ### Towards a Definition of Disentangled Representations

How can intelligent agents solve a diverse set of tasks in a data-effici...
12/05/2018 ∙ by Irina Higgins, et al. ∙ 0

• ### A Metric for Linear Symmetry-Based Disentanglement

The definition of Linear Symmetry-Based Disentanglement (LSBD) proposed ...
11/26/2020 ∙ by Luis A. Pérez Rey, et al. ∙ 0

• ### Equivariant Hamiltonian Flows

This paper introduces equivariant hamiltonian flows, a method for learni...
09/30/2019 ∙ by Danilo Jimenez Rezende, et al. ∙ 0

• ### Linear Disentangled Representations and Unsupervised Action Estimation

Disentangled representation learning has seen a surge in interest over r...
08/18/2020 ∙ by Matthew Painter, et al. ∙ 0

• ### Curiosity Driven Exploration of Learned Disentangled Goal Spaces

Intrinsically motivated goal exploration processes enable agents to auto...
07/04/2018 ∙ by Adrien Laversanne-Finot, et al. ∙ 0

• ### Improving Optimization in Models With Continuous Symmetry Breaking

Many loss functions in representation learning are invariant under a con...
03/08/2018 ∙ by Robert Bamler, et al. ∙ 0

• ### Towards Learning Fine-Grained Disentangled Representations from Speech

Learning disentangled representations of high-dimensional data is curren...
08/08/2018 ∙ by Yuan Gong, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Disentangled Representation Learning aims at finding a low-dimensional vectorial representation of the world for which the underlying structure of the world is separated into disjoint parts (i.e., disentangled) corresponding to the actual compositional nature of the world. Previous work (Raffin et al., 2019) has shown that agents capable of learning disentangled representations can perform data-efficient policy learning. However, there is no generally accepted formal definition of disentanglement in Representation Learning, which prevents significant progress in this emerging field.

Recent efforts have been made towards finding a proper definition (Locatello et al., 2018). In particular, Higgins et al. (2018)

defines Symmetry-Based Disentangled Representation Learning (SBDRL), by taking inspiration from the successful study of symmetry transformations in Physics. Their definition focuses on the transformation properties of the world. They argue that transformations that change only some properties of the underlying world state, while leaving all other properties invariant, are what gives exploitable structure to any kind of data. They distinguish between linear disentangled representation which models these transformation effect on the representation linearly and non-linear ones. Supposedly, the former should be more useful for downstream tasks such as Reinforcement Learning or auxiliary prediction tasks. Their definition is intuitive and provides principled resolutions to several points of contention regarding what disentangling is. For clarity, we refer to a representation as SB-disentangled if it is disentangled in the sense of SBDRL, and as LSB-disentangled if linear disentangled.

We build on the work of Higgins et al. (2018) and make observations, theoretical and empirical, that lead us to argue that SBDRL requires interaction with environments. The necessity of having interaction has been suggested before (Thomas et al., 2017). We are able to give theoretical and empirical evidence of why it is needed for SBDRL. As in the original work, we base our analysis on a simple environment, where we can formally define and manipulate a SB-disentangled representation. This simple environment is 2D, composed of one circular agent on a plane that can move left-right and up-down. The world is cyclic: whenever the agent steps beyond the boundary of the world, it is placed at the opposite end (e.g. stepping up at the top of the grid places the object at the bottom of the grid).

We prove, for this environment, that the minimal number of dimensions of the representation required for it to be LSB-disentangled is counter-intuitive. Indeed, the natural number of dimensions required to describe the state of the world is not enough to describe its symmetries in a linear way, which is supposedly ideal for subsequent tasks. Additionally, learning a non-linear SB-disentangled representation is possible, but current approaches are not designed to model the effect of the world’s symmetries on the representation, a key aspect of SBDRL which we present later. We thus ask: how is one supposed to, in practice, learn a (L)SB-disentangled representation?

We propose two options that arise naturally, one where representation and world symmetries effect on it are learned separately and one where they are learned jointly. For both scenarios, we formally define what could be the proper representation to learn, using the formalism of SBDRL. We propose empirical implementations that are able to successfully approximate these analytically defined representations. Both empirical approaches make use of transitions rather than still images , which validates the main point of this paper: Symmetry-Based Disentangled Representation Learning requires interaction with the environment.

Our contributions are the following:

• We prove that learning a LSB-disentangled representation of dimension is impossible in the world considered in this paper, due to the cyclical nature of the environment dynamics.

• We propose alternatives for learning linear and non-linear SB-disentangled representation, both using transitions rather than still observations. We validate both approaches emprically.

• Based on these observations, we take a step back and argue that interaction with the environment, i.e. the use of transitions, is necessary for SBDRL.

## 2 Symmetry-Based Disentangled Representation Learning

Higgins et al. (2018) defines Symmetry-Based Disentangled Representation Learning (SBDRL) as an attempt to formalize disentanglement in Representation Learning. The core idea is that SB-disentanglement of a representation is defined with respect to a particular decomposition of the symmetries of the environment. Symmetries are transformations of the environment that leave some aspects of it unchanged. For instance, for an agent on a plane, translations of the agent on the -axis leave its coordinate unchanged. They formalize this using group theory. Groups are composed of these transformations, and group actions are the effect of the transformations on the state of the world and representation.

The proposed definition of SB-disentanglement supposes that these symmetries are formally defined as a group that can be decomposed into a direct product . We now recall the formal definition of a SB-disentangled representation w.r.t to this group decomposition. We advise the reader to refer to the detailed work of (Higgins et al., 2018) for any clarification. Let be the set of world-states. We suppose that there is a generative process leading from world-states to observations (these could be pixel, retinal, or any other potentially multi-sensory observations), and an inference process leading from observations to an agent’s representations. We consider the composition . Suppose also that there is a group of symmetries acting on via a group action . What we would like is to find a corresponding group action so that the symmetry structure of is reflected in . We also want the group action to be disentangled, which means that applying to leaves all sub-spaces of unchanged but the one corresponding to the transformation . Formally, the representation is SB-disentangled with respect to the decomposition if:

1. There is a group action

2. The map is equivariant between the group actions on and :

3. There is a decomposition such that each is fixed by the action of all and affected only by

This definition of SB-disentangled representations does not make any assumptions on what form the group action should take when acting on the relevant disentangled vector subspace. However, many subsequent tasks may benefit from a SB-disentangled representation where the group actions transform their corresponding disentangled subspace linearly. Such representations are termed linear SB-disentangled representations, which we refer to as LSB-disentangled representations.

## 3 Considered environment

In this paper, we consider a simplification of the environment studied in the original paper (Higgins et al., 2018). This environment is 2D, composed of one circular agent on a plane that can move left-right and up-down, see Fig.1. Whenever the agent steps beyond the boundary of the world, it is placed at the opposite end (e.g. stepping up at the top of the grid places the object at the bottom of the grid). The world-states can be described in two-dimensions: position of the agent. All of our results are based on this environment. It is simple, yet presents the basis for a navigation environment in 2D. We chose this environment because we are able to define theoretically SB-disentangled representations, without making any approximation. We implement this simple environment using Flatland (Caselles-Dupré et al., 2018).

## 4 Theoretical analysis

We provide a theorem that proves it is impossible to learn a LSB-disentangled representation of dimension in the environment presented in Sec.3 (the result also applies to the environment considered in Higgins et al. (2018)).

###### Theorem 1.

For the considered world, there exists no LSB-disentangled representation w.r.t to the group decomposition , such that and is not trivial.

###### Proof.

Proof by contradiction. The key element of the proof is that the two actual dimension of the environment are not linear but cyclic. Hence the impossibility of modelling two cyclic dimensions using two linear dimensions. See Appendix A for full proof.

## 5 Symmetry-Based Disentangled Representation Learning in practice requires transitions

We now consider the problem of learning, in practice, SB-disentangled and LSB-disentangled representations for the world considered in Sec.3.

### 5.1 First option: SB-disentanglement with 2 dimension

Theorem 1 states that we cannot learn a -dimensional LSB-disentangled representation for the environment. We thus consider learning a -dimensional SB-disentangled representation. We started by reproducing the results in (Higgins et al., 2018): we used a variant of current state-of-the-art disentangled representation learning model CCI-VAE. The learned representation corresponds (up to a scaling factor) to the world-state , i.e. the position of the agent. This intuitively seems like a reasonable approximation to a disentangled representation.

However, once the representation is learned, we have no idea how the group action of symmetries affect the representation, even though it is at the core of the definition of SBDRL. This is where the necessity for transitions rather than still observations comes into play. Indeed, learning about the effect of transformations on the world implies learning about the dynamics of the environment, which requires transitions.

Starting from the learned non-linear SB-disentangled representation, we propose to learn the group action of on using a separate model. This way, we have a complete description of the SB-disentangled representation. This approach is effectively decoupling the learning of physics from vision as in (Ha & Schmidhuber, 2018). The second option would be to jointly learn vision and physics, which we demonstrate in the next experiment with LSB-disentangled representations.

In practice, we learn

with a variant of CCI-VAE, and then use a multi-layer perceptron to learn the group action on Z

, such that is an equivariant map between the actions on and . The results are presented in Fig.2, where we observe that the learned group action correctly approximate the cyclical movement of the agent. We thus have learned a properly SB-disentangled representation of the world, w.r.t to the group decomposition .

### 5.2 Second option: LSB-disentanglement with 4 dimensions

We now propose a method to learn a LSB-disentangled representation and the group action effect on it. To accomplish this, we start with a theoretically constructed LSB-disentangled representation. It is based on a example given in Higgins et al. (2018). The representation is defined as following, using 4 dimensions:

• is defined as

• is defined as

In this representation, the position is mapped to two complex numbers . For each translation (on the x-axis or y-axis), the associated group representation is a rotation on a complex plane associated to the specific axis. This representation linearly accounts for the cyclic symmetry present in the environment. Using CCI-VAE with 4 dimensions fails to learn this representation: we verified experimentally that only 2 dimensions were actually used when learning (for encoding the position), and the two remaining were ignored. We need to take into account transitions and enforce linearity in order to learn this specific representation.

We propose a method that allows to learn this LSB-disentangled representation. Once again, rather than using still observations, we generate a dataset of transitions, and use it to learn the 4-dimensional LSB-disentangled representation with a specific VAE architecture we term Forward-VAE. The core idea is to enforce linearity in transitions in the representation space.

We begin by re-writing the complex-valued function as a real-valued function:

 ρ(g):R4→R4v→ρ(g)(v)=A∗(g)⋅v (1)

is a 2x2 block-diagonal matrix, composed of 2x2 rotation matrices. Let’s consider the environment in Sec.3. The agent has 4 actions: go left, right, up or down. We associate each action with a corresponding matrix with trainable weights.

For instance, if is a translation on the x-axis, the corresponding matrix is and we associate actions go right/left with corresponding matrices , where are trainable parameters:

.

We would like the representation model that we learn to satisfy . We thus enforce the representation to satisfy it, as illustrated in Fig.1. The training procedure is presented in Algorithm 1 in Appendix B. For each image in a batch, we compute and using the encoder part of the VAE. Then we decode with the decoder and compute the reconstruction loss and annealed KL divergence as in (Caselles-Dupré et al., 2019). Then we compute and compute the forward loss, which is the MSE with :

. We then backpropagate w.r.t to the full loss function of Forward-VAE:

 LForward−VAE =Lreconstruction+γt⋅LKL+Lforward (2)

The results are presented in Fig.2 and Appendix C. Forward-VAE correctly learns a representation where the two complex dimensions correspond to the position of the agent. Plus, we observe that the learned matrices are very good approximation of the ideal matrices defined above, with . The mean squared difference is very small (order of ).

## 6 Discussion & Conclusion

Discussion. We used inductive bias given by the theoretical construction of a LSB-disentangled representation theory to design the action matrices and its trainable weights. This construction is specific to this example. However, the idea of having an action matrix for each action is extendable. If each action is high-level and associated to a symmetry, then you can perform SBDRL. Still, it requires high level actions that represent these symmetries. One potential way to find these actions is through active search (Soatto, 2011), as suggested in (Higgins et al., 2018).

Learning a LSB-disentangled representation is supposedly beneficial for subsequent tasks. However, this remains to be demonstrated, as other work have even challenged the benefit of learning disentangled representations over non-disentangled ones (Locatello et al., 2018). In Appendix E, we present preliminary results that indicates (L)SB-disentangled representations might indeed be beneficial for subsequent tasks. Overall, the field of Disentangled Representation Learning needs more investigation on this matter in order to move forward. We further discuss our results and the approach of SBDRL in Appendix D.

Conclusion. Using theoretical and empirical arguments, we demonstrated that SBDRL (Higgins et al., 2018), a proposed definition for disentanglement in Representation Learning, requires interaction with the environment. We then proposed two methods to perform SBDRL in practice, both of which are successful empirically. We believe SBDRL provides a new perspective on disentanglement which can be promising for Representation Learning in the context of an agent acting in a environment.

## Acknowledgements

We thank Irina Higgins for insightful mail discussions.

## Appendix A Proofs

### a.1 Trivial representations

We first define trivial representations and then prove that they are LSB-disentangled. We will then use this definition to prove Theorem 1.

###### Definition 1.

is a trivial representation if and only if is constant.

If is a trivial representation, we thus have that each state of the world has the same representation.

###### Proposition 1.

If is a trivial representation then is LSB-disentangled w.r.t to every group decomposition.

We prove Proposition 1 which states that trivial representations are LSB-disentangled.

###### Proof.

The definition of LSB-disentangled representation of dimension 2 is:

1. There is a linear action . It thus can be viewed as a group representation .

2. The map is equivariant between the actions on and .

3. There is a decomposition or such that each is fixed by the action of all and affected only by .

Let be the identity function , which is linear.

We have that is constant. We can verify that is equivariant between the actions on and :

 ρ(g)(f(w))=f(w)=f(g⋅Ww) (3)

Finally, has the same representation , so is fixed by the action of any subgroup of . Hence for all decomposition of , point 3. of the definition is satisfied.

### a.2 It is impossible to learn a LSB-disentangled representation of dimension 2 in the considered environment

We prove Theorem 1 which states that it is impossible to learn a LSB-disentangled representation of dimension in the environment presented in Sec.3 (the result also applies to the environment considered in Higgins et al. (2018)).

###### Theorem.

For the considered world, there exists no LSB-disentangled representation w.r.t to the group decomposition , such that and is not trivial.

###### Proof.

Suppose that there exists a LSB-disentangled representation w.r.t to the group decomposition , such that . Then, by definition:

1. There is a linear action . It thus can be viewed as a group representation .

2. The map is equivariant between the actions on and .

3. There is a decomposition or such that each is fixed by the action of all and affected only by .

We now prove that if these conditions are verified, is necessarily constant. Consequently, has the same representation for each state of the world, which is a trivial representation. So, if a LSB-disentangled representation of dimension w.r.t to , then is the trivial representation.

We thus suppose that there exists a LSB-disentangled representation of dimension w.r.t to the group decomposition . Hence, we have, by point 2. of the definition:

 g⋅Zf(w) =f(g⋅Ww) (4)

Since is linear, we can view it as a group representation , as mentioned in point 1. of the definition:

 g⋅Zf(w) =ρ(g)(f(w)) (5)

Because and , we can re-write as:

 f(w) =f((x,y)) (6) =(f1(x,y),f2(x,y))

Hence, combining (4) and (5):

 f(g⋅W(x,y)) =ρ(g)((f1(x,y),f2(x,y))) (7)

We can decompose any into the composition of functions of each subgroup of , i.e. such that . Plus, by definition of and because , the action of all on and is fixed by the action of all and affected only by . We can thus re-write both terms of Equation (7).

 f(g⋅W(x,y)) =(f1((gx(x),gy(y))),f2((gx(x),gy(y))) since g⋅W(x,y)=(gx(x),gy(y)) (8)
 ρ(g)((f1(x,y),f2(x,y))) =(ρx(gx)(f1(x,y)),ρy(gy)(f2(x,y))) by definition of ρ (9)

Hence, Equation 7 becomes:

 (f1((gx(x),gy(y))),f2((gx(x),gy(y))) =(ρx(gx)(f1(x,y)),ρy(gy)(f2(x,y))) (10)

We will now prove that is necessarily constant. The same argument applies for .

From Equation (10), we have:

 f1((gx(x),gy(y))) =ρx(gx)(f1(x,y)) (11)

and are respectively translations on the -axis and -axis. Let be the size of the grid, then s.t. . When at edge of the world, if the object translates to the right, it returns to the left, hence the modulo operation that represents this cycle. Hence:

 f1((gx(x),gy(y))) =f1(((x+nx)modN,(y+ny)modN)) (12) =ρx(gx)(f1(x,y))

The key argument of the proof lies in the fact that is necessary cyclic of order (the minimal order can be inferior to , but it is not useful to caracterize the minimal order in this proof). Let’s compose times:

 ρx(gx)(2N)(f1(x,y)) =f1(((x+2N⋅nx)modN,(y+2N⋅ny)modN)) (13) =f1((x,y))

We now use the fact that is a linear application of , thus:

 ρx(gx)∈GL(R)⟹∃(a(gx),b(gx))∈R2s.t.∀x∈Rρx(gx)(x)=a(gx)⋅x+b(gx) (14)

For notation purposes, we drop the dependence on of the coefficients of the real linear application , and we can rewrite Equation (10):

 ρx(gx)(f1(x,y)) =a⋅f1(x,y)+b (15)

Hence, using Equation 13 we can develop the term :

 ρ2Nx(gx)(f1(x,y)) =a2N⋅f1(x,y)+b⋅2N−1∑i=0ai (16) =f1(x,y)

Define , we have:

 a2N⋅f1(x,y)+c =f1(x,y) (17) ⟺ (a2N−1)⋅f1(x,y)+c =0

Equation (17) is verified . Let :

 {(a2N−1)⋅f1(x1,y1)+c=0(a2N−1)⋅f1(x2,y2)+c=0⟹(a2N−1)⋅(f1(x1,y1)−f1(x2,y2))=0 (18)

We can now derive conditions on or . From Equation (18) we know that either is constant or . If , then Equation (17) simplifies to . So either is the identity function or is constant. The same argument applies to and , hence we have that either is constant or . By plugging the second option in Equation (7), we have that is constant.

Hence is necessarily constant, which implies that is a trivial representation.

## Appendix B Details about Forward-VAE

### b.1 Definition of ^A

is a 2x2 block-diagonal rotation matrix of dimension 4. For instance, if is a translation on the x-axis, the corresponding matrix is: . Similarly, for which is a translation on the y-axis, the corresponding matrix is .

Let’s consider the environment in Sec.3. The agent has 4 actions: go left, right, up or down. We associate each action with a corresponding matrix with trainable weights. Thus, we associate actions go up and go down with a matrix of the form , and we associate actions go left and go right with a matrix of the form where represents trainable parameters.

## Appendix C Additional results

We observe that the mean squared difference between the ideal matrices and the learned matrices is very small (order of ). Hence, we have :

 ^A(go left / go right)≈A∗(go left / go right)=⎡⎢ ⎢ ⎢ ⎢⎣cos(±α)−sin(±α)00sin(±α)cos(±α)0000100001⎤⎥ ⎥ ⎥ ⎥⎦

The result is quite surprising as we do not have completely explicitly optimized for this matrix (at least for the cos/sin part). Plus there is no instability in training.

One issue with the fact that the approximation is not exact, is unstability with composition. Rotation matrices’ determinants are stable with composition, as we have:

 det(AB)=det(A)det(B)

As rotation matrices have a determinant equal to , the composition operation is cyclic for rotations.

However, as are only approximation of rotation matrices, their determinant is approximately but not exactly. This is why, as many compositions are performed, the determinant of the resulting matrix either collapses to zero or explodes to . We provide evidence for this phenomenon in Fig.3.

## Appendix D Discussion

It is important to note that Forward-VAE successfully learns even if there is a possible source of instability in training: the target for the physics loss are constantly changing throughout training, as the encoder is being trained.

The benefit of using transitions rather than still observations for representation learning in the context of an agent acting in a environment has been proposed, discussed and implemented in previous work (Thomas et al., 2017; Raffin et al., 2019). In this work however, we emphasize that using transitions is not an beneficial option, but is compulsory in the context of the current definition of SBDRL for an agent acting in an environment.

We make a connection with SBDRL and the Good Regulator Theorem (Conant & Ross Ashby, 1970). This principle states that, with regard to the brain, insofar as it is successful and efficient as a regulator for survival, it must proceed, in learning, by the formation of a model (or models) of its environment. In SBDRL, the aim is to find a representation that incorporates information about the dynamics of the environment.

Applying SBDRL to more complex environments is not straightforward. For instance consider that we add an object in the environment studied in this paper. Then the group structure of the symmetries of the world are broken when the agent is close to the object. However, the symmetries are conserved locally. One approach would be to start from this local property to learn an approximate SB-disentangled representation.

## Appendix E Using (L)SB-disentangled representations for downstream tasks

In this section we wish to answer the following question: is it increasingly better to use non-disentangled/non-linear SB-disentangled/LSB-disentangled representation for downstream tasks?

We define better in terms of sample efficiency, final performance, and performance with restricted capacity classifiers/restricted amount of data.

For the choice of downstream task, we select the task of learning an inverse model, which consists in predicting the action from two consecutive states .

As a LSB-disentangled representation models the interaction with the environment linearly, it intuitively should be increasingly easier to learn an inverse model from: a non-disentangled representation, a non-linear SB-disentangled representation, and a LSB-disentangled representation.

In order to test this hypothesis, we selected a well-established implementation (Scikit-learn (Pedregosa et al., 2011)) of a well-studied classifier (Random Forest (Breiman, 2001)). We collected 10k transitions . We trained the following models and baselines to compare:

• LSB-disentangled representation of dimension 4: Forward-VAE trained as in Sec.5.2.

• SB-disentangled representation of dimension 2: CCI-VAE variant trained as in Sec.5.1.

• Non-disentangled representation of dimension 2: Auto-encoder, non-disentangled baseline.

• SB-disentangled representation of dimension 4: CCI-VAE trained as in Sec.5.1 but with 4 dimensions, baseline to control for the effect of the size of the representation.

For each model, once trained, we created a dataset of transitions in the corresponding representation space . We then report the 10-fold cross-validation mean validation accuracy as a function of the maximum depth parameter of random forest, which controls the capacity of the classifier.

We first observe that in all cases, either LSB or SB-disentangled representations are performing best. In terms of final performance, all models meet at the upper 100% accuracy limit if given enough data and a classifier with enough capacity.

However, if we consider a constraint in training set size and a fixed high capacity (see Fig.??), we can see that using a SB-disentangled representation is superior to other options. We refer to the capacity of the classifer as "high" if increasing the capacity parameter does not lead to an increase in validation accuracy.

Moreover, if we consider a fixed training set size and a constraint on the classifier’s capacity, using LSB-disentangled representation is the best option.

In conclusion, we observed that it is easier for a small capacity classifier to solve the task using a LSB-disentangled representation and it is easier to solve the task using less data with a SB-disentangled representation. This indicates that (L)SB-disentanglement is indeed useful for downstream task solving.