1 Introduction
Disentangled Representation Learning aims at finding a lowdimensional vectorial representation of the world for which the underlying structure of the world is separated into disjoint parts (i.e., disentangled) corresponding to the actual compositional nature of the world. Previous work (Raffin et al., 2019) has shown that agents capable of learning disentangled representations can perform dataefficient policy learning. However, there is no generally accepted formal definition of disentanglement in Representation Learning, which prevents significant progress in this emerging field.
Recent efforts have been made towards finding a proper definition (Locatello et al., 2018). In particular, Higgins et al. (2018)
defines SymmetryBased Disentangled Representation Learning (SBDRL), by taking inspiration from the successful study of symmetry transformations in Physics. Their definition focuses on the transformation properties of the world. They argue that transformations that change only some properties of the underlying world state, while leaving all other properties invariant, are what gives exploitable structure to any kind of data. They distinguish between linear disentangled representation which models these transformation effect on the representation linearly and nonlinear ones. Supposedly, the former should be more useful for downstream tasks such as Reinforcement Learning or auxiliary prediction tasks. Their definition is intuitive and provides principled resolutions to several points of contention regarding what disentangling is. For clarity, we refer to a representation as SBdisentangled if it is disentangled in the sense of SBDRL, and as LSBdisentangled if linear disentangled.
We build on the work of Higgins et al. (2018) and make observations, theoretical and empirical, that lead us to argue that SBDRL requires interaction with environments. The necessity of having interaction has been suggested before (Thomas et al., 2017). We are able to give theoretical and empirical evidence of why it is needed for SBDRL. As in the original work, we base our analysis on a simple environment, where we can formally define and manipulate a SBdisentangled representation. This simple environment is 2D, composed of one circular agent on a plane that can move leftright and updown. The world is cyclic: whenever the agent steps beyond the boundary of the world, it is placed at the opposite end (e.g. stepping up at the top of the grid places the object at the bottom of the grid).
We prove, for this environment, that the minimal number of dimensions of the representation required for it to be LSBdisentangled is counterintuitive. Indeed, the natural number of dimensions required to describe the state of the world is not enough to describe its symmetries in a linear way, which is supposedly ideal for subsequent tasks. Additionally, learning a nonlinear SBdisentangled representation is possible, but current approaches are not designed to model the effect of the world’s symmetries on the representation, a key aspect of SBDRL which we present later. We thus ask: how is one supposed to, in practice, learn a (L)SBdisentangled representation?
We propose two options that arise naturally, one where representation and world symmetries effect on it are learned separately and one where they are learned jointly. For both scenarios, we formally define what could be the proper representation to learn, using the formalism of SBDRL. We propose empirical implementations that are able to successfully approximate these analytically defined representations. Both empirical approaches make use of transitions rather than still images , which validates the main point of this paper: SymmetryBased Disentangled Representation Learning requires interaction with the environment.
Our contributions are the following:

We prove that learning a LSBdisentangled representation of dimension is impossible in the world considered in this paper, due to the cyclical nature of the environment dynamics.

We propose alternatives for learning linear and nonlinear SBdisentangled representation, both using transitions rather than still observations. We validate both approaches emprically.

Based on these observations, we take a step back and argue that interaction with the environment, i.e. the use of transitions, is necessary for SBDRL.
2 SymmetryBased Disentangled Representation Learning
Higgins et al. (2018) defines SymmetryBased Disentangled Representation Learning (SBDRL) as an attempt to formalize disentanglement in Representation Learning. The core idea is that SBdisentanglement of a representation is defined with respect to a particular decomposition of the symmetries of the environment. Symmetries are transformations of the environment that leave some aspects of it unchanged. For instance, for an agent on a plane, translations of the agent on the axis leave its coordinate unchanged. They formalize this using group theory. Groups are composed of these transformations, and group actions are the effect of the transformations on the state of the world and representation.
The proposed definition of SBdisentanglement supposes that these symmetries are formally defined as a group that can be decomposed into a direct product . We now recall the formal definition of a SBdisentangled representation w.r.t to this group decomposition. We advise the reader to refer to the detailed work of (Higgins et al., 2018) for any clarification. Let be the set of worldstates. We suppose that there is a generative process leading from worldstates to observations (these could be pixel, retinal, or any other potentially multisensory observations), and an inference process leading from observations to an agent’s representations. We consider the composition . Suppose also that there is a group of symmetries acting on via a group action . What we would like is to find a corresponding group action so that the symmetry structure of is reflected in . We also want the group action to be disentangled, which means that applying to leaves all subspaces of unchanged but the one corresponding to the transformation . Formally, the representation is SBdisentangled with respect to the decomposition if:

There is a group action

The map is equivariant between the group actions on and :

There is a decomposition such that each is fixed by the action of all and affected only by
This definition of SBdisentangled representations does not make any assumptions on what form the group action should take when acting on the relevant disentangled vector subspace. However, many subsequent tasks may benefit from a SBdisentangled representation where the group actions transform their corresponding disentangled subspace linearly. Such representations are termed linear SBdisentangled representations, which we refer to as LSBdisentangled representations.
3 Considered environment
In this paper, we consider a simplification of the environment studied in the original paper (Higgins et al., 2018). This environment is 2D, composed of one circular agent on a plane that can move leftright and updown, see Fig.1. Whenever the agent steps beyond the boundary of the world, it is placed at the opposite end (e.g. stepping up at the top of the grid places the object at the bottom of the grid). The worldstates can be described in twodimensions: position of the agent. All of our results are based on this environment. It is simple, yet presents the basis for a navigation environment in 2D. We chose this environment because we are able to define theoretically SBdisentangled representations, without making any approximation. We implement this simple environment using Flatland (CasellesDupré et al., 2018).
4 Theoretical analysis
We provide a theorem that proves it is impossible to learn a LSBdisentangled representation of dimension in the environment presented in Sec.3 (the result also applies to the environment considered in Higgins et al. (2018)).
Theorem 1.
For the considered world, there exists no LSBdisentangled representation w.r.t to the group decomposition , such that and is not trivial.
Proof.
Proof by contradiction. The key element of the proof is that the two actual dimension of the environment are not linear but cyclic. Hence the impossibility of modelling two cyclic dimensions using two linear dimensions. See Appendix A for full proof.
5 SymmetryBased Disentangled Representation Learning in practice requires transitions
We now consider the problem of learning, in practice, SBdisentangled and LSBdisentangled representations for the world considered in Sec.3.
5.1 First option: SBdisentanglement with 2 dimension
Theorem 1 states that we cannot learn a dimensional LSBdisentangled representation for the environment. We thus consider learning a dimensional SBdisentangled representation. We started by reproducing the results in (Higgins et al., 2018): we used a variant of current stateoftheart disentangled representation learning model CCIVAE. The learned representation corresponds (up to a scaling factor) to the worldstate , i.e. the position of the agent. This intuitively seems like a reasonable approximation to a disentangled representation.
However, once the representation is learned, we have no idea how the group action of symmetries affect the representation, even though it is at the core of the definition of SBDRL. This is where the necessity for transitions rather than still observations comes into play. Indeed, learning about the effect of transformations on the world implies learning about the dynamics of the environment, which requires transitions.
Starting from the learned nonlinear SBdisentangled representation, we propose to learn the group action of on using a separate model. This way, we have a complete description of the SBdisentangled representation. This approach is effectively decoupling the learning of physics from vision as in (Ha & Schmidhuber, 2018). The second option would be to jointly learn vision and physics, which we demonstrate in the next experiment with LSBdisentangled representations.
In practice, we learn
with a variant of CCIVAE, and then use a multilayer perceptron to learn the group action on Z
, such that is an equivariant map between the actions on and . The results are presented in Fig.2, where we observe that the learned group action correctly approximate the cyclical movement of the agent. We thus have learned a properly SBdisentangled representation of the world, w.r.t to the group decomposition .5.2 Second option: LSBdisentanglement with 4 dimensions
We now propose a method to learn a LSBdisentangled representation and the group action effect on it. To accomplish this, we start with a theoretically constructed LSBdisentangled representation. It is based on a example given in Higgins et al. (2018). The representation is defined as following, using 4 dimensions:

is defined as

is defined as
In this representation, the position is mapped to two complex numbers . For each translation (on the xaxis or yaxis), the associated group representation is a rotation on a complex plane associated to the specific axis. This representation linearly accounts for the cyclic symmetry present in the environment. Using CCIVAE with 4 dimensions fails to learn this representation: we verified experimentally that only 2 dimensions were actually used when learning (for encoding the position), and the two remaining were ignored. We need to take into account transitions and enforce linearity in order to learn this specific representation.
We propose a method that allows to learn this LSBdisentangled representation. Once again, rather than using still observations, we generate a dataset of transitions, and use it to learn the 4dimensional LSBdisentangled representation with a specific VAE architecture we term ForwardVAE. The core idea is to enforce linearity in transitions in the representation space.
We begin by rewriting the complexvalued function as a realvalued function:
(1) 
is a 2x2 blockdiagonal matrix, composed of 2x2 rotation matrices. Let’s consider the environment in Sec.3. The agent has 4 actions: go left, right, up or down. We associate each action with a corresponding matrix with trainable weights.
For instance, if is a translation on the xaxis, the corresponding matrix is and we associate actions go right/left with corresponding matrices , where are trainable parameters:
.
We would like the representation model that we learn to satisfy . We thus enforce the representation to satisfy it, as illustrated in Fig.1. The training procedure is presented in Algorithm 1 in Appendix B. For each image in a batch, we compute and using the encoder part of the VAE. Then we decode with the decoder and compute the reconstruction loss and annealed KL divergence as in (CasellesDupré et al., 2019). Then we compute and compute the forward loss, which is the MSE with :
. We then backpropagate w.r.t to the full loss function of ForwardVAE:
(2) 
The results are presented in Fig.2 and Appendix C. ForwardVAE correctly learns a representation where the two complex dimensions correspond to the position of the agent. Plus, we observe that the learned matrices are very good approximation of the ideal matrices defined above, with . The mean squared difference is very small (order of ).
6 Discussion & Conclusion
Discussion. We used inductive bias given by the theoretical construction of a LSBdisentangled representation theory to design the action matrices and its trainable weights. This construction is specific to this example. However, the idea of having an action matrix for each action is extendable. If each action is highlevel and associated to a symmetry, then you can perform SBDRL. Still, it requires high level actions that represent these symmetries. One potential way to find these actions is through active search (Soatto, 2011), as suggested in (Higgins et al., 2018).
Learning a LSBdisentangled representation is supposedly beneficial for subsequent tasks. However, this remains to be demonstrated, as other work have even challenged the benefit of learning disentangled representations over nondisentangled ones (Locatello et al., 2018). In Appendix E, we present preliminary results that indicates (L)SBdisentangled representations might indeed be beneficial for subsequent tasks. Overall, the field of Disentangled Representation Learning needs more investigation on this matter in order to move forward. We further discuss our results and the approach of SBDRL in Appendix D.
Conclusion. Using theoretical and empirical arguments, we demonstrated that SBDRL (Higgins et al., 2018), a proposed definition for disentanglement in Representation Learning, requires interaction with the environment. We then proposed two methods to perform SBDRL in practice, both of which are successful empirically. We believe SBDRL provides a new perspective on disentanglement which can be promising for Representation Learning in the context of an agent acting in a environment.
Acknowledgements
We thank Irina Higgins for insightful mail discussions.
References
 Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 CasellesDupré et al. (2018) Hugo CasellesDupré, Louis Annabi, Oksana Hagen, Michael GarciaOrtiz, and David Filliat. Flatland: a lightweight firstperson 2d environment for reinforcement learning. arXiv preprint arXiv:1809.00510, 2018.
 CasellesDupré et al. (2019) Hugo CasellesDupré, Michael GarciaOrtiz, and David Filliat. Strigger: Continual state representation learning via selftriggered generative replay. arXiv preprint arXiv:1902.09434, 2019.
 Conant & Ross Ashby (1970) Roger C Conant and W Ross Ashby. Every good regulator of a system must be a model of that system. International journal of systems science, 1(2):89–97, 1970.
 Ha & Schmidhuber (2018) David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. arXiv preprint arXiv:1809.01999, 2018.
 Higgins et al. (2018) Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
 Locatello et al. (2018) Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.
 Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Raffin et al. (2019) Antonin Raffin, Ashley Hill, Kalifou René Traoré, Timothée Lesort, Natalia DíazRodríguez, and David Filliat. Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics. arXiv preprint arXiv:1901.08651, 2019.
 Soatto (2011) Stefano Soatto. Steps towards a theory of visual information: Active perception, signaltosymbol conversion and the interplay between sensing and control. arXiv preprint arXiv:1110.2053, 2011.
 Thomas et al. (2017) Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, MarieJean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable features. arXiv preprint arXiv:1708.01289, 2017.
Appendix A Proofs
a.1 Trivial representations
We first define trivial representations and then prove that they are LSBdisentangled. We will then use this definition to prove Theorem 1.
Definition 1.
is a trivial representation if and only if is constant.
If is a trivial representation, we thus have that each state of the world has the same representation.
Proposition 1.
If is a trivial representation then is LSBdisentangled w.r.t to every group decomposition.
We prove Proposition 1 which states that trivial representations are LSBdisentangled.
Proof.
The definition of LSBdisentangled representation of dimension 2 is:

There is a linear action . It thus can be viewed as a group representation .

The map is equivariant between the actions on and .

There is a decomposition or such that each is fixed by the action of all and affected only by .
Let be the identity function , which is linear.
We have that is constant. We can verify that is equivariant between the actions on and :
(3) 
Finally, has the same representation , so is fixed by the action of any subgroup of . Hence for all decomposition of , point 3. of the definition is satisfied.
∎
a.2 It is impossible to learn a LSBdisentangled representation of dimension 2 in the considered environment
We prove Theorem 1 which states that it is impossible to learn a LSBdisentangled representation of dimension in the environment presented in Sec.3 (the result also applies to the environment considered in Higgins et al. (2018)).
Theorem.
For the considered world, there exists no LSBdisentangled representation w.r.t to the group decomposition , such that and is not trivial.
Proof.
Proof by contradiction.
Suppose that there exists a LSBdisentangled representation w.r.t to the group decomposition , such that . Then, by definition:

There is a linear action . It thus can be viewed as a group representation .

The map is equivariant between the actions on and .

There is a decomposition or such that each is fixed by the action of all and affected only by .
We now prove that if these conditions are verified, is necessarily constant. Consequently, has the same representation for each state of the world, which is a trivial representation. So, if a LSBdisentangled representation of dimension w.r.t to , then is the trivial representation.
We thus suppose that there exists a LSBdisentangled representation of dimension w.r.t to the group decomposition . Hence, we have, by point 2. of the definition:
(4) 
Since is linear, we can view it as a group representation , as mentioned in point 1. of the definition:
(5) 
Because and , we can rewrite as:
(6)  
(7) 
We can decompose any into the composition of functions of each subgroup of , i.e. such that . Plus, by definition of and because , the action of all on and is fixed by the action of all and affected only by . We can thus rewrite both terms of Equation (7).
since  (8) 
by definition of  (9) 
Hence, Equation 7 becomes:
(10) 
We will now prove that is necessarily constant. The same argument applies for .
From Equation (10), we have:
(11) 
and are respectively translations on the axis and axis. Let be the size of the grid, then s.t. . When at edge of the world, if the object translates to the right, it returns to the left, hence the modulo operation that represents this cycle. Hence:
(12)  
The key argument of the proof lies in the fact that is necessary cyclic of order (the minimal order can be inferior to , but it is not useful to caracterize the minimal order in this proof). Let’s compose times:
(13)  
We now use the fact that is a linear application of , thus:
(14) 
For notation purposes, we drop the dependence on of the coefficients of the real linear application , and we can rewrite Equation (10):
(15) 
Hence, using Equation 13 we can develop the term :
(16)  
Define , we have:
(17)  
Equation (17) is verified . Let :
(18) 
We can now derive conditions on or . From Equation (18) we know that either is constant or . If , then Equation (17) simplifies to . So either is the identity function or is constant. The same argument applies to and , hence we have that either is constant or . By plugging the second option in Equation (7), we have that is constant.
Hence is necessarily constant, which implies that is a trivial representation.
∎
Appendix B Details about ForwardVAE
b.1 Definition of
is a 2x2 blockdiagonal rotation matrix of dimension 4. For instance, if is a translation on the xaxis, the corresponding matrix is: . Similarly, for which is a translation on the yaxis, the corresponding matrix is .
Let’s consider the environment in Sec.3. The agent has 4 actions: go left, right, up or down. We associate each action with a corresponding matrix with trainable weights. Thus, we associate actions go up and go down with a matrix of the form , and we associate actions go left and go right with a matrix of the form where represents trainable parameters.
b.2 Pseudocode of ForwardVAE
Appendix C Additional results
We observe that the mean squared difference between the ideal matrices and the learned matrices is very small (order of ). Hence, we have :
The result is quite surprising as we do not have completely explicitly optimized for this matrix (at least for the cos/sin part). Plus there is no instability in training.
One issue with the fact that the approximation is not exact, is unstability with composition. Rotation matrices’ determinants are stable with composition, as we have:
As rotation matrices have a determinant equal to , the composition operation is cyclic for rotations.
However, as are only approximation of rotation matrices, their determinant is approximately but not exactly. This is why, as many compositions are performed, the determinant of the resulting matrix either collapses to zero or explodes to . We provide evidence for this phenomenon in Fig.3.
Appendix D Discussion
It is important to note that ForwardVAE successfully learns even if there is a possible source of instability in training: the target for the physics loss are constantly changing throughout training, as the encoder is being trained.
The benefit of using transitions rather than still observations for representation learning in the context of an agent acting in a environment has been proposed, discussed and implemented in previous work (Thomas et al., 2017; Raffin et al., 2019). In this work however, we emphasize that using transitions is not an beneficial option, but is compulsory in the context of the current definition of SBDRL for an agent acting in an environment.
We make a connection with SBDRL and the Good Regulator Theorem (Conant & Ross Ashby, 1970). This principle states that, with regard to the brain, insofar as it is successful and efficient as a regulator for survival, it must proceed, in learning, by the formation of a model (or models) of its environment. In SBDRL, the aim is to find a representation that incorporates information about the dynamics of the environment.
Applying SBDRL to more complex environments is not straightforward. For instance consider that we add an object in the environment studied in this paper. Then the group structure of the symmetries of the world are broken when the agent is close to the object. However, the symmetries are conserved locally. One approach would be to start from this local property to learn an approximate SBdisentangled representation.
Appendix E Using (L)SBdisentangled representations for downstream tasks
In this section we wish to answer the following question: is it increasingly better to use nondisentangled/nonlinear SBdisentangled/LSBdisentangled representation for downstream tasks?
We define better in terms of sample efficiency, final performance, and performance with restricted capacity classifiers/restricted amount of data.
For the choice of downstream task, we select the task of learning an inverse model, which consists in predicting the action from two consecutive states .
As a LSBdisentangled representation models the interaction with the environment linearly, it intuitively should be increasingly easier to learn an inverse model from: a nondisentangled representation, a nonlinear SBdisentangled representation, and a LSBdisentangled representation.
In order to test this hypothesis, we selected a wellestablished implementation (Scikitlearn (Pedregosa et al., 2011)) of a wellstudied classifier (Random Forest (Breiman, 2001)). We collected 10k transitions . We trained the following models and baselines to compare:

LSBdisentangled representation of dimension 4: ForwardVAE trained as in Sec.5.2.

SBdisentangled representation of dimension 2: CCIVAE variant trained as in Sec.5.1.

Nondisentangled representation of dimension 2: Autoencoder, nondisentangled baseline.

SBdisentangled representation of dimension 4: CCIVAE trained as in Sec.5.1 but with 4 dimensions, baseline to control for the effect of the size of the representation.
For each model, once trained, we created a dataset of transitions in the corresponding representation space . We then report the 10fold crossvalidation mean validation accuracy as a function of the maximum depth parameter of random forest, which controls the capacity of the classifier.
We first observe that in all cases, either LSB or SBdisentangled representations are performing best. In terms of final performance, all models meet at the upper 100% accuracy limit if given enough data and a classifier with enough capacity.
However, if we consider a constraint in training set size and a fixed high capacity (see Fig.??), we can see that using a SBdisentangled representation is superior to other options. We refer to the capacity of the classifer as "high" if increasing the capacity parameter does not lead to an increase in validation accuracy.
Moreover, if we consider a fixed training set size and a constraint on the classifier’s capacity, using LSBdisentangled representation is the best option.
In conclusion, we observed that it is easier for a small capacity classifier to solve the task using a LSBdisentangled representation and it is easier to solve the task using less data with a SBdisentangled representation. This indicates that (L)SBdisentanglement is indeed useful for downstream task solving.
Comments
There are no comments yet.