learning-group-structure
Code associated with our paper "Learning Group Structure and Disentangled Representations of Dynamical Environments"
view repo
Discovering the underlying structure of a dynamical environment involves learning representations that are interpretable and disentangled, which is a challenging task. In physics, interpretable representations of our universe and its underlying dynamics are formulated in terms of representations of groups of symmetry transformations. We propose a physics-inspired method, built upon the theory of group representation, that learns a representation of an environment structured around the transformations that generate its evolution. Experimentally, we learn the structure of explicitly symmetric environments without supervision while ensuring the interpretability of the representations. We show that the learned representations allow for accurate long-horizon predictions and further demonstrate a correlation between the quality of predictions and disentanglement in the latent space.
READ FULL TEXT VIEW PDFCode associated with our paper "Learning Group Structure and Disentangled Representations of Dynamical Environments"
The notion of representation learning occupies a central place in the machine learning litterature (
Bengio et al. (2013); Ridgeway (2016)). What is at stake with representation learning is that a clever agent must understand its environment and its generative (or variational/explanatory) hidden factors in order to efficiently make predictions, classifications or generalizations. However, learning interpretable representations of data that explicitly disentangle the underlying mechanisms structuring this data is still a challenge.To address this, one can begin by drawing a parallel between the pursuit of underlying structure in machine learning and in physics. Representation learning considers generative factors underlying data, which are in essence the underlying degrees of freedom that when modified independently and tractably modify the generated data. Physics, however, often searches for structure using group representation theory by considering the infinitesimal transformations that generate the symmetry group of a physical environment (
Lie (1893); Weinberg (1995)). In both cases, one has to find a faithful – and, ideally, interpretable – representation of these factors or transformations to structure the representation one has of the environment. This connection between representation learning in machine learning and representations in physics was previously highlighted in Higgins et al. (2018). However, although they propose to define representations with respect to the analogy with physics, they do not propose a method for learning these representations from data.In this work, motivated by the parallel between transformations in physics and in machine learning, we propose a method for learning disentangled representations of dynamical environments. Our method focuses on learning the structure of the symmetry group ruling the environment’s transformations, where symmetry transformations are understood as transformations that do not change the nature of the objects. For this purpose we focus on representing dynamical environments through a representation of observations and transformations. We encode observations as elements of a latent space and represent transformations as special orthogonal matrices that act linearly on the latent space. As we consider not only representations of observations but also representations of transformations, in the following we will semantically overload the term of representation to describe the full representation of an environment through its transformations and its observations.
Different definitions of what constitutes, and how to learn, a disentangled representation have been put forward (Locatello et al. (2018)). Generative Adversarial Networks (GAN) (Goodfellow et al. (2014)
) and Variational AutoEncoders (VAE) (
Kingma and Welling (2013)) have been used with some success to identify loosely defined data generative factors (Higgins et al. (2017a); Chen et al. (2016); Karras et al. (2019)) in non-interactive datasets. These approaches have also been used for dynamical environments (Burgess et al. (2018)), where they have focused on learning disentangled state representations that can for example be used for domain adaptation (Higgins et al. (2017b)).However, it was argued by Higgins et al. (2018) that a disentangled representation of an environment should focus not only on its states but also on its transformations. They proposed a more formal definition of disentangled representations based on the physical notion of symmetry transformations. Based upon this definition, Caselles-Dupré et al. (2019)
showed that symmetry-based disentangled representation learning requires interaction with environments. However,
Higgins et al. (2018) did not propose a method for learning such representations, and the method of Caselles-Dupré et al. (2019), which in contrast to our work uses a type of VAE, requires prior knowledge of the symmetries in the system. Prior work on learning underlying group structure from data (Cohen and Welling (2014b, a)) also assumed prior knowledge of the symmetry group. To the best of our knowledge, our work is the first to learn the underlying group structure of environments and disentangled representations (as defined in Higgins et al. (2018)) without any prior knowledge of the symmetry group.Interesting parallels can be drawn between our work and state-space models in machine learning. Learning disentangled representations of dynamical environments is important for state-space models to robustly predict the evolution of complex systems (Miladinović et al. (2019)). Moreover, the state-space model implementation of Fraccaro et al. (2017)
is theoretically close to our method as they consider representations of both observations and the dynamics that act linearly on the latent space, however their method does not reveal the group structure of the transformations. In parallel, physics and machine learning have been forging strong ties mostly based upon the Hamiltonian theory and the integration of ordinary differential equations in the latent space to describe the evolution of dynamical systems (
Chen et al. (2018), Toth et al. (2019), Greydanus et al. (2019)). Finally, Hamiltonian-based methods have also been used to discover specific symmetries of physical systems (Bondesan and Lamacraft (2019)).In this section we review the notion of symmetry-based disentangled representations which is based upon the definition provided by Higgins et al. (2018)
. As we intend to represent observations as elements of a vector space
, the representation (or realization) of the symmetry group we wish to learn has to be a linear representation on . By focusing on linear disentangled representations as described by group theory (Hall (2015)), we enforce all transformations of the environment to be represented by only linear transformations in the learnt latent space.
The goal of representation learning is to discover useful representations of data (Ridgeway (2016)). Not only must representations be faithful and preserve the information held in the data, they must be explicit and interpretable such that every generative factor is expressed and can be easily identified when looking at the representation. Specifically, in our case, we want to represent observations of a dynamical environment in a latent space that exhibits all those qualities. Learning such representations requires using inductive biases Locatello et al. (2018), which in our case take the form of symmetry transformations that operate on the environment.
Consider a dynamical environment from which we can extract observations and a set of transformations that act on the environment. As in physics, we assume that those transformations generate a symmetry group . We will have learnt a representation of this environment if we can map observations of the environment to elements of a latent space and map symmetry transformations to linear applications on this latent space such that the group structure of the symmetry group is conserved in the latent space.
Formally, learning a symmetry-based representation requires learning the structure of the symmetry group and a latent representation of the observations. This means finding a homomorphism between the symmetry group and the general linear group of the latent space . In order to learn a full representation of an environment, the observation space shall be mapped to the latent space with and transformations shall be represented as matrices acting linearly on the latent space so that the map in Figure 1 is equivariant.
For example, consider the observation of a ball, the transformation of moving the ball to the left, the observation of the ball after the transformation is . We need to verify . We make the common shortcut of forgetting the notation and we use to denote the transformation in both the observation space and the latent space.
Another requirement we wish to make concerning the representation to be learned is that it is disentangled in the sense of Higgins et al. (2018). The question of disentanglement makes sense because it ensures the interpretability of the model of the environment we learn. Furthermore, if a robot or any intelligent agent wants to learn a representation of its environment, it should learn a disentangled representation so it can associate simple actions to distinct subspaces of the representation it has of its environment. Then it is simpler for it to perform tasks (Raffin et al. (2019)) and learn complex actions in this representation as they would become combinations of simple disentangled actions.
Formally, if there exists a subgroup decomposition of such that , we would like to decompose the representation in subrepresentations such that the restricted subrepresentations are non-trivial and the restricted subrepresentations are trivial (we recall that a trivial representation of is equal to the identity for every element of the group ).
Our goal is to learn a disentangled representation of the symmetry group with no prior knowledge of the actual symmetries of the environment so that any transformation of the environment can be represented by a linear operator in the latent space. Because we are looking for a matrix representation of an a priori unknown symmetry group , we need to use a parameterization of a group of matrices large enough to potentially contain a subgroup that is a representation of . We will also restrain ourselves to real representations.
We assume that can be represented by a group of matrices belonging to which is the set of orthogonal matrices with unit determinant. Given its prevalence in physics and in the natural world we can expect to be broadly expressive of the type of symmetries we are most likely to want to learn. As orthogonal matrices conserve the norm of vectors they act on, this corresponds to encoding observations in a unit-norm spherical latent space (Davidson et al. (2018); Connor and Rozell (2019)).
We parameterize the -dimensional representation of any transformation as the product of rotations (Pinchon and Siohan (2016); Clements et al. (2016)) :
(1) |
Where denotes the rotation in the plane embedded in the -dimensional representation. For example, in a 3-dimensional representation, one of the rotations is :
(2) |
And any transformation in a 3-dimensional representation has 3 learnable parameters such that :
(3) |
These parameters, , are learnt jointly with the parameters of an encoder mapping the observations to the -dimensional latent space and a decoder mapping the latent space to observations. The training procedure is such that we encode a random observation with then we transform in parallel the environment and the latent vector using random transformations and their representation matrices . The result of those linear transformations in the latent space is decoded with and yields :
(4) |
The training objective is the minimization of the reconstruction loss (in the following we use a binary cross entropy) between the true observations obtained after the successive transformations in the environment and the reconstructed observations obtained after the successive linear transformations using the representations on the latent space.
As explained in section 3.3, for a representation to be disentangled, each subgroup of the symmetry group should act on a specific subspace of the latent space. We want to impose, without supervision, this disentanglement constraint on the set of transformations that act on the environment. In order to do so without any prior knowledge on the structure of the symmetry group, our intuition is that if each transformation acts on a minimum of dimensions of the latent space, then the representation can naturally disentangle itself.
We formalize this notion of entanglement into a metric proper to our parameterization. We choose a metric that quantifies sparsity and interpretability through the number of rotations (each of which is parameterized by ) involved in the transformation matrices . The smallest non-trivial transformation matrix involves a single rotation, so measures the use of any additional rotations :
(5) |
The higher , the higher the entanglement of the representation of the set of transformations. Minimizing this metric makes sure that for each transformation , most of the parameters appearing in the representation of this transformation go to 0, which implies that the transformation acts on a minimum of dimensions of the latent space. If there is only one non-zero parameter in the parameterization of a transformation, then it only acts on 2 dimensions of the latent space.
The code to reproduce these experiments is available at https://github.com/IndustAI/learning-group-structure.
Our first goal is to show that the parameterization and the training method described in 4.1 allows us to extract useful information about the structure of an environment from looking at the topology of the learnt latent space. We use a simple environment similar to Higgins et al. (2018), consisting of a ball evolving in a 2-dimensional grid-world of size with periodic boundary conditions. At each timestep, the ball can move one step left, right, up, and down and observations are returned as matrices with value 1 at the position of the ball and 0 elsewhere. A 2-dimensional plane with periodic boundary conditions is topologically equivalent to a torus. It is this topology that we aim to learn from the dynamics of the environment.
Concretely, the symmetry group of this environment is the finite group where denotes the cyclic group of order (also called or ) and is a finite subgroup of . In order to learn a representation of this environment, we need to learn an encoder, a decoder and the representation matrices for the 4 transformations and that generate .
To learn this group structure from data, our only assumption is that it can be represented with 4-dimensional orthogonal matrices . This choice is motivated from the fact that real representations of cyclic groups can be seen as rotations in planes. Since the symmetry group consists of the direct product of 2 cyclic groups, we need 2 planes so 4 dimensions to learn a real representation of it. We recall that a matrix of has degrees of freedom so the matrices of this 4-dimensional representation each have 6 parameters.
We use neural networks for the encoder and the decoder. The encoder
has normalized outputs so that it always maps observations to unit-norm latent vectors. We learn jointly the encoder parameters , the decoder parameters and the parameters of the 4 transformation matrices and which we denote as .Our results are shown in Figure 2 in which we see that we learn a 4-dimensional representation of the finite symmetry group . The explicit toroidal structure of the latent space thus respects the structure of the symmetry group. Therefore, we are able to learn without supervision the underlying symmetry structure of the environment and an equivariant map between the observation space and the latent space.
We have shown that we could learn a representation of this environment, now we show that we can control its entanglement using the metric introduced in 4.2. Indeed, many 4-dimensional representations of exist, and most of them are entangled. Since the transformation matrices are parameterized as products of 6 rotations, if most of the 6 parameters are non-zero, then the transformation is poorly interpretable because all dimensions of the latent space are mixed after acting on it with this representation.
Using the entanglement metric from section 5 as a regularization term, we are able to control the entanglement of the learnt represenation. Figure 3 compares the learnt transformations between a regularization minimizing the entanglement (, Figure 2(a)) and a regularization maximizing the entanglement (, Figure 2(b)) in the case. Even though both representations encode and exhibit the corresponding toroidal structure, the maximally disentangled representation is much more interpretable. The up/down transformations rotate in a single plane (dimensions 1 and 3) by , whereas the left/right transformations act equivalently in an orthogonal space (dimensions 2 and 4). This is the most intuitive 4-dimensional disentangled representation of the symmetry group we could have learnt.
One could imagine comparing the representations learnt by our method with those learnt by VAEs such as CCI-VAE (Burgess et al. (2018)
). However, we cannot directly compare our results to such representations for the latent space we use is not a gaussian distribution but a spherical latent space and our entanglement metric does not make sense in the context of VAEs.
We now show that we are able to learn disentangled representations of environments with more complex symmetry group structures to prove the robustness of our method. We consider a colored point evolving on a 3-dimensional sphere. The transformations that we consider are discrete rotations around the sphere and periodic discrete changes of color of the point. We wish to disentangle two factors of variation: the spatial rotations and the changes of color.
We use a set of 5 colors that we visualize on a periodic line and the transformations corresponding to the periodic changes of color are denoted as color+ and color-. We also learn a set of 3-dimensional rotations around the axes of the sphere of radius that the colored point lives on. These 6 transformations are denoted x+, x-, y+, y-, z+ and z- and they respectively correspond to rotations of an angle and around each axis. The symmetry group generated by those transformations, and that we aim to learn, therefore lies in .
As explained in Higgins et al. (2018), learning a disentangled representation of 3D rotations directly questions the definition we give of a disentangled representation. Indeed, cannot be written as a non-trivial direct product of subgroups, therefore, we cannot find a representation in which two rotations around different axes would act on two different subspaces of the latent space. We can still satisfy ourselves with a disentangled representation in which rotations around the x, y and z axes each act on a minimum of dimensions of the latent space, a definition of entanglement aligned with the metric we introduced in 4.2.
We choose to learn a 5-dimensional representation of this environment because an interpretable disentangled representation would associate a 3-dimensional subspace to the space transformations and a 2-dimensional subspace to the color transformations. Nevertheless, using a higher-dimensional latent space does not change the learnt representation as the disentanglement objective makes unnecessary dimensions impactless and present in none of the transformation matrices.
Figure 3(b) shows that, when also minimizing the entanglement metric, we effectively learn this way a 5-dimensional disentangled representation of the environment that conserves the symmetry group structure and such that the spatial transformations act on a distinct subspace from the subspace on which the color transformations act on.
As a final step for learning disentangled representations of symmetry groups, we now show that we are able to learn continuous groups corresponding to infinite sets of transformations and are therefore not limited to discrete environments. We consider a point evolving on a sphere under the continuous set of rotations around the 3 axes and in the interval .
To learn a continuous group of symmetry transformations, also known as a Lie group, we approximate the group homomorphism using a neural network, . The network takes as input the transformation in the environment: a concatenation of a scalar denoting the rotation axis (0 for , 1 for and 2 for ) and the value of the angle of the rotation around this axis. It outputs scalars parameterizing the -dimensional representation of this transformation.
In this case, we aim to learn the 3D rotations around the axes of a sphere so we use a 3-dimensional latent space to represent the symmetry group . We recall that, in our parameterization, 3-dimensional representations have 3 parameters corresponding to rotations and . For example, the representation of a rotation of around the axis is parameterized as a product of 3 rotations such that :
(6) |
As in previous sections, the training objective is the minimization of a loss combining a reconstruction loss and an entanglement regularization .
Our results are shown in Figure 5, proving that we effectively learn a 3-dimensional representation of , where rotation about each axis acts only within a single plane of the latent space. With this 3-dimensional disentangled representation of a continuous group of symmetry transformations in a 3-dimensional space, we show that we learn a perfectly interpretable representation of an environment with continuous dynamical transformations.
We now show that the learnt representations are capable of excellent long-term predictions. We go back to the torus-world environment of a ball evolving on a grid with periodic boundary conditions. Using our entanglement metric , we control the entanglement of the representations to reach a target entanglement . We train the parameters of the encoder , the decoder and the representation matrices parameters to minimize the objective : .
During training, we minimize this objective on sequences of 10 successive random transformations sampled among up, down, left, and right. After training, we use the learnt representations to predict the state of the environment over 500 random transformation steps in the latent space and we measure the reconstruction error between the decoded observation and the true one.
Figure 6 shows that the lower the entanglement of the learnt representation, the better the long term predictions. The label "min" stands for minimization of the entanglement ( and ) and "max" for maximization of the entanglement ( and ). For "0.3" and "0.6", the label is the value of and . This result is in agreement with the widespread notion that disentangled representations make better predictions (Bengio et al. (2013); Ridgeway (2016); Higgins et al. (2017b)).
In this work, we have opened the possibility of applying representation theory to the problem of learning disentangled representations of dynamical environments. We have exhibited the faithful, explicit and interpretable structure of the latent representations learnt with this method for simple symmetrical environments with a specific parameterization. With this method, the structure of the latent space naturally respects the structure of the symmetry group without imposing any constraint on the latent space during training. Adding a very general regularization on the parameters of the transformation matrices makes the representations easily interpretable, yielding a representation very similar to a representation that physicists would derive when formulating their conception of the symmetries of the environment. Whilst performance in complex real world environments remains untested we think that this further evidences the benefits of applying physics-based biases to representation learning.
We would like to thank Yaël Frégier, Sébastien Toth, Irina Higgins, and the team at indust.ai for helpful discussions.
A disentangled recognition and nonlinear dynamics model for unsupervised learning
. In Advances in Neural Information Processing Systems, pp. 3601–3610. Cited by: §2.Darla: improving zero-shot transfer in reinforcement learning
. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1480–1490. Cited by: §2, §5.5.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4401–4410. Cited by: §2.Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics
. arXiv preprint arXiv:1901.08651. Cited by: §3.3.