Disentangling the independently controllable factors of variation by interacting with the world

02/26/2018 ∙ by Valentin Thomas, et al. ∙ 0

It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors, and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When solving Reinforcement Learning problems, what separates great results from random policies is often having the right feature representation. Even with function approximation, learning the right features can lead to faster convergence than blindly attempting to solve given problems

(Jaderberg et al., 2016).

The idea that learning good representations is vital for solving most kinds of real-world problems is not new, both in the supervised learning literature

(Bengio, 2009; Goodfellow et al., 2016), and in the RL literature (Dayan, 1993; Precup, 2000). An alternate idea is that these representations do not need to be learned explicitly, and that learning can be guided through internal mechanisms of reward, usually called intrinsic motivation (Barto et al., ; Oudeyer and Kaplan, 2009; Salge et al., 2013; Gregor et al., 2017).

We build on a previously studied (Thomas et al., 2017) mechanism for representation learning that has close ties to intrinsic motivation mechanisms and causality. This mechanism explicitly links the agent’s control over its environment to the representation of the environment that is learned by the agent. More specifically, this mechanism’s hypothesis is that most of the underlying factors of variation in the environment can be controlled by the agent independently of one another.

We propose a general and easily computable objective for this mechanism, that can be used in any RL algorithm that uses function approximation to learn a latent space. We show that our mechanism can push a model to learn to disentangle its input in a meaningful way, and learn to represent factors which take multiple actions to change and show that these representations make it possible to perform model-based predictions in the learned latent space, rather than in a low-level input space (e.g. pixels).

2 Learning disentangled representations

The canonical deep learning framework to learn representations is the autoencoder framework

(Hinton and Salakhutdinov, 2006). There, an encoder and a decoder are trained to minimize the reconstruction error, . is called the latent (or representation) space, and is usually constrained in order to push the autoencoder towards more desirable solutions. For example, imposing that pushes to learn to compress the input; there the bottleneck often forces to extract the principal factors of variation from . However, this does not necessarily imply that the learned latent space disentangles the different factors of variations. Such a problem motivates the approach presented in this work.

Other authors have proposed mechanisms to disentangle underlying factors of variation. Many deep generative models, including variational autoencoders (Kingma and Welling, 2014) , generative adversarial networks (Goodfellow et al., 2014) or non-linear versions of ICA (Dinh et al., 2014; Hyvarinen and Morioka, 2016)

attempt to disentangle the underlying factors of variation by assuming that their joint distribution (marginalizing out the observed

) factorizes, i.e., that they are marginally independent.

Here we explore another direction, trying to exploit the ability of a learning agent to act in the world in order to impose a further constraint on the representation. We hypothesize that interactions can be the key to learning how to disentangle the various causal factors of the stream of observations that an agent is faced with, and that such learning can be done in an unsupervised way.

3 The selectivity objective

We consider the classical reinforcement learning setting but in the case where extrinsic rewards are not available. We introduce the notion of controllable factors of variation

which are generated from a neural network

where is the current latent state. The factor represents an embedding of a policy whose goal is to realize the variation in the environment.

To discover meaningful factors of variation and their associated policies , we consider the following general quantity which we refer to as selectivity and that is used as a reward signal for :

(1)

Here is the encoded initial state before executing and is the encoded terminal state. and represent factors of variation a factor. should be understood as a score describing how close is to the variation it caused in . For example in the experiments of section 4.1, we choose to be a gaussian kernel between and , while in the experiments of section 4.2, we choose . The intuition behind these objectives is that in expectation, a factor should be close to the variation it caused when following compared to other factors that could have been sampled and followed thus encouraging independence within the factors.

Conditioned on a scene representation

, a distribution of policies are feasible. Samples from this distribution represent ways to modify the scene and thus may trigger an internal selectivity reward signal. For instance, might represent a room with objects such as a light switch.

can be thought of as the distributed representation for the “name” of an underlying factor, to which is associated a policy and a value. In this setting, the light in a room could be a factor that could be either on or off. It could be associated with a policy to turn it on, and a binary value referring to its state, called an attribute or a feature value.We wish to jointly learn the policy

that modifies the scene, so as to control the corresponding value of the attribute in the scene, whose variation is computed by a scoring function . In order to get a distribution of such embeddings, we compute as a function of and some random noise .

The goal of a selectivity-maximizing model is to find the density of factors , the latent representation , as well as the policies that maximize .

(a)
Figure 1: The computational model of our architecture. is the first state, from its encoding and a noise distribution , is generated. is used to compute the policy , which is used to act in the world. The sequence is used to update our model through the selectivity loss, as well as an optional autoencoder loss on .

3.1 Link with mutual information and causality

The selectivity objective, while intuitive, can also be related to information theoretical quantities defined in the latent space. From (Donsker and Varadhan, 1975; Ruderman et al., 2012) we have . Applying this equality to the mutual information gives

where is the set of weights shared by the factor generator, the policy network and the encoder.

Thus, our total objective along entire trajectories is a lower bound on the causal (Ziebart, 2010) or directed (Massey, 1990) information which is a measure of the causality the process exercises on the process . See Appendix C for details.

4 Experiments

We use MazeBase (Sukhbaatar et al., 2015) to assess the performance of our approach. We do not aim to solve the game. In this setting, the agent (a red circle) can move in a small environment ( pixels) and perform the actions down, left, right, up. The agent can go anywhere except on the orange blocks.

4.1 Learned representations

(a)
(b)
Figure 2: (a) Sampling of variations

and its kernel density estimation encountered when sampling random controllable factors

. We observe that our algorithm disentangles these representations on main modes, each corresponding to the action that was actually taken by the agent.111pink and white for up, light blue for down+left, green for right, purple black down and night blue for left.  (b) The disentangled structure in the latent space. The and axis are disentangled such that we can recover the and position of the agent in any observation simply by looking at its latent encoding . The missing point on this grid is the only position the agent cannot reach as it lies on an orange block.

After jointly training the reconstruction and selectivity losses, our algorithm disentangles four directed factors of variations as seen in Figure 1: -position and -position of the agent. For visualization purposes we chose the bottleneck of the autoencoder to be of size . To complicate the disentanglement task, we added the redundant action up as well as the action down+left in this experiment.

The disentanglement appears clearly as the latent features corresponding to the and position are orthogonal in the latent space. Moreover, we notice that our algorithm assigns both actions up (white and pink dots in Figure 1.a) to the same feature. It also does not create a significant mode for the feature corresponding to the action down+left (light blue dots in Figure 1.a) as this feature is already explained by features down and left.

4.2 Multistep embedding of policies

In this experiment, are embeddings of -steps policies . We add a model-based loss defined only in the latent space, and jointly train a decoder alongside with the encoder. Notice that we never train our model-based cost at pixel level. While we currently suffer from mode collapsing of some factors of variations, we show that we are successfully able to do predictions in latent space, reconstruct the latent prediction with the decoder, and that our factor space disentangles several types of variations.

(a)
(b)
Figure 3: (a) The actual 3-step trajectory done by the agent. (b) PCA view of the space . Each arrow points to the reconstruction of the prediction made by different . The at the start of the green arrow is the one used by the policy in (a). Notice how its prediction accurately predicts the actual final state.

5 Conclusion, success and limitations

Pushing representations to model independently controllable features currently yields some encouraging success. Visualizing our features clearly shows the different controllable aspects of simple environments, yet, our learning algorithm is unstable. What seems to be the strength of our approach could also be its weakness, as the independence prior forces a very strict separation of concerns in the learned representation, and should maybe be relaxed.

Some sources of instability also seem to slow our progress: learning a conditional distribution on controllable aspects that often collapses to fewer modes than desired, learning stochastic policies that often optimistically converge to a single action, tuning many hyperparameters due to the multiple parts of our model. Nonetheless, we are hopeful in the steps that we are now taking. Disentangling happens, but understanding our optimization process as well as our current objective function will be key to further progress.

References

Appendix A Additional details

a.1 Architecture

Our architecture is as follows: the encoder, mapping the raw pixel state to a latent representation, is a 4-layer convolutional neural network with batch normalization

[Ioffe and Szegedy, 2015]

and leaky ReLU activations. The decoder uses the transposed architecture with ReLU activations. The noise

is sampled from a 2-dimensional gaussian distribution and both the generator

and the policy are neural networks consisting of 2 fully-connected layers. In practice, a minibatch of or vectors is sampled at each step. The agent randomly choses one and samples actions from its policy . Our model parameters are then updated using policy gradient with the REINFORCE estimator and a state-dependent baseline and importance sampling. For each selectivity reward, the term is estimated as .

In practice, we don’t use concatenation of vectors when feeding two vectors as input for a network (like for the factor generator or for the policy). For vectors . We use a bilinear operation as in Florensa et al. [2017]. We observe the bilinear integrated input to more strongly enforce dependence on both vectors; in contrast, our models often ignored one input when using a simple concatenation.

Through our research, we experiment with different outputs for our generator . We explored embedding the -vectors into a hypercube, a hypersphere, a simplex and also a simplex multiplied by the output of a operation on a scalar.

a.2 First experiment

In the first experiment, figure 1, we used a gaussian similarity kernel i.e with . In this experiment only, for clarity of the figure, we only allowed permissible actions in the environment (no no-op action).

Appendix B Additional Figures

b.1 Discrete simple case

Here we consider the case where we learn a latent space of size , with factors corresponding to the coordinates of (, and learn separately parameterized policies . We train our model with the selectivity objective, but no autoencoder loss, and find that we correctly recover independently controllable features on a simple environment. Albeit slower than when jointly training an autoencoder, this shows that the objective we propose is strong enough to provide a learning signal for discovering a disentangled latent representation.

We train such a model on a gridworld MNIST environment, where there are two MNIST digits . The two digits can be moved on the grid via 4 directional actions (so there are 8 actions total), the first digit is always odd and the second digit always even, so they are distiguishable. In Figure

4 we plot each latent feature as a curve, as a function of each ground truth. For example we see that the black feature recovers , the horizontal position of the first digit, or that the purple feature recovers , the vertical position of the second digit.

Figure 4: In a gridworld environment with 2 objects (in this case 2 MNIST digits), we know there are 4 underlying features, the position of each digit . Here each of the four plots represents the evolution of the ’s as a function of their underlying feature, from left to right , , , . We see that for each of them, at least one recovers it almost linearly, from the raw pixels only.

b.2 Planning and policy inference example in 1-step

This disentangled structure could be used to address many challenging issues in reinforcement learning. We give two examples in figure 5:

  • Model-based predictions: Given an initial state, , and an action sequence , we want to predict the resulting state .

  • A simplified deterministic policy inference problem: Given an initial state and a terminal state , we aim to find a suitable action sequence such that can be reached from by following it.

Because of the activation on the last layer of , the different factors of variation are placed on the vertices of a hypercube of dimension , and we can think of the the policy inference problem as finding a path in that simpler space, where the starting point is and the goal is . We believe this could prove to be a much easier problem to solve.

Encoder

Decoder
(a)

Encoder

Encoder
(b)
Figure 5: (a) Predicting the effect of a cause on Mazebase. The leftmost image is the visual input of the environment, where the agent is the round circle, and the switch states are represented by shades of green. After the training, we are able to distinguish one cluster per (Figure 1), that is to say per variation obtained after performing an action, independently from the position . Therefore, we are able to move the agent just by adding the corresponding to our latent representation . The second image is just the reconstruction obtained by feeding the resulting into the decoder. (b) Given a starting state and a goal state, we are able to decompose the difference of the two representations into a (non-directed) sequence of movements.

b.3 Multistep Example

We demonstrate an instance of ICF operating in a 44 Mazebase enviroment over five time steps in Figure 6. We consistently witness a failure of mode collapse in our generator and therefore the generator only produces a subset of all possible -variations. In Figure 6, we observe the governing the agent’s policy appears to correspond to moving two positions down and then to repeatedly toggle the switch. A random action due to -greedy led to the agent moving up and off the switch at time step-4. This perturbation is corrected by the policy by moving down in order to return to toggling the relevant switch.

(a)
(b)
Figure 6: (a) Mazebase environment over five time-steps. Here the red dot denotes the position of the agent. The governing the agent’s policy appears to control toggling the switch indicated by the red rounded box. (b) Visualization of the policies instantiated by different

s. Each box represents the probability distribution of the policies at that time step. Each row is generated by a different

and each column corresponds to an action (up, left, pass, right, toggle, down) in order. The boxed column shows the . The symbols below each box represent the most-probable action for the behavioral policy, where the grey circle indicates toggling the switch.

Appendix C Variational bound and the selectivity

Let us call the probability distribution over final hidden states starting from and using the policy parametrized by the embedding .

. where is the transition probability of the environment.

For simplicity, let’s refer to as , as and as .

c.1 Lower bound on the mutual information

The bound

can be proven by using Donsker-Varadhan variational representation of the KL divergence [Donsker and Varadhan, 1975, Ruderman et al., 2012]:

For and using the identity with and , we have:

for parametric functions.

As we sample the factors uniformly, our total objective is then a lower bound on which corresponds here to the directed information [Massey, 1990] Ziebart [2010] as is sampled independently from .

Appendix D Additional information on the training

In our experiments, we use the selectivity objective, an autoencoding loss and an entropy regularization loss for each of the policies . Furthermore, in experiment 4.2 we added the model-based cost with a learned two layer fully connected neural network.

The selectivity is used to update the parameters of the encoder, factor generator and policy networks. We use the following equation for computing the gradients

We also use a state dependent baseline

as a control variate to reduce the variance of the REINFORCE estimator.

Furthermore, to be able to train the factor generator efficiently, we train all sampled in a mini-batch (of size ) by importance sampling on the probability ratio of the trajectory under each