Independently Controllable Factors

08/03/2017 ∙ by Valentin Thomas, et al. ∙ 0

It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Whether in static or dynamic environments, decision making for real world problems is often confronted with the hard challenge of finding a “good” representation of the problem. In the context of supervised or semi-supervised learning, it has been argued 

(Bengio, 2009)

that good representations separate out underlying explanatory factors, which may be causes of the observed data. In such problems, feature learning often involves mechanisms such as autoencoders

(Hinton and Salakhutdinov, 2006; Vincent et al., 2008), which find latent features that explain the observed data. In interactive environments, the temporal dependency between successive observations creates a new opportunity to notice causal structure in data which may not be apparent using only observational studies. The need to experiment in order to discover causal relationships has already been well explored in psychology (e.g. Gopnik and Wellman (2012)

). In reinforcement learning, several approaches explore mechanisms that push the internal representations of learned models to be “good” in the sense that they provide better control (see §

4), and control is a particularly important causal relationship between an agent and elements of its environment.

We propose and explore a more direct mechanism for representation learning, which explicitly links an agent’s control over its environment with its internal feature representations. Specifically, we hypothesize that some of the factors explaining variations in the data correspond to aspects of the world that can be controlled by the agent. For example, an object that could be pushed around or picked up independently of others is an independently controllable aspect of the environment. Our approach therefore aims to jointly discover a set of features (functions of the environment state) and policies (which change the state) such that each policy controls the associated feature while leaving the other features unchanged as much as possible. In §2 and §3 we explain this mechanism and show experimental results for the simplest instantiation of this new principle. In §5 we discuss how this principle could be applied more generally, and what are the research challenges that emerge.

2 Independently controllable features

To make the above intuitions concrete, assume that there are factors of variation underlying the observations coming from an interactive environment that are independently controllable. That is, a controllable factor of variation is one for which there exists a policy which will modify that factor only, and not the others. For example, the object associated with a set of pixels could be acted on independently from other objects, which would explain variations in its pose and scale when we move it around while leaving the others generally unchanged. The object position in this case is a factor of variation. What poses a challenge for discovering and mapping such factors into computed features is the fact that the factors are not explicitly observed. Our goal is for the agent to autonomously discover such factors – which we call independently controllable features – along with policies that control them. While these may seem like strong assumptions about the nature of the environment, we argue that these assumptions are similar to regularizers, and are meant to make a difficult learning problem (that of learning good representations which disentangle underlying factors) better constrained.

There are many possible ways to express the preference for learning independently controllable features as an objective. §2.2 proposes such an objective for a simple scenario. §3.1 illustrates the effect of this objective when all the features of the environment are simple and controllable by the agent. Moreover, in §3.2, we show that by itself the objective we propose is strong enough to recover underlying factors of variation without additional reconstruction loss. In §2.3, we aim to generalize such an objective for a continuous representation of factors and policies. In §3.3, we present our experiments on the Mazebase domain (Sukhbaatar et al., 2015). §3.3.1 shows that, using these continuous embeddings, we are able to disentangle the latent space and the controllable factors. In §3.3.2, we show how the learnt representations can be used for planning and for inferring the sequence of actions performed between two states.

2.1 Capturing the main factors of variation

Since not all factors of variation present in the data are controllable, we propose to combine two objectives: (1) one to encourage the learned representation to capture the main factors of variation, and (2) one to encourage the representation to be structured so that the controllable factors are disentangled from each other and from other factors. Any common method for representation learning could be used for (1); for simplicity we use a simple autoencoder framework throughout this paper (Hinton and Salakhutdinov, 2006). The encoder and decoder of the autoencoder are viewed as function approximators with parameters such that maps the input space to some latent space , and maps back to the input space . Autoencoders are trained to minimize the discrepancy between and , a.k.a. the reconstruction error, e.g.,:

We call the latent feature representation of , with features.

It is common in the case of a vanilla autoencoder to assume that . This causes and to perform dimensionality reduction of , i.e. compression

, since there is a dimension bottleneck through which information about the input data must pass. Often, this bottleneck forces the optimization procedure to uncover principal factors of variation of the data on which they are trained. However, this does not necessarily imply that the different components of the vector

are individually meaningful. In fact, note that for any bijective transformation , we could obtain the same reconstruction error by replacing by and by , so we should not expect any form of disentangling of the factors of variation unless some additional constraints or penalties are imposed on . This motivates the approach we are about to present. Specifically, we have a preference for policies that can separately influence one of the coordinates of , and we want to express a preference for learning representations that make such policies possible.

Note that there may be several other ways to discover and disentangle underlying factors of variation. Many deep generative models, including variational autoencoders (Kingma and Welling, 2014) and other descendants of the Helmholtz machine (Dayan et al., 1995), generative adversarial networks (Goodfellow et al., 2014) or non-linear versions of ICA (Dinh et al., 2014; Hyvarinen and Morioka, 2016)

attempt to disentangle the underlying factors of variation by assuming that their joint distribution (marginalizing out the observed

) factorizes, i.e., that they are marginally independent. Here we explore another direction, trying to exploit the ability of a learning agent to act in the world in order impose a further constraint on the representation.

2.2 Disentangling independently controllable factors in the simplest case

Consider the following simple scenario: we train an autoencoder producing latent features . In tandem with these features we train policies, denoted , that map an agent’s observation to a categorical distribution over a set of actions . Autoencoders can learn relatively arbitrary feature representations, but we would like many of these features to correspond to controllable factors in the learner’s environment. Specifically, we would like policy to cause a change only in and not in any other features. We think of and as a feature-policy pair.

In order to quantify the change in when actions are taken according to , we define the selectivity of a feature as:

(1)

where , are successive raw state representations (e.g. pixels), is the action, and is the environment transition distribution from to under action . The normalization factor in the denominator of the above equation ensures that the selectivity of is maximal when only that single feature changes as a result of some action.

By having an objective that maximizes selectivity and minimizes the autoencoder objective, we can ensure that the features learned can both capture the main factors of variation in the data and recover independently controllable factors. Hence, we define the following objective, which can be minimized jointly on , and

, via stochastic gradient descent:

(2)

Here one can think of as the reward signal of a control problem, and the expected reward is maximized by finding the optimal set of policies .

Note that many variations of this objective are possible. For example it is also possible to have directed selectivity: by using (denoted ) or simply instead of the absolute value in the numerator of (1), the policies must learn to increase the learned latent feature rather than simply change it. This may be useful if the policy to gradually increase a feature is distinct from the policy that decreases it. Using log-selectivity, , or this sharpened form, , may also lead to easier optimization.

The learning algorithm we propose is summarized in Algorithm 1, where and are the parameters of and the parameters of .

1:for  do
2:     Sample from the environment
3:     
4:     
5:     for  do
6:         
7:               
Algorithm 1 Training an autoencoder with disentangled factors

The gradients on lines 3 and 4 are computed exactly via backpropagation. In our experiments, the gradient on line 6 is also computed by backpropagation and sampling of the expectation, while the gradient on line 7 is computed with the REINFORCE

(Glynn, 1987; Williams, 1992)estimator:

where is a baseline function, which can for example be chosen to be the mean reward or an estimate of the value of the state.

2.3 From enumerated factors to continuous embeddings of individual factors

A limitation of the approach in Algorithm 1 is that it requires the set of potentially controllable factors to be small and enumerated. This makes sense in a simple environment where we always have the same set of objects in the scene. But in more realistic environments, the number of possible objects present in the set can be combinatorially large (and better described by notions such as types), while an individual scene will only comprise a finite number of instances of such objects. Therefore, instead of indexing the possible factors by an integer, we propose to index them by an embedding, i.e., a real-valued vector. In the last section, we enforced variations in the environment to be captured by a coordinate of . We can view this as having a set of attribute variations who are influenced separately by the policies . We now relax this assumption by indexing this set by a learned real-valued vector leading to a continuous set of attributes

. The idea of mapping symbolic entities to a distributed representation is one of the key ingredients of the success of deep learning

(Goodfellow et al., 2016), and can be exploited here as well.

Selecting attributes

Conditioned on a scene representation

, a distribution of policies are feasible. Samples from this distribution represent ways to modify the scene and thus may trigger an internal selectivity reward signal. For instance, might represent a room with objects such as a light switch. can be thought of as the distributed representation for the “name” of an underlying factor, to which is associated a policy and a value. In this setting, the light in a room could be a factor that could be either on or off. It could be associated with a policy to turn it on, and a binary value referring to its state, called an attribute or a feature value.We wish to jointly learn the policy that modifies the scene, so as to control the corresponding value of the attribute in the scene, whose variation is computed by an attribute variation selector function . In order to get a distribution of such embeddings, we compute as a function of and some random noise .

In this scenario, one strategy to determine whether some selected attribute variation evolves independently from other attributes variations is to compare its value (in expectation over the policy actions) to the values obtained with other factors. We thus compute the following selectivity that acts as an intrinsic reward signal, generalizing (1):

(3)

where . We approximate the expectation over by sampling a fixed number of factor embeddings. This model is then trained by jointly minimizing the autoencoder reconstruction cost and the disentanglement objective as depicted in Figure 1.

Figure 1: The proposed distributed representation architecture. and are the reconstruction and selectivity objectives respectively.
Implementing an attribute selector

Ideally

could be an arbitrary function, e.g. a neural network, but such function may be harder to optimize. Instead, we observe that in the discrete case mentioned previously, using

to select attribute is equivalent to where is a one-hot vector at index . One simple step towards continuous embeddings is to relax this constraint, and let be a function of and random vector

, drawn from uniform distribution, and compute

as . However, in most of our experiments, we used a gaussian kernel: because of the better numerical stability it provides.

Unlike in the finite case, we are not sampling uniformly over policies , as we now let a neural network choose

’s probability distribution. This could lead to exploration issues. We demonstrate that simple strategies allow for a network to learn simple distributions in the experiments of §

3.3.

3 Experimental results

In order to validate that our method learns independently controllable features, we perform several experiments. First, in the most basic gridworld-like setting, an agent is allowed to move around in four directions. This basic domain allows us to verify whether in the discrete case, the learning process disentangles the underlying features and recovers the ground truth properties of the environment.

Then, we show results of our continuous factors embeddings method applied to MazeBase (Sukhbaatar et al., 2015), as well as how we can use the learned representations to tackle policy inference or planning problems.

3.1 A simple gridworld

Our first experiment is performed on a gridworld-like setting, illustrated in Figure 2(a): the agent sees a square on a pixel grid, and has 4 actions that move it up, down, left or right. By interacting with the environment, an autoencoder 111We use the following architecture: has two ReLU convolutional layers with stride 2, followed by a fully connected ReLU layer of 32 units, and a layer of features; is the transpose architecture of ; is a softmax policy over 4 actions, computed from the output of the ReLU fully connected layer. We use Adam (Kingma and Ba, 2014) to perform gradient descent. with directed selectivity (objective (1) without absolute value in the numerator) learns latent features that map to the position of the square (see Figure 2(b,c)), without ever having explicit access to these values, and while reconstructing its input properly. In contrast, a plain autoencoder also reconstructs properly but without learning the two latent features explicitly.

Note that in this setting, the learning process is robust to a stochastic version of the environment – where with probability either no action is taken () or a random action is taken. We have successfully trained models recovering and with up to , using the same architecture but a smaller learning rate.

(a)
(b)
(c)
Figure 2:

A simple gridworld with 4 actions that push a square left, right, up or down. (a) left is an example ground truth, right is the reconstruction of the model trained with selectivity. (b) The slope of a linear regression of the true features (the real

and position of the agent) as a function of each latent feature. White is no correlation, blue and red indicate strong negative or positive slopes respectively. and recover and and recover . (c) Each row is a policy , each column corresponds to an action (left/right/up/down). Cell is the average over of ;

3.2 Selectivity as an only objective

We also find experimentally that training discrete independently controllable features without training the autoencoder objective correctly recovers ground truth features and their associated control policies. Albeit slower than when jointly training an autoencoder, this shows that the objective we propose is strong enough to provide a learning signal for discovering a disentangled latent representation.

We train such a model on a gridworld MNIST environment, where instead of a

square there are two MNIST digits . The two digits can be moved on the grid via 4 directional actions (so there are 8 actions total), the first digit is always odd and the second digit always even, so they are distiguishable. In Figure

3 we plot each latent feature as a curve, as a function of each ground truth. For example we see that the black feature recovers , the horizontal position of the first digit, or that the purple feature recovers , the vertical position of the second digit.

Figure 3: In a gridworld environment with 2 objects (in this case 2 MNIST digits), we know there are 4 underlying features, the position of each digit . Here each of the four plots represents the evolution of the ’s as a function of their underlying feature, from left to right , , , . We see that for each of them, at least one recovers it almost linearly, from the raw pixels only.

3.3 Experiments on MazeBase

We use MazeBase (Sukhbaatar et al., 2015) to assess the performance of our continuous embeddings approach on a more complex and well-known environment. MazeBase contains 10 different 2D games in which an agent has to solve a specific task (going to a certain location on the board, activate switches, move a block to a specific place…). We do not aim to solve the game, and only deal with one-step policies.

In this setting, the agent (a red circle) can move in a small environment ( pixels) and perform the actions down, left, right, up, and, to complexify the disentanglement task, we add the redundant action up as well as the action down+left. The agent can go anywhere except on the orange blocks.

In Figure 2, we show that the learned representation is such that for each underlying factor of variation, the learned representation clusters vectors such that it is possible to decompose the variation between two arbitrary state representations as a sum of small variations along a trajectory (Figure 5).

3.3.1 Continuous policy embeddings

We consider the model described in §2.3

. Our architecure is as follows: the encoder, mapping the raw pixel state to a latent representation, is a 4-layer convolutional neural network with batch normalization

(Ioffe and Szegedy, 2015) and leaky ReLU activations. The decoder uses the transposed architecture with ReLU activations. The noise

is sampled from a 6-dimensional gaussian distribution and both the generator

and the policy are neural networks consisting of 2 fully-connected layers. Our attribute selector is a gaussian kernel. In practice, a minibatch of vectors is sampled at each step. The agent randomly choses one and samples an action . Our model parameters are then updated using policy gradient and importance sampling. For each selectivity reward, the term is estimated as .

After jointly training the reconstruction and selectivity losses, our algorithm disentangles four directed factors of variations as seen in Figure 2: -position and -position of the agent. For visualization purposes, in the rest of the section, we chose the bottleneck of the autoencoder to be of size .

The disentanglement appears clearly as the latent features corresponding to the and position are orthogonal in the latent space. Moreover, we notice that our algorithm assigns both actions up (white and pink dots in Figure 2.a) to the same feature. It also does not create a signifant mode for the feature corresponding to the action down+left (light blue dots in Figure 2.a) as this feature is already explained by features down and left.

(a)
(b)
Figure 4: (a) Sampling of variations

and its kernel density estimation encountered when sampling random controllable factors

. We observe that our algorithm disentangles these representations on main modes, each corresponding to the action that was actually taken by the agent.222pink and white for up, light blue for down+left, green for right, purple black down and night blue for left.  (b) The disentangled stucture in the latent space. The and axis are disentangled such that we can recover the and position of the agent in any observation simply by looking at its latent encoding . The missing point on this grid is the only position the agent cannot reach as it lies on an orange block.

3.3.2 Towards planning and policy inference

This disentangled structure could be used to address many challenging issues in reinforcement learning. We give two examples in figure 5:

  • Model-based predictions: Given an initial state, , and an action sequence , we want to predict the resulting state .

  • A simplified deterministic policy inference problem: Given an initial state and a terminal state , we aim to find a suitable action sequence such that can be reached from by following it.

Because of the activation on the last layer of , the different factors of variation are placed on the vertices of a hypercube of dimension , and we can think of the the policy inference problem as finding a path in that simpler space, where the starting point is and the goal is . We believe this could prove to be a much easier problem to solve.

Encoder

Decoder
(a)

Encoder

Encoder
(b)
Figure 5: (a) Predicting the effect of a cause on Mazebase. The leftmost image is the visual input of the environment, where the agent is the round circle, and the switch states are represented by shades of green. After the training, we are able to distinguish one cluster per (Figure 2), that is to say per variation obtained after performing an action, independently from the position . Therefore, we are able to move the agent just by adding the corresponding to our latent representation . The second image is just the reconstruction obtained by feeding the resulting into the decoder. (b) Given a starting state and a goal state, we are able to decompose the difference of the two representations into a (non-directed) sequence of movements.

However, this disentangled representation alone cannot solve completely these two issues in an arbitrary environment. Indeed, the only factors we are able to disentangle are the factors directly controllable by the agent, thus, we are not able to account for the ambiant dynamics or other agents’ influence.

4 Related work

There is a large body of work on learning features in RL focusing on indirectly learning good internal representations. In Jaderberg et al. (2016), agents learn off-policy to control their pixel inputs, forcing them to learn features that help control the environment (at the pixel level). Oh et al. (2015) propose models that learn to predict the future, conditioned on action sequences, which push the agent to capture temporal features. Many more works go in this direction, such as (deep) successor feature representations (Dayan, 1993; Kulkarni et al., 2016) or the options framework (Sutton et al., 1999; Precup, 2000) when used in conjunction with neural networks (Bacon et al., 2016).

Our approach is similar in spirit to the Horde architecture (Sutton, 2011). In that scenario, agents learn policies that maximize specific inputs, whereas we learn policies that control simultaneously learned features of the input. The predictions for all these policies then become features for the agent. Our objective is defined specifically in the context of autoencoders but can be generalized to other representation-learning frameworks. Unlike recent work on the predictron (David Silver, 2017), our approach is not focused on solving a planning task, and the goal is simply to learn how agents control their environment.

5 Conclusion and discussion: Scaling to general environments, controllability and the binding problem

We have introduced a novel type of clue aiming at learning representations which disentangle the underlying factors of variation. The main assumption is that some of those factors correspond to independently controllable aspects of the environment. This leads to training frameworks in which one learns jointly a set of exploratory policies and corresponding features of the learned representation which disentangle those controlled aspects. This is only a first step towards training agents which learn to control their environments at the same time as learning good representations of it.

We focused on the simpler setups in which the environment is made of a static set of objects. In this case, if the objective posited in §2.2 is learned correctly, we can assume that feature of the representation can unambiguously refer to some controllable property of some specific object in the environment. For example, the agent’s world might contain only a red circle and a green rectangle, which are only affected by the actions of the agent (they do not move on their own) and we only change the positions and colours of these objects from one trial to the next. Hence, a specific feature can learn to unambiguously refer to the position or the colour of one of these two objects.

In reality, environments are stochastic, and the set of objects in a given scene is drawn from some distribution. The number of objects may vary and their types may be different. It then becomes less obvious how feature could refer in a clear way to some feature of one of the objects in a particular scene. If we have instances of objects of different types, some addressing or naming scheme is required to refer to the particular objects (instances) present in the scene, so as to match the policy with a particular attribute of a particular object to selectively modify. While our proposed distributed alternative (§2.3) is an attempt to address this, a fundamental representational problem remains.

This is connected to the binding problem in neuro-cognitive science: how to represent a set of objects, each having different attributes, so that we do not confuse, for example, the set red circle, blue square with red square, blue circle. The binding problem has received some attention in the representation learning literature (Minin et al., 2012; Greff et al., 2016), but still remains mostly unsolved. Jointly considering this problem and learning controllable features may prove fruitful.

These ideas may also lead to interesting ways of performing exploration. The RL exploration process could be driven by a notion of controllability, predicting the interestingness of objects in a scene and choosing features and associated policies with which to attempt controlling them – such ideas have only been briefly explored in the literature (e.g. Ratitch and Precup (2003)). How do humans choose with which object to play? We are attracted to objects for which we do not yet know if and how we can control them, and such a process may be critical to learn how the world works.

References