1 Introduction
Whether in static or dynamic environments, decision making for real world problems is often confronted with the hard challenge of finding a “good” representation of the problem. In the context of supervised or semisupervised learning, it has been argued
(Bengio, 2009)that good representations separate out underlying explanatory factors, which may be causes of the observed data. In such problems, feature learning often involves mechanisms such as autoencoders
(Hinton and Salakhutdinov, 2006; Vincent et al., 2008), which find latent features that explain the observed data. In interactive environments, the temporal dependency between successive observations creates a new opportunity to notice causal structure in data which may not be apparent using only observational studies. The need to experiment in order to discover causal relationships has already been well explored in psychology (e.g. Gopnik and Wellman (2012)). In reinforcement learning, several approaches explore mechanisms that push the internal representations of learned models to be “good” in the sense that they provide better control (see §
4), and control is a particularly important causal relationship between an agent and elements of its environment.We propose and explore a more direct mechanism for representation learning, which explicitly links an agent’s control over its environment with its internal feature representations. Specifically, we hypothesize that some of the factors explaining variations in the data correspond to aspects of the world that can be controlled by the agent. For example, an object that could be pushed around or picked up independently of others is an independently controllable aspect of the environment. Our approach therefore aims to jointly discover a set of features (functions of the environment state) and policies (which change the state) such that each policy controls the associated feature while leaving the other features unchanged as much as possible. In §2 and §3 we explain this mechanism and show experimental results for the simplest instantiation of this new principle. In §5 we discuss how this principle could be applied more generally, and what are the research challenges that emerge.
2 Independently controllable features
To make the above intuitions concrete, assume that there are factors of variation underlying the observations coming from an interactive environment that are independently controllable. That is, a controllable factor of variation is one for which there exists a policy which will modify that factor only, and not the others. For example, the object associated with a set of pixels could be acted on independently from other objects, which would explain variations in its pose and scale when we move it around while leaving the others generally unchanged. The object position in this case is a factor of variation. What poses a challenge for discovering and mapping such factors into computed features is the fact that the factors are not explicitly observed. Our goal is for the agent to autonomously discover such factors – which we call independently controllable features – along with policies that control them. While these may seem like strong assumptions about the nature of the environment, we argue that these assumptions are similar to regularizers, and are meant to make a difficult learning problem (that of learning good representations which disentangle underlying factors) better constrained.
There are many possible ways to express the preference for learning independently controllable features as an objective. §2.2 proposes such an objective for a simple scenario. §3.1 illustrates the effect of this objective when all the features of the environment are simple and controllable by the agent. Moreover, in §3.2, we show that by itself the objective we propose is strong enough to recover underlying factors of variation without additional reconstruction loss. In §2.3, we aim to generalize such an objective for a continuous representation of factors and policies. In §3.3, we present our experiments on the Mazebase domain (Sukhbaatar et al., 2015). §3.3.1 shows that, using these continuous embeddings, we are able to disentangle the latent space and the controllable factors. In §3.3.2, we show how the learnt representations can be used for planning and for inferring the sequence of actions performed between two states.
2.1 Capturing the main factors of variation
Since not all factors of variation present in the data are controllable, we propose to combine two objectives: (1) one to encourage the learned representation to capture the main factors of variation, and (2) one to encourage the representation to be structured so that the controllable factors are disentangled from each other and from other factors. Any common method for representation learning could be used for (1); for simplicity we use a simple autoencoder framework throughout this paper (Hinton and Salakhutdinov, 2006). The encoder and decoder of the autoencoder are viewed as function approximators with parameters such that maps the input space to some latent space , and maps back to the input space . Autoencoders are trained to minimize the discrepancy between and , a.k.a. the reconstruction error, e.g.,:
We call the latent feature representation of , with features.
It is common in the case of a vanilla autoencoder to assume that . This causes and to perform dimensionality reduction of , i.e. compression
, since there is a dimension bottleneck through which information about the input data must pass. Often, this bottleneck forces the optimization procedure to uncover principal factors of variation of the data on which they are trained. However, this does not necessarily imply that the different components of the vector
are individually meaningful. In fact, note that for any bijective transformation , we could obtain the same reconstruction error by replacing by and by , so we should not expect any form of disentangling of the factors of variation unless some additional constraints or penalties are imposed on . This motivates the approach we are about to present. Specifically, we have a preference for policies that can separately influence one of the coordinates of , and we want to express a preference for learning representations that make such policies possible.Note that there may be several other ways to discover and disentangle underlying factors of variation. Many deep generative models, including variational autoencoders (Kingma and Welling, 2014) and other descendants of the Helmholtz machine (Dayan et al., 1995), generative adversarial networks (Goodfellow et al., 2014) or nonlinear versions of ICA (Dinh et al., 2014; Hyvarinen and Morioka, 2016)
attempt to disentangle the underlying factors of variation by assuming that their joint distribution (marginalizing out the observed
) factorizes, i.e., that they are marginally independent. Here we explore another direction, trying to exploit the ability of a learning agent to act in the world in order impose a further constraint on the representation.2.2 Disentangling independently controllable factors in the simplest case
Consider the following simple scenario: we train an autoencoder producing latent features . In tandem with these features we train policies, denoted , that map an agent’s observation to a categorical distribution over a set of actions . Autoencoders can learn relatively arbitrary feature representations, but we would like many of these features to correspond to controllable factors in the learner’s environment. Specifically, we would like policy to cause a change only in and not in any other features. We think of and as a featurepolicy pair.
In order to quantify the change in when actions are taken according to , we define the selectivity of a feature as:
(1) 
where , are successive raw state representations (e.g. pixels), is the action, and is the environment transition distribution from to under action . The normalization factor in the denominator of the above equation ensures that the selectivity of is maximal when only that single feature changes as a result of some action.
By having an objective that maximizes selectivity and minimizes the autoencoder objective, we can ensure that the features learned can both capture the main factors of variation in the data and recover independently controllable factors. Hence, we define the following objective, which can be minimized jointly on , and
, via stochastic gradient descent:
(2) 
Here one can think of as the reward signal of a control problem, and the expected reward is maximized by finding the optimal set of policies .
Note that many variations of this objective are possible. For example it is also possible to have directed selectivity: by using (denoted ) or simply instead of the absolute value in the numerator of (1), the policies must learn to increase the learned latent feature rather than simply change it. This may be useful if the policy to gradually increase a feature is distinct from the policy that decreases it. Using logselectivity, , or this sharpened form, , may also lead to easier optimization.
The learning algorithm we propose is summarized in Algorithm 1, where and are the parameters of and the parameters of .
The gradients on lines 3 and 4 are computed exactly via backpropagation. In our experiments, the gradient on line 6 is also computed by backpropagation and sampling of the expectation, while the gradient on line 7 is computed with the REINFORCE
(Glynn, 1987; Williams, 1992)estimator:where is a baseline function, which can for example be chosen to be the mean reward or an estimate of the value of the state.
2.3 From enumerated factors to continuous embeddings of individual factors
A limitation of the approach in Algorithm 1 is that it requires the set of potentially controllable factors to be small and enumerated. This makes sense in a simple environment where we always have the same set of objects in the scene. But in more realistic environments, the number of possible objects present in the set can be combinatorially large (and better described by notions such as types), while an individual scene will only comprise a finite number of instances of such objects. Therefore, instead of indexing the possible factors by an integer, we propose to index them by an embedding, i.e., a realvalued vector. In the last section, we enforced variations in the environment to be captured by a coordinate of . We can view this as having a set of attribute variations who are influenced separately by the policies . We now relax this assumption by indexing this set by a learned realvalued vector leading to a continuous set of attributes
. The idea of mapping symbolic entities to a distributed representation is one of the key ingredients of the success of deep learning
(Goodfellow et al., 2016), and can be exploited here as well.Selecting attributes
Conditioned on a scene representation
, a distribution of policies are feasible. Samples from this distribution represent ways to modify the scene and thus may trigger an internal selectivity reward signal. For instance, might represent a room with objects such as a light switch. can be thought of as the distributed representation for the “name” of an underlying factor, to which is associated a policy and a value. In this setting, the light in a room could be a factor that could be either on or off. It could be associated with a policy to turn it on, and a binary value referring to its state, called an attribute or a feature value.We wish to jointly learn the policy that modifies the scene, so as to control the corresponding value of the attribute in the scene, whose variation is computed by an attribute variation selector function . In order to get a distribution of such embeddings, we compute as a function of and some random noise .In this scenario, one strategy to determine whether some selected attribute variation evolves independently from other attributes variations is to compare its value (in expectation over the policy actions) to the values obtained with other factors. We thus compute the following selectivity that acts as an intrinsic reward signal, generalizing (1):
(3) 
where . We approximate the expectation over by sampling a fixed number of factor embeddings. This model is then trained by jointly minimizing the autoencoder reconstruction cost and the disentanglement objective as depicted in Figure 1.
Implementing an attribute selector
Ideally
could be an arbitrary function, e.g. a neural network, but such function may be harder to optimize. Instead, we observe that in the discrete case mentioned previously, using
to select attribute is equivalent to where is a onehot vector at index . One simple step towards continuous embeddings is to relax this constraint, and let be a function of and random vector, drawn from uniform distribution, and compute
as . However, in most of our experiments, we used a gaussian kernel: because of the better numerical stability it provides.Unlike in the finite case, we are not sampling uniformly over policies , as we now let a neural network choose
’s probability distribution. This could lead to exploration issues. We demonstrate that simple strategies allow for a network to learn simple distributions in the experiments of §
3.3.3 Experimental results
In order to validate that our method learns independently controllable features, we perform several experiments. First, in the most basic gridworldlike setting, an agent is allowed to move around in four directions. This basic domain allows us to verify whether in the discrete case, the learning process disentangles the underlying features and recovers the ground truth properties of the environment.
Then, we show results of our continuous factors embeddings method applied to MazeBase (Sukhbaatar et al., 2015), as well as how we can use the learned representations to tackle policy inference or planning problems.
3.1 A simple gridworld
Our first experiment is performed on a gridworldlike setting, illustrated in Figure 2(a): the agent sees a square on a pixel grid, and has 4 actions that move it up, down, left or right. By interacting with the environment, an autoencoder ^{1}^{1}1We use the following architecture: has two ReLU convolutional layers with stride 2, followed by a fully connected ReLU layer of 32 units, and a layer of features; is the transpose architecture of ; is a softmax policy over 4 actions, computed from the output of the ReLU fully connected layer. We use Adam (Kingma and Ba, 2014) to perform gradient descent. with directed selectivity (objective (1) without absolute value in the numerator) learns latent features that map to the position of the square (see Figure 2(b,c)), without ever having explicit access to these values, and while reconstructing its input properly. In contrast, a plain autoencoder also reconstructs properly but without learning the two latent features explicitly.
Note that in this setting, the learning process is robust to a stochastic version of the environment – where with probability either no action is taken () or a random action is taken. We have successfully trained models recovering and with up to , using the same architecture but a smaller learning rate.
A simple gridworld with 4 actions that push a square left, right, up or down. (a) left is an example ground truth, right is the reconstruction of the model trained with selectivity. (b) The slope of a linear regression of the true features (the real
and position of the agent) as a function of each latent feature. White is no correlation, blue and red indicate strong negative or positive slopes respectively. and recover and and recover . (c) Each row is a policy , each column corresponds to an action (left/right/up/down). Cell is the average over of ;3.2 Selectivity as an only objective
We also find experimentally that training discrete independently controllable features without training the autoencoder objective correctly recovers ground truth features and their associated control policies. Albeit slower than when jointly training an autoencoder, this shows that the objective we propose is strong enough to provide a learning signal for discovering a disentangled latent representation.
We train such a model on a gridworld MNIST environment, where instead of a
square there are two MNIST digits . The two digits can be moved on the grid via 4 directional actions (so there are 8 actions total), the first digit is always odd and the second digit always even, so they are distiguishable. In Figure
3 we plot each latent feature as a curve, as a function of each ground truth. For example we see that the black feature recovers , the horizontal position of the first digit, or that the purple feature recovers , the vertical position of the second digit.3.3 Experiments on MazeBase
We use MazeBase (Sukhbaatar et al., 2015) to assess the performance of our continuous embeddings approach on a more complex and wellknown environment. MazeBase contains 10 different 2D games in which an agent has to solve a specific task (going to a certain location on the board, activate switches, move a block to a specific place…). We do not aim to solve the game, and only deal with onestep policies.
In this setting, the agent (a red circle) can move in a small environment ( pixels) and perform the actions down, left, right, up, and, to complexify the disentanglement task, we add the redundant action up as well as the action down+left. The agent can go anywhere except on the orange blocks.
In Figure 2, we show that the learned representation is such that for each underlying factor of variation, the learned representation clusters vectors such that it is possible to decompose the variation between two arbitrary state representations as a sum of small variations along a trajectory (Figure 5).
3.3.1 Continuous policy embeddings
We consider the model described in §2.3
. Our architecure is as follows: the encoder, mapping the raw pixel state to a latent representation, is a 4layer convolutional neural network with batch normalization
(Ioffe and Szegedy, 2015) and leaky ReLU activations. The decoder uses the transposed architecture with ReLU activations. The noiseis sampled from a 6dimensional gaussian distribution and both the generator
and the policy are neural networks consisting of 2 fullyconnected layers. Our attribute selector is a gaussian kernel. In practice, a minibatch of vectors is sampled at each step. The agent randomly choses one and samples an action . Our model parameters are then updated using policy gradient and importance sampling. For each selectivity reward, the term is estimated as .After jointly training the reconstruction and selectivity losses, our algorithm disentangles four directed factors of variations as seen in Figure 2: position and position of the agent. For visualization purposes, in the rest of the section, we chose the bottleneck of the autoencoder to be of size .
The disentanglement appears clearly as the latent features corresponding to the and position are orthogonal in the latent space. Moreover, we notice that our algorithm assigns both actions up (white and pink dots in Figure 2.a) to the same feature. It also does not create a signifant mode for the feature corresponding to the action down+left (light blue dots in Figure 2.a) as this feature is already explained by features down and left.
and its kernel density estimation encountered when sampling random controllable factors
. We observe that our algorithm disentangles these representations on main modes, each corresponding to the action that was actually taken by the agent.^{2}^{2}2pink and white for up, light blue for down+left, green for right, purple black down and night blue for left. (b) The disentangled stucture in the latent space. The and axis are disentangled such that we can recover the and position of the agent in any observation simply by looking at its latent encoding . The missing point on this grid is the only position the agent cannot reach as it lies on an orange block.3.3.2 Towards planning and policy inference
This disentangled structure could be used to address many challenging issues in reinforcement learning. We give two examples in figure 5:

Modelbased predictions: Given an initial state, , and an action sequence , we want to predict the resulting state .

A simplified deterministic policy inference problem: Given an initial state and a terminal state , we aim to find a suitable action sequence such that can be reached from by following it.
Because of the activation on the last layer of , the different factors of variation are placed on the vertices of a hypercube of dimension , and we can think of the the policy inference problem as finding a path in that simpler space, where the starting point is and the goal is . We believe this could prove to be a much easier problem to solve.
However, this disentangled representation alone cannot solve completely these two issues in an arbitrary environment. Indeed, the only factors we are able to disentangle are the factors directly controllable by the agent, thus, we are not able to account for the ambiant dynamics or other agents’ influence.
4 Related work
There is a large body of work on learning features in RL focusing on indirectly learning good internal representations. In Jaderberg et al. (2016), agents learn offpolicy to control their pixel inputs, forcing them to learn features that help control the environment (at the pixel level). Oh et al. (2015) propose models that learn to predict the future, conditioned on action sequences, which push the agent to capture temporal features. Many more works go in this direction, such as (deep) successor feature representations (Dayan, 1993; Kulkarni et al., 2016) or the options framework (Sutton et al., 1999; Precup, 2000) when used in conjunction with neural networks (Bacon et al., 2016).
Our approach is similar in spirit to the Horde architecture (Sutton, 2011). In that scenario, agents learn policies that maximize specific inputs, whereas we learn policies that control simultaneously learned features of the input. The predictions for all these policies then become features for the agent. Our objective is defined specifically in the context of autoencoders but can be generalized to other representationlearning frameworks. Unlike recent work on the predictron (David Silver, 2017), our approach is not focused on solving a planning task, and the goal is simply to learn how agents control their environment.
5 Conclusion and discussion: Scaling to general environments, controllability and the binding problem
We have introduced a novel type of clue aiming at learning representations which disentangle the underlying factors of variation. The main assumption is that some of those factors correspond to independently controllable aspects of the environment. This leads to training frameworks in which one learns jointly a set of exploratory policies and corresponding features of the learned representation which disentangle those controlled aspects. This is only a first step towards training agents which learn to control their environments at the same time as learning good representations of it.
We focused on the simpler setups in which the environment is made of a static set of objects. In this case, if the objective posited in §2.2 is learned correctly, we can assume that feature of the representation can unambiguously refer to some controllable property of some specific object in the environment. For example, the agent’s world might contain only a red circle and a green rectangle, which are only affected by the actions of the agent (they do not move on their own) and we only change the positions and colours of these objects from one trial to the next. Hence, a specific feature can learn to unambiguously refer to the position or the colour of one of these two objects.
In reality, environments are stochastic, and the set of objects in a given scene is drawn from some distribution. The number of objects may vary and their types may be different. It then becomes less obvious how feature could refer in a clear way to some feature of one of the objects in a particular scene. If we have instances of objects of different types, some addressing or naming scheme is required to refer to the particular objects (instances) present in the scene, so as to match the policy with a particular attribute of a particular object to selectively modify. While our proposed distributed alternative (§2.3) is an attempt to address this, a fundamental representational problem remains.
This is connected to the binding problem in neurocognitive science: how to represent a set of objects, each having different attributes, so that we do not confuse, for example, the set red circle, blue square with red square, blue circle. The binding problem has received some attention in the representation learning literature (Minin et al., 2012; Greff et al., 2016), but still remains mostly unsolved. Jointly considering this problem and learning controllable features may prove fruitful.
These ideas may also lead to interesting ways of performing exploration. The RL exploration process could be driven by a notion of controllability, predicting the interestingness of objects in a scene and choosing features and associated policies with which to attempt controlling them – such ideas have only been briefly explored in the literature (e.g. Ratitch and Precup (2003)). How do humans choose with which object to play? We are attracted to objects for which we do not yet know if and how we can control them, and such a process may be critical to learn how the world works.
References
 Bacon et al. (2016) PierreLuc Bacon, Jean Harb, and Doina Precup. The optioncritic architecture. arXiv preprint arXiv:1609.05140, 2016.
 Bengio (2009) Yoshua Bengio. Learning deep architectures for AI. Now Publishers, 2009.
 David Silver (2017) Matteo Hessel Tom Schaul Arthur Guez Tim Harley Gabriel DulacArnold David Reichert Neil Rabinowitz Andre Barreto Thomas Degris David Silver, Hado van Hasselt. The Predictron: EndToEnd Learning and Planning. arXiv, (arXiv:1612.08810), 2017.
 Dayan (1993) Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
 Dayan et al. (1995) Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The Helmholtz machine. Neural computation, 7(5):889–904, 1995.
 Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Nonlinear Independent Components Estimation. arXiv:1410.8516, ICLR 2015 workshop, 2014.
 Glynn (1987) Peter W Glynn. Likelilood ratio gradient estimation: an overview. In Proceedings of the 19th conference on Winter simulation, pages 366–375. ACM, 1987.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 Goodfellow et al. (2014) Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. In NIPS’2014, 2014.
 Gopnik and Wellman (2012) Alison Gopnik and Henry M. Wellman. Reconstructing constructivism: Causal models, Bayesian learning mechanisms and the theory theory. Psychological Bulletin, 138(6):1085, 2012.
 Greff et al. (2016) Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and Juergen Schmidhuber. Tagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing Systems, pages 4484–4492, 2016.
 Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

Hyvarinen and Morioka (2016)
Aapo Hyvarinen and Hiroshi Morioka.
Unsupervised Feature Extraction by TimeContrastive Learning and Nonlinear ICA.
In NIPS, 2016. 
Ioffe and Szegedy (2015)
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pages 448–456, 2015.  Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling (2014) Durk P. Kingma and Max Welling. Autoencoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
 Kulkarni et al. (2016) Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.

LeCun (1998)
Yann LeCun.
The MNIST database of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 1998. 
Minin et al. (2012)
Alexey Minin, Alois Knoll, HansGeorg Zimmermann, AG Siemens, and LLC Siemens.
Complex Valued Artificial Recurrent Neural Network as a Novel Approach to Model the Perceptual Binding Problem.
In ESANN. Citeseer, 2012.  Oh et al. (2015) Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015.
 Precup (2000) Doina Precup. Temporal abstraction in reinforcement learning. 2000.
 Ratitch and Precup (2003) Bohdana Ratitch and Doina Precup. Using MDP Characteristics to Guide Exploration in Reinforcement Learning. In ECML, pages 313–324, 2003.
 Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, and Rob Fergus. MazeBase: A sandbox for learning from games. arXiv preprint arXiv:1511.07401, 2015.
 Sutton (2011) Modayil J. Delp M. Degris T. Pilarski P. M. White A. PrecupD. Sutton, R. S. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In AAMAS, 2011.
 Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.

Vincent et al. (2008)
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol.
Extracting and composing robust features with denoising autoencoders.
In ICML 2008, 2008.  Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.