1 Introduction
When solving Reinforcement Learning problems, what separates great results from random policies is often having the right feature representation. Even with function approximation, learning the right features can lead to faster convergence than blindly attempting to solve given problems
(Jaderberg et al., 2016).The idea that learning good representations is vital for solving most kinds of realworld problems is not new, both in the supervised learning literature
(Bengio, 2009; Goodfellow et al., 2016), and in the RL literature (Dayan, 1993; Precup, 2000). An alternate idea is that these representations do not need to be learned explicitly, and that learning can be guided through internal mechanisms of reward, usually called intrinsic motivation (Barto et al., ; Oudeyer and Kaplan, 2009; Salge et al., 2013; Gregor et al., 2017).We build on a previously studied (Thomas et al., 2017) mechanism for representation learning that has close ties to intrinsic motivation mechanisms and causality. This mechanism explicitly links the agent’s control over its environment to the representation of the environment that is learned by the agent. More specifically, this mechanism’s hypothesis is that most of the underlying factors of variation in the environment can be controlled by the agent independently of one another.
We propose a general and easily computable objective for this mechanism, that can be used in any RL algorithm that uses function approximation to learn a latent space. We show that our mechanism can push a model to learn to disentangle its input in a meaningful way, and learn to represent factors which take multiple actions to change and show that these representations make it possible to perform modelbased predictions in the learned latent space, rather than in a lowlevel input space (e.g. pixels).
2 Learning disentangled representations
The canonical deep learning framework to learn representations is the autoencoder framework
(Hinton and Salakhutdinov, 2006). There, an encoder and a decoder are trained to minimize the reconstruction error, . is called the latent (or representation) space, and is usually constrained in order to push the autoencoder towards more desirable solutions. For example, imposing that pushes to learn to compress the input; there the bottleneck often forces to extract the principal factors of variation from . However, this does not necessarily imply that the learned latent space disentangles the different factors of variations. Such a problem motivates the approach presented in this work.Other authors have proposed mechanisms to disentangle underlying factors of variation. Many deep generative models, including variational autoencoders (Kingma and Welling, 2014) , generative adversarial networks (Goodfellow et al., 2014) or nonlinear versions of ICA (Dinh et al., 2014; Hyvarinen and Morioka, 2016)
attempt to disentangle the underlying factors of variation by assuming that their joint distribution (marginalizing out the observed
) factorizes, i.e., that they are marginally independent.Here we explore another direction, trying to exploit the ability of a learning agent to act in the world in order to impose a further constraint on the representation. We hypothesize that interactions can be the key to learning how to disentangle the various causal factors of the stream of observations that an agent is faced with, and that such learning can be done in an unsupervised way.
3 The selectivity objective
We consider the classical reinforcement learning setting but in the case where extrinsic rewards are not available. We introduce the notion of controllable factors of variation
which are generated from a neural network
where is the current latent state. The factor represents an embedding of a policy whose goal is to realize the variation in the environment.To discover meaningful factors of variation and their associated policies , we consider the following general quantity which we refer to as selectivity and that is used as a reward signal for :
(1) 
Here is the encoded initial state before executing and is the encoded terminal state. and represent factors of variation a factor. should be understood as a score describing how close is to the variation it caused in . For example in the experiments of section 4.1, we choose to be a gaussian kernel between and , while in the experiments of section 4.2, we choose . The intuition behind these objectives is that in expectation, a factor should be close to the variation it caused when following compared to other factors that could have been sampled and followed thus encouraging independence within the factors.
Conditioned on a scene representation
, a distribution of policies are feasible. Samples from this distribution represent ways to modify the scene and thus may trigger an internal selectivity reward signal. For instance, might represent a room with objects such as a light switch.can be thought of as the distributed representation for the “name” of an underlying factor, to which is associated a policy and a value. In this setting, the light in a room could be a factor that could be either on or off. It could be associated with a policy to turn it on, and a binary value referring to its state, called an attribute or a feature value.We wish to jointly learn the policy
that modifies the scene, so as to control the corresponding value of the attribute in the scene, whose variation is computed by a scoring function . In order to get a distribution of such embeddings, we compute as a function of and some random noise .The goal of a selectivitymaximizing model is to find the density of factors , the latent representation , as well as the policies that maximize .
3.1 Link with mutual information and causality
The selectivity objective, while intuitive, can also be related to information theoretical quantities defined in the latent space. From (Donsker and Varadhan, 1975; Ruderman et al., 2012) we have . Applying this equality to the mutual information gives
where is the set of weights shared by the factor generator, the policy network and the encoder.
4 Experiments
We use MazeBase (Sukhbaatar et al., 2015) to assess the performance of our approach. We do not aim to solve the game. In this setting, the agent (a red circle) can move in a small environment ( pixels) and perform the actions down, left, right, up. The agent can go anywhere except on the orange blocks.
4.1 Learned representations


and its kernel density estimation encountered when sampling random controllable factors
. We observe that our algorithm disentangles these representations on main modes, each corresponding to the action that was actually taken by the agent.^{1}^{1}1pink and white for up, light blue for down+left, green for right, purple black down and night blue for left. (b) The disentangled structure in the latent space. The and axis are disentangled such that we can recover the and position of the agent in any observation simply by looking at its latent encoding . The missing point on this grid is the only position the agent cannot reach as it lies on an orange block.After jointly training the reconstruction and selectivity losses, our algorithm disentangles four directed factors of variations as seen in Figure 1: position and position of the agent. For visualization purposes we chose the bottleneck of the autoencoder to be of size . To complicate the disentanglement task, we added the redundant action up as well as the action down+left in this experiment.
The disentanglement appears clearly as the latent features corresponding to the and position are orthogonal in the latent space. Moreover, we notice that our algorithm assigns both actions up (white and pink dots in Figure 1.a) to the same feature. It also does not create a significant mode for the feature corresponding to the action down+left (light blue dots in Figure 1.a) as this feature is already explained by features down and left.
4.2 Multistep embedding of policies
In this experiment, are embeddings of steps policies . We add a modelbased loss defined only in the latent space, and jointly train a decoder alongside with the encoder. Notice that we never train our modelbased cost at pixel level. While we currently suffer from mode collapsing of some factors of variations, we show that we are successfully able to do predictions in latent space, reconstruct the latent prediction with the decoder, and that our factor space disentangles several types of variations.


5 Conclusion, success and limitations
Pushing representations to model independently controllable features currently yields some encouraging success. Visualizing our features clearly shows the different controllable aspects of simple environments, yet, our learning algorithm is unstable. What seems to be the strength of our approach could also be its weakness, as the independence prior forces a very strict separation of concerns in the learned representation, and should maybe be relaxed.
Some sources of instability also seem to slow our progress: learning a conditional distribution on controllable aspects that often collapses to fewer modes than desired, learning stochastic policies that often optimistically converge to a single action, tuning many hyperparameters due to the multiple parts of our model. Nonetheless, we are hopeful in the steps that we are now taking. Disentangling happens, but understanding our optimization process as well as our current objective function will be key to further progress.
References
 (1) Andrew G Barto, Satinder Singh, and Nuttapong Chentanez. Intrinsically motivated learning of hierarchical collections of skills.
 Bengio (2009) Yoshua Bengio. Learning deep architectures for AI. Now Publishers, 2009.
 Dayan (1993) Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
 Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Nonlinear Independent Components Estimation. arXiv:1410.8516, ICLR 2015 workshop, 2014.
 Donsker and Varadhan (1975) Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1–47, 1975.
 Florensa et al. (2017) Carlos Florensa, Yan Duan, and Pieter Abbeel. Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012, 2017.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 Goodfellow et al. (2014) Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. In NIPS’2014, 2014.
 Gregor et al. (2017) K. Gregor, D. Jimenez Rezende, and D. Wierstra. Variational Intrinsic Control. InProceedings of the International Conference on Learning Representations (ICLR), November 2017.
 Hinton and Salakhutdinov (2006) Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

Hyvarinen and Morioka (2016)
Aapo Hyvarinen and Hiroshi Morioka.
Unsupervised Feature Extraction by TimeContrastive Learning and Nonlinear ICA.
In NIPS, 2016. 
Ioffe and Szegedy (2015)
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
In
International Conference on Machine Learning
, pages 448–456, 2015.  Jaderberg et al. (2016) Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Kingma and Welling (2014) Durk P. Kingma and Max Welling. Autoencoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.

LeCun (1998)
Yann LeCun.
The MNIST database of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 1998.  Massey (1990) James Massey. Causality, feedback and directed information. In Proc. Int. Symp. Inf. Theory Applic.(ISITA90), pages 303–305, 1990.
 Oudeyer and Kaplan (2009) PierreYves Oudeyer and Frederic Kaplan. What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6, 2009.
 Precup (2000) Doina Precup. Temporal abstraction in reinforcement learning. 2000.
 Ruderman et al. (2012) Avraham Ruderman, Mark Reid, Darío GarcíaGarcía, and James Petterson. Tighter variational representations of fdivergences via restriction to probability measures. arXiv preprint arXiv:1206.4664, 2012.
 Salge et al. (2013) Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment  an introduction. CoRR, abs/1310.1863, 2013. URL http://arxiv.org/abs/1310.1863.
 Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Gabriel Synnaeve, Soumith Chintala, and Rob Fergus. MazeBase: A sandbox for learning from games. arXiv preprint arXiv:1511.07401, 2015.
 Thomas et al. (2017) Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, MarieJean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable factors. CoRR, abs/1708.01289, 2017. URL http://arxiv.org/abs/1708.01289.
 Ziebart (2010) Brian D Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Carnegie Mellon University, 2010.
Appendix A Additional details
a.1 Architecture
Our architecture is as follows: the encoder, mapping the raw pixel state to a latent representation, is a 4layer convolutional neural network with batch normalization
[Ioffe and Szegedy, 2015]and leaky ReLU activations. The decoder uses the transposed architecture with ReLU activations. The noise
is sampled from a 2dimensional gaussian distribution and both the generator
and the policy are neural networks consisting of 2 fullyconnected layers. In practice, a minibatch of or vectors is sampled at each step. The agent randomly choses one and samples actions from its policy . Our model parameters are then updated using policy gradient with the REINFORCE estimator and a statedependent baseline and importance sampling. For each selectivity reward, the term is estimated as .In practice, we don’t use concatenation of vectors when feeding two vectors as input for a network (like for the factor generator or for the policy). For vectors . We use a bilinear operation as in Florensa et al. [2017]. We observe the bilinear integrated input to more strongly enforce dependence on both vectors; in contrast, our models often ignored one input when using a simple concatenation.
Through our research, we experiment with different outputs for our generator . We explored embedding the vectors into a hypercube, a hypersphere, a simplex and also a simplex multiplied by the output of a operation on a scalar.
a.2 First experiment
In the first experiment, figure 1, we used a gaussian similarity kernel i.e with . In this experiment only, for clarity of the figure, we only allowed permissible actions in the environment (no noop action).
Appendix B Additional Figures
b.1 Discrete simple case
Here we consider the case where we learn a latent space of size , with factors corresponding to the coordinates of (, and learn separately parameterized policies . We train our model with the selectivity objective, but no autoencoder loss, and find that we correctly recover independently controllable features on a simple environment. Albeit slower than when jointly training an autoencoder, this shows that the objective we propose is strong enough to provide a learning signal for discovering a disentangled latent representation.
We train such a model on a gridworld MNIST environment, where there are two MNIST digits . The two digits can be moved on the grid via 4 directional actions (so there are 8 actions total), the first digit is always odd and the second digit always even, so they are distiguishable. In Figure
4 we plot each latent feature as a curve, as a function of each ground truth. For example we see that the black feature recovers , the horizontal position of the first digit, or that the purple feature recovers , the vertical position of the second digit.b.2 Planning and policy inference example in 1step
This disentangled structure could be used to address many challenging issues in reinforcement learning. We give two examples in figure 5:

Modelbased predictions: Given an initial state, , and an action sequence , we want to predict the resulting state .

A simplified deterministic policy inference problem: Given an initial state and a terminal state , we aim to find a suitable action sequence such that can be reached from by following it.
Because of the activation on the last layer of , the different factors of variation are placed on the vertices of a hypercube of dimension , and we can think of the the policy inference problem as finding a path in that simpler space, where the starting point is and the goal is . We believe this could prove to be a much easier problem to solve.


b.3 Multistep Example
We demonstrate an instance of ICF operating in a 44 Mazebase enviroment over five time steps in Figure 6. We consistently witness a failure of mode collapse in our generator and therefore the generator only produces a subset of all possible variations. In Figure 6, we observe the governing the agent’s policy appears to correspond to moving two positions down and then to repeatedly toggle the switch. A random action due to greedy led to the agent moving up and off the switch at time step4. This perturbation is corrected by the policy by moving down in order to return to toggling the relevant switch.


s. Each box represents the probability distribution of the policies at that time step. Each row is generated by a different
and each column corresponds to an action (up, left, pass, right, toggle, down) in order. The boxed column shows the . The symbols below each box represent the mostprobable action for the behavioral policy, where the grey circle indicates toggling the switch.Appendix C Variational bound and the selectivity
Let us call the probability distribution over final hidden states starting from and using the policy parametrized by the embedding .
. where is the transition probability of the environment.
For simplicity, let’s refer to as , as and as .
c.1 Lower bound on the mutual information
The bound
can be proven by using DonskerVaradhan variational representation of the KL divergence [Donsker and Varadhan, 1975, Ruderman et al., 2012]:
For and using the identity with and , we have:
for parametric functions.
Appendix D Additional information on the training
In our experiments, we use the selectivity objective, an autoencoding loss and an entropy regularization loss for each of the policies . Furthermore, in experiment 4.2 we added the modelbased cost with a learned two layer fully connected neural network.
The selectivity is used to update the parameters of the encoder, factor generator and policy networks. We use the following equation for computing the gradients
We also use a state dependent baseline
as a control variate to reduce the variance of the REINFORCE estimator.
Furthermore, to be able to train the factor generator efficiently, we train all sampled in a minibatch (of size ) by importance sampling on the probability ratio of the trajectory under each
Comments
There are no comments yet.