1 Introduction
Reinforcement Learning methods have shown great success in learning complex behaviors, like robotic locomotion [Heess et al., 2017, Florensa et al., 2017a], manipulation of objects [Florensa et al., 2017b, Rajeswaran et al., 2018], or superhuman game strategies [Mnih et al., 2015, Silver et al., 2016]. These approaches hold the promise of endtoend learning of policies that select the optimal action to apply at every state, only leveraging direct interaction with the environment, and a reward signal indicating the desired behavior. Nevertheless, in continuous control many of these approaches still avoid learning from raw sensory input like camera images, or have a separately engineered vision module trained to provide some state information that is known to be relevant for the task [Andrychowicz et al., 2018]. This is a considerable drawback to scale up robotic learning and bring them out of instrumented lab environments.
In this work we ask ourselves what a robot could learn solely through selfsupervised interaction with the environment, and its highdimensional sensory input, without any further instrumentation or reward. We show that we can learn a goalreaching value function [Schaul et al., 2015] as well as a corresponding policy that is conditioned on a previously collected observation, and is able to bring the system to a state where the observation closely matches the desired one. This policy is able to connect states even if it has never attempted to reach one from the other. It could later be used in a hierarchical fashion to perform sequence of operations (like following a desired path, or an assembly instruction manual), or for more meaningful exploration in other downstream tasks [Hausman et al., 2018].
The underlying idea of our algorithm is to learn a state representation where the Euclidean distance between two embedded states corresponds to the minimum number of actions or timesteps needed to connect them. Such an embedding must satisfy a recursive formula that corresponds to the Bellman equation when the reward is the indicator function signaling a perfect match of the observation and the goal. This reward requires no additional instrumentation, but unfortunately the probability of observing it is negligible for any highdimensional sensory input. To sidestep this issue we can relabel the trajectories as trying to recreate an observation that happened along the trajectory
[Andrychowicz et al., 2017], therefore observing this reward as often as needed. This idea was originally studied with lowdimensional state representations, and required a suitably defined distance function to give rewards based on Balls around the goals. Here, we show that both limitations can be lifted. Furthermore, we introduce a novel architecture for the stateaction value function that ties this modelfree Reinforcement Learning technique to modelbased counterparts.2 Background
In this section we formally introduce the Reinforcement Learning framework, its extension for goaloriented tasks, and efficient modelfree learning algorithms that leverage goal relabeling. We also review the basics of modelbased learning as we will draw a connection between the two later on.
2.1 Reinforcement Learning framework
We define a discretetime finitehorizon Markov decision process (MDP) by a tuple
, in which is a state set, an action set,is a transition probability distribution,
is a bounded reward function, is a start state distribution, and is the horizon. Our aim is to learn a stochastic policy parametrized by that maximizes the expected return, . We denote by the expected cumulative reward starting when starting from a , where denotes a whole trajectory, with , and .2.2 Visual goal tasks
In this work we assume that we have access to the state only through camera readings, or observations . Observations are a function of the current state of the world , usually higher dimensional, noisy, and prone to aliasing, like robotic sensory inputs. Technically observations might not be Markovian, but if this is anyhow critical for the task we can always set as state the collection of all previous observations (see Section 6 for further discussion). Our objective is to train a policy that is conditioned on the current observation and a goal observation , such that its sampled actions modify the world to obtain an observation that matches . We employ ideas from Universal Value Function Approximators [Schaul et al., 2015] but for tasks in continuous action spaces and where the goals are in the same space as the observations.
2.3 Offpolicy training with goal relabeling
When we execute a policy conditioned on a certain goal, it might be that the policy fails to reach that goal, hence does not receive any reward. And if no reward is ever observed, no learning can take place. Fortunately, the goal only affects the reward, but not the dynamics of the environment. Therefore, if the training is done with an offpolicy RL algorithm like DDPG [Lillicrap et al., 2016], we can relabel in hindsight the trajectories to learn about other goals than the one we were originally trying to achieve [Andrychowicz et al., 2017]. In particular, if using as goal a point along the trajectory, we are sure to observe a reward.
2.4 ModelBased Learning
A forward model is a function that predicts the next observation from the current observation and action. A standard way to learn such a model is to fit a parameterized function by solving
(1) 
where are observed transitions of different trajectories . Once such a model has been learned, it could be used by search and planning algorithms to select actions that reach a given goal observation . One approach would be Model Predictive Control (MPC) methods [Rao, 2009, Nagabandi et al., 2017], where a full sequence of actions is selected based on the solution to
(2) 
Despite considerable progress in the area [Finn et al., 2016, Lee et al., 2018], it is still challenging to learn a visual model useful for longhorizon planning, specially due to the intractability of solving equation 2 exactly, the modelbias and compounding errors in higher dimensions.
3 Related work
One of the closes recent work is Temporal Difference Models [Pong et al., 2018], where the authors also try to link modelbased and modelfree RL through the stateaction value function of trying to reach all states. Nevertheless, instead of using directly the minimum number of timesteps to reach a state as we do, they fit the distance at which they will be from the intended goal after timesteps (with also given as input to the Q function). Therefore they need to define a distance metric, which is exactly what we try to avoid in our work. Furthermore, this limitation makes them work only from statespace (i.e. positions), where the Euclidean distance is more informative than in image space. Finally, although they sketch a similar connection between MB and MF, theirs only holds for , which is not the discount used in practice.
Several work learn a representations of the image observations with some autoencoding technique [Lange and Riedmiller, 2010]. For example, in [Finn et al., 2015] an autoencoder with spatial softmax is learned on data from trajectories that already succeed at a task 10% of the time, and then its features are used as the state in a GPS algorithm [Levine and Abbeel, 2014] to solve that specific task. On the other hand, Visual RL from Imagined Goals [Nair et al., 2018] has the same objective of trying to reach any reachable observation. They do so training a VAE [Higgins et al., 2016] and directly using a distance in that learned space. They obtain interesting results, despite their encoding not reflecting anyhow the dynamics of the environment not what states are actually close to each other in terms of minimum number of actions needed to join them.
Other recent methods have proposed to learn an image embedding spaces where the planning problem could be easier [Kurutach et al., 2018, Zhang et al., 2018]. All these methods require more complex methods to learn the embedding than just replacing by in equation 1. Only performing this change would just lead to collapse of the embedding space (, trivially satisfies the equation).
Successor features [Dayan, 1993] or representations [Barreto et al., 2018, 2016] constitute another related body of work. They also tackle the problem of using the collected transitions for learning policies that can perform well under many rewards. Unfortunately such rewards need to be expressed as a linear combination of some features, and the collection of all indicator rewards cannot be expressed with a finite number of such features. See Appendix C for more details.
4 Method
In an MDP setting, two states are considered “nearby” if a small number of actions are required to reach one from the other. Unfortunately, a simple distance between states, like the Euclidean, might be uninformative about how nearby two states are. Most prior work in Reinforcement Learning assumes access to a carefully defined state representation such that a reward based on a distance in that space provides enough shaping to learn the desired policy. The problem is considerably harder when we only have access to high dimensional observations, like images.
Our aim is to learn an embedding space where the distance between embedded observations is representative of (e.g. proportional to) the minimum number of timesteps needed to reach one observation from any other. The main idea is to leverage all collected trajectories to learn such embedding, and, as a byproduct, obtain a policy able to reach any past observation upon request. In this section we first recast this as an RL problem, and explain a goal relabeling strategy to solve it. Then we introduce a novel structure for the Q function, that can be seen as bridging the gap between modelbased and modelfree RL.
4.1 Minimum time to observation as RL
Knowing the minimum number of timesteps needed to reach a desired observation as a function of the current state and action is arguably sufficient to perform goal reaching tasks. A discountbased equivalent of this function can be defined by the recursive equation
(3) 
If the environment is deterministic and the transition exists, then . Equivalently, if at least steps are needed to reach a certain state from , then we have . Note that the equation above is exactly the Bellman equation defining the optimal function [Sutton and Barto, 1998] for the reward . In continuous observation spaces (or discrete but highdimensional) this reward is not practical since even a nearoptimal policy might never reach exactly the observation that it is trying to achieve. Therefore to train this value function we use the relabeling strategy outlined in section 2.3. In the next section we describe the specific algorithm we use.
4.2 Goal relabeling and Q fitting
The ideas outlined in the previous section are not tied to any particular algorithm choice for learning . For our experiments we use a goalconditioned variant of MPO [Abdolmaleki et al., 2018]. This algorithm combines a Q fitting done with Retrace [Munos et al., 2016] to propagate the discounted returns faster, and a policy improvement step. See details in the Appendix A.
The learning loop of our full algorithm is described in Algorithm 1. Note that there are three sample procedures involved: first obtain a trajectory from the replay buffer, then decide to relabel the trajectory with probability (we use 0.5, but we have seen that, surprisingly, any value between 0.2 and 0.8 yields similar results!). Finally, if we are relabeling the trajectory, we pick a timestep within in the future and use the observation at that time as the goal (also replacing the reward at that time).
The data collection loop executes the most recent policy by conditioning it on any previously observed state as goal. In the next subsection we motivate a new architecture for the Q function, and outline a connection with model learning.
4.3 Q structure and modellearning
A general architecture for the
would be to have two independent feature extractors, one for the current observation and another for the goal observation, followed by a MultiLayer Perceptron (MLP) acting on the concatenated representations. Such an architecture is depicted in Fig.
0(a). Is there a more sensible choice of structure for ? First of all, see that this is an universal value function [Schaul et al., 2015] that depends on a goal, and in this case the goal also belongs to the observation space. Therefore it is reasonable to apply the same processing to both inputs. Furthermore, the action has no effect on the goal, it only affect the current state. Finally, the Q function should be positive everywhere and evolve exponentially in the minimum number of timesteps needed to reach a state. This suggest the following architecture:(4) 
where and are parameterized functions. A scheme of our proposed architecture can be found in Fig. 0(b).
Given the observations from the previous subsections, we realize that using this architecture to optimize equation 3 enforces:
(5) 
Therefore can be understood as a model in embedding space. Critically, this is not the only equation being fitted, otherwise the embedding would tend to collapse (equation 5 is trivially satisfied by , ). Indeed is also trained to satisfy equation 3 which enforces when steps are the minimum needed to reach from . Assuming that equation 5 holds true, we are imposing that the embedding satisfies . Therefore this can be understood as a model enhanced with stronger planning capabilities, also giving an embedding where distances are proportional to shortest paths between points.
As an additional observation, the sample efficiency of modelbased RL is often attributed to the ability to make use of all observed data, as any valid transition is informative of the dynamics. Relabeling procedures as the one used here achieves a similar effect, and along with the specific structure introduced above we blur the lines between both types of methods.
5 Experiments
We investigate the following questions: 1) Can we learn a goalreaching policy solely from visual input and no rewards? 2) Does adding structure to the Q function improve the performance of the algorithm? 3) Does the learned embedding carry dynamics information like timesteps between points along a trajectory? We analyze this questions in three environments implemented with the physics simulator MuJoCo [Todorov et al., 2012]: visual pointmass, wall pointmass, and Jaco arm reacher. The first two environments have a twodimensional actionspace corresponding to a force applied to a spherical object. For both the observation is a fully topdown view of the scene, as seen in Fig. 1(a)1(b). The only difference is a wall blocking 3/4 of the middle division in the wall task, creating a sort of Umaze. The Jaco arm has a sevendimensional actionspace to control the velocity of each joint actuator. The observation is a frontal view. All trajectories last 10 seconds and the agents operate at 10Hz for the pointmass and at 20Hz for the Jaco. Our algorithm works solely from visual inputs with the resolution observed in Fig. 2: 64x64 pixels for the pointmass environments and 96x96 pixels for the Jaco arm.
5.1 Selfsupervised learning imageconditioned policies
In this subsection we show that we can learn goalconditioned policies that control the agent to reach a state which observation matches any previously seen goal observation. No reward, nor state information, is ever used in the learning process. Nevertheless, we use the L1 distance in positionspace as learning progress metric given that distances in pixelspace are more noisy and less interpretable. For the pointmass tasks, the position is the coordinates of the Center of Mass. For the Jaco arm, it is the sevendimensional joint angles.
In Fig. 3 we compare three versions of the algorithm, differing only in the structure of the Qfunction. Q unstructured does not impose any structure on the function, Q shared encoding uses the same vision stack to process the current observation and the goal observation, and Q structured additionally imposes the structure given in equation 4. To answer our first question, we observe that the algorithm is able to reduce the final distance to the given goal using any of the three models. The performance reported in these plots is computed based on collecting trajectories conditioned on previously seen observations as goals. Therefore, as the replay buffer grows, the evaluation criteria gets harder at the start of learning, specially for the higher dimensional environments like the Jaco. To answer our second question, we see that the structure that we introduced in the previous section substantially increases convergence speed and final performance attained.
5.2 Embedding analysis
In this section we study the evolution of three types of distances to the goal along a successful trajectory for the Wall pointmass task. Similar plots can be found for the other environments. In the top row of Fig. 4 we observe three frames obtained at 0, 2, and 8 seconds, as well as the goal image . In the bottom row we monitor, from left to right, our structured function, the distance in pixel space and the distance in position space. From the left figure we see that the agent reaches the exact position that generated the goal observation at , and stays there with small oscillations. We see in the pixel distance plot that this is an uninformative distance before having reached the vicinity of the goal (all observations before seconds are at the same noisy distance), and even after reaching the goal it is never reduced to 0 because the observations never match exactly. Therefore it is hard to interpret or use this distance as a reward to learn a goalreaching policy. Finally, in the left figure we see that the distance in embedding space trained through our structured follows , where is the remaining number of timesteps to the goal. The fact that it does not reach exactly 1 is because it never exactly reaches the same observation, but it understands that only a few timesteps would be needed (in theory, as in practice it will never match the exact same observation).
6 Failure modes and future work
Both in the Wall pointmass, and the Jaco arm environments we do not obtain a complete convergence with our algorithm. Here we describe some existing issues, suggest an hypothesis about their source, and some experiments to check their validity.
First, in all the environments we observe some oscillation around the goal position. For the pointmass environments this is not critical, but for the Jaco arm we have observed that this is the cause of most of the final distance to goal. We know that in many robotic environments, a knowledge of the velocities is critical to act optimally to reach specified states (like estimating if the agent is going in the right direction). Nevertheless this information cannot be conveyed in a single image. Furthermore, working from images directly may introduce issues with observation aliasing when several states produce similar observations. Both problems could be mitigated by adding as input to the
and the observations from some previous timesteps. For example, the model could be , or be an RNN taking in all previous observations. Another solution to the oscillation problem would be a change in the action space: if the system allows the use of deltaposition commands this would greatly alleviate the issue.Second, in the wall pointmass we observe some difficulties in reaching goals that are very far into the other leg of the Ushape. We think this might be an exploration problem. Indeed our method completely overlooks this issue (ie, is orthogonal to it), solely relying on the random initialization, and on the maximum entropy policy given by MPO. This might not give enough structure to the space, specifically to link far away states. To alleviate this issue we could either use some intrinsic motivation reward to expand the set of observed goals, or add to the replay buffer some demonstrations performing the hardest connections.
Finally, we would like to point out at some limitations of the exact formulation we propose here, and possible fixes. An underlying assumption of our work is that the environment is reversible, otherwise there is no embedding space where a distance (which by definition is symmetric) can be equal to the minimum timesteps between states. This is true for a wide range of practical tasks like all quasistatic manipulation, but might not hold when acting on deformable objects or highly dynamic tasks like throwing objects. In such cases, we should replace the distance in the Q function model by a nonsymmetric comparison between states.
7 Conclusions
We have shown it is possible to learn goalreaching policies in a completely selfsupervised setup, and only from high dimensional sensory inputs like images. Our approach does not use any auxiliary learning signal. Instead, it relies solely on computing the minimum number of timesteps needed to connect different states. This can be written as a Bellman equation that we efficiently solve with a modified offpolicy algorithm paired with goal relabeling. We also introduce a novel structure of the function that connects modelfree and modelbased RL methods, as well as improving the learning speed and final performance.
References
 Abdolmaleki et al. [2018] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a posteriori policy optimisation. International Conference on Learning Representations, 2018.
 Andrychowicz et al. [2017] Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, 2017.
 Andrychowicz et al. [2018] Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous InHand manipulation. In https://arxiv.org/abs/1808.00177, 2018.
 Barreto et al. [2016] André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. http://arxiv.org/abs/1606.05312, 2016.

Barreto et al. [2018]
Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel,
Daniel Mankowitz, Augustin Zidek, and Remi Munos.
Transfer in deep reinforcement learning using successor features and
generalised policy improvement.
In Jennifer Dy and Andreas Krause, editors,
Proceedings of the 35th International Conference on Machine Learning
, volume 80 of Proceedings of Machine Learning Research, pages 501–510, Stockholmsmässan, Stockholm Sweden, 2018. PMLR.  Dayan [1993] Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Comput., 5(4):613–624, 1993.

Finn et al. [2015]
Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, and Pieter
Abbeel.
Deep spatial autoencoders for visuomotor learning.
In International Conference on Robotics and Automation, 2015.  Finn et al. [2016] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems, 2016.

Florensa et al. [2017a]
Carlos Florensa, Yan Duan, and Pieter Abbeel.
Stochastic neural networks for hierarchical reinforcement learning.
International Conference in Learning Representations, 2017a.  Florensa et al. [2017b] Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. Reverse curriculum generation for reinforcement learning. Conference on Robot Learning, 2017b.
 Hausman et al. [2018] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. International Conference on Learning Representations, 2018.
 Heess et al. [2017] Nicolas Heess, T B Dhruva, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S M Ali Eslami, Martin Riedmiller, and David Silver. Emergence of locomotion behaviours in rich environments. http://arxiv.org/abs/1707.02286, 2017.
 Higgins et al. [2016] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. In International Conference in Learning Representations, 2016.
 Kurutach et al. [2018] Thanard Kurutach, Aviv Tamar, Ge Yang, Stuart Russell, and Pieter Abbeel. Learning plannable representations with causal InfoGAN. http://arxiv.org/abs/1807.09341, 2018.
 Lange and Riedmiller [2010] Sascha Lange and Martin A Riedmiller. Deep learning of visual control policies. In ESANN, 2010.
 Lee et al. [2018] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. http://arxiv.org/abs/1804.01523, 2018.
 Levine and Abbeel [2014] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems 27, pages 1071–1079. Curran Associates, Inc., 2014.
 Lillicrap et al. [2016] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Munos et al. [2016] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and efficient OffPolicy reinforcement learning. In Advances in Neural Information Processing Systems, 2016.
 Nagabandi et al. [2017] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for ModelBased deep reinforcement learning with ModelFree FineTuning. http://arxiv.org/abs/1708.02596, 2017.
 Nair et al. [2018] Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. http://arxiv.org/abs/1807.04742, 2018.
 Peters et al. [2010] J Peters, K Mülling, and Y Altun. Relative entropy policy search. AAAI, 2010.
 Pong et al. [2018] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: ModelFree deep RL for ModelBased control. In International Conference on Learning Representations, 2018.
 Rajeswaran et al. [2018] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. In Robotics: Science and Systems, 2018.
 Rao [2009] Anil V Rao. A survey of numerical methods for optimal control. Advances in the Astronautical Sciences, AAS 09334, 2009.
 Schaul et al. [2015] Tom Schaul, Dan Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International Conference on Machine Learning, 2015.
 Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 1998.
 Todorov et al. [2012] E Todorov, T Erez, and Y Tassa. MuJoCo: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
 Zhang et al. [2018] Marvin Zhang, Sharad Vikram, Laura Smith, Pieter Abbeel, Matthew J Johnson, and Sergey Levine. Solar: Deep structured latent representations for modelbased reinforcement learning. http://arxiv.org/abs/1808.09105, 2018.
Appendix A Q fitting
Let’s denote by all the parameters of the function, their target values that are updated with the value of every few learning iterations, an arbitrary behavior policy, its induced state visitation, and our current policy. Then the value estimation step amounts to solving:
(6) 
Once an approximation of the value of the current policy is known, we can express the policy that maximizes , under the trust region as
where the value of is the solution of a convex dual problem [Peters et al., 2010]. This is a nonparametric form, so to recover a policy from where we can sample, we can solve another KLconstrained maximum likelihood problem, where is the corresponding dual variable [Abdolmaleki et al., 2018]:
(7) 
Appendix B Hyperparameteres used
b.1 Architecture choices
The vision stack for the critic consists of five convolutions of strides
and output channels . All kernel shapes are . The number of encoded features (ie, the output dimension of the convolution stack) is 128. The current representations are then concatenated with the action, and passed through two fully connected layers of sizeswith ReLU nonlinearities. The final output is a scalar with a tanh nonlinearity. The policy
has the same architecture as the critic, with the exception of the output being of the dimension of twice the action space to parameterize a Gaussian distribution (with diagonal covariance matrix).
b.2 Algorithm hyperparameters
The batch size to form the Retrace loss and the MPO objective consists of 128 sequences of 32 steps from the replay buffer. The MPO objective has an initial temperature of , and a KL constraint of . The optimization algorithm to minimize these losses is Adam, with a learning rate of . The critic target is updated every 8 learning iterations. The capacity of the replay buffer is set to trajectories.
Appendix C Successor features connection
Successor Representations (SR) have been used to train statereaching tasks in simple continuous cases. Nevertheless there is a major limitation to the previously proposed approaches: all states that we will ever be interested in learning a reaching policy need to be prespecified before starting any learning! In the case of [Barreto et al., 2016], those are only 12. The reason behind this limitation is that SR require the reward function to be defined as a linear function of some state features . Therefore, if we want to express rewards related to reaching a particular state , like , we need to have a component
in the state representation vector
that gives exactly this value, such that the reward can be expressed with . This is because rewards like the ones specified above cannot be expressed linearly as a function of . Unfortunately this trick can only be done a finite number of times, as many as we are willing to increase the dimensionality of .In fact, taking this process to the extreme, the feature "vector" becomes a function , and then the reward needs to be expressed as . In other words, in this case the “vector” is simply any function of the state, meaning we can represent any reward! This seems to indicate that if we compute the SR for a policy (now also dependent on the goal ):
we could directly find the actionvalue function for any reward: . Of course this is not very practical as computing this integral is probably as hard as computing the Q from scratch.