Reinforcement learning (RL) algorithms hold the promise of providing a broadly-applicable tool for automating control, and the combination of high-capacity deep neural network models with RL extends their applicability to settings with complex observations and that require intricate policies. However, RL with function approximation, including deep RL, presents a challenging optimization problem. Despite years of research, current deep RL methods are far from a turnkey solution: most popular methods lack convergence guarantees(Baird, 1995; Tsitsiklis and Van Roy, 1997) or require prohibitive numbers of samples (Schulman et al., 2015; Lillicrap et al., 2015)
. Moreover, in practice, many commonly used algorithms are extremely sensitive to hyperparameters(Henderson et al., 2018). Besides the optimization challenges, another usability challenge of RL is reward function design: although RL automatically determines how to solve the task, the task itself must be specified in a form that the RL algorithm can interpret and optimize. These challenges prompt us to consider whether there might exist a general method for learning behaviors without the need for complex, deep RL algorithms.
Imitation learning is an alternative paradigm to RL that provides a simple and straightforward approach for training control policies via standard supervised learning methods. By maximizing the likelihood of good actions provided by an expert demonstrator, supervised imitation learning can produce effective policies without the algorithmic complexities and optimization challenges of RL. Supervised learning algorithms in deep learning have matured to the point of being robust and reliable, and imitation learning algorithms have demonstrated success in acquiring behaviors robustly and reliably from high-dimensional sensory data such as images(Rajeswaran et al., 2017; Lynch et al., 2019). The catch is that imitation learning methods require an expert demonstrator – typically a person – to provide a number of demonstrations of optimal behavior. Obtaining expert demonstrations can be challenging; the large number of demonstrations required limits the scalability of such algorithms. In this paper, we ask: can we use ideas from imitation learning to train effective goal-directed policies without any expert demonstrations, retaining the benefits of imitation learning, but making it possible to learn goal-directed behavior autonomously from scratch?
The key observation for making progress on this problem is that, in the multi-task setting, trajectories that are generated by a suboptimal policy can serve as optimal examples for other tasks. In particular, in the setting where the tasks correspond to reaching different goal states, every trajectory is a successful demonstration for the state that it actually reaches, even if it performs sub-optimally for the goal that was originally commanded. Similar observations have been made in prior works as well (Kaelbling, 1993; Andrychowicz et al., 2017; Nair et al., 2018; Mavrin et al., 2019; Savinov et al., 2018), but have been used to motivate data reuse in off-policy RL or semi-parametric methods. Our approach will leverage this idea to obtain near-optimal goal-conditioned policies without explicit off-policy RL or complex hand-designed reward functions.
The algorithm that we study is, at its core, very simple: at each iteration, we run our latest goal-conditioned policy, collect data, and then use this data to train a policy with supervised learning. Supervision is obtained by noting that each action that is taken is a good action for reaching the states that actually occurred in future time steps along the same trajectory. This algorithm resembles imitation learning, but is self-supervised. This procedure combines the benefits of goal-conditioned policies with the simplicity of supervised learning, and we theoretically show that this algorithm does policy learning by optimizing a lower bound on the goal-reaching objective. While several prior works have proposed training goal-conditioned policies via imitation learning based on a superficially similar algorithm (Ding et al., 2019; Lynch et al., 2019), to our knowledge no prior work proposes a complete policy learning algorithm based on this idea that learns from scratch, without expert demonstrations. This procedure reaps the benefits of off-policy data re-use without the need for learning complex functions or value functions. Moreover, we can bootstrap our algorithm with a small number of expert demonstrations, such that it can continue to improve its behavior self supervised, without dealing with the challenges of combining imitation learning with off-policy RL.
The main contribution of our work is a complete algorithm for learning policies from scratch via goal-conditioned imitation learning, and to show that this algorithm can successfully train goal-conditioned policies. Our theoretical analysis of self-supervised goal-conditioned imitation learning shows that this method optimizes a lower bound on the probability that the agent reaches the desired goal. Empirically, we show that our proposed algorithm is able to learn goal reaching behaviors from scratch without the need for an explicit reward function or expert demonstrations.
2 Related Work
Our work addresses the same problem statement as goal conditioned reinforcement learning (RL) (Kaelbling, 1993; Andrychowicz et al., 2017; Pong et al., 2018; Nair et al., 2018; Held et al., 2018), where we aim to learn a policy via RL that can reach different goals. Learning goal-conditioned policies is quite challenging, especially when provided only sparse rewards. This challenge can be partially mitigated by hindsight relabeling approaches that relabel goals retroactively (Kaelbling, 1993; Schaul et al., 2015; Pong et al., 2018; Andrychowicz et al., 2017). However, even with relabelling, the goal-conditioned optimization problem still uses unstable off-policy RL methods. In this work, we take a different approach and leverage ideas from supervised learning and data relabeling to build off-policy goal reaching algorithms which do not require any explicit RL. This allows GCSL to inherit the benefits of supervised learning without the pitfalls of off-policy RL. While, in theory, on-policy algorithms might be used to solve goal reaching problem as well, their inefficient use of data makes it challenging to apply these approaches to real-world settings.
Our algorithm is based on ideas from imitation learning (Billard et al., 2008; Hussein et al., 2017) via behavioral cloning (Pomerleau, 1989) but it is not an imitation learning method. While it is built on top of ideas from supervised learning, we are not trying to imitate externally provided expert demonstrations. Instead, we build an algorithm which can learn to reach goals from scratch, without explicit rewards. A related line of work (Hester et al., 2018; Brown et al., 2019; Rajeswaran et al., 2017) has explored how agents can leverage expert demonstrations to bootstrap the process of reinforcement learning. While GCSL is an algorithm to learn goal-reaching policies from scratch, it lends itself naturally to bootstrapping from demonstrations. As we show in Section 5.4, GCSL can easily incorporate demonstrations into off-policy learning and continue improving, avoiding many of the challenges described in Kumar et al. (2019b).
Recent imitation learning algorithms propose methods that are closely related to GCSL. Lynch et al. (2019) aim to learn general goal conditioned policies from “play” data collected by a human demonstrator, and Ding et al. (2019) perform goal-conditioned imitation learning where expert goal-directed demonstrations are relabeled for imitation learning. However, neither of these methods are iterative, and both require human-provided expert demonstrations. Our method instead iteratively performs goal-conditioned behavioral cloning, starting from scratch. Our analysis shows that performing such iterated imitation learning on the policy’s own sampled data actually optimizes a lower bound on the probability of successfully reaching goals, without the need for any expert demonstrations. Relay Policy Learning (Gupta et al., 2019) extends some of these insights into the hierarchical long horizon setting, but requires on policy finetuning with RL and also requires expert demonstrations.
The cross-entropy method (Mannor et al., 2003), self-imitation learning (Oh et al., 2018), reward-weighted regression (Peters and Schaal, 2007), path-integral policy improvement (Theodorou et al., 2010), reward-augmented maximum likelihood (Norouzi et al., 2016; Nachum et al., 2016), and proportional cross-entropy method (Goschin et al., 2013) selectively weight policies or trajectories by their performance during learning, as measured by then environment’s reward function. While these may appear procedurally similar to GCSL, our method is fully self-supervised, as it does not require a reward function, and is applicable in the goal-conditioned setting.
A few works similar to ours in spirit study the problem of learning goal-conditioned policy without external supervision. Pathak et al. (2018) use an inverse model with forward consistency to learn from novelty seeking behavior, but lacks convergence guarantees and requires learning a complex inverse model. Semi-parametric methods (Savinov et al., 2018; Eysenbach et al., 2019) learn a policy similar to ours but do so by building a connectivity graph over the visited states in order to navigate environments, which requires large memory storage and computation time that increases with the number of states.
We consider goal reaching in an environment defined by the tuple , where and are the state and action spaces, is the transition function, is the initial state distribution, the horizon length, and is the distribution over goal states . We aim to find a time-varying goal-conditioned policy : , where is the probability simplex over the action space and is the remaining horizon. We will say that a policy is optimal if it maximizes the probability the specified goal is reached at the end of the episode:
It is important to note here that optimality does not correspond to finding the shortest path to the goal, but any path that reaches the goal at the end of time steps. This can equivalently be cast as an RL problem. The modified state space contains the current state, goal, and the horizon. The modified transition function appropriately handles the modified state space. The reward function depends on both the goal and the time step. Because of the special structure of this formulation, off-policy RL methods can relabel an observed transition to that of a different goal and different horizon , such as , analogously to temporal difference models (Pong et al., 2018).
We consider algorithms for goal-reaching that use behavior cloning, a standard method for imitation learning. In behavior cloning for goal-conditioned policies, an expert policy provides demonstrations for reaching some target goals at the very last timestep, and we aim to find a policy that best predicts the expert actions from the observations. More formally, given a dataset of expert behavior , and a set of stochastic, time-varying policies , the behavior-cloned policy corresponds to
4 Learning Goal-Conditioned Policies with Self-Imitation
Goal-conditioned behavior cloning using expert demonstrations can provide supervision not only for the task the expert was aiming for, but also for reaching any state along the expert’s trajectory (Lynch et al., 2019; Ding et al., 2019). Can we design a procedure to learn goal-reaching behaviors from scratch that uses goal-conditioned behavior cloning as a subroutine without requiring any expert demonstrations?
In this work, we show how imitation learning with data relabeling can be utilized in an iterative procedure, that gives rise to a method which optimizes a lower bound on the RL objective, while providing a number of benefits over standard RL algorithms. It is important to note here that we are not proposing an imitation learning algorithm, but an algorithm for learning goal-reaching behaviors from scratch without any expert demonstrations.
4.1 Goal Reaching via Iterated Imitation Learning
First, consider goal conditioned imitation learning via behavioral cloning with demonstrations as described by Lynch et al. (2019); Ding et al. (2019). This scheme works well given expert data, but expert data is unavailable when we are learning to reach goals from scratch. To leverage this scheme when learning from scratch, we use the following insight: while an arbitrary trajectory from a sub-optimal policy may be suboptimal for reaching the intended goal, it may be optimal for reaching some other goal. In the goal-reaching formalism defined in Equation 1, recall a policy is optimal if it maximizes the probability that the goal is reached at the last time step of an episode. An optimal path under this objective is not necessarily the shortest one.
Under this notion of optimality, we can use a simple data relabeling scheme to construct an expert dataset from an arbitrary set of trajectories. Consider a trajectory obtained by commanding the policy to reach some goal . Although the actions may be suboptimal for reaching the commanded goal , they do succeed at reaching the states that occur later in the observed trajectory. More precisely, for any time step and horizon , the action in state is likely to be a good action for reaching in time steps, and thus useful supervision for . This autonomous relabeling method allows us to convert suboptimal trajectories into optimal goal reaching trajectories for different goals, without the need for any human supervision or intervention. To obtain a concrete algorithm, we can relabel all time steps and horizons in a trajectory to create an expert dataset according to . Because the relabelling procedure is valid for any horizon , we can relabel every such combination to create optimal tuples of from a single trajectory.
This relabeled dataset can then be used to perform goal-conditioned behavioral cloning to update the policy . While performing one iteration of goal conditioned behavioral cloning on the relabeled dataset is not immediately sufficient to reach all desired goals, we will show that this procedure does in fact optimize a lower bound on a well-defined reinforcement learning objective. As described concretely in Algorithm 1, the algorithm proceeds as follows: (1) Sample a goal from a target goal distribution . (2) Execute the current policy for steps in the environment to collect a potentially suboptimal trajectory . (3) Relabel the trajectory according to the previous paragraph to add new expert tuples to the training dataset. (4) Perform supervised learning on the entire dataset to update the policy via maximum likelihood. We term this iterative procedure of sampling trajectories, relabelling them, and training a policy until convergence goal-conditioned supervised learning (GCSL). This algorithm can use all of the prior off-policy data in the training dataset because this data continues to remain optimal under the notion of goal-reaching optimality that was defined in Section 3.
The GCSL algorithm (described above) provides us with an algorithm that can learn to reach goals from the target distribution
simply using iterated behavioral cloning. The resultant goal reaching algorithm is off-policy, uses low variance gradients, and is simple to implement and tune without the need for any explicit reward function engineering or demonstrations. Additionally, since this algorithm is off-policy and does not require a value function estimator, it is substantially easier to bootstrap from demonstrations when real demonstrations are available, as our experiments will show in Section5.4.
4.2 Theoretical Analysis
While the GCSL algorithm is simple to implement, does this algorithm actually solve a well-defined policy learning problem? In this section, we argue that GCSL maximizes a lower bound on the probability for a policy to reach commanded goals.
We start by writing the probability that policy conditioned on goal produces trajectory as . We define as the final state of a trajectory. Recalling Equation 1, the target goal-reaching objective we wish to maximize is the probability of reaching a commanded goal:
That is, we are optimizing a multi-task problem where the reward for each task is an indicator that its goal was reached. The distribution over tasks (goals) of interest is assumed to be pre-specified as . In the on-policy setting, GCSL performs imitation learning on trajectories commanded by the goals that were reached by the current policy. For the sake of brevity, we simplify the objective to only consider relabeling to the last time-step of a trajectory, resulting in the following objective:
Here, corresponds to a copy of through which gradients do not propagate, following the notation of Schulman et al. (2015). Our main result, which is derived in the on-policy data collection setting, shows that optimizing optimizes a lower bound on the desired objective, (proof in Appendix B.1):
Let and be as defined above. Then, , Where C is a constant that does not depend on .
Note that, to prevent and from being zero, the probability of reaching a goal under must be nonzero. In scenarios where such a condition does not hold, the bound remains true, albeit vacuous. The tightness of this bound can be controlled by the effective error in the GCSL objective. We present a technical analysis of the bound in Appendix B.2, and further performance guarantees if the GCSL loss is sufficiently minimized in Appendix B.3
. This indicates that in the regime with expressive policies where the loss function can be minimized well, GCSL will improve the expected reward.
In our experimental evaluation, we aim to answer the following questions:
Does GCSL effectively learn goal-conditioned policies from scratch?
Does the performance of GCSL improve over successive iterations?
Can GCSL learn goal-conditioned policies from high-dimensional image observations?
Can GCSL incorporate demonstration data more effectively than standard RL algorithms?
5.1 Experimental Details
We consider a number of simulated control environments: 2D room navigation, object pushing with a robotic arm, and the classic Lunar Lander game, shown in Figure 2
. The tasks allow us to study the performance of our method under a variety of system dynamics, both low-dimensional state inputs and high-dimensional image observations, and in settings with both easy and difficult exploration. For each task, the target goal distribution corresponds to a uniform distribution over all reachable configurations. The performance of a method is quantified by the distance of the agent to the goal at the last timestep. We present full details about the environments, evaluation protocol, and hyperparameter choices in AppendixA.
For the practical implementation of GCSL, we parametrize the policy as a neural network that takes in state, goal, and horizon as input and outputs a parameterized action distribution. We find that omitting the horizon from the input to the policy still provides good results, despite the formulation suggesting that the optimal policy is most likely non-Markovian. We speculate that this is due to optimal actions changing only mildly with different horizons in our tasks. Full details about the implementation for GCSL are presented in Appendix A.1.
GCSL is competitive with state-of-the-art off-policy value function RL algorithms for goal-reaching from low-dimensional sensory observations. Shaded regions denote the standard deviation across 3 random seeds (lower is better).
5.2 Learning Goal-Conditioned Policies
We evaluate the effectiveness of GCSL for reaching goals on the domains visualized in Figure 2, both from low-dimensional proprioception and from images. To better understand the performance of our algorithm, we compare to two families of reinforcement learning algorithms for solving goal-conditioned tasks. First, we consider off-policy temporal-difference RL algorithms, particularly TD3-HER (Eysenbach et al., 2019; Held et al., 2018), which uses hindsight experience replay to more efficiently learn goal-conditioned value functions. TD3-HER requires significantly more machinery than the relatively simple GCSL algorithm: it maintains a policy, a value function, a target policy, and a target value function, all which are required to prevent degradation of the learning procedure. We also compare with TRPO (Schulman et al., 2015), an on-policy reinforcement learning algorithm that cannot leverage data relabeling, but is known to provide more stable optimization than off-policy methods. Because these methods cannot relabel data, we provide an epsilon-ball reward corresponding to reaching the goal. Details for the training procedure for these comparisons, along with hyperparameter and architectural choices, are presented in Appendix A.2. Videos and further details can be found at https://sites.google.com/view/gcsl/
We first investigate the learning performance of these algorithms from low-dimensional state observations, as shown in Figure 3. We find that on the pushing and Lunar Lander domains, GCSL is able to reach a larger set of goals more consistently than either RL algorithm. TD3 is competitive on the navigation task; however, on the other domains that require synthesis of more complicated skills, the algorithm makes little to no learning progress. Given a limited amount of data, TRPO performs poorly as it cannot relabel or reuse data, and so cannot match the performance of the other two algorithms. When scaling these algorithms to image-based domains, which we evaluate in Figure 4, we find that GCSL is still able to learn goal-reaching behaviors on several of these tasks, albeit slower than from state. For most tasks, from both state and images, GCSL is able to reach within 80% of the desired goals and learn at a rate comparable to or better than previously proposed off-policy RL methods. This evaluation demonstrates that simple iterative self-imitation is a competitive scheme for reaching goals in challenging environments which scales favorably with dimensionality of state and complexity of behaviors.
5.3 Analysis of Learning Progress and Learned Behaviors
Next, we investigate how GCSL performs as we vary the quality and quantity of data, the policy class we optimize over, and the relabelling technique (Figure 5). Full details for these scenarios can be found in Appendix A.4.
First, we consider how varying the policy class can affect the performance of GCSL. In Section 5.1, we hypothesized that optimizing over a Markovian policy class would be more performant over maintaining a non-Markovian policy. We find that allowing policies to be time-varying (“Time-Varying Policy” in Figure 5) can drastically speed up training on small domains, as these non-Markovian optimal policies can fit the training data more accurately. However, on domains with active exploration challenges such as the Pusher, exploration using time-varying policies is ineffective, and degrades performance.
Second, we investigate how the quality of the data in the dataset used to train the policy affects the learned policy. We consider two variations of GCSL: one that collects data using a fixed policy (“Fixed Data Collection” in Figure 5) and another that limits the size of the dataset to be small, forcing all the data to be on-policy (“On-Policy” in Figure 5). When collecting data using a fixed policy, the learning progress of the algorithm demonstratedly decreases, which indicates that the iterative loop of collecting data and training the policy is crucial for converging to a performant solution. By forcing the data to be on-policy, the algorithm cannot utilize the full set of experiences seen thus far and must discard data. Although this on-policy process remains effective on simple domains, the technique leads to slower learning progress on tasks requiring more challenging control. We additionally consider a comparison to learning a simple one step inverse model, which model relabels only states and goals that are one step apart. We see that this performs poorly as compared to GCSL, indicating that multi-horizon relabeling is an important component for effective learning in comparison to one-step inverse models (Pathak et al., 2018).
5.4 Initializing with Demonstrations
Because GCSL can learn from arbitrary data sources, the algorithm is amenable to initialization from prior experience or from demonstrations. In this section, we study how GCSL performs when incorporating expert demonstrations into the dataset, as compared to prior methods. Our results comparing GCSL and TD3 in this setting corroborate the existing hypothesis that off-policy value function-based RL algorithms are challenging to integrate with initialization from demonstrations (Kumar et al., 2019a).
We consider the setting where an expert provides a set of demonstration trajectories, each for reaching a different goal. GCSL requires no modifications to incorporate these demonstrations – the demonstrations are simply added to the initial dataset. Off-policy TD methods often have a harder time dealing with such demonstration data due to the difficulties in effectively training the critic on only demonstration data, which can result in drastic drops in performance (Kumar et al., 2019a). We compare the performance of GCSL to a variant of TD3-HER at incorporating expert demonstrations on the four rooms navigation and the robotic pushing environment in Figure 6 (Details in Appendix A.5). On both tasks, although both algorithms start at the same initial performance, TD3 encounters a characteristic drop in performance in the early stages of training. We postulate this occurs because the initial value function learnt with policy evaluation is not accurate, leading to deterioration in performance from the behavioral-cloned policy. This can be attributed to the fact that the TD3 critic function is poorly initialized, deteriorating policy performance as well. In contrast, GCSL scales favorably, learns faster than from scratch, does not regress in behavior from the beginning of training, and effectively incorporates the expert demonstrations. We believe this benefit largely comes from not needing to train an explicit critic, which can be unstable when trained using highly off-policy data such as demonstrations (Kumar et al., 2019b).
6 Discussion and Future Work
In this work, we proposed GCSL, a simple algorithm for learning goal-conditioned policies that uses imitation learning, while still learning autonomously from scratch. This method is exceptionally simple, relying entirely on supervised learning to learn policies by relabeling its own previously collected data. This method can easily utilize off-policy data, seamlessly incorporate expert demonstrations when they are available, and can learn directly from image observations. Although several prior works have explored similar algorithm designs in an imitation learning setting (Ding et al., 2019; Lynch et al., 2019), to our knowledge our work is the first to derive a complete iterated algorithm based on this principle for learning from scratch, and the first to theoretically show that this method optimizes a lower bound on a well-defined reinforcement learning objective.
While our proposed method is simple, scalable, and readily applicable, it does have a number of limitations. The current instantiation of this approach provides limited facilities for effective exploration, relying entirely on random noise during the rollouts to explore. More sophisticated exploration methods, such as exploration bonuses (Mohamed and Rezende, 2015; Storck et al., 1995), are difficult to apply to our method, since there is no explicit reward function that is used during learning. However, a promising direction for future work would be to reweight the sampled rollouts based on novelty to effectively incorporate a novelty-seeking exploration procedure. A further direction for future work is to study whether the simplicity and scalability of our method can make it possible to perform goal-conditioned reinforcement learning on substantially larger and more varied datasets. This can in principle enable wider generalization, and realize a central goal in goal-conditioned reinforcement learning — universal policies that can succeed at a wide range of tasks in diverse environments.
This research was supported by an NSF graduate fellowship, Berkeley DeepDrive, the National Science Foundation, the Office of Naval Research, and support from Google, Amazon, and NVIDIA. We thank Karol Hausman, Ignasi Clavera, Aviral Kumar, Marvin Zhang, Vikash Kumar for thoughtful discussions, insights and feedback on paper drafts.
- Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: §A.2, §1, §2.
- Residual algorithms: reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pp. 30–37. Cited by: §1.
- Robot programming by demonstration. Springer handbook of robotics, pp. 1371–1394. Cited by: §2.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. arXiv preprint arXiv:1904.06387. Cited by: §2.
- Goal conditioned imitation learning. In Advances in Neural Information Processing Systems, Cited by: §1, §2, §4.1, §4, §6.
- Search on the replay buffer: bridging planning and reinforcement learning. arXiv preprint arXiv:1906.05253. Cited by: §2, §5.2.
- Off-policy deep reinforcement learning without exploration. arXiv preprint arXiv:1812.02900. Cited by: §A.2.
The cross-entropy method optimizes for quantiles. In International Conference on Machine Learning, pp. 1193–1201. Cited by: §2.
- Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. CoRR abs/1910.11956. External Links: Cited by: §2.
- Automatic goal generation for reinforcement learning agents. ICML. Cited by: §2, §5.2.
Deep reinforcement learning that matters.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
- Deep q-learning from demonstrations. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
- Imitation learning: a survey of learning methods. ACM Computing Surveys (CSUR) 50 (2), pp. 21. Cited by: §2.
- Learning to achieve goals. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 1094–1098. Cited by: §1, §2.
- Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, San Francisco, CA, USA, pp. 267–274. External Links: Cited by: §B.3.
- Stabilizing off-policy q-learning via bootstrapping error reduction. CoRR abs/1906.00949. Cited by: §5.4, §5.4.
- Stabilizing off-policy q-learning via bootstrapping error reduction. arXiv preprint arXiv:1906.00949. Cited by: §2, §5.4.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
- Learning latent plans from play. arXiv preprint arXiv:1903.01973. Cited by: §1, §1, §2, §4.1, §4, §6.
- The cross entropy method for fast policy search. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 512–519. Cited by: §2.
- Distributional reinforcement learning for efficient exploration. arXiv preprint arXiv:1905.06125. Cited by: §1.
- Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in neural information processing systems, pp. 2125–2133. Cited by: §6.
- Improving policy gradient by exploring under-appreciated rewards. arXiv preprint arXiv:1611.09321. Cited by: §2.
- Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pp. 9191–9200. Cited by: §1, §2.
- Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, pp. 1723–1731. Cited by: §2.
- Self-imitation learning. In International Conference on Machine Learning, pp. 3875–3884. Cited by: §2.
- Zero-shot visual imitation. In , pp. 2050–2053. Cited by: §2, §5.3.
- Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp. 745–750. Cited by: §2.
- Alvinn: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, pp. 305–313. Cited by: §2.
- Temporal difference models: model-free deep rl for model-based control. arXiv preprint arXiv:1802.09081. Cited by: §2, §3.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §1, §2.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson, and M. Dudík (Eds.), Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 627–635. External Links: Cited by: §B.3.
- Semi-parametric topological memory for navigation. arXiv preprint arXiv:1803.00653. Cited by: §1, §2.
- Universal value function approximators. In International conference on machine learning, pp. 1312–1320. Cited by: §2.
- Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1889–1897. Cited by: §A.2, §B.1, §B.3, §1, §4.2, §5.2.
- Reinforcement driven information acquisition in non-deterministic environments. In Proceedings of the international conference on artificial neural networks, Paris, Vol. 2, pp. 159–164. Cited by: §6.
- A generalized path integral control approach to reinforcement learning. journal of machine learning research 11 (Nov), pp. 3137–3181. Cited by: §2.
- Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pp. 1075–1081. Cited by: §1.
Appendix A Experimental Details
a.1 Goal-Conditioned Supervised Learning (GCSL)
GCSL iteratively performs maximum likelihood estimation using a dataset of relabelled trajectories that have been previously collected by the agent. Here we present details about the policy class, data collection procedure, and other design choices. We parametrize a time-invariant policy using a neural network which takes as input state and goal, and returns probabilities for a discretized grid of actions of the action space. For the state-based domains, the neural network is a feedforward network with two hidden layers of size and respectively. For the image-based domains, both the observation image and the goal image are first preprocessed through three convolutional layers, with kernel size and channels respectively. When executing in the environment, data is sampled according to an exploratory policy which increases the temperature of the current policy: . The replay buffer stores trajectories and relabels on the fly, with the size of the buffer subject only to memory constraints.
a.2 RL Comparisons
We perform experimental comparisons with TD3-HER (Fujimoto et al., 2018; Andrychowicz et al., 2017). We relabel transitions as gets relabelled to , where with probability , with probability , and for some future state in the trajectory with probability . As described in Section 3, the agent receives a reward of and the trajectory ends if the transition is relabelled to , and otherwise. Under this formalism, the optimal -function, , where is the minimum expected time to go from to . Both the Q-function and the actor for TD3 are parametrized as neural networks, with the same architecture (except final layers) for state-based domains and image domains as those for GCSL.
We also compare to TRPO (Schulman et al., 2015), an on-policy RL algorithm. Because TRPO is on-policy, we cannot relabel goals, and so we provide a surrogate -ball indicator reward function: , where is chosen appropriately for each environment. To maximize the data efficiency of TRPO, we performed a coarse hyperparameter sweep over the batch size for the algorithm. Just as with TD3, we mimic the same neural network architecture for the parametrizations of the policies as GCSL.
a.3 Task Descriptions
For each environment, the goal space is identical to the state space. For the image-based experiments, images were rendered at resolution .
2D Room Navigation This environment requires an agent to navigate to points in an environment with four rooms that connect to adjacent rooms. The state space has two dimensions, consisting of the cartesian coordinates of the agent. The agent has acceleration control, and the action space has two dimensions. The distribution of goals is uniform on the state space, and the agent starts in a fixed location in the bottom left room.
Robotic Pushing This environment requires a Sawyer manipulator to move a freely moving block in an enclosed play area with dimensions cm cm. The state space is -dimensional, consistsing of the cartesian coordinates of the end-effector of the sawyer agent and the cartesian coordinates of the block. The Sawyer is controlled via end-effector position control with a three-dimensional action space. The distribution of goals is uniform on the state space (uniform block location and uniform end-effector location), and the agent starts with the block and end-effector both in the bottom-left corner of the play area.
Lunar Lander This environment requires a rocket to land in a specified region. The state space includes the normalized position of the rocket, the angle of the rocket, whether the legs of the rocket are touching the ground, and velocity information. Goals are sampled uniformly along the landing region, either touching the ground or hovering slightly above, with zero velocity.
Inverse Model - This model relabels only states and goals that are one step apart:
On-Policy Only the most recent transitions are stored and trained on.
Fixed Data Collection Data is collected according to a uniform policy over actions.
Time-Varying Policy Policies are are conditioned on the remaining horizon. Alongside the state and goal, the policy gets a reverse temperature encoding of the remaining horizon as input.
a.5 Initializing with Demonstrations
We train an expert policy for robotic pushing using TRPO with a shaped dense reward function, and collect a dataset of 200 trajectories, each corresponding to a different goal. To train GCSL using these demonstrations, we simply populate the replay buffer with these trajectories at the beginning of training, and optimize the GCSL objective using these trajectories to warm-start the algorithm. Initializing a value function method using demonstrates requires significantly more attention: we perform the following procedure. First, we perform goal-conditioned behavior cloning to learn an initial policy . Next, we collect 200 new trajectories in the environment using a uniform data collection scheme. Using this dataset of trajectories, we perform policy evaluation on to learn using policy evaluation via bootstrapping. Having trained such an estimate of the Q-function, we initialize the policy and Q-function to these estimates, and run the appropriate value function RL algorithm.
Appendix B Theoretical Analysis
b.1 Proof of Theorem 4.1
We will assume a discrete state space in this proof, and denote a trajectory as . Let the notation denote the final state of a trajectory, which represents the goal that the trajectory reached. As there can be multiple paths to a goal, we let denote the set of trajectories that reach a particular goal . We abbreviate a policy’s trajectory distribution as . The target goal-reaching objective we wish to optimize is the probability of reaching a commanded goal,
The distribution over tasks (goals) is assumed to be pre-specified as . GCSL optimizes the following objective, where the log-likelihood of the actions conditioned on the goals actually reached by the policy, :
Here, using notation from Schulman et al. (2015), is a copy of the policy through which gradients do not propagate. To analyze how this objective relates to , we first analyze the relationship between and a surrogate objective, given by
As and have the same gradient for all , the differ by some -independent constant , i.e. .
We can now lower-bound the surrogate objective via the following:
The final line is our goal-relabeling objective: we train the policy to reach goals we reached. The inequality holds since is always negative. The inequality is loose by a term related to the probability of not reaching the commanded goal.
Since the initial state and transition probabilities do not depend on the policy, we can simplify as (by absorbing non -dependent terms into ):
Combining this result with the bound on the expected return completes the proof, namely that . Note that in order for and to not be degenerate, the probability of reaching a goal under must be non-zero. This assumption is reasonable, and matches the assumptions on ”exploratory data-collection” and full-support policies that are required by Q-learning and policy gradient convergence guarantees.
b.2 Quantifying the Quality of the Approximation
We now seek to better understand the gap introduced by Equation 2 in the analysis above. We define to be the probability of failure under and . We overload the notation , and additionally define and the conditional distribution of trajectories under given that it did not reach and did the commanded goal respectively.
In the following section, we show that the gap can be controlled by the probability of making a mistake, , and , a measure of the difference between the distribution of trajectories that must be relabelled and those not.
We rewrite Equation 2 as follows:
|Define to be the Radon-Nikodym derivative of wrt|
The first term is affine with respect to the GCSL loss, so the second term is the error we seek to understand.
The inequality is maintained because of the nonpositivity of , and the final step holds because is a mixture of and . This derivation shows that the gap between and (up to affine consideration) can be controlled by 1) the probability of reaching the wrong goal and 2) the divergence between the conditional distribution of good trajectories and those which must be relabelled. As either term goes to , the bound becomes tight.
b.3 Performance Guarantees
We now show that sufficiently optimizing the GCSL objective causes the probability of reaching the wrong goal to be bounded close to , and thus bounds the gap close to .
Suppose we collect trajectories from a policy . Mimicking the notation from before, we define . For convenience, we define to be the conditional distribution of actions for a given state given that the goal is reached at the end of the trajectory. If this conditional distribution is not defined, we let be uniform, so that is well-defined for all states, goals, and timesteps.
Consider an environment with deterministic dynamics in which all goals are reachable in timesteps, and a behavior policy which is exploratory: for all (for example epsilon-greedy exploration). Suppose the GCSL objective is sufficiently optimized, so that for all , and time-steps , .
Then, the probability of making a mistake can be bounded above by
We show the result through a coupling argument, similar to Schulman et al. (2015); Kakade and Langford (2002); Ross et al. (2011). Because , we can define a -coupled policy pair , which takes differing actions with probability . By a union bound over all timesteps, the probability that and take any different actions throughout the trajectory is bounded by , and under assumptions of deterministic dynamics, take the same trajectory with probability . Under assumption of deterministic dynamics and all goals being reachable from the initial state distribution in timesteps, the policy satisfies . Because reaches the goal with probability , this implies that must also reach the goal with probability at least . Thus, . ∎
Appendix C Example Trajectories
Figure 8 below shows parts of the state along trajectories produced by GCSL. In Lunar Lander, this state is captured by the rocket’s position, and in 2D Room Navigation it is the agent’s position. While these trajectories do not always take the shortest path to the goal, they do often take fairly direct paths to the goal from the initial position avoiding very roundabout trajectories.