When people solve tasks, there is often a clear ordering to which steps are more preferable than others. For example, when building a piece of furniture, we may assume that it is better to have the pieces outside of the box than in, and a leg screwed into its base than laying on the floor. The final steps of a problem are often more desirable than earlier ones, assuming the task is being completed optimally, because they indicate that we have fewer steps left to go. Put in other terms, these later steps are typically more valuable than those seen in the beginning of the problem.
In this paper, we use this insight to compute values from expert observations without access to the underlying actions or rewards. Because task goals are often achieved at or near the end of a demonstration trajectory, it is likely that later states should have more value than early ones. Hence, given expert state trajectories, we compute the expected value of each state by assuming that the reward at the last state in the trajectory is 1 and 0 everywhere else, and then backing up values to the start of the trace by utilizing knowledge of the length of the trajectory in a self-supervised manner. We show how these values can be used to learn action-values for reinforcement learning more efficiently than training from sparse rewards.
We formally introduce our approach, Perceptual Values from Observation (PVO), which aims to learn values directly from expert observations. We show that this approach learns meaningful values that increase as the goal nears, and demonstrate that these values can be used to train a reinforcement learning agent. We demonstrate the learned values in a maze environment (Zuo, 2018), liquid pouring task (Sermanet et al., 2016), and a task for picking up objects (Goyal et al., 2017), and show that PVO can be used to train RL agents in OpenAI’s CoinRun environment (Cobbe et al., 2018).
2 Related work
There has recently been a large amount of interest in learning behaviors from expert state observations. This mechanism introduces several opportunities to obtain training examples for agents; there is a wealth of pre-existing videos that consist of humans and other entities—such as animated characters and animals—performing tasks that we might like an agent to learn. Learning in this manner becomes more difficult however because the underlying actions and rewards are unknown. In order to make use of the abundance of video data available on the web, we should consider how we can learn goals and values without access to this information.
Given a set of single-goal observations, one recent approach is to train a classifier to predict if a state is a goal or not and then use this discriminator as a reward signal(Xie et al., 2018; Singh et al., 2019). However, goal-prediction is essentially a sparse-reward problem and thus may not shape behavior. As such, while single-goal representations require little demonstration data, they may require more environment interactions to train reinforcement learning agents than methods that provide more guidance. In general, we should expect a trade-off between the amount of experience we need to provide an agent and the amount of time it will take the agent to learn.
To that point, we can train models using already existing videos or other forms of observation. This work focuses on learning to imitate from such sequences. One approach to this problem is to learn or use pre-existing features for computing rewards (Edwards et al., 2016; Liu et al., 2017; Sermanet et al., 2017; Aytar et al., 2018; Yu et al., 2019). Such approaches likely offer a better shaped reward than goal-prediction based rewards because they are based on the distance to the goal. Another approach is to learn rewards directly (Sermanet et al., 2016; Edwards & Isbell Jr, 2017) or in an adversarial manner (Torabi et al., 2018b). Finally, we can avoid learning rewards at all by learning dynamics that aim to infer the actions taken in the state sequences (Pathak et al., 2018; Torabi et al., 2018a; Edwards et al., 2018). However, learning dynamics can often be difficult and may require a large amount of demonstration or environmental data.
This paper introduces another mechanism for learning from state observations. In particular, we are interested in learning values because they allow us to bypass engineering reward functions that may be susceptible to locally sub-optimal solutions. As we will show, by using values we can additionally remove the bootstrapped component of training reinforcement learning.
We are interested in solving problems specified through a Markov Decision Process, where we do not have access to the transition function or environment rewards, and the states consist of visual inputs. We are given a set of expert state observationswhere we assume we also do not have access to the underlying expert actions or rewards.
Given a trajectory of expert observations , PVO aims to learn a value function that makes an approximation of the expert value function. As we noted, we are not given the underlying reward function with these demonstrations. Rather, we enforce a surrogate reward based on a simple assumption that tasks obtained from expert observations can be specified through a sparse reward of 1 at the end of the trajectory and 0 elsewhere.
This hypothesis comes from the observation that the goal will often occur at the end of the trajectory, especially in goal-directed tasks. However, we enforce this reward function even if a trajectory does not actually end at the goal. Using this assumption, we may backtrack values from the end of a trajectory to the start without knowing the actions taken. We then use this value function to learn values of novel states and to learn action-values for RL.
4.1 Step 1: Learning values from observation
The first step of this approach aims to obtain values from expert observations. Given a length trajectory , we first make the assumption that is a terminal goal state, and so its reward is assigned to .
Note that the expected value of some state can be expressed as:
Because the reward at is , we can assign the values using samples from the demonstration as:
In general, we express the value of some observation as:
This update is shown in figure 1. Here is effectively the number of steps remaining in the trajectory. It corresponds to how much the value at the goal will be discounted from state before reaching the terminal state. Because we have a sequence of optimal expert observations, we know how many steps remain.
We use a deep neural network to learn the values, and aim to minimize the following loss:
This simple yet effective approach is shown in Algorithm 1.
4.2 Step 2: Learning action-values from values
Given the learned values, we aim to use RL to learn action-values and a corresponding policy. We introduce two approaches to this problem: 1) using the values to replace bootstrapping in Q-learning and 2) using the values as a potential-based shaping reward.
4.2.1 Replacing bootstrapping in Q-learning
The typical loss update for Q-learning can be defined as:
. The problem with this approach is that it requires making estimates based off of a moving target. We aim to remove this bootstrapped step by replacing the target network with our estimate of the value function.
The Bellman equation states that the maximal action-value is equivalent to the value of a state under the optimal policy (Sutton & Barto, 1998):
Given this definition, we can replace the max operator from equation 4 with the learned value function , and modify the target accordingly:
Because we assume a sparse reward obtained only at the goal, and because we do not compute action-values at terminal states, can be replaced with the surrogate reward of , and so the target becomes:
4.2.2 Potential-based shaping reward
If the value function is incorrect for some states, using it as a replacement for bootstrapping might be too strong of a signal. That is because the formulation aims to directly maximize the value function, and so may get stuck in locally sub-optimal areas if it is not truly optimal.
As such, we also introduce using a potential-based shaping reward (Ng et al., 1999):
Our experiments aim to demonstrate that PVO can learn values from observation only and that these values can be used to train reinforcement learning agents. We evaluate the agent within unseen environments and aim to determine if PVO learns a general value function that can infer values outside of the training environments.
In this section, we discuss the environments used for evaluation. We were interested in goal-directed tasks that consisted of a desired target state. We were additionally interested in demonstrating generalization and thus also evaluated within procedurally generated environments.
5.0.2 Maze environment
The maze environment, shown in figure 2. consists of procedurally generated mazes. The agent can take actions up, down, left, and right. The game ends when the agent (blue) reaches some target goal (green). We used search to obtain demonstrations in this environment. The demonstration set only consisted of mazes from sizes 4x4 to 20x20. We aim to determine if PVO can learn values in unseen mazes of size 25x25. We obtained 1000 episodes of demonstrations for a simple empty maze and a more complicated one where the agent must navigate around obstacles to reach the goal.
5.0.3 Liquid pouring dataset
The liquid pouring dataset has been used to train robots to learn to pour from videos of humans (Sermanet et al., 2016). We use pouring demonstrations to train values and aim to determine if PVO can infer values in an unseen video.
5.0.4 Something something dataset
The something something dataset (Goyal et al., 2017) consists of videos of humans doing something to something, for example, pouring something into something, plugging something into something, etc. We use videos of people picking up something from a surface to determine if PVO can infer values in an unseen video.
5.0.5 CoinRun environment
The CoinRun environment (Cobbe et al., 2018) consists of procedurally generated platform environments. The background, player, enemies, platforms, obstacles, and goal locations are all randomly instantiated. The agent can take actions left, right, jump, and down, jump-left, jump-right, and do-nothing. The game ends when the agent reaches a single coin in the game. We trained PPO (Schulman et al., 2017) for 2.5 million steps to obtain 1000 episodes of expert demonstrations. We evaluate on unseen easy levels.
In this section, we discuss the results of using PVO to learn values and to train RL agents. Our experiments in the maze environment aim to demonstrate that PVO can learn meaningful values in unseen environments. Figure 2 shows a heatmat of the values learned using this approach. It is clear that not only is PVO capable of detecting where the goal is, it can also infer the values of states around the goal.
We also demonstrate value learning in the liquid pouring task, as shown in figure 3. PVO has clearly learned a meaningful value function for this task, even though it was only trained with 10 demonstrations. The initial image is an empty glass without any pouring and the value is clearly low. As the glass becomes more full, the values increase.
Finally, we show value learning for the “picking up something” task in the something something dataset, as shown in figure 4. PVO has again learned meaningful values that increase as the task becomes completed.
Our experiments in the CoinRun environment aim to demonstrate that PVO can be used for training reinforcement learning agents in unseen environments. The results are shown in figure 5. We call the PVO method that replaces bootstrapping with the learned values PVO value and the method that uses the values as a shaping reward PVO shaping. Both methods learn significantly faster than standard RL. We have thus demonstrated that PVO can be used for imitation and can generalize to unseen environments after receiving observation data only. Additionally, PVO can be used to replace bootstrapping for RL, but the shaping reward was also powerful. One reason for this may be that using the value function to replace bootstrapping essentially initializes the Q-values, which has been shown to be equivalent to potential-based reward shaping (Wiewiora, 2003).
In this paper, we have demonstrated that PVO is able to learn values for difficult tasks, and that it can be used to train reinforcement learning agents. We have shown that this approach can generalize to unseen configurations. Finally, we have demonstrated that PVO can significantly speed up reinforcement learning within sparse reward settings.
- Aytar et al. (2018) Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Freitas, N. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.
- Cobbe et al. (2018) Cobbe, K., Klimov, O., Hesse, C., Kim, T., and Schulman, J. Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341, 2018.
- Edwards et al. (2016) Edwards, A., Isbell, C., and Takanishi, A. Perceptual reward functions. Deep Reinforcement Learning: Frontiers and Challenges, IJCAI Workshop, 2016.
- Edwards & Isbell Jr (2017) Edwards, A. D. and Isbell Jr, C. L. Cross-domain perceptual reward functions. RLDM 2017, 2017.
- Edwards et al. (2018) Edwards, A. D., Sahni, H., Schroecker, Y., and Isbell, C. L. Imitating latent policies from observation. arXiv preprint arXiv:1805.07914, 2018.
- Goyal et al. (2017) Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. The” something something” video database for learning and evaluating visual common sense. In ICCV, volume 1, pp. 3, 2017.
- Liu et al. (2017) Liu, Y., Gupta, A., Abbeel, P., and Levine, S. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374, 2017.
- Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
- Pathak et al. (2018) Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen, D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A., and Darrell, T. Zero-shot visual imitation. arXiv preprint arXiv:1804.08606, 2018.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sermanet et al. (2016) Sermanet, P., Xu, K., and Levine, S. Unsupervised perceptual rewards for imitation learning. arXiv preprint arXiv:1612.06699, 2016.
- Sermanet et al. (2017) Sermanet, P., Lynch, C., Hsu, J., and Levine, S. Time-contrastive networks: Self-supervised learning from multi-view observation. arXiv preprint arXiv:1704.06888, 2017.
- Singh et al. (2019) Singh, A., Yang, L., Hartikainen, K., Finn, C., and Levine, S. End-to-end robotic reinforcement learning without reward engineering. Robotics: Science and Systems, 2019.
- Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction, volume 1. Cambridge Univ Press, 1998.
- Torabi et al. (2018a) Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954, 2018a.
- Torabi et al. (2018b) Torabi, F., Warnell, G., and Stone, P. Generative adversarial imitation from observation. arXiv preprint arXiv:1807.06158, 2018b.
Potential-based shaping and q-value initialization are equivalent.
Journal of Artificial Intelligence Research, 19:205–208, 2003.
- Xie et al. (2018) Xie, A., Singh, A., Levine, S., and Finn, C. Few-shot goal inference for visuomotor learning and planning. arXiv preprint arXiv:1810.00482, 2018.
- Yu et al. (2019) Yu, T., Shevchuk, G., Sadigh, D., and Finn, C. Unsupervised visuomotor control through distributional planning networks. arXiv preprint arXiv:1902.05542, 2019.
- Zuo (2018) Zuo, X. mazelab: A customizable framework to create maze and gridworld environments. https://github.com/zuoxingdong/mazelab, 2018.