It is important that robots and other autonomous agents can safely learn from and adapt to a variety of human preferences and goals. One common way to learn preferences and goals is via imitation learning, in which an autonomous agent learns how to perform a task by observing demonstrations of the task (Argall et al., 2009). When learning from demonstrations, it is important for an agent to be able to provide high-confidence bounds on its performance with respect to the demonstrator; however, while there exists much work on high-confidence off-policy evaluation in the reinforcement learning (RL) setting, there has been much less work on high-confidence policy evaluation in the imitation learning setting, where the reward samples are unavailable.
Prior work on high-confidence policy evaluation for imitation learning has used Bayesian inverse reinforcement learning (IRL) Ramachandran & Amir (2007) to allow an agent to reason about reward uncertainty and policy generalization error (Brown et al., 2018). However, Bayesian IRL is typically intractable for complex problems due to the need to repeatedly solve an MDP in the inner loop, resulting in high computational cost as well as high sample cost if a model is not available. This precludes robust safety and uncertainty analysis for imitation learning in high-dimensional problems or in problems in which a useful model of the MDP is unavailable. We seek to remedy this problem by proposing and evaluating a method for safe and efficient Bayesian reward learning via preferences over demonstrations. Preferences over demonstrations are intuitive for humans to provide (Sadigh et al., 2017; Christiano et al., 2017; Palan et al., 2019) and allow for better-than-demonstrator performance (Brown et al., 2019a). To the best of our knowledge, we are the first to show that preferences over demonstrations enables fast Bayesian reward learning in high-dimensional control tasks and also enables efficient high-confidence performance bounds for imitation learning.
We first formalize the problem of high-confidence policy evaluation (Thomas et al., 2015)
for imitation learning. We next propose a novel algorithm, Bayesian Reward Extrapolation (Bayesian REX), that uses a pairwise ranking likelihood to significantly reduce the computational complexity of generating samples from the posterior distribution over reward functions. We demonstrate that Bayesian REX can leverage neural network function approximation to learn useful reward features via self-supervised learning in order to efficiently perform deep Bayesian reward inference from visual demonstrations. Finally, we demonstrate that samples obtained from Bayesian REX can be used to solve the high-confidence policy evaluation problem for imitation learning. We evaluate our method on imitation learning for Atari games and demonstrate that we can efficiently compute high-confidence bounds on policy performance, without requiring samples of the reward function. Furthermore, we demonstrate that these high-confidence bounds can be used to accurately rank different evaluation policies according to their risk and performance under the distribution over the unknown ground-truth reward function. Finally, we provide evidence that bounds on uncertainty and risk and may provide a useful tool for detecting reward hacking/gaming(Amodei et al., 2016), a common problem in reward inference from demonstrations (Ibarz et al., 2018) as well as reinforcement learning (Ng et al., 1999; Leike et al., 2017).
2 Related work
2.1 Imitation Learning
Imitation learning is the problem of learning a policy from demonstrations and can roughly be divided into techniques that use behavioral cloning and techniques that use inverse reinforcement learning. Behavioral cloning methods (Pomerleau, 1991; Torabi et al., 2018) seek to solve the imitation learning problem via supervised learning, in which the goal is to learn a mapping from states to actions that mimics the demonstrator. While computationally efficient, these methods suffer from compounding errors (Ross et al., 2011). Methods such as DAgger (Ross et al., 2011) and DART (Laskey et al., 2017) avoid this problem by repeatedly collecting additional state-action pairs from an expert.
Inverse reinforcement learning (IRL) methods (Ng & Russell, 2000) seek to solve the imitation learning problem by estimating a reward function that makes the demonstrations appear near optimal. Classical approaches repeatedly alternate between a reward estimation step and a full policy optimization step (Abbeel & Ng, 2004; Ziebart et al., 2008; Ramachandran & Amir, 2007). Bayesian IRL (Ramachandran & Amir, 2007) samples from the posterior distribution over reward functions, whereas other methods seek a single reward function that induces the demonstrator’s feature expectations (Abbeel & Ng, 2004), often while also seeking to maximize the entropy of the resulting policy (Ziebart et al., 2008)
. Deep learning approaches to imitation learning are typically based on maximum entropy policy optimization and divergence minimization between marginal state-action distributions(Ho & Ermon, 2016; Fu et al., 2017; Ghasemipour et al., 2019) and are related to Generative Adversarial Networks (Finn et al., 2016a). These methods scale to complex control problems by iterating between reward learning and policy learning steps. Alternatively, Brown et al. (2019b) use ranked demonstrations to learn a reward function via supervised learning without requiring an MDP solver or any inference time data collection. The learned reward function can then be used to optimize a potentially better-than-demonstrator policy (Brown et al., 2019b). Subsequent research showed that preferences over demonstrations can be automatically generated via noise injection, allowing better-than-demonstrator performance even in the absence of explicit preference labels (Brown et al., 2019a). Despite the success of deep imitation learning methods, existing methods typically return a point estimate of the reward function, precluding uncertainty and robustness analysis.
2.2 Safe Imitation Learning
While there has been much recent interest in imitation learning, less attention has been given to problems related to safe imitation learning. Zhang & Cho (2017) propose SafeDAgger a supervised learning approach to imitation learning that predicts in which states the imitation learning policy will have a large action difference from the demonstrator. Control is given to the the demonstrator only if the predicted action difference of the novice is above some hand-tuned parameter, . Lacotte et al. (2019) propose an imitation learning algorithm that seeks to match the tail risk of the expert as well as find a policy that is indistinguishable from the demonstrations. Brown & Niekum (2018) propose a Bayesian sampling approach to provide explicit high-confidence safety bounds on value-at-risk in the imitation learning setting. Their method uses samples from the posterior distribution to compute sample efficient probabilistic upper bounds on the policy loss of any evaluation policy. Other work considers robust policy optimization over a distribution of reward functions conditioned on demonstrations or a partially specified reward function, but these methods require an MDP solver in the inner loop, limiting their scalability Hadfield-Menell et al. (2017); Brown et al. (2018); Huang et al. (2018). We extend and generalize the work of Brown & Niekum (2018) by demonstrating, for the first time, that high-confidence performance bounds can be efficiently obtained when performing imitation learning from high-dimensional visual demonstrations without requiring access to a model of the MDP for reward inference.
2.3 Value Alignment and Active Preference Learning
Safe imitation learning is closely related to the problem of value alignment, which seeks to design methods that prevent AI systems from acting in ways that violate human values (Hadfield-Menell et al., 2016; Fisac et al., 2020). Research has shown that difficulties arise when an agent seeks to align its value with a human who is not perfectly rational (Milli et al., 2017) and that there are fundamental impossibility results regarding value alignment (Eckersley, 2018); however, Eckersley (2018) demonstrate that these impossibility results do not hold if the objective is represented as a set of partially ordered preferences.
Prior work has used active queries to perform Bayesian reward inference on low-dimensional, hand-crafted reward features (Sadigh et al., 2017; Brown et al., 2018; Bıyık et al., 2019). Christiano et al. (2017) and Ibarz et al. (2018) use deep networks to scale active preference learning to high-dimensional tasks, but require large numbers of active queries during policy optimization and do not perform Bayesian reward inference. Our work complements and extends prior work by: (1) removing the requirement for active queries during reward inference or policy optimization, (2) showing that preferences over demonstrations enable efficient Bayesian reward inference in high-dimensional visual control tasks, and (3) providing an efficient method for computing high-confidence bounds on the performance of any evaluation policy in the imitation learning setting.
2.4 Safe Reinforcement Learning
Research on safe reinforcement learning (RL) usually focuses on safe exploration strategies or optimization objectives other than expected return (Garcıa & Fernández, 2015). Recently, objectives based on measures of risk such as Value at Risk (VaR) and Conditional VaR have been shown to provide tractable and useful risk-sensitive measures of performance for MDPs (Tamar et al., 2015; Chow et al., 2015). Other work focuses on finding robust solutions to MDPs (Ghavamzadeh et al., 2016; Petrik & Russell, 2019), using model-based RL to safely improve upon suboptimal demonstrations (Thananjeyan et al., 2019), and obtaining high-confidence off-policy bounds on the performance of an evaluation policy (Thomas et al., 2015; Hanna et al., 2019). Our work provides an efficient solution to the problem of high-confidence policy evaluation in the imitation learning setting, in which samples of rewards are not observed and the demonstrator’s policy is unknown.
2.5 Bayesian Neural Networks
Bayesian neural networks typically either perform Markov Chain Monte Carlo (MCMC) sampling(MacKay, 1992), variational inference (Sun et al., 2019; Khan et al., 2018), or use hybrid methods such as particle-based inference (Liu & Wang, 2016) to approximate the posterior distribution over neural network weights. Alternative approaches such as ensembles (Lakshminarayanan et al., 2017) or approximations such as Bayesian dropout (Gal & Ghahramani, 2016; Kendall & Gal, 2017) have also been used to obtain a distribution on the outputs of a neural network in order to provide uncertainty quantification (Maddox et al., 2019). In this work, we are not only interested in the uncertainty of the output of the reward function network, but also in the uncertainty over the performance of a policy when evaluated in an MDP with an unknown reward function. Thus, we face the doubly-hard problem of needing to measure the uncertainty in the evaluation of a policy which depends on both the stochasticity of the policy as well as the uncertainty over rewards that the policy will obtain.
We model the environment as a Markov Decision Process (MDP) consisting of states, actions , transition dynamics , reward function , initial state distribution , and discount factor . A policy
is a mapping from states to a probability distribution over actions. We denote the value of a policyunder reward function as and denote the value of executing policy starting at state as . Given a reward function , the Q-value of a state-action pair is . We also denote and .
3.2 Bayesian Inverse Reinforcement Learning
Bayesian inverse reinforcement learning (IRL) (Ramachandran & Amir, 2007) models the environment as an MDPR in which the reward function is unavailable. Bayesian IRL seeks to infer the latent reward function of a Boltzman-rational demonstrator that executes the following policy
in which is the true reward function of the demonstrator, and represents the confidence that the demonstrator is acting optimally. Under the assumption of Boltzman rationality, the likelihood of a set of demonstrated state-action pairs, , given a specific reward function hypothesis , can be written as
Bayesian IRL generates samples from the posterior distribution via Markov Chain Monte Carlo (MCMC) sampling, but this requires solving for to compute the likelihood of each new proposal . Thus, Bayesian IRL methods are only used for low-dimensional problems with reward functions that are often linear combinations of a small number of hand-crafted features (Bobu et al., 2018; Bıyık et al., 2019). One of our contributions is an efficient Bayesian reward inference algorithm that leverages preferences over demonstrations in order to significantly improve the efficiency of Bayesian reward inference.
4 High Confidence Evaluation for Imitation Learning
Before detailing our approach, we first formalize the problem of high-confidence policy evaluation for imitation learning. We assume access to an MDPR, an evaluation policy , a set of demonstrations, , in which is either a complete or partial trajectory comprised of states or state-action pairs, a confidence level , and performance statistic , in which denotes the space of reward functions and is the space of all policies.
The High-Confidence Policy Evaluation problem for Imitation Learning (HCPE-IL) is to find a high-confidence lower bound such that
in which denotes the demonstrator’s true reward function and denotes the space of all possible demonstration sets. HCPE-IL takes as input an evaluation policy , a set of demonstrations , and a performance statistic, , which evaluates a policy under a reward function. The goal of HCPE-IL is to return a high-confidence lower bound on the performance statistic .
5 Deep Bayesian Reward Extrapolation
In this and the following sections we describe our main contribution: a method for scaling Bayesian IRL to high-dimensional visual control tasks as a way to efficiently solve the HCPE-IL problem for complex imitation learning tasks. Our first insight is that the main bottleneck for standard Bayesian IRL (Ramachandran & Amir, 2007) is computing the likelihood function in Equation (2) which requires optimal Q-values. Thus, to make Bayesian IRL scale to high-dimensional visual domains, it is necessary to either efficiently approximate optimal Q-values or to formulate a new likelihood. Value-based reinforcement learning focuses on efficiently learning optimal Q-values; however, for visual control tasks such as Atari, RL algorithms can take several hours or even days to train (Mnih et al., 2015; Hessel et al., 2018)
. This makes MCMC, which requires evaluating large numbers of likelihood ratios, infeasible given the current state-of-the-art in value-based RL. Methods such as transfer learning have great potential to reduce the time needed to calculatefor a new proposed reward function ; however, transfer learning is not guaranteed to speed up reinforcement learning (Taylor & Stone, 2009). Thus, we choose to focus on reformulating the likelihood function as a way to speed up Bayesian reward inference.
An ideal likelihood function requires little computation and minimal interaction with the environment. To accomplish this, we leverage recent work on learning control policies from preferences (Christiano et al., 2017; Palan et al., 2019; Bıyık et al., 2019). Given ranked demonstrations, Brown et al. (2019b) propose Trajectory-ranked Reward Extrapolation (T-REX): an efficient reward inference algorithm that transforms reward function learning into classification problem via a pairwise ranking loss. T-REX removes the need to repeatedly sample from or partially solve an MDP in the inner loop, allowing IRL to scale to visual imitation learning domains such as Atari and to extrapolate beyond the performance of the best demonstration. However, T-REX only solves for a point estimate of the reward function. We now discuss how a similar approach based on a pairwise preference likelihood allows for efficient sampling from the posterior distribution over reward functions.
We assume access to a sequence of trajectories, , along with a set of pairwise preferences over trajectories . Note that we do not require a total-ordering over trajectories. These preferences may come from a human demonstrator or could be automatically generated by watching a learner improve at a task (Jacq et al., 2019) or via noise injection (Brown et al., 2019a). The benefit of pairwise preferences over trajectories is that we can now leverage a pair-wise ranking loss to compute the likelihood of a set of preferences over demonstrations , given a parameterized reward function hypothesis . We use the standard Bradley-Terry model (Bradley & Terry, 1952) to obtain the following pairwise ranking likelihood function, commonly used in learning to rank applications such collaborative filtering (Volkovs & Zemel, 2014):
in which is the predicted return of trajectory under the reward function , and is the inverse temperature parameter that models the confidence in the preference labels. We can then perform Bayesian inference via MCMC to obtain samples from . We call this approach Bayesian Reward Extrapolation or Bayesian REX.
Note that using the likelihood function defined in Equation (4) does not require solving an MDP. In fact, it does not require any rollouts or access to the MDP. All that is required is that we first calculate the return of each trajectory under and compare the relative predicted returns to the preference labels to determine the likelihood of the demonstrations under the reward hypothesis . Thus, given preferences over demonstrations, Bayesian REX is significantly more efficient than standard Bayesian IRL. In the following section, we discuss further optimizations that improve the efficiency of Bayesian REX and make it more amenable to our end goal of high-confidence policy evaluation bounds.
In order to learn rich, complex reward functions, it is desirable to use a deep network to represent the reward function . While MCMC remains the gold-standard for Bayesian Neural Networks, it is often challenging to scale to deep networks. To make Bayesian REX more efficient and practical, we propose to limit the proposal to only change the last layer of weights in when generating MCMC proposals—we will discuss pre-training the bottom layers of in the next section. After pre-training, we freeze all but the last layer of weights and use the activations of the penultimate layer as the latent reward features . This allows the reward at a state to be represented as a linear combination of features: . Similar to work by Pradier et al. (2018), operating in the lower-dimensional latent space makes full Bayesian inference tractable.
A second advantage of using a learned linear reward function is that it allows us to efficiently compute likelihood ratios when performing MCMC. Consider the likelihood function in Equation (4). If we do not represent as a linear combination of pretrained features, and instead let any parameter in change during each proposal, then for demonstrations of length , computing for a new proposal requires forward passes through the entire network to compute . Thus, the complexity of generating samples from the posterior results is , where is the number of computations required for a full forward pass through the entire network . Given that we would like to use a deep network to parameterize and generate thousands of samples from the posterior distribution over , this many computations will significantly slow down MCMC proposal evaluation.
If we represent as a linear combination of pre-trained features, we can reduce this computational cost because
Thus, we can precompute and cache for and the likelihood becomes
Note that demonstrations only need to be passed through the reward network once to compute since the pre-trained embedding remains constant during MCMC proposal generation. This results in an initial passes through all but the last layer of to obtain , , and then only multiplications per proposal evaluation thereafter—each proposal requires that we compute for and . Thus, when using feature pre-training, the total complexity is only to generate samples via MCMC. This reduction in the complexity of MCMC from to results in significant and practical computational savings because (1) we want to make and large and (2) the number of demonstrations, , and the size of the latent embedding, , are typically several orders of magnitude smaller than and .
A third, and critical advantage of using a learned linear reward function is that it makes solving the HCPE-IL problem discussed in Section 4 tractable. Performing a single policy evaluation is a non-trivial task (Sutton et al., 2000) and even in tabular settings has complexity in which is the size of the state-space (Littman et al., 1995). Because we are in an imitation learning setting, we would like to be able to efficiently evaluate any given policy across the posterior distribution over reward functions found via Bayesian REX. Given a posterior distribution over reward function hypotheses we would need to solve policy evaluations. However, note that given , the value function of a policy can be written as
in which we assume a finite horizon MDP with horizon and in which are the expected feature counts (Abbeel & Ng, 2004; Barreto et al., 2017) of . Thus, given any evaluation policy , we only need to solve one policy evaluation problem to compute . We can then compute the expected value of
over the entire posterior distribution of reward functions via a single matrix vector multiplication, where is an -by- matrix with each row corresponding to a single reward function weight hypothesis . This significantly reduces the complexity of policy evaluation over the reward function posterior distribution from to .
When we refer to Bayesian REX we will refer to the optimized version described in this section (see the Appendix for full implementation details and pseudo-code). Running Bayesian REX with 144 preference labels to generate 100,000 reward hypothesis for Atari imitation learning tasks takes approximately 5 minutes on a Dell Inspiron 5577 personal laptop with an Intel i7-7700 processor without using the GPU. In comparison, using standard Bayesian IRL to generate one sample from the posterior takes 10+ hours of training for a parallelized PPO reinforcement learning agent (Dhariwal et al., 2017) on an NVIDIA TITAN V GPU.
5.2 Pre-training the Reward Function Network
The previous section presupposed access to a pretrained latent embedding function . We now discuss our pre-training process. Because we are interested in imitation learning problems, we need to be able to train from the demonstrations without access to the ground-truth reward function. One potential method is to train using the pairwise ranking likelihood function in Equation (4) and then freeze all but the last layer of weights; however, the learned embedding may overfit to the limited number of preferences over demonstrations and fail to capture features relevant to the ground-truth reward function. Thus, we supplement the pairwise ranking objective with auxiliary objectives that can be optimized in a self-supervised fashion using data from the demonstrations.
We use the following self-supervised tasks to pre-train : (1) Learn an inverse dynamics model that uses embeddings and to predict the corresponding action (Torabi et al., 2018; Hanna & Stone, 2017), (2) Learn a forward dynamics model that predicts from and (Oh et al., 2015; Thananjeyan et al., 2019), (3) Learn an embedding that predicts the temporal distance between two randomly chosen states from the same demonstration (Aytar et al., 2018)
, and (4) Train a variational pixel-to-pixel autoencoder in whichis the learned latent encoding (Makhzani & Frey, 2017; Doersch, 2016). Table 5 summarizes the auxiliary tasks used to train .
There are many possibilities for pre-training ; however, we found that each objective described above encourages the embedding to encode different features. For example, an accurate inverse dynamics model can be learned by only attending to the movement of the agent. Learning forward dynamics supplements this by requiring to encode information about short-term changes to the environment. Learning to predict the temporal distance between states in a trajectory forces to encode long-term progress. Finally, the autoencoder loss acts as a regularizer to the other losses as it seeks to embed all aspects of the state (see the Appendix for details and visualizations of the learned embedding). The full Bayesian REX pipeline for generating samples from the posterior over reward functions is summarized in Figure 1.
5.3 HCPE-IL via Bayesian REX
We now discuss how to use Bayesian REX to find an efficient solution to the high-confidence policy evaluation for imitation learning (HCPE-IL) problem (see Section 4). Given samples from the distribution , in which , we compute the posterior distribution over any performance statistic as follows. For each sampled weight vector produced by Bayesian REX, we compute . This results in a sample from the posterior distribution , i.e., the posterior distribution over performance statistic . We then compute a confidence lower bound, , by finding the
-quantile offor .
While there are many potential performance statistics , in this paper we focus on bounding the expected value of the evaluation policy, i.e., . To compute a confidence bound on , we take full advantage of the learned linear reward representation to efficiently calculate the posterior distribution over policy returns given preferences and demonstrations. The posterior distribution over returns is calculated via a matrix vector product, , in which each row of is a sample, , from the MCMC chain and is the evaluation policy. We then sort the resulting vector and select the -quantile lowest value. This results in a confidence lower bound on and corresponds to the -Value at Risk (VaR) over (Jorion, 1997).
6 Experimental Results
6.1 Imitation Learning via Bayesian REX
We first tested the imitation learning performance of Bayesian REX. We pre-trained a 64 dimensional latent state embedding using the self-supervised losses shown in Table 5 along with the T-REX pairwise preference loss. We found via ablation studies that combining the T-REX loss with the self-supervised losses resulted in better performance than training only with the T-REX loss or only with the self-supervised losses (see Appendix for details). We then used Bayesian REX to generate 200,000 samples from the posterior . We then took the MAP and mean reward function estimates from the posterior and optimized a policy using Proximal Policy Optimization (PPO) (Schulman et al., 2017) (see Appendix for details).
|Ranked Demonstrations||Bayesian REX Mean||Bayesian REX MAP||T-REX||GAIL|
|Game||Best||Avg||Avg (Std)||Avg (Std)||Avg||Avg|
|Beam Rider||1332||686.0||5,504.7 (2121.2)||5,870.3 (1905.1)||3,335.7||355.5|
|Breakout||32||14.5||390.7 (48.8)||393.1 (63.7)||221.3||0.28|
|Enduro||84||39.8||487.7 (89.4)||135.0 (24.8)||586.8||0.28|
|Seaquest||600||373.3||734.7 (41.9)||606.0 (37.6)||747.3||0.0|
|Space Invaders||600||332.9||1,118.8 (483.1)||961.3 (392.3)||1,032.5||370.2|
To test whether Bayesian REX scales to complex imitation learning tasks we selected five Atari games from the Arcade Learning Environment (Bellemare et al., 2013). We do not give the RL agent access to the ground-truth reward signal and mask the game scores and number of lives when learning and when sampling from the posterior distribution over reward functions. Table 2 shows the imitation learning performance of Bayesian REX. We also compare against the results reported by (Brown et al., 2019b) for T-REX, and GAIL (Ho & Ermon, 2016) and use the same 12 suboptimal demonstrations used by Brown et al. (2019b) to train Bayesian REX (see Appendix for details).
Table 2 shows that Bayesian REX is able to utilize preferences over demonstrations to infer an accurate reward function that enables better-than-demonstrator performance. The average ground-truth return for Bayesian REX surpasses the performance of the best demonstration across all 5 games. In comparison, GAIL seeks to match the demonstrator’s state-action distributions which makes imitation learning difficult when demonstrations are suboptimal and noisy. In addition to providing uncertainty information, Bayesian REX remains competitive with T-REX (which only finds a maximum likelihood estimate of the reward function) and achieves better performance on 3 out of 5 games.
6.2 High-Confidence Policy Performance Bounds
Next, we ran an experiment to validate whether the posterior distribution generated by Bayesian REX can be used to solve the HCPE-IL problem described in Section 4. We first evaluated four different evaluation policies, , created by partially training a PPO agent on the ground-truth reward function and checkpointing the policy at various stages of learning. We ran Bayesian REX to generate 200,000 samples from . Because IRL is fundamentally ill-posed we do not know the scale of the true reward . Thus, the results from Bayesian REX are most useful when used to compare the relative performance of several different evaluation policies.
|Predicted||Ground Truth Avg.|
|Risk profiles given initial preferences|
|Predicted||Ground Truth Avg.|
|Risk profiles after rankings w.r.t. MAP and No-Op|
The results for Beam Rider are shown in Table 3. We show results for partially trained RL policies A-D. We found that the ground-truth returns for the checkpoints were highly correlated with the mean and 0.05-VaR (5th percentile policy return) returns under the posterior. However, we also noticed that the trajectory length was also highly correlated with the ground-truth reward. If the reward function learned via IRL gives a small positive reward at every timestep, then long polices that do the wrong thing may look good under the posterior. To test this hypothesis we used a No-Op policy that seeks to exploit the learned reward function by not taking any actions. This allows the agent to live until the Atari emulator times out after 99,994 steps.
Table 3 shows that while the No-Op policy has a high expected return over the chain, looking at the 0.05-VaR shows that the No-Op policy has high risk under the distribution, much lower than evaluation policy A. Our results demonstrate that reasoning about probabilistic worst-case performance may be one potential way to detect policies that exhibit so-called reward hacking (Amodei et al., 2016) or that have overfit to certain features in the demonstrations that are correlated with the intent of the demonstrations, but do not lead to desired behavior, a common problem in imitation learning (Ibarz et al., 2018; de Haan et al., 2019).
Table 4 contains policy evaluation results for the game Breakout. The top half of the table shows the mean return and 95%-confidence lower bound on the expected return under the reward function posterior for four evaluation policies as well as the MAP policy found via Bayesian IRL and a No-Op policy that never chooses to release the ball. Both the MAP and No-Op policies have high expected returns under the reward function posterior, but also have high risk (low 0.05-VaR). The MAP policy has much higher risk than the No-Op policy, despite good true performance. One likely reason is that, as shown in Table 2, the best demonstrations given to Bayesian REX only achieved a game score of 32. Thus, the MAP policy represents an out of distribution sample and thus has potentially high risk, since Bayesian REX was not trained on policies that hit any of the top layers of bricks. The ranked demonstrations do not give enough evidence to eliminate the possibility that only lower layers of bricks should be hit, but they do give strong evidence that missing the ball and quickly losing the game is bad. This results in the No-Op policy have a higher high-confidence performance bound over the posterior.
To test this hypothesis, we added two new ranked demonstrations, a single rollout from the MAP policy and a single rollout from the No-Op policy to the original set of 12 ranked demonstrations and peformed MCMC with these new preferences. As the bottom of Table 4 shows, adding two more ranked demonstrations results in a significant change in the risk profiles of the MAP and No-Op policy—the No-Op policy is now correctly predicted to have high risk and low expected returns and the MAP policy now has a much higher 95%-confidence lower bound on performance.
6.3 Additional Experiments
We also tested Bayesian REX on a wide variety of human demonstrations and found that Bayesian REX is able to robustly rank evaluation policies relative to each other, even when some of the demonstrations deceptive, e.g., moving and firing but not destroying enemies. We also conducted a more rigorous exploration of the risk-return trade-off for safe policy selection by varying the risk tolerance from 0 (maximally risk averse) to 1 (maximally risk tolerant) and then performing high-confidence policy selection based on the desired risk tolerance threshold. Due to space constraints we have included these additional experimental results in the Appendix.
Bayesian reasoning is a powerful tool when dealing with uncertainty and risk; however, existing Bayesian inverse reinforcement learning algorithms require solving an MDP in the inner loop, rendering them intractable for complex problems in which solving an MDP may take several hours or even days. In this paper we propose a novel deep learning algorithm, Bayesian Reward Extrapolation (Bayesian REX), that leverages preference labels over demonstrations to make Bayesian IRL tractable for high-dimensional visual imitation learning tasks. Bayesian REX can sample tens of thousands of reward functions from the posterior in a matter of minutes using a consumer laptop. We tested our approach on five Atari imitation learning tasks and demonstrated Bayesian REX achieves state-of-the-art performance in 3 out of 5 games. Furthermore, Bayesian REX enables efficient high-confidence performance bounds for arbitrary evaluation policies. We demonstrated that these high-confidence bounds allow accurate comparison of different evaluation policies and provide a potential way to detect reward hacking and value misalignment.
We note that our proposed safety bounds are only safe with respect to the assumptions that we make: good feature pre-training, rapid MCMC mixing, and accurate preferences over demonstrations. Future work includes using exploratory trajectories for better pre-training of the latent feature embeddings, developing methods to determine when a relevant feature is missing from the learned latent space, and using high-confidence performance bounds to perform safe policy optimization in the imitation learning setting.
Abbeel & Ng (2004)
Abbeel, P. and Ng, A. Y.
Apprenticeship learning via inverse reinforcement learning.
Proceedings of the 21st international conference on Machine learning, 2004.
- Amodei et al. (2016) Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Argall et al. (2009) Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Aytar et al. (2018) Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Freitas, N. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.
- Barreto et al. (2017) Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.
Bellemare et al. (2013)
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bıyık et al. (2019) Bıyık, E., Palan, M., Landolfi, N. C., Losey, D. P., and Sadigh, D. Asking easy questions: A user-friendly approach to active reward learning. In Conference on Robot Learning (CoRL), 2019.
- Bobu et al. (2018) Bobu, A., Bajcsy, A., Fisac, J. F., and Dragan, A. D. Learning under misspecified objective spaces. arXiv preprint arXiv:1810.05157, 2018.
- Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Brown & Niekum (2018) Brown, D. S. and Niekum, S. Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2018.
- Brown et al. (2018) Brown, D. S., Cui, Y., and Niekum, S. Risk-aware active inverse reinforcement learning. In Conference on Robot Learning (CoRL), 2018.
- Brown et al. (2019a) Brown, D. S., Goo, W., and Niekum, S. Better-than-demonstrator imitation learning via automaticaly-ranked demonstrations. In Conference on Robot Learning (CoRL), 2019a.
- Brown et al. (2019b) Brown, D. S., Goo, W., Prabhat, N., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 2019b.
- Chow et al. (2015) Chow, Y., Tamar, A., Mannor, S., and Pavone, M. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, 2015.
- Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307, 2017.
- de Haan et al. (2019) de Haan, P., Jayaraman, D., and Levine, S. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pp. 11693–11704, 2019.
- Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Openai baselines. https://github.com/openai/baselines, 2017.
- Doersch (2016) Doersch, C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
- Eckersley (2018) Eckersley, P. Impossibility and uncertainty theorems in ai value alignment (or why your agi should not have a utility function). arXiv preprint arXiv:1901.00064, 2018.
- Finn et al. (2016a) Finn, C., Christiano, P., Abbeel, P., and Levine, S. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016a.
- Finn et al. (2016b) Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 2016b.
- Fisac et al. (2020) Fisac, J. F., Gates, M. A., Hamrick, J. B., Liu, C., Hadfield-Menell, D., Palaniappan, M., Malik, D., Sastry, S. S., Griffiths, T. L., and Dragan, A. D. Pragmatic-pedagogic value alignment. In Robotics Research, pp. 49–57. Springer, 2020.
- Fu et al. (2017) Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
- Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
- Garcıa & Fernández (2015) Garcıa, J. and Fernández, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Ghasemipour et al. (2019) Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. arXiv preprint arXiv:1911.02256, 2019.
- Ghavamzadeh et al. (2016) Ghavamzadeh, M., Petrik, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298–2306, 2016.
- Hadfield-Menell et al. (2016) Hadfield-Menell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pp. 3909–3917, 2016.
- Hadfield-Menell et al. (2017) Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
- Hanna et al. (2019) Hanna, J., Niekum, S., and Stone, P. Importance sampling policy evaluation with an estimated behavior policy. In Proceedings of the 36th International Conference on Machine Learning (ICML), June 2019.
- Hanna & Stone (2017) Hanna, J. P. and Stone, P. Grounded action transformation for robot learning in simulation. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Hessel et al. (2018) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
- Huang et al. (2018) Huang, J., Wu, F., Precup, D., and Cai, Y. Learning safe policies with expert guidance. In Advances in Neural Information Processing Systems, pp. 9105–9114, 2018.
- Ibarz et al. (2018) Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Advances in Neural Information Processing Systems, 2018.
- Jacq et al. (2019) Jacq, A., Geist, M., Paiva, A., and Pietquin, O. Learning from a learner. In International Conference on Machine Learning, pp. 2990–2999, 2019.
- Jorion (1997) Jorion, P. Value at risk. McGraw-Hill, New York, 1997.
Kendall & Gal (2017)
Kendall, A. and Gal, Y.
What uncertainties do we need in bayesian deep learning for computer vision?In Advances in neural information processing systems, pp. 5574–5584, 2017.
- Khan et al. (2018) Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. Fast and scalable bayesian deep learning by weight-perturbation in adam. arXiv preprint arXiv:1806.04854, 2018.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lacotte et al. (2019) Lacotte, J., Ghavamzadeh, M., Chow, Y., and Pavone, M. Risk-sensitive generative adversarial imitation learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2154–2163, 2019.
- Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413, 2017.
- Laskey et al. (2017) Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning. Conference on Robot Learning (CoRL), 2017.
- Leike et al. (2017) Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
- Littman et al. (1995) Littman, M. L., Dean, T. L., and Kaelbling, L. P. On the complexity of solving markov decision problems. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.
- Liu & Wang (2016) Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 2378–2386, 2016.
MacKay, D. J.
A practical bayesian framework for backpropagation networks.Neural computation, 4(3):448–472, 1992.
- Maddox et al. (2019) Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pp. 13132–13143, 2019.
- Makhzani & Frey (2017) Makhzani, A. and Frey, B. J. Pixelgan autoencoders. In Advances in Neural Information Processing Systems, pp. 1975–1985, 2017.
- Milli et al. (2017) Milli, S., Hadfield-Menell, D., Dragan, A., and Russell, S. Should robots be obedient? In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4754–4760, 2017.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Ng & Russell (2000) Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 663–670, 2000.
- Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
- Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871, 2015.
- Palan et al. (2019) Palan, M., Landolfi, N. C., Shevchuk, G., and Sadigh, D. Learning reward functions by integrating human demonstrations and preferences. In Proceedings of Robotics: Science and Systems (RSS), June 2019.
- Petrik & Russell (2019) Petrik, M. and Russell, R. H. Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. arXiv preprint arXiv:1902.07605, 2019.
- Pomerleau (1991) Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
- Pradier et al. (2018) Pradier, M. F., Pan, W., Yao, J., Ghosh, S., and Doshi-Velez, F. Projected bnns: Avoiding weight-space pathologies by learning latent representations of neural network weights. arXiv preprint arXiv:1811.07006, 2018.
- Ramachandran & Amir (2007) Ramachandran, D. and Amir, E. Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artifical intelligence, pp. 2586–2591, 2007.
- Ross et al. (2011) Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
- Sadigh et al. (2017) Sadigh, D., Dragan, A. D., Sastry, S. S., and Seshia, S. A. Active preference-based learning of reward functions. In Proceedings of Robotics: Science and Systems (RSS), 2017.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
- Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
- Tamar et al. (2015) Tamar, A., Glassner, Y., and Mannor, S. Optimizing the cvar via sampling. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2993–2999, 2015.
- Taylor & Stone (2009) Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
- Thananjeyan et al. (2019) Thananjeyan, B., Balakrishna, A., Rosolia, U., Li, F., McAllister, R., Gonzalez, J. E., Levine, S., Borrelli, F., and Goldberg, K. Safety augmented value estimation from demonstrations (saved): Safe deep model-based rl for sparse cost robotic tasks. arXiv preprint arXiv:1905.13402, 2019.
- Thomas et al. (2015) Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. High-confidence off-policy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3000–3006, 2015.
- Torabi et al. (2018) Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), July 2018.
- Volkovs & Zemel (2014) Volkovs, M. N. and Zemel, R. S. New learning methods for supervised and unsupervised preference aggregation. The Journal of Machine Learning Research, 15(1):1135–1176, 2014.
- Zhang & Cho (2017) Zhang, J. and Cho, K. Query-efficient imitation learning for end-to-end simulated driving. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 1433–1438, 2008.
Appendix A MCMC Details
We represent as a linear combination of pre-trained features:
We pre-compute and cache for and the likelihood becomes
We enforce constraints on the weight vectors by normalizing the output of the weight vector proposal such that and use a Gaussian proposal function centered on
with standard deviation. Thus, given the current sample , the proposal is defined as , in which normalize divides by the L2 norm of the sample to project back to the surface of the L2-unit ball.
For all experiments, except Seaquest, we used a default step size of 0.005. For Seaquest increased the step size to 0.05. We run 200,000 steps of MCMC and use a burn-in of 5000 and skip every 20th sample to reduce auto-correlation. We initialize the MCMC chain with a randomly chosen vector on the L2-unit ball. Because the inverse reinforcement learning is ill-posed there are an infinite number of reward functions that could match any set of demonstrations. Prior work by Finn et al. (2016b) demonstrates that strong regularization is needed when learning cost functions via deep neural networks. To ensure that the rewards learned allow good policy optimization when fed into an RL algorithm we used a non-negative return prior on the return of the lowest ranked demonstration. The prior takes the following form:
This forces MCMC to not only find reward function weights that match the rankings, but to also find weights such that the return of the worse demonstration is non-negative. If the return of the worse demonstration was negative during proposal generation, then we assigned it a prior probability of. Because the ranking likelihood is invariant to affine transformations of the rewards, this prior simply shifts the range of learned returns and does not affect the log likelihood ratios.
Appendix B Pre-training Latent Reward Features
We experimented with several pretraining methods. One method is to train using the pairwise ranking likelihood function in Equation (9) and then freeze all but the last layer of weights; however, the learned embedding may overfit to the limited number of preferences over demonstrations and fail to capture features relevant to the ground-truth reward function. Thus, we supplement the pairwise ranking objective with auxiliary objectives that can be optimized in a self-supervised fashion using data from the demonstrations.
We use the following self-supervised tasks to pre-train : (1) Learn an inverse dynamics model that uses embeddings and to predict the corresponding action (Torabi et al., 2018; Hanna & Stone, 2017), (2) Learn a forward dynamics model that predicts from and (Oh et al., 2015; Thananjeyan et al., 2019), (3) Learn an embedding that predicts the temporal distance between two randomly chosen states from the same demonstration (Aytar et al., 2018), and (4) Train a variational pixel-to-pixel autoencoder in which is the learned latent encoding (Makhzani & Frey, 2017; Doersch, 2016). Table 5 summarizes the auxiliary tasks used to train .
There are many possibilities for pre-training ; however, we found that each objective described above encourages the embedding to encode different features. For example, an accurate inverse dynamics model can be learned by only attending to the movement of the agent. Learning forward dynamics supplements this by requiring to encode information about short-term changes to the environment. Learning to predict the temporal distance between states in a trajectory forces to encode long-term progress. Finally, the autoencoder loss acts as a regularizer to the other losses as it seeks to embed all aspects of the state.
In the Atari domain, input to the network is given visually as grayscale frames resized to . To provide temporal information, four sequential frames are stacked one on top of another to create a framestack which provides a brief snapshot of activity. The network architecture takes a framestack, applies four convolutional layers following a similar architecture to Christiano et al. (2017) and Brown et al. (2019b)
, with leaky ReLU units as non-linearities following each convolution layer. The convolutions follow the following structure:
|#||Filter size||Image size||Stride|
The convolved image is then flattened. Two sequential fully connected layers, with leaky ReLU applied to the first layer, transform the flattened image into the encoding, where is the initial framestack. The width of these layers depends on the size of the feature encoding chosen. In our experiments with a latent dimension of 64, the first layer transforms from size 784 to 128 and the second from 128 to 64.
See Figure 2 for a complete diagram of this process.
Architectural information for each auxiliary task is given below.
The variational autoencoder (VAE) tries to reconstruct the original framestack from the feature encoding using transposed convolutions. Mirroring the structure of the initial convolutions, two fully connected layers precede four transposed convolution layers. These first two layers transform the 64-dimensional feature encoding from 64 to 128, and from 128 to 1568. The following four layers’ structures are summarized below:
# Filter size Image size Stride Input - - 1 1 2 1 3 2 4 1
A cross-entropy loss is applied between the reconstructed image and the original, as well as a term added to penalize the KL divergence of the distribution from the unit normal.
A temporal difference estimator, which takes two random feature encodings from the same demonstration and predicts the number of timesteps in between. It is a single fully-connected layer, transforming the concatenated feature encodings into a scalar time difference. A mean-squared error loss is applied between the real difference and predicted.
An inverse dynamics model, which takes two sequential feature encodings and predicts the action taken in between. It is again a single fully-connected layer, trained as a classification problem with a binary cross-entropy loss over the discrete action set.
A forward dynamics model, which takes a concatenated feature encoding and action and predicts the next feature encoding with a single fully-connected layer. This is repeated 5 times, which increases the difference between the initial and final encoding. It is trained using a mean-squared error between the predicted and real feature encoding.
A T-REX loss, which samples feature encodings from two different demonstrations and tries to predict which one of them has preference over the other. This is done with a single fully-connected layer that transforms an encoding into scalar reward, and is then trained as a classification problem with a binary cross-entropy loss. A 1 is assigned to the demonstration sample with higher preference and a 0 to the demonstration sample with lower preference.
In order to encourage a feature encoding that has information easily interpretable via linear combinations, the temporal difference, T-REX, inverse dynamics, and forward dynamics tasks consist of only a single layer atop the feature encoding space rather than multiple layers.
To compute the final loss on which to do the backwards pass, all of the losses described above are summed with weights determined empirically to balance out their values.
b.1 Training specifics
We used an NVIDIA TITAN V GPU for training the embedding. We used the same 12 demonstrations used for MCMC to train the self-supervised and ranking losses described above. We sample 60,000 trajectory snippets pairs from the demonstration pool, where each snippet is between 50 and 100 timesteps long. We use a learning rate of 0.001 and a weight decay of 0.001. We make a single pass through all of the training data using batch size of 1 resulting in 60,000 updates using the Adam (Kingma & Ba, 2014) optimizer. For Enduro prior work (Brown et al., 2019b) showed that full trajectories resulted in better performance than subsampling trajectories. Thus, for Enduro we subsample 10,000 pairs of entire trajectories by randomly selecting a starting time between 0 and 5 steps after the initial state and then skipping every t frames where t is chosen uniformly from the range and train with two passes through the training data. When performing subsampling for either snippets or full trajectories, we subsample pairs of trajectories such that one is from a worse ranked demonstration and one is from a better ranked demonstration following the procedure outlined in (Brown et al., 2019b).
Appendix C Visualizations of Latent Space
c.1 Individual Dimensions
Figure 3 contains samples taken from the feature encoding space for Space Invaders by linearly varying dimension . Notice how the value of this dimension has a clear effect on the network’s reconstruction of the game image. The framestacks on the left, with a lower value in dimension 16, have enemy ships concentrated toward the top left, while the framestacks with a higher value in dimension 16 have enemy ships concentrated to the top right.
This is in contrast to Seaquest, observable in Figure 4, where the autoencoder predicts roughly the same output in every framestack. This may be due to the freely moving objects in Seaquest, which tend to be fixed in common positions much less often than in Space Invaders. Future work is needed to investigate the exact cause of why some domains work so much better than others, and how to encourage the autoencoder to pick up on more fleeting features.
c.2 Random samples
While the samples in Figure 4 and Figure 3 were selected to illustrate specific points about the latent encoding, Figure 8 and Figure 9 provide four entirely samples from the latent space of all five Atari games tested. Figure 8, trained with the T-REX loss, has noticeably lower-quality reconstructions than Figure 9, which was trained without the T-REX loss. However, the networks trained with the T-REX loss tend to perform better when running MCMC, suggesting that the feature encoding contains additional information not recognized by the autoencoder.
c.3 Visualizations of Learned Features
Viewable here111https://www.youtube.com/watch?v=DMf8kNH9nVg is a video containing an Enduro demonstration trajectory, its decoding with respect to the pre-trained autoencoder, and a plot of the dimensions in the latent encoding over time. Observe how changes in the demonstration, such as turning right or left or a shift, correspond to changes in the plots of the feature embedding. We noticed that certain features increase when the agent passes other cars while other features decrease when the agent gets passed by other cars. This is evidence that the pretraining has learned features that are relevant to the ground truth reward which gives +1 every time the agent passes a car and -1 every time the agent gets passed.
Viewable here222https://www.youtube.com/watch?v=2uN5uD17H6M is a similar visualization of the latent space for Space Invaders. Notice how it tends to focus on the movement of enemy ships, useful for game progress in things such as the temporal difference loss, but seems to ignore the player’s ship despite its utility in inverse dynamics loss. Likely the information exists in the encoding but is not included in the output of the autoencoder.
Viewable here333https://www.youtube.com/watch?v=8zgbD1fZOH8 is visualization of the latent space for Breakout. Observe that breaking a brick often results in a small spike in the latent encoding. Many dimensions, like the dark green curve which begins at the lowest value, seem to invert as game progress continues on, thus acting as a measure of how much time has passed.
Appendix D Imitation Learning Ablations for Pre-training
, only the self-supervised losses shown in Table 1 of the main paper, and using both the T-REX ranking loss plus the self-supervised loss function. We found that performance varried over the different pre-training schemes, but that using Ranking + Self-Supervised achieved high performance across all games, clearly outperforming only using self-supervised losses and achieving superior performance to only using the ranking loss on 3 out of 5 games.
|Ranking Loss||Self-Supervised||Ranking + Self-Supervised|
Appendix E Suboptimal Demonstration Details
We used the same suboptimal demonstrations used by Brown et al. (2019b) for comparison. These demonstrations were obtained by running PPO on the ground truth reward and checkpointing every 50 updates using OpenAI Baselines (Dhariwal et al., 2017). Brown et al. (2019b) make the checkpoint files available, so to generate the demonstration data we used their saved checkpoints and followed the instructions in their released code to generate the data for our algorithm444Code from (Brown et al., 2019b) was downloaded from https://github.com/hiwonjoon/ICML2019-TREX. We gave Bayesian REX these demonstrations as well as ground-truth rankings using the game score; however, other than the rankings, Bayesian REX has no access to the true reward samples. Following the recommendations of Brown et al. (2019b), we mask the game score and other parts of the game that are directly indicative of the game score such as the number of enemy ships left, the number of lives left, the level number, etc. See (Brown et al., 2019b) for full details.
Appendix F Reinforcement Learning Details
. We used the default hyperparameters for all games and all experiments. We run RL for 50 million frames and then take the final checkpoint to perform evaluations. We adapted the OpenAI Baselines code so even though the RL agent receives a standard preprocessed observation, it only receives samples of the reward learned via Bayesian REX, rather than the ground-truth reward. T-REX(Brown et al., 2019b) uses a sigmoid to normalize rewards before passing them to the RL algorithm; however, we obtained better performance for Bayesian REX by feeding the unnormalized predicted reward into PPO for policy optimization. We follow the OpenAI baselines default preprocessing for the framestacks that are fed into the RL algorithm as observations. We also apply the default OpenAI baselines wrappers the environments. We run PPO with 9 workers on an NVIDIA TITAN V GPU.
Appendix G High-Confidence Policy Performance Bounds
In this section we describe the details of the policy performance bounds.
g.1 Policy Evaluation Details
We estimated using Monte Carlo rollouts for each evaluation policy. Thus, after generating rollouts, from the feature expectations are computed as
We used for all experiments.
g.2 Evaluation Policies
We evaluated several different evaluation policies. To see if the learned reward function posterior can interpolate and extrapolate we created four different evaluation policies: A, B, C, and D. These policies were created by running RL via PPO on the ground truth reward for the different Atari games. We then checkpointed the policy and selected checkpoints that would result in different levels of performance. For all games except for Enduro these checkpoints correspond to 25, 325, 800, and 1450 update steps using OpenAI baselines. For Enduro, PPO performance was stuck at 0 return until much later in learning. To ensure diversity in the evaluation policies, we chose to use evaluation policies corresponding to 3125, 3425, 3900, and 4875 steps. We also evaluated each game with a No-Op policy. These policies are often adversarial for some games, such as Seaquest, Breakout, and Beam Rider, since they allow the agent to live for a very long time without actually playing the game—a potential way to hack the learned reward since most learned rewards for Atari will incentivize longer gameplay.
The results for Beam Rider and Breakout are shown in the main paper. For completeness, we have included the high-confidence policy evaluation results for the other games here in the Appendix. Table 7 shows the high-confidence policy evaluation results for Enduro. Both the average returns over the posterior as well as the the high-confidence performance bounds () demonstrate accurate predictions relative to the ground-truth performance. The No-Op policy results in the racecar slowly moving along the track and losing the race. This policy is accurately predicted as being much worse than the other evaluation policies. We also evaluated the Mean and MAP policies found by optimizing the Mean reward and MAP reward from the posterior obtained using Bayesian REX. We found that the learned posterior is able to capture that the MAP policy is more than twice as good as the evaluation policy D and that the Mean policy has performance somewhere between the performance of policies B and C. These results show that Bayesian REX has the potential to predict better-than-demonstrator performance (Brown et al., 2019a).
Table 8 shows the results for high-confidence policy evaluation for Seaquest. The results show that high-confidence performance bounds are able to accurately predict that evaluation policies A and B are worse than C and D. The ground truth performance of policies C and D are too close and the mean performance over the posterior and 0.05-VaR bound on the posterior are not able to find any statistical difference between them. Interestingly the no-op policy has very high mean and 95%-confidence lower bound, despite not scoring any points. However, as shown in the bottom half of Table 8, adding one more ranked demonstration from a 3000 length segment of a no-op policy solves this problem. These results motivate a natural human-in-the-loop approach for safe imitation learning.
|Results after adding one ranked demo from No-Op|
Finally, Table 9 shows the results for high-confidence policy evaluation for Space Invaders. The results show that using both the mean performance and 95%-confidence lower bound are good indicators of ground truth performance for the evaluation polices. The No-Op policy for Space Invaders results in the agent getting hit by alien lasers early in the game. The learned reward function posterior correctly assigns low average performance and high risk (low 95%-confidence lower bound).
Appendix H Different Evaluation Policies
To test Bayesian REX on different learned policies we took a policy trained with RL on the ground truth reward function for Beam Rider, the MAP policy learned via Bayesian REX for Beam Rider, and a policy trained with an earlier version of Bayesian REX (trained without all of the auxiliary losses) that learned a novel reward hack where the policy repeatedly presses left and then right, enabling the agent’s ship to stay in between two of the firing lanes of the enemies. The imitation learning reward hack allows the agent to live for a very long time. We took a 2000 step prefix of each policy and evaluated the expected and 5th perentile worst-case predicted returns for each policy. We found that Bayesian REX is able to accurately predict that the reward hacking policy is worse than both the RL policy and the policy optimizing the Bayesian REX reward. However, we found that the Bayesian REX policy, while not performing as well as the RL policy, was given higher expected return and a higher lower bound on performance than the RL policy. Results are shown in Table 10.
Appendix I Human Demonstrations
To investigate whether Bayesian REX is able to correctly rank human demonstrations, one of the authors provided demonstrations of a variety of different behaviors and then we took the latent embeddings of the demonstrations and used the posterior distribution to find high-confidence performance bounds for these different rollouts.
We generated four human demonstrations: (1) good, a good demonstration that plays the game well, (2) bad, a bad demonstration that seeks to play the game but does a poor job, (3) suicidal, a demonstration that does not shoot enemies and seeks enemy bullets, and (4) adversarial a demonstration that pretends to play the game by moving and shooting as much as possibly but tries to avoid actually shooting enemies. The results of high-confidence policy evaluation are shown in Table 11. The high-confidence bounds and average performance over the posterior correctly rank the behaviors. This provides evidence that the learned linear reward correctly rewards actually destroying aliens and avoiding getting shot, rather than just flying around and shooting.
i.2 Space Invaders
For Space Invaders we demonstrated an even wider variety of behaviors to see how Bayesian REX would rank their relative performance. We evaluated the following policies: (1) good, a demonstration that attempts to play the game as well as possible, (2) every other, a demonstration that only shoots aliens in the 2nd and 4th columns, (3) flee, a demonstration that did not shoot aliens, but tried to always be moving while avoiding enemy lasers, (4) hide, a demonstration that does not shoot and hides behind on of the barriers to avoid enemy bullets, (5) suicidal, a policy that seeks enemy bullets while not shooting, (6) shoot shelters, a demonstration that tries to destroy its own shelters by shooting at them, (7) hold ’n fire, a demonstration where the player rapidly fires but does not move to avoid enemy lasers, and (8) miss, a demonstration where the demonstrator tries to fire but not hit any aliens while avoiding enemy lasers.
|hold ’n fire||44.3||18.6||210||638.0|
Table 12 shows the results of evaluating the different demonstrations. The good demonstration is clearly the best performing demonstration in terms of mean performance and 95%-confidence lower bound on performance and the suicidal policy is correctly given the lowest performance lower bound. However, we found that the length of the demonstration appears to have a strong effect on the predicted performance for Space Invaders. Demonstrations such as hide and miss are able to live for a longer time than policies that actually hit aliens. This results in them having higher 0.05-quantile worst-case predicted performance and higher mean performance.
To study this further we looked at only the first 600 timesteps of each policy, to remove any confounding by the length of the trajectory. The results are shown in Table 13. With a fixed length demonstration, Bayesian REX is able to correctly rank good, every other, and hold ’n fire as the best demonstrations, despite evaluation policies that are deceptive.
|hold ’n fire||40.9||17.1||210||600.0|
For Enduro we tested four different human demonstrations: (1) good a demonstration that seeks to play the game well, (2) periodic a demonstration that alternates between speeding up and passing cars and then slowing down and being passed, (3) neutral a demonstration that stays right next to the last car in the race and doesn’t try to pass or get passed, and (4) ram a demonstration that tries to ram into as many cars while going fast. Table 14 shows that Bayesian REX is able to accurately predict the performance and risk of each of these demonstrations and gives the highest (lowest 0.05-VaR) risk to the ram demonstration and the lest risk to the good demonstration.
Appendix J Trading off risk and reward
We next tested whether we can trade-off risk and reward using the posterior distribution learned via Bayesian REX. We obtained a variety of evaluation policies obtained by partially training and checkpointing RL policies trained on the ground-truth reward (see Appendix for details). We then varied risk tolerance from 0 (maximally risk averse) to 1 (maximally risk tolerant) and performed a policy evaluation for each evaluation policy. We then sorted the evaluation policies based on the risk tolerance threshold and selected the best policy. In figures 9(a) and 9(b) we plot the risk tolerance along the x-axis and the ground-truth expected performance along the y-axis. For Breakout we noticed that increasing the risk-tolerance improves expected return, at the expense of possibly riskier policies. For Beamrider we found that increasing the risk tolerance from 0 to 1 did not change the policy selected. We investigated this and found that one policy dominated the other evaluation policies as demonstrated in the Figure 9(c) where the posterior distribution for the dominating policy (green) is compared to the distributions for several other evaluation policies.