1 Introduction
It is important that robots and other autonomous agents can safely learn from and adapt to a variety of human preferences and goals. One common way to learn preferences and goals is via imitation learning, in which an autonomous agent learns how to perform a task by observing demonstrations of the task (Argall et al., 2009). When learning from demonstrations, it is important for an agent to be able to provide highconfidence bounds on its performance with respect to the demonstrator; however, while there exists much work on highconfidence offpolicy evaluation in the reinforcement learning (RL) setting, there has been much less work on highconfidence policy evaluation in the imitation learning setting, where the reward samples are unavailable.
Prior work on highconfidence policy evaluation for imitation learning has used Bayesian inverse reinforcement learning (IRL) Ramachandran & Amir (2007) to allow an agent to reason about reward uncertainty and policy generalization error (Brown et al., 2018). However, Bayesian IRL is typically intractable for complex problems due to the need to repeatedly solve an MDP in the inner loop, resulting in high computational cost as well as high sample cost if a model is not available. This precludes robust safety and uncertainty analysis for imitation learning in highdimensional problems or in problems in which a useful model of the MDP is unavailable. We seek to remedy this problem by proposing and evaluating a method for safe and efficient Bayesian reward learning via preferences over demonstrations. Preferences over demonstrations are intuitive for humans to provide (Sadigh et al., 2017; Christiano et al., 2017; Palan et al., 2019) and allow for betterthandemonstrator performance (Brown et al., 2019a). To the best of our knowledge, we are the first to show that preferences over demonstrations enables fast Bayesian reward learning in highdimensional control tasks and also enables efficient highconfidence performance bounds for imitation learning.
We first formalize the problem of highconfidence policy evaluation (Thomas et al., 2015)
for imitation learning. We next propose a novel algorithm, Bayesian Reward Extrapolation (Bayesian REX), that uses a pairwise ranking likelihood to significantly reduce the computational complexity of generating samples from the posterior distribution over reward functions. We demonstrate that Bayesian REX can leverage neural network function approximation to learn useful reward features via selfsupervised learning in order to efficiently perform deep Bayesian reward inference from visual demonstrations. Finally, we demonstrate that samples obtained from Bayesian REX can be used to solve the highconfidence policy evaluation problem for imitation learning. We evaluate our method on imitation learning for Atari games and demonstrate that we can efficiently compute highconfidence bounds on policy performance, without requiring samples of the reward function. Furthermore, we demonstrate that these highconfidence bounds can be used to accurately rank different evaluation policies according to their risk and performance under the distribution over the unknown groundtruth reward function. Finally, we provide evidence that bounds on uncertainty and risk and may provide a useful tool for detecting reward hacking/gaming
(Amodei et al., 2016), a common problem in reward inference from demonstrations (Ibarz et al., 2018) as well as reinforcement learning (Ng et al., 1999; Leike et al., 2017).2 Related work
2.1 Imitation Learning
Imitation learning is the problem of learning a policy from demonstrations and can roughly be divided into techniques that use behavioral cloning and techniques that use inverse reinforcement learning. Behavioral cloning methods (Pomerleau, 1991; Torabi et al., 2018) seek to solve the imitation learning problem via supervised learning, in which the goal is to learn a mapping from states to actions that mimics the demonstrator. While computationally efficient, these methods suffer from compounding errors (Ross et al., 2011). Methods such as DAgger (Ross et al., 2011) and DART (Laskey et al., 2017) avoid this problem by repeatedly collecting additional stateaction pairs from an expert.
Inverse reinforcement learning (IRL) methods (Ng & Russell, 2000) seek to solve the imitation learning problem by estimating a reward function that makes the demonstrations appear near optimal. Classical approaches repeatedly alternate between a reward estimation step and a full policy optimization step (Abbeel & Ng, 2004; Ziebart et al., 2008; Ramachandran & Amir, 2007). Bayesian IRL (Ramachandran & Amir, 2007) samples from the posterior distribution over reward functions, whereas other methods seek a single reward function that induces the demonstrator’s feature expectations (Abbeel & Ng, 2004), often while also seeking to maximize the entropy of the resulting policy (Ziebart et al., 2008)
. Deep learning approaches to imitation learning are typically based on maximum entropy policy optimization and divergence minimization between marginal stateaction distributions
(Ho & Ermon, 2016; Fu et al., 2017; Ghasemipour et al., 2019) and are related to Generative Adversarial Networks (Finn et al., 2016a). These methods scale to complex control problems by iterating between reward learning and policy learning steps. Alternatively, Brown et al. (2019b) use ranked demonstrations to learn a reward function via supervised learning without requiring an MDP solver or any inference time data collection. The learned reward function can then be used to optimize a potentially betterthandemonstrator policy (Brown et al., 2019b). Subsequent research showed that preferences over demonstrations can be automatically generated via noise injection, allowing betterthandemonstrator performance even in the absence of explicit preference labels (Brown et al., 2019a). Despite the success of deep imitation learning methods, existing methods typically return a point estimate of the reward function, precluding uncertainty and robustness analysis.2.2 Safe Imitation Learning
While there has been much recent interest in imitation learning, less attention has been given to problems related to safe imitation learning. Zhang & Cho (2017) propose SafeDAgger a supervised learning approach to imitation learning that predicts in which states the imitation learning policy will have a large action difference from the demonstrator. Control is given to the the demonstrator only if the predicted action difference of the novice is above some handtuned parameter, . Lacotte et al. (2019) propose an imitation learning algorithm that seeks to match the tail risk of the expert as well as find a policy that is indistinguishable from the demonstrations. Brown & Niekum (2018) propose a Bayesian sampling approach to provide explicit highconfidence safety bounds on valueatrisk in the imitation learning setting. Their method uses samples from the posterior distribution to compute sample efficient probabilistic upper bounds on the policy loss of any evaluation policy. Other work considers robust policy optimization over a distribution of reward functions conditioned on demonstrations or a partially specified reward function, but these methods require an MDP solver in the inner loop, limiting their scalability HadfieldMenell et al. (2017); Brown et al. (2018); Huang et al. (2018). We extend and generalize the work of Brown & Niekum (2018) by demonstrating, for the first time, that highconfidence performance bounds can be efficiently obtained when performing imitation learning from highdimensional visual demonstrations without requiring access to a model of the MDP for reward inference.
2.3 Value Alignment and Active Preference Learning
Safe imitation learning is closely related to the problem of value alignment, which seeks to design methods that prevent AI systems from acting in ways that violate human values (HadfieldMenell et al., 2016; Fisac et al., 2020). Research has shown that difficulties arise when an agent seeks to align its value with a human who is not perfectly rational (Milli et al., 2017) and that there are fundamental impossibility results regarding value alignment (Eckersley, 2018); however, Eckersley (2018) demonstrate that these impossibility results do not hold if the objective is represented as a set of partially ordered preferences.
Prior work has used active queries to perform Bayesian reward inference on lowdimensional, handcrafted reward features (Sadigh et al., 2017; Brown et al., 2018; Bıyık et al., 2019). Christiano et al. (2017) and Ibarz et al. (2018) use deep networks to scale active preference learning to highdimensional tasks, but require large numbers of active queries during policy optimization and do not perform Bayesian reward inference. Our work complements and extends prior work by: (1) removing the requirement for active queries during reward inference or policy optimization, (2) showing that preferences over demonstrations enable efficient Bayesian reward inference in highdimensional visual control tasks, and (3) providing an efficient method for computing highconfidence bounds on the performance of any evaluation policy in the imitation learning setting.
2.4 Safe Reinforcement Learning
Research on safe reinforcement learning (RL) usually focuses on safe exploration strategies or optimization objectives other than expected return (Garcıa & Fernández, 2015). Recently, objectives based on measures of risk such as Value at Risk (VaR) and Conditional VaR have been shown to provide tractable and useful risksensitive measures of performance for MDPs (Tamar et al., 2015; Chow et al., 2015). Other work focuses on finding robust solutions to MDPs (Ghavamzadeh et al., 2016; Petrik & Russell, 2019), using modelbased RL to safely improve upon suboptimal demonstrations (Thananjeyan et al., 2019), and obtaining highconfidence offpolicy bounds on the performance of an evaluation policy (Thomas et al., 2015; Hanna et al., 2019). Our work provides an efficient solution to the problem of highconfidence policy evaluation in the imitation learning setting, in which samples of rewards are not observed and the demonstrator’s policy is unknown.
2.5 Bayesian Neural Networks
Bayesian neural networks typically either perform Markov Chain Monte Carlo (MCMC) sampling
(MacKay, 1992), variational inference (Sun et al., 2019; Khan et al., 2018), or use hybrid methods such as particlebased inference (Liu & Wang, 2016) to approximate the posterior distribution over neural network weights. Alternative approaches such as ensembles (Lakshminarayanan et al., 2017) or approximations such as Bayesian dropout (Gal & Ghahramani, 2016; Kendall & Gal, 2017) have also been used to obtain a distribution on the outputs of a neural network in order to provide uncertainty quantification (Maddox et al., 2019). In this work, we are not only interested in the uncertainty of the output of the reward function network, but also in the uncertainty over the performance of a policy when evaluated in an MDP with an unknown reward function. Thus, we face the doublyhard problem of needing to measure the uncertainty in the evaluation of a policy which depends on both the stochasticity of the policy as well as the uncertainty over rewards that the policy will obtain.3 Preliminaries
3.1 Notation
We model the environment as a Markov Decision Process (MDP) consisting of states
, actions , transition dynamics , reward function , initial state distribution , and discount factor . A policyis a mapping from states to a probability distribution over actions. We denote the value of a policy
under reward function as and denote the value of executing policy starting at state as . Given a reward function , the Qvalue of a stateaction pair is . We also denote and .3.2 Bayesian Inverse Reinforcement Learning
Bayesian inverse reinforcement learning (IRL) (Ramachandran & Amir, 2007) models the environment as an MDPR in which the reward function is unavailable. Bayesian IRL seeks to infer the latent reward function of a Boltzmanrational demonstrator that executes the following policy
(1) 
in which is the true reward function of the demonstrator, and represents the confidence that the demonstrator is acting optimally. Under the assumption of Boltzman rationality, the likelihood of a set of demonstrated stateaction pairs, , given a specific reward function hypothesis , can be written as
(2) 
Bayesian IRL generates samples from the posterior distribution via Markov Chain Monte Carlo (MCMC) sampling, but this requires solving for to compute the likelihood of each new proposal . Thus, Bayesian IRL methods are only used for lowdimensional problems with reward functions that are often linear combinations of a small number of handcrafted features (Bobu et al., 2018; Bıyık et al., 2019). One of our contributions is an efficient Bayesian reward inference algorithm that leverages preferences over demonstrations in order to significantly improve the efficiency of Bayesian reward inference.
4 High Confidence Evaluation for Imitation Learning
Before detailing our approach, we first formalize the problem of highconfidence policy evaluation for imitation learning. We assume access to an MDPR, an evaluation policy , a set of demonstrations, , in which is either a complete or partial trajectory comprised of states or stateaction pairs, a confidence level , and performance statistic , in which denotes the space of reward functions and is the space of all policies.
The HighConfidence Policy Evaluation problem for Imitation Learning (HCPEIL) is to find a highconfidence lower bound such that
(3) 
in which denotes the demonstrator’s true reward function and denotes the space of all possible demonstration sets. HCPEIL takes as input an evaluation policy , a set of demonstrations , and a performance statistic, , which evaluates a policy under a reward function. The goal of HCPEIL is to return a highconfidence lower bound on the performance statistic .
5 Deep Bayesian Reward Extrapolation
In this and the following sections we describe our main contribution: a method for scaling Bayesian IRL to highdimensional visual control tasks as a way to efficiently solve the HCPEIL problem for complex imitation learning tasks. Our first insight is that the main bottleneck for standard Bayesian IRL (Ramachandran & Amir, 2007) is computing the likelihood function in Equation (2) which requires optimal Qvalues. Thus, to make Bayesian IRL scale to highdimensional visual domains, it is necessary to either efficiently approximate optimal Qvalues or to formulate a new likelihood. Valuebased reinforcement learning focuses on efficiently learning optimal Qvalues; however, for visual control tasks such as Atari, RL algorithms can take several hours or even days to train (Mnih et al., 2015; Hessel et al., 2018)
. This makes MCMC, which requires evaluating large numbers of likelihood ratios, infeasible given the current stateoftheart in valuebased RL. Methods such as transfer learning have great potential to reduce the time needed to calculate
for a new proposed reward function ; however, transfer learning is not guaranteed to speed up reinforcement learning (Taylor & Stone, 2009). Thus, we choose to focus on reformulating the likelihood function as a way to speed up Bayesian reward inference.An ideal likelihood function requires little computation and minimal interaction with the environment. To accomplish this, we leverage recent work on learning control policies from preferences (Christiano et al., 2017; Palan et al., 2019; Bıyık et al., 2019). Given ranked demonstrations, Brown et al. (2019b) propose Trajectoryranked Reward Extrapolation (TREX): an efficient reward inference algorithm that transforms reward function learning into classification problem via a pairwise ranking loss. TREX removes the need to repeatedly sample from or partially solve an MDP in the inner loop, allowing IRL to scale to visual imitation learning domains such as Atari and to extrapolate beyond the performance of the best demonstration. However, TREX only solves for a point estimate of the reward function. We now discuss how a similar approach based on a pairwise preference likelihood allows for efficient sampling from the posterior distribution over reward functions.
We assume access to a sequence of trajectories, , along with a set of pairwise preferences over trajectories . Note that we do not require a totalordering over trajectories. These preferences may come from a human demonstrator or could be automatically generated by watching a learner improve at a task (Jacq et al., 2019) or via noise injection (Brown et al., 2019a). The benefit of pairwise preferences over trajectories is that we can now leverage a pairwise ranking loss to compute the likelihood of a set of preferences over demonstrations , given a parameterized reward function hypothesis . We use the standard BradleyTerry model (Bradley & Terry, 1952) to obtain the following pairwise ranking likelihood function, commonly used in learning to rank applications such collaborative filtering (Volkovs & Zemel, 2014):
(4) 
in which is the predicted return of trajectory under the reward function , and is the inverse temperature parameter that models the confidence in the preference labels. We can then perform Bayesian inference via MCMC to obtain samples from . We call this approach Bayesian Reward Extrapolation or Bayesian REX.
Note that using the likelihood function defined in Equation (4) does not require solving an MDP. In fact, it does not require any rollouts or access to the MDP. All that is required is that we first calculate the return of each trajectory under and compare the relative predicted returns to the preference labels to determine the likelihood of the demonstrations under the reward hypothesis . Thus, given preferences over demonstrations, Bayesian REX is significantly more efficient than standard Bayesian IRL. In the following section, we discuss further optimizations that improve the efficiency of Bayesian REX and make it more amenable to our end goal of highconfidence policy evaluation bounds.
5.1 Optimizations
In order to learn rich, complex reward functions, it is desirable to use a deep network to represent the reward function . While MCMC remains the goldstandard for Bayesian Neural Networks, it is often challenging to scale to deep networks. To make Bayesian REX more efficient and practical, we propose to limit the proposal to only change the last layer of weights in when generating MCMC proposals—we will discuss pretraining the bottom layers of in the next section. After pretraining, we freeze all but the last layer of weights and use the activations of the penultimate layer as the latent reward features . This allows the reward at a state to be represented as a linear combination of features: . Similar to work by Pradier et al. (2018), operating in the lowerdimensional latent space makes full Bayesian inference tractable.
A second advantage of using a learned linear reward function is that it allows us to efficiently compute likelihood ratios when performing MCMC. Consider the likelihood function in Equation (4). If we do not represent as a linear combination of pretrained features, and instead let any parameter in change during each proposal, then for demonstrations of length , computing for a new proposal requires forward passes through the entire network to compute . Thus, the complexity of generating samples from the posterior results is , where is the number of computations required for a full forward pass through the entire network . Given that we would like to use a deep network to parameterize and generate thousands of samples from the posterior distribution over , this many computations will significantly slow down MCMC proposal evaluation.
If we represent as a linear combination of pretrained features, we can reduce this computational cost because
(5) 
Thus, we can precompute and cache for and the likelihood becomes
(6) 
Note that demonstrations only need to be passed through the reward network once to compute since the pretrained embedding remains constant during MCMC proposal generation. This results in an initial passes through all but the last layer of to obtain , , and then only multiplications per proposal evaluation thereafter—each proposal requires that we compute for and . Thus, when using feature pretraining, the total complexity is only to generate samples via MCMC. This reduction in the complexity of MCMC from to results in significant and practical computational savings because (1) we want to make and large and (2) the number of demonstrations, , and the size of the latent embedding, , are typically several orders of magnitude smaller than and .
A third, and critical advantage of using a learned linear reward function is that it makes solving the HCPEIL problem discussed in Section 4 tractable. Performing a single policy evaluation is a nontrivial task (Sutton et al., 2000) and even in tabular settings has complexity in which is the size of the statespace (Littman et al., 1995). Because we are in an imitation learning setting, we would like to be able to efficiently evaluate any given policy across the posterior distribution over reward functions found via Bayesian REX. Given a posterior distribution over reward function hypotheses we would need to solve policy evaluations. However, note that given , the value function of a policy can be written as
(7) 
in which we assume a finite horizon MDP with horizon and in which are the expected feature counts (Abbeel & Ng, 2004; Barreto et al., 2017) of . Thus, given any evaluation policy , we only need to solve one policy evaluation problem to compute . We can then compute the expected value of
over the entire posterior distribution of reward functions via a single matrix vector multiplication
, where is an by matrix with each row corresponding to a single reward function weight hypothesis . This significantly reduces the complexity of policy evaluation over the reward function posterior distribution from to .When we refer to Bayesian REX we will refer to the optimized version described in this section (see the Appendix for full implementation details and pseudocode). Running Bayesian REX with 144 preference labels to generate 100,000 reward hypothesis for Atari imitation learning tasks takes approximately 5 minutes on a Dell Inspiron 5577 personal laptop with an Intel i77700 processor without using the GPU. In comparison, using standard Bayesian IRL to generate one sample from the posterior takes 10+ hours of training for a parallelized PPO reinforcement learning agent (Dhariwal et al., 2017) on an NVIDIA TITAN V GPU.
5.2 Pretraining the Reward Function Network
The previous section presupposed access to a pretrained latent embedding function . We now discuss our pretraining process. Because we are interested in imitation learning problems, we need to be able to train from the demonstrations without access to the groundtruth reward function. One potential method is to train using the pairwise ranking likelihood function in Equation (4) and then freeze all but the last layer of weights; however, the learned embedding may overfit to the limited number of preferences over demonstrations and fail to capture features relevant to the groundtruth reward function. Thus, we supplement the pairwise ranking objective with auxiliary objectives that can be optimized in a selfsupervised fashion using data from the demonstrations.
We use the following selfsupervised tasks to pretrain : (1) Learn an inverse dynamics model that uses embeddings and to predict the corresponding action (Torabi et al., 2018; Hanna & Stone, 2017), (2) Learn a forward dynamics model that predicts from and (Oh et al., 2015; Thananjeyan et al., 2019), (3) Learn an embedding that predicts the temporal distance between two randomly chosen states from the same demonstration (Aytar et al., 2018)
, and (4) Train a variational pixeltopixel autoencoder in which
is the learned latent encoding (Makhzani & Frey, 2017; Doersch, 2016). Table 5 summarizes the auxiliary tasks used to train .There are many possibilities for pretraining ; however, we found that each objective described above encourages the embedding to encode different features. For example, an accurate inverse dynamics model can be learned by only attending to the movement of the agent. Learning forward dynamics supplements this by requiring to encode information about shortterm changes to the environment. Learning to predict the temporal distance between states in a trajectory forces to encode longterm progress. Finally, the autoencoder loss acts as a regularizer to the other losses as it seeks to embed all aspects of the state (see the Appendix for details and visualizations of the learned embedding). The full Bayesian REX pipeline for generating samples from the posterior over reward functions is summarized in Figure 1.
Inverse Dynamics  

Forward Dynamics  
Temporal Distance  
Variational Autoencoder 
5.3 HCPEIL via Bayesian REX
We now discuss how to use Bayesian REX to find an efficient solution to the highconfidence policy evaluation for imitation learning (HCPEIL) problem (see Section 4). Given samples from the distribution , in which , we compute the posterior distribution over any performance statistic as follows. For each sampled weight vector produced by Bayesian REX, we compute . This results in a sample from the posterior distribution , i.e., the posterior distribution over performance statistic . We then compute a confidence lower bound, , by finding the
quantile of
for .While there are many potential performance statistics , in this paper we focus on bounding the expected value of the evaluation policy, i.e., . To compute a confidence bound on , we take full advantage of the learned linear reward representation to efficiently calculate the posterior distribution over policy returns given preferences and demonstrations. The posterior distribution over returns is calculated via a matrix vector product, , in which each row of is a sample, , from the MCMC chain and is the evaluation policy. We then sort the resulting vector and select the quantile lowest value. This results in a confidence lower bound on and corresponds to the Value at Risk (VaR) over (Jorion, 1997).
6 Experimental Results
6.1 Imitation Learning via Bayesian REX
We first tested the imitation learning performance of Bayesian REX. We pretrained a 64 dimensional latent state embedding using the selfsupervised losses shown in Table 5 along with the TREX pairwise preference loss. We found via ablation studies that combining the TREX loss with the selfsupervised losses resulted in better performance than training only with the TREX loss or only with the selfsupervised losses (see Appendix for details). We then used Bayesian REX to generate 200,000 samples from the posterior . We then took the MAP and mean reward function estimates from the posterior and optimized a policy using Proximal Policy Optimization (PPO) (Schulman et al., 2017) (see Appendix for details).
Ranked Demonstrations  Bayesian REX Mean  Bayesian REX MAP  TREX  GAIL  

Game  Best  Avg  Avg (Std)  Avg (Std)  Avg  Avg 
Beam Rider  1332  686.0  5,504.7 (2121.2)  5,870.3 (1905.1)  3,335.7  355.5 
Breakout  32  14.5  390.7 (48.8)  393.1 (63.7)  221.3  0.28 
Enduro  84  39.8  487.7 (89.4)  135.0 (24.8)  586.8  0.28 
Seaquest  600  373.3  734.7 (41.9)  606.0 (37.6)  747.3  0.0 
Space Invaders  600  332.9  1,118.8 (483.1)  961.3 (392.3)  1,032.5  370.2 
To test whether Bayesian REX scales to complex imitation learning tasks we selected five Atari games from the Arcade Learning Environment (Bellemare et al., 2013). We do not give the RL agent access to the groundtruth reward signal and mask the game scores and number of lives when learning and when sampling from the posterior distribution over reward functions. Table 2 shows the imitation learning performance of Bayesian REX. We also compare against the results reported by (Brown et al., 2019b) for TREX, and GAIL (Ho & Ermon, 2016) and use the same 12 suboptimal demonstrations used by Brown et al. (2019b) to train Bayesian REX (see Appendix for details).
Table 2 shows that Bayesian REX is able to utilize preferences over demonstrations to infer an accurate reward function that enables betterthandemonstrator performance. The average groundtruth return for Bayesian REX surpasses the performance of the best demonstration across all 5 games. In comparison, GAIL seeks to match the demonstrator’s stateaction distributions which makes imitation learning difficult when demonstrations are suboptimal and noisy. In addition to providing uncertainty information, Bayesian REX remains competitive with TREX (which only finds a maximum likelihood estimate of the reward function) and achieves better performance on 3 out of 5 games.
6.2 HighConfidence Policy Performance Bounds
Next, we ran an experiment to validate whether the posterior distribution generated by Bayesian REX can be used to solve the HCPEIL problem described in Section 4. We first evaluated four different evaluation policies, , created by partially training a PPO agent on the groundtruth reward function and checkpointing the policy at various stages of learning. We ran Bayesian REX to generate 200,000 samples from . Because IRL is fundamentally illposed we do not know the scale of the true reward . Thus, the results from Bayesian REX are most useful when used to compare the relative performance of several different evaluation policies.
Predicted  Ground Truth Avg.  

Policy  Mean  0.05VaR  Score  Length 
A  17.1  7.9  480.6  1372.6 
B  22.7  11.9  703.4  1,412.8 
C  45.5  24.9  1828.5  2,389.9 
D  57.6  31.5  2586.7  2,965.0 
NoOp  102.5  1557.1  0.0  99,994.0 
Risk profiles given initial preferences  

Predicted  Ground Truth Avg.  
Policy  Mean  0.05VaR  Score  Length 
A  1.5  0.5  1.9  202.7 
B  6.3  3.7  15.8  608.4 
C  10.6  5.8  27.7  849.3 
D  13.9  6.2  41.2  1020.8 
MAP  98.2  370.2  401.0  8780.0 
NoOp  41.2  1.0  0.0  7000.0 
Risk profiles after rankings w.r.t. MAP and NoOp  
A  0.7  0.3  1.9  202.7 
B  8.7  5.5  15.8  608.4 
C  18.3  12.1  27.7  849.3 
D  26.3  17.1  41.2  1020.8 
MAP  606.8  289.1  401.0  8780.0 
NoOp  5.0  13.5  0.0  7000.0 
The results for Beam Rider are shown in Table 3. We show results for partially trained RL policies AD. We found that the groundtruth returns for the checkpoints were highly correlated with the mean and 0.05VaR (5th percentile policy return) returns under the posterior. However, we also noticed that the trajectory length was also highly correlated with the groundtruth reward. If the reward function learned via IRL gives a small positive reward at every timestep, then long polices that do the wrong thing may look good under the posterior. To test this hypothesis we used a NoOp policy that seeks to exploit the learned reward function by not taking any actions. This allows the agent to live until the Atari emulator times out after 99,994 steps.
Table 3 shows that while the NoOp policy has a high expected return over the chain, looking at the 0.05VaR shows that the NoOp policy has high risk under the distribution, much lower than evaluation policy A. Our results demonstrate that reasoning about probabilistic worstcase performance may be one potential way to detect policies that exhibit socalled reward hacking (Amodei et al., 2016) or that have overfit to certain features in the demonstrations that are correlated with the intent of the demonstrations, but do not lead to desired behavior, a common problem in imitation learning (Ibarz et al., 2018; de Haan et al., 2019).
Table 4 contains policy evaluation results for the game Breakout. The top half of the table shows the mean return and 95%confidence lower bound on the expected return under the reward function posterior for four evaluation policies as well as the MAP policy found via Bayesian IRL and a NoOp policy that never chooses to release the ball. Both the MAP and NoOp policies have high expected returns under the reward function posterior, but also have high risk (low 0.05VaR). The MAP policy has much higher risk than the NoOp policy, despite good true performance. One likely reason is that, as shown in Table 2, the best demonstrations given to Bayesian REX only achieved a game score of 32. Thus, the MAP policy represents an out of distribution sample and thus has potentially high risk, since Bayesian REX was not trained on policies that hit any of the top layers of bricks. The ranked demonstrations do not give enough evidence to eliminate the possibility that only lower layers of bricks should be hit, but they do give strong evidence that missing the ball and quickly losing the game is bad. This results in the NoOp policy have a higher highconfidence performance bound over the posterior.
To test this hypothesis, we added two new ranked demonstrations, a single rollout from the MAP policy and a single rollout from the NoOp policy to the original set of 12 ranked demonstrations and peformed MCMC with these new preferences. As the bottom of Table 4 shows, adding two more ranked demonstrations results in a significant change in the risk profiles of the MAP and NoOp policy—the NoOp policy is now correctly predicted to have high risk and low expected returns and the MAP policy now has a much higher 95%confidence lower bound on performance.
6.3 Additional Experiments
We also tested Bayesian REX on a wide variety of human demonstrations and found that Bayesian REX is able to robustly rank evaluation policies relative to each other, even when some of the demonstrations deceptive, e.g., moving and firing but not destroying enemies. We also conducted a more rigorous exploration of the riskreturn tradeoff for safe policy selection by varying the risk tolerance from 0 (maximally risk averse) to 1 (maximally risk tolerant) and then performing highconfidence policy selection based on the desired risk tolerance threshold. Due to space constraints we have included these additional experimental results in the Appendix.
7 Conclusion
Bayesian reasoning is a powerful tool when dealing with uncertainty and risk; however, existing Bayesian inverse reinforcement learning algorithms require solving an MDP in the inner loop, rendering them intractable for complex problems in which solving an MDP may take several hours or even days. In this paper we propose a novel deep learning algorithm, Bayesian Reward Extrapolation (Bayesian REX), that leverages preference labels over demonstrations to make Bayesian IRL tractable for highdimensional visual imitation learning tasks. Bayesian REX can sample tens of thousands of reward functions from the posterior in a matter of minutes using a consumer laptop. We tested our approach on five Atari imitation learning tasks and demonstrated Bayesian REX achieves stateoftheart performance in 3 out of 5 games. Furthermore, Bayesian REX enables efficient highconfidence performance bounds for arbitrary evaluation policies. We demonstrated that these highconfidence bounds allow accurate comparison of different evaluation policies and provide a potential way to detect reward hacking and value misalignment.
We note that our proposed safety bounds are only safe with respect to the assumptions that we make: good feature pretraining, rapid MCMC mixing, and accurate preferences over demonstrations. Future work includes using exploratory trajectories for better pretraining of the latent feature embeddings, developing methods to determine when a relevant feature is missing from the learned latent space, and using highconfidence performance bounds to perform safe policy optimization in the imitation learning setting.
References

Abbeel & Ng (2004)
Abbeel, P. and Ng, A. Y.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the 21st international conference on Machine learning
, 2004.  Amodei et al. (2016) Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
 Argall et al. (2009) Argall, B. D., Chernova, S., Veloso, M., and Browning, B. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 Aytar et al. (2018) Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Freitas, N. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018.
 Barreto et al. (2017) Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul, T., van Hasselt, H. P., and Silver, D. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pp. 4055–4065, 2017.

Bellemare et al. (2013)
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The arcade learning environment: An evaluation platform for general
agents.
Journal of Artificial Intelligence Research
, 47:253–279, 2013.  Bıyık et al. (2019) Bıyık, E., Palan, M., Landolfi, N. C., Losey, D. P., and Sadigh, D. Asking easy questions: A userfriendly approach to active reward learning. In Conference on Robot Learning (CoRL), 2019.
 Bobu et al. (2018) Bobu, A., Bajcsy, A., Fisac, J. F., and Dragan, A. D. Learning under misspecified objective spaces. arXiv preprint arXiv:1810.05157, 2018.
 Bradley & Terry (1952) Bradley, R. A. and Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
 Brown & Niekum (2018) Brown, D. S. and Niekum, S. Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2018.
 Brown et al. (2018) Brown, D. S., Cui, Y., and Niekum, S. Riskaware active inverse reinforcement learning. In Conference on Robot Learning (CoRL), 2018.
 Brown et al. (2019a) Brown, D. S., Goo, W., and Niekum, S. Betterthandemonstrator imitation learning via automaticalyranked demonstrations. In Conference on Robot Learning (CoRL), 2019a.
 Brown et al. (2019b) Brown, D. S., Goo, W., Prabhat, N., and Niekum, S. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 2019b.
 Chow et al. (2015) Chow, Y., Tamar, A., Mannor, S., and Pavone, M. Risksensitive and robust decisionmaking: a cvar optimization approach. In Advances in Neural Information Processing Systems, 2015.
 Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pp. 4299–4307, 2017.
 de Haan et al. (2019) de Haan, P., Jayaraman, D., and Levine, S. Causal confusion in imitation learning. In Advances in Neural Information Processing Systems, pp. 11693–11704, 2019.
 Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Openai baselines. https://github.com/openai/baselines, 2017.
 Doersch (2016) Doersch, C. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
 Eckersley (2018) Eckersley, P. Impossibility and uncertainty theorems in ai value alignment (or why your agi should not have a utility function). arXiv preprint arXiv:1901.00064, 2018.
 Finn et al. (2016a) Finn, C., Christiano, P., Abbeel, P., and Levine, S. A connection between generative adversarial networks, inverse reinforcement learning, and energybased models. arXiv preprint arXiv:1611.03852, 2016a.
 Finn et al. (2016b) Finn, C., Levine, S., and Abbeel, P. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 2016b.
 Fisac et al. (2020) Fisac, J. F., Gates, M. A., Hamrick, J. B., Liu, C., HadfieldMenell, D., Palaniappan, M., Malik, D., Sastry, S. S., Griffiths, T. L., and Dragan, A. D. Pragmaticpedagogic value alignment. In Robotics Research, pp. 49–57. Springer, 2020.
 Fu et al. (2017) Fu, J., Luo, K., and Levine, S. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
 Gal & Ghahramani (2016) Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059, 2016.
 Garcıa & Fernández (2015) Garcıa, J. and Fernández, F. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
 Ghasemipour et al. (2019) Ghasemipour, S. K. S., Zemel, R., and Gu, S. A divergence minimization perspective on imitation learning methods. arXiv preprint arXiv:1911.02256, 2019.
 Ghavamzadeh et al. (2016) Ghavamzadeh, M., Petrik, M., and Chow, Y. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems, pp. 2298–2306, 2016.
 HadfieldMenell et al. (2016) HadfieldMenell, D., Russell, S. J., Abbeel, P., and Dragan, A. Cooperative inverse reinforcement learning. In Advances in neural information processing systems, pp. 3909–3917, 2016.
 HadfieldMenell et al. (2017) HadfieldMenell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. Inverse reward design. In Advances in neural information processing systems, pp. 6765–6774, 2017.
 Hanna et al. (2019) Hanna, J., Niekum, S., and Stone, P. Importance sampling policy evaluation with an estimated behavior policy. In Proceedings of the 36th International Conference on Machine Learning (ICML), June 2019.
 Hanna & Stone (2017) Hanna, J. P. and Stone, P. Grounded action transformation for robot learning in simulation. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Hessel et al. (2018) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Ho & Ermon (2016) Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
 Huang et al. (2018) Huang, J., Wu, F., Precup, D., and Cai, Y. Learning safe policies with expert guidance. In Advances in Neural Information Processing Systems, pp. 9105–9114, 2018.
 Ibarz et al. (2018) Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. Reward learning from human preferences and demonstrations in atari. In Advances in Neural Information Processing Systems, 2018.
 Jacq et al. (2019) Jacq, A., Geist, M., Paiva, A., and Pietquin, O. Learning from a learner. In International Conference on Machine Learning, pp. 2990–2999, 2019.
 Jorion (1997) Jorion, P. Value at risk. McGrawHill, New York, 1997.

Kendall & Gal (2017)
Kendall, A. and Gal, Y.
What uncertainties do we need in bayesian deep learning for computer vision?
In Advances in neural information processing systems, pp. 5574–5584, 2017.  Khan et al. (2018) Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal, Y., and Srivastava, A. Fast and scalable bayesian deep learning by weightperturbation in adam. arXiv preprint arXiv:1806.04854, 2018.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Lacotte et al. (2019) Lacotte, J., Ghavamzadeh, M., Chow, Y., and Pavone, M. Risksensitive generative adversarial imitation learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2154–2163, 2019.
 Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413, 2017.
 Laskey et al. (2017) Laskey, M., Lee, J., Fox, R., Dragan, A., and Goldberg, K. Dart: Noise injection for robust imitation learning. Conference on Robot Learning (CoRL), 2017.
 Leike et al. (2017) Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., Orseau, L., and Legg, S. Ai safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
 Littman et al. (1995) Littman, M. L., Dean, T. L., and Kaelbling, L. P. On the complexity of solving markov decision problems. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995.
 Liu & Wang (2016) Liu, Q. and Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances in neural information processing systems, pp. 2378–2386, 2016.

MacKay (1992)
MacKay, D. J.
A practical bayesian framework for backpropagation networks.
Neural computation, 4(3):448–472, 1992.  Maddox et al. (2019) Maddox, W. J., Izmailov, P., Garipov, T., Vetrov, D. P., and Wilson, A. G. A simple baseline for bayesian uncertainty in deep learning. In Advances in Neural Information Processing Systems, pp. 13132–13143, 2019.
 Makhzani & Frey (2017) Makhzani, A. and Frey, B. J. Pixelgan autoencoders. In Advances in Neural Information Processing Systems, pp. 1975–1985, 2017.
 Milli et al. (2017) Milli, S., HadfieldMenell, D., Dragan, A., and Russell, S. Should robots be obedient? In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4754–4760, 2017.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Ng & Russell (2000) Ng, A. Y. and Russell, S. J. Algorithms for inverse reinforcement learning. In Proceedings of the International Conference on Machine Learning, pp. 663–670, 2000.
 Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
 Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871, 2015.
 Palan et al. (2019) Palan, M., Landolfi, N. C., Shevchuk, G., and Sadigh, D. Learning reward functions by integrating human demonstrations and preferences. In Proceedings of Robotics: Science and Systems (RSS), June 2019.
 Petrik & Russell (2019) Petrik, M. and Russell, R. H. Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. arXiv preprint arXiv:1902.07605, 2019.
 Pomerleau (1991) Pomerleau, D. A. Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3(1):88–97, 1991.
 Pradier et al. (2018) Pradier, M. F., Pan, W., Yao, J., Ghosh, S., and DoshiVelez, F. Projected bnns: Avoiding weightspace pathologies by learning latent representations of neural network weights. arXiv preprint arXiv:1811.07006, 2018.
 Ramachandran & Amir (2007) Ramachandran, D. and Amir, E. Bayesian inverse reinforcement learning. In Proceedings of the 20th International Joint Conference on Artifical intelligence, pp. 2586–2591, 2007.
 Ross et al. (2011) Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction to noregret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635, 2011.
 Sadigh et al. (2017) Sadigh, D., Dragan, A. D., Sastry, S. S., and Seshia, S. A. Active preferencebased learning of reward functions. In Proceedings of Robotics: Science and Systems (RSS), 2017.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional variational bayesian neural networks. arXiv preprint arXiv:1903.05779, 2019.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
 Tamar et al. (2015) Tamar, A., Glassner, Y., and Mannor, S. Optimizing the cvar via sampling. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, pp. 2993–2999, 2015.
 Taylor & Stone (2009) Taylor, M. E. and Stone, P. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
 Thananjeyan et al. (2019) Thananjeyan, B., Balakrishna, A., Rosolia, U., Li, F., McAllister, R., Gonzalez, J. E., Levine, S., Borrelli, F., and Goldberg, K. Safety augmented value estimation from demonstrations (saved): Safe deep modelbased rl for sparse cost robotic tasks. arXiv preprint arXiv:1905.13402, 2019.
 Thomas et al. (2015) Thomas, P. S., Theocharous, G., and Ghavamzadeh, M. Highconfidence offpolicy evaluation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3000–3006, 2015.
 Torabi et al. (2018) Torabi, F., Warnell, G., and Stone, P. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), July 2018.
 Volkovs & Zemel (2014) Volkovs, M. N. and Zemel, R. S. New learning methods for supervised and unsupervised preference aggregation. The Journal of Machine Learning Research, 15(1):1135–1176, 2014.
 Zhang & Cho (2017) Zhang, J. and Cho, K. Queryefficient imitation learning for endtoend simulated driving. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 1433–1438, 2008.
Appendix A MCMC Details
We represent as a linear combination of pretrained features:
(8) 
We precompute and cache for and the likelihood becomes
(9) 
We enforce constraints on the weight vectors by normalizing the output of the weight vector proposal such that and use a Gaussian proposal function centered on
with standard deviation
. Thus, given the current sample , the proposal is defined as , in which normalize divides by the L2 norm of the sample to project back to the surface of the L2unit ball.For all experiments, except Seaquest, we used a default step size of 0.005. For Seaquest increased the step size to 0.05. We run 200,000 steps of MCMC and use a burnin of 5000 and skip every 20th sample to reduce autocorrelation. We initialize the MCMC chain with a randomly chosen vector on the L2unit ball. Because the inverse reinforcement learning is illposed there are an infinite number of reward functions that could match any set of demonstrations. Prior work by Finn et al. (2016b) demonstrates that strong regularization is needed when learning cost functions via deep neural networks. To ensure that the rewards learned allow good policy optimization when fed into an RL algorithm we used a nonnegative return prior on the return of the lowest ranked demonstration. The prior takes the following form:
(10) 
This forces MCMC to not only find reward function weights that match the rankings, but to also find weights such that the return of the worse demonstration is nonnegative. If the return of the worse demonstration was negative during proposal generation, then we assigned it a prior probability of
. Because the ranking likelihood is invariant to affine transformations of the rewards, this prior simply shifts the range of learned returns and does not affect the log likelihood ratios.Appendix B Pretraining Latent Reward Features
We experimented with several pretraining methods. One method is to train using the pairwise ranking likelihood function in Equation (9) and then freeze all but the last layer of weights; however, the learned embedding may overfit to the limited number of preferences over demonstrations and fail to capture features relevant to the groundtruth reward function. Thus, we supplement the pairwise ranking objective with auxiliary objectives that can be optimized in a selfsupervised fashion using data from the demonstrations.
We use the following selfsupervised tasks to pretrain : (1) Learn an inverse dynamics model that uses embeddings and to predict the corresponding action (Torabi et al., 2018; Hanna & Stone, 2017), (2) Learn a forward dynamics model that predicts from and (Oh et al., 2015; Thananjeyan et al., 2019), (3) Learn an embedding that predicts the temporal distance between two randomly chosen states from the same demonstration (Aytar et al., 2018), and (4) Train a variational pixeltopixel autoencoder in which is the learned latent encoding (Makhzani & Frey, 2017; Doersch, 2016). Table 5 summarizes the auxiliary tasks used to train .
There are many possibilities for pretraining ; however, we found that each objective described above encourages the embedding to encode different features. For example, an accurate inverse dynamics model can be learned by only attending to the movement of the agent. Learning forward dynamics supplements this by requiring to encode information about shortterm changes to the environment. Learning to predict the temporal distance between states in a trajectory forces to encode longterm progress. Finally, the autoencoder loss acts as a regularizer to the other losses as it seeks to embed all aspects of the state.
Inverse Dynamics  

Forward Dynamics  
Temporal Distance  
Variational Autoencoder 
In the Atari domain, input to the network is given visually as grayscale frames resized to . To provide temporal information, four sequential frames are stacked one on top of another to create a framestack which provides a brief snapshot of activity. The network architecture takes a framestack, applies four convolutional layers following a similar architecture to Christiano et al. (2017) and Brown et al. (2019b)
, with leaky ReLU units as nonlinearities following each convolution layer. The convolutions follow the following structure:
#  Filter size  Image size  Stride 

Input      
1  3  
2  2  
3  1  
4  1 
The convolved image is then flattened. Two sequential fully connected layers, with leaky ReLU applied to the first layer, transform the flattened image into the encoding, where is the initial framestack. The width of these layers depends on the size of the feature encoding chosen. In our experiments with a latent dimension of 64, the first layer transforms from size 784 to 128 and the second from 128 to 64.
See Figure 2 for a complete diagram of this process.
Architectural information for each auxiliary task is given below.

The variational autoencoder (VAE) tries to reconstruct the original framestack from the feature encoding using transposed convolutions. Mirroring the structure of the initial convolutions, two fully connected layers precede four transposed convolution layers. These first two layers transform the 64dimensional feature encoding from 64 to 128, and from 128 to 1568. The following four layers’ structures are summarized below:
# Filter size Image size Stride Input   1 1 2 1 3 2 4 1 A crossentropy loss is applied between the reconstructed image and the original, as well as a term added to penalize the KL divergence of the distribution from the unit normal.

A temporal difference estimator, which takes two random feature encodings from the same demonstration and predicts the number of timesteps in between. It is a single fullyconnected layer, transforming the concatenated feature encodings into a scalar time difference. A meansquared error loss is applied between the real difference and predicted.

An inverse dynamics model, which takes two sequential feature encodings and predicts the action taken in between. It is again a single fullyconnected layer, trained as a classification problem with a binary crossentropy loss over the discrete action set.

A forward dynamics model, which takes a concatenated feature encoding and action and predicts the next feature encoding with a single fullyconnected layer. This is repeated 5 times, which increases the difference between the initial and final encoding. It is trained using a meansquared error between the predicted and real feature encoding.

A TREX loss, which samples feature encodings from two different demonstrations and tries to predict which one of them has preference over the other. This is done with a single fullyconnected layer that transforms an encoding into scalar reward, and is then trained as a classification problem with a binary crossentropy loss. A 1 is assigned to the demonstration sample with higher preference and a 0 to the demonstration sample with lower preference.
In order to encourage a feature encoding that has information easily interpretable via linear combinations, the temporal difference, TREX, inverse dynamics, and forward dynamics tasks consist of only a single layer atop the feature encoding space rather than multiple layers.
To compute the final loss on which to do the backwards pass, all of the losses described above are summed with weights determined empirically to balance out their values.
b.1 Training specifics
We used an NVIDIA TITAN V GPU for training the embedding. We used the same 12 demonstrations used for MCMC to train the selfsupervised and ranking losses described above. We sample 60,000 trajectory snippets pairs from the demonstration pool, where each snippet is between 50 and 100 timesteps long. We use a learning rate of 0.001 and a weight decay of 0.001. We make a single pass through all of the training data using batch size of 1 resulting in 60,000 updates using the Adam (Kingma & Ba, 2014) optimizer. For Enduro prior work (Brown et al., 2019b) showed that full trajectories resulted in better performance than subsampling trajectories. Thus, for Enduro we subsample 10,000 pairs of entire trajectories by randomly selecting a starting time between 0 and 5 steps after the initial state and then skipping every t frames where t is chosen uniformly from the range and train with two passes through the training data. When performing subsampling for either snippets or full trajectories, we subsample pairs of trajectories such that one is from a worse ranked demonstration and one is from a better ranked demonstration following the procedure outlined in (Brown et al., 2019b).
Appendix C Visualizations of Latent Space
c.1 Individual Dimensions
Figure 3 contains samples taken from the feature encoding space for Space Invaders by linearly varying dimension . Notice how the value of this dimension has a clear effect on the network’s reconstruction of the game image. The framestacks on the left, with a lower value in dimension 16, have enemy ships concentrated toward the top left, while the framestacks with a higher value in dimension 16 have enemy ships concentrated to the top right.
This is in contrast to Seaquest, observable in Figure 4, where the autoencoder predicts roughly the same output in every framestack. This may be due to the freely moving objects in Seaquest, which tend to be fixed in common positions much less often than in Space Invaders. Future work is needed to investigate the exact cause of why some domains work so much better than others, and how to encourage the autoencoder to pick up on more fleeting features.
c.2 Random samples
While the samples in Figure 4 and Figure 3 were selected to illustrate specific points about the latent encoding, Figure 8 and Figure 9 provide four entirely samples from the latent space of all five Atari games tested. Figure 8, trained with the TREX loss, has noticeably lowerquality reconstructions than Figure 9, which was trained without the TREX loss. However, the networks trained with the TREX loss tend to perform better when running MCMC, suggesting that the feature encoding contains additional information not recognized by the autoencoder.
Space Invaders  
Enduro  
Seaquest  
Breakout  
Beamrider 
Space Invaders  
Enduro  
Seaquest  
Breakout  
Beamrider 
c.3 Visualizations of Learned Features
Viewable here^{1}^{1}1https://www.youtube.com/watch?v=DMf8kNH9nVg is a video containing an Enduro demonstration trajectory, its decoding with respect to the pretrained autoencoder, and a plot of the dimensions in the latent encoding over time. Observe how changes in the demonstration, such as turning right or left or a shift, correspond to changes in the plots of the feature embedding. We noticed that certain features increase when the agent passes other cars while other features decrease when the agent gets passed by other cars. This is evidence that the pretraining has learned features that are relevant to the ground truth reward which gives +1 every time the agent passes a car and 1 every time the agent gets passed.
Viewable here^{2}^{2}2https://www.youtube.com/watch?v=2uN5uD17H6M is a similar visualization of the latent space for Space Invaders. Notice how it tends to focus on the movement of enemy ships, useful for game progress in things such as the temporal difference loss, but seems to ignore the player’s ship despite its utility in inverse dynamics loss. Likely the information exists in the encoding but is not included in the output of the autoencoder.
Viewable here^{3}^{3}3https://www.youtube.com/watch?v=8zgbD1fZOH8 is visualization of the latent space for Breakout. Observe that breaking a brick often results in a small spike in the latent encoding. Many dimensions, like the dark green curve which begins at the lowest value, seem to invert as game progress continues on, thus acting as a measure of how much time has passed.
Appendix D Imitation Learning Ablations for Pretraining
Table 6 shows the results of pretraining reward features only using different losses. We experimented with using only the TREX Ranking loss (Brown et al., 2019b)
, only the selfsupervised losses shown in Table 1 of the main paper, and using both the TREX ranking loss plus the selfsupervised loss function. We found that performance varried over the different pretraining schemes, but that using Ranking + SelfSupervised achieved high performance across all games, clearly outperforming only using selfsupervised losses and achieving superior performance to only using the ranking loss on 3 out of 5 games.
Ranking Loss  SelfSupervised  Ranking + SelfSupervised  

Game  Mean  MAP  Mean  MAP  Mean  MAP 
Beam Rider  3816.7  4275.7  180.4  143.7  5870.3  5504.7 
Breakout  389.9  409.5  360.1  367.4  393.1  390.7 
Enduro  472.7  479.3  0.0  0.0  135.0  487.7 
Seaquest  675.3  670.7  674.0  683.3  606.0  734.7 
Space Invaders  1482.0  1395.5  391.2  396.2  961.3  1118.8 
Appendix E Suboptimal Demonstration Details
We used the same suboptimal demonstrations used by Brown et al. (2019b) for comparison. These demonstrations were obtained by running PPO on the ground truth reward and checkpointing every 50 updates using OpenAI Baselines (Dhariwal et al., 2017). Brown et al. (2019b) make the checkpoint files available, so to generate the demonstration data we used their saved checkpoints and followed the instructions in their released code to generate the data for our algorithm^{4}^{4}4Code from (Brown et al., 2019b) was downloaded from https://github.com/hiwonjoon/ICML2019TREX. We gave Bayesian REX these demonstrations as well as groundtruth rankings using the game score; however, other than the rankings, Bayesian REX has no access to the true reward samples. Following the recommendations of Brown et al. (2019b), we mask the game score and other parts of the game that are directly indicative of the game score such as the number of enemy ships left, the number of lives left, the level number, etc. See (Brown et al., 2019b) for full details.
Appendix F Reinforcement Learning Details
We used the OpenAI Baselines implementation of Proximal Policy Optimization (PPO) (Schulman et al., 2017; Dhariwal et al., 2017)
. We used the default hyperparameters for all games and all experiments. We run RL for 50 million frames and then take the final checkpoint to perform evaluations. We adapted the OpenAI Baselines code so even though the RL agent receives a standard preprocessed observation, it only receives samples of the reward learned via Bayesian REX, rather than the groundtruth reward. TREX
(Brown et al., 2019b) uses a sigmoid to normalize rewards before passing them to the RL algorithm; however, we obtained better performance for Bayesian REX by feeding the unnormalized predicted reward into PPO for policy optimization. We follow the OpenAI baselines default preprocessing for the framestacks that are fed into the RL algorithm as observations. We also apply the default OpenAI baselines wrappers the environments. We run PPO with 9 workers on an NVIDIA TITAN V GPU.Appendix G HighConfidence Policy Performance Bounds
In this section we describe the details of the policy performance bounds.
g.1 Policy Evaluation Details
We estimated using Monte Carlo rollouts for each evaluation policy. Thus, after generating rollouts, from the feature expectations are computed as
(11) 
We used for all experiments.
g.2 Evaluation Policies
We evaluated several different evaluation policies. To see if the learned reward function posterior can interpolate and extrapolate we created four different evaluation policies: A, B, C, and D. These policies were created by running RL via PPO on the ground truth reward for the different Atari games. We then checkpointed the policy and selected checkpoints that would result in different levels of performance. For all games except for Enduro these checkpoints correspond to 25, 325, 800, and 1450 update steps using OpenAI baselines. For Enduro, PPO performance was stuck at 0 return until much later in learning. To ensure diversity in the evaluation policies, we chose to use evaluation policies corresponding to 3125, 3425, 3900, and 4875 steps. We also evaluated each game with a NoOp policy. These policies are often adversarial for some games, such as Seaquest, Breakout, and Beam Rider, since they allow the agent to live for a very long time without actually playing the game—a potential way to hack the learned reward since most learned rewards for Atari will incentivize longer gameplay.
The results for Beam Rider and Breakout are shown in the main paper. For completeness, we have included the highconfidence policy evaluation results for the other games here in the Appendix. Table 7 shows the highconfidence policy evaluation results for Enduro. Both the average returns over the posterior as well as the the highconfidence performance bounds () demonstrate accurate predictions relative to the groundtruth performance. The NoOp policy results in the racecar slowly moving along the track and losing the race. This policy is accurately predicted as being much worse than the other evaluation policies. We also evaluated the Mean and MAP policies found by optimizing the Mean reward and MAP reward from the posterior obtained using Bayesian REX. We found that the learned posterior is able to capture that the MAP policy is more than twice as good as the evaluation policy D and that the Mean policy has performance somewhere between the performance of policies B and C. These results show that Bayesian REX has the potential to predict betterthandemonstrator performance (Brown et al., 2019a).
Predicted  Ground Truth  

Policy  Mean  0.05VaR  Avg.  Length 
A  324.7  48.2  7.3  3322.4 
B  328.9  52.0  26.0  3322.4 
C  424.5  135.8  145.0  3389.0 
D  526.2  192.9  199.8  3888.2 
Mean  1206.9  547.5  496.7  7249.4 
MAP  395.2  113.3  133.6  3355.7 
NoOp  245.9  31.7  0.0  3322.0 
Table 8 shows the results for highconfidence policy evaluation for Seaquest. The results show that highconfidence performance bounds are able to accurately predict that evaluation policies A and B are worse than C and D. The ground truth performance of policies C and D are too close and the mean performance over the posterior and 0.05VaR bound on the posterior are not able to find any statistical difference between them. Interestingly the noop policy has very high mean and 95%confidence lower bound, despite not scoring any points. However, as shown in the bottom half of Table 8, adding one more ranked demonstration from a 3000 length segment of a noop policy solves this problem. These results motivate a natural humanintheloop approach for safe imitation learning.
Predicted  Ground Truth  

Policy  Mean  0.05VaR  Avg.  Length 
A  24.3  10.8  338.6  1077.8 
B  53.6  24.1  827.2  2214.1 
C  56.0  25.4  872.2  2248.5 
D  55.8  25.3  887.6  2264.5 
NoOp  2471.6  842.5  0.0  99994.0 
Results after adding one ranked demo from NoOp  
A  0.5  0.5  338.6  1077.8 
B  3.7  2.0  827.2  2214.1 
C  3.8  2.1  872.2  2248.5 
D  3.2  1.5  887.6  2264.5 
NoOp  321.7  578.2  0.0  99994.0 
Finally, Table 9 shows the results for highconfidence policy evaluation for Space Invaders. The results show that using both the mean performance and 95%confidence lower bound are good indicators of ground truth performance for the evaluation polices. The NoOp policy for Space Invaders results in the agent getting hit by alien lasers early in the game. The learned reward function posterior correctly assigns low average performance and high risk (low 95%confidence lower bound).
Predicted  Ground Truth  
Policy  Mean  0.05VaR  Avg.  Length 
A  45.1  20.6  195.3  550.1 
B  108.9  48.7  436.0  725.7 
C  148.7  63.6  575.2  870.6 
D  150.5  63.8  598.2  848.2 
Mean  417.4  171.7  1143.7  1885.7 
MAP  360.2  145.0  928.0  1629.5 
NoOp  18.8  3.8  0.0  504.0 
Appendix H Different Evaluation Policies
To test Bayesian REX on different learned policies we took a policy trained with RL on the ground truth reward function for Beam Rider, the MAP policy learned via Bayesian REX for Beam Rider, and a policy trained with an earlier version of Bayesian REX (trained without all of the auxiliary losses) that learned a novel reward hack where the policy repeatedly presses left and then right, enabling the agent’s ship to stay in between two of the firing lanes of the enemies. The imitation learning reward hack allows the agent to live for a very long time. We took a 2000 step prefix of each policy and evaluated the expected and 5th perentile worstcase predicted returns for each policy. We found that Bayesian REX is able to accurately predict that the reward hacking policy is worse than both the RL policy and the policy optimizing the Bayesian REX reward. However, we found that the Bayesian REX policy, while not performing as well as the RL policy, was given higher expected return and a higher lower bound on performance than the RL policy. Results are shown in Table 10.
Predicted  Ground Truth  
Policy  Mean  0.05VaR  Avg.  Length 
RL  36.7  19.5  2135.2  2000 
BREX  68.1  38.1  649.4  2000 
Hacking  28.8  10.2  2.2  2000 
Appendix I Human Demonstrations
To investigate whether Bayesian REX is able to correctly rank human demonstrations, one of the authors provided demonstrations of a variety of different behaviors and then we took the latent embeddings of the demonstrations and used the posterior distribution to find highconfidence performance bounds for these different rollouts.
i.1 Beamrider
We generated four human demonstrations: (1) good, a good demonstration that plays the game well, (2) bad, a bad demonstration that seeks to play the game but does a poor job, (3) suicidal, a demonstration that does not shoot enemies and seeks enemy bullets, and (4) adversarial a demonstration that pretends to play the game by moving and shooting as much as possibly but tries to avoid actually shooting enemies. The results of highconfidence policy evaluation are shown in Table 11. The highconfidence bounds and average performance over the posterior correctly rank the behaviors. This provides evidence that the learned linear reward correctly rewards actually destroying aliens and avoiding getting shot, rather than just flying around and shooting.
Predicted  Ground Truth  

Policy  Mean  0.05VaR  Avg.  Length 
good  12.4  5.8  1092  1000.0 
bad  10.7  4.5  396  1000.0 
suicidal  6.6  0.8  0  1000.0 
adversarial  8.4  2.4  176  1000.0 
i.2 Space Invaders
For Space Invaders we demonstrated an even wider variety of behaviors to see how Bayesian REX would rank their relative performance. We evaluated the following policies: (1) good, a demonstration that attempts to play the game as well as possible, (2) every other, a demonstration that only shoots aliens in the 2nd and 4th columns, (3) flee, a demonstration that did not shoot aliens, but tried to always be moving while avoiding enemy lasers, (4) hide, a demonstration that does not shoot and hides behind on of the barriers to avoid enemy bullets, (5) suicidal, a policy that seeks enemy bullets while not shooting, (6) shoot shelters, a demonstration that tries to destroy its own shelters by shooting at them, (7) hold ’n fire, a demonstration where the player rapidly fires but does not move to avoid enemy lasers, and (8) miss, a demonstration where the demonstrator tries to fire but not hit any aliens while avoiding enemy lasers.
Predicted  Ground Truth  

Policy  Mean  0.05VaR  Avg.  Length 
good  198.3  89.2  515  1225.0 
every other  56.2  25.9  315  728.0 
hold ’n fire  44.3  18.6  210  638.0 
shoot shelters  47.0  20.6  80  712.0 
flee  45.1  19.8  0  722.0 
hide  83.0  39.0  0  938.0 
miss  66.0  29.9  0  867.0 
suicidal  0.5  13.2  0  266.0 
Table 12 shows the results of evaluating the different demonstrations. The good demonstration is clearly the best performing demonstration in terms of mean performance and 95%confidence lower bound on performance and the suicidal policy is correctly given the lowest performance lower bound. However, we found that the length of the demonstration appears to have a strong effect on the predicted performance for Space Invaders. Demonstrations such as hide and miss are able to live for a longer time than policies that actually hit aliens. This results in them having higher 0.05quantile worstcase predicted performance and higher mean performance.
To study this further we looked at only the first 600 timesteps of each policy, to remove any confounding by the length of the trajectory. The results are shown in Table 13. With a fixed length demonstration, Bayesian REX is able to correctly rank good, every other, and hold ’n fire as the best demonstrations, despite evaluation policies that are deceptive.
Predicted  Ground Truth  

Policy  Mean  0.05VaR  Avg.  Length 
good  47.8  22.8  515  600.0 
every other  34.6  15.0  315  600.0 
hold ’n fire  40.9  17.1  210  600.0 
shoot shelters  33.0  13.3  80  600.0 
flee  31.3  11.9  0  600.0 
hide  32.4  13.8  0  600.0 
miss  30.0  11.3  0  600.0 
i.3 Enduro
For Enduro we tested four different human demonstrations: (1) good a demonstration that seeks to play the game well, (2) periodic a demonstration that alternates between speeding up and passing cars and then slowing down and being passed, (3) neutral a demonstration that stays right next to the last car in the race and doesn’t try to pass or get passed, and (4) ram a demonstration that tries to ram into as many cars while going fast. Table 14 shows that Bayesian REX is able to accurately predict the performance and risk of each of these demonstrations and gives the highest (lowest 0.05VaR) risk to the ram demonstration and the lest risk to the good demonstration.
Predicted  Ground Truth  

Policy  Mean  0.05VaR  Avg.  Length 
good  246.7  113.2  177  3325.0 
periodic  230.0  130.4  44  3325.0 
neutral  190.8  160.6  0  3325.0 
ram  148.4  214.3  0  3325.0 
Appendix J Trading off risk and reward
We next tested whether we can tradeoff risk and reward using the posterior distribution learned via Bayesian REX. We obtained a variety of evaluation policies obtained by partially training and checkpointing RL policies trained on the groundtruth reward (see Appendix for details). We then varied risk tolerance from 0 (maximally risk averse) to 1 (maximally risk tolerant) and performed a policy evaluation for each evaluation policy. We then sorted the evaluation policies based on the risk tolerance threshold and selected the best policy. In figures 9(a) and 9(b) we plot the risk tolerance along the xaxis and the groundtruth expected performance along the yaxis. For Breakout we noticed that increasing the risktolerance improves expected return, at the expense of possibly riskier policies. For Beamrider we found that increasing the risk tolerance from 0 to 1 did not change the policy selected. We investigated this and found that one policy dominated the other evaluation policies as demonstrated in the Figure 9(c) where the posterior distribution for the dominating policy (green) is compared to the distributions for several other evaluation policies.
Comments
There are no comments yet.