Learning Human Objectives by Evaluating Hypothetical Behavior

12/05/2019
by   Siddharth Reddy, et al.
20

We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.

READ FULL TEXT VIEW PDF

Authors

page 9

page 17

page 18

12/30/2019

A New Framework for Query Efficient Active Imitation Learning

We seek to align agent policy with human expert behavior in a reinforcem...
01/29/2018

Learning the Reward Function for a Misspecified Model

In model-based reinforcement learning it is typical to treat the problem...
12/21/2020

Difference Rewards Policy Gradients

Policy gradient methods have become one of the most popular classes of a...
04/13/2021

Subgoal-based Reward Shaping to Improve Efficiency in Reinforcement Learning

Reinforcement learning, which acquires a policy maximizing long-term rew...
12/11/2019

SMiRL: Surprise Minimizing RL in Dynamic Environments

All living organisms struggle against the forces of nature to carve out ...
04/13/2021

Reward Shaping with Dynamic Trajectory Aggregation

Reinforcement learning, which acquires a policy maximizing long-term rew...
12/21/2020

Evaluating Agents without Rewards

Reinforcement learning has enabled agents to solve challenging tasks in ...

Code Repositories

ReQueST

Code for the paper, "Learning Human Objectives by Evaluating Hypothetical Behavior"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Users typically specify objectives for reinforcement learning (RL) agents through scalar-valued reward functions (sutton2018reinforcement). While users can easily define reward functions for tasks like playing games of Go or StarCraft, users may struggle to describe practical tasks like driving cars or controlling robotic arms in terms of rewards (hadfield2017inverse). Understanding user objectives in these settings can be challenging – not only for machines, but also for humans modeling each other and introspecting on themselves (premack1978does).

For example, consider the trolley problem (foot1967problem): if you were the train conductor in Figure 1, presented with the choice of either allowing multiple people to come to harm by letting the train continue on its current track, or harming one person by diverting the train, what would you do? The answer depends on whether your value system leans toward consequentialism or deontological ethics – a distinction that may not be captured by a reward function designed to evaluate common situations, in which ethical dilemmas like the trolley problem rarely occur. In complex domains, the user may not be able to anticipate all possible agent behaviors and specify a reward function that accurately describes user preferences over those behaviors.

Figure 1: Our method learns a reward model from user feedback on hypothetical behaviors, then deploys a model-based reinforcement learning agent that optimizes the learned rewards.

We address this problem by actively synthesizing hypothetical behaviors from scratch, and asking the user to label them with rewards. Figure 1 describes our algorithm: using a generative model of initial states and a forward dynamics model trained on off-policy data, we synthesize hypothetical behaviors, ask the user to label the behaviors with rewards, and train a neural network to predict the rewards. We repeat this process until the reward model converges, then deploy a model-based RL agent that optimizes the learned rewards.

The key idea in this paper is synthesizing informative hypotheticals (illustrated in Figure 2). Ideally, we would generate these hypotheticals by optimizing the value of information (VOI; savage1954foundations), but the VOI is intractable for real-world domains with high-dimensional, continuous states.111The VOI is intractable because it requires computing an expectation over all possible trajectories, conditioned on the optimal policy for the updated reward model. See Section 3.3 for details. Instead, we use trajectory optimization to produce four types of hypotheticals that improve the reward model in different ways: behaviors that (1) maximize reward model uncertainty,222We measure uncertainty using the disagreement between an ensemble of reward models. See Section 3.3 for details. to elicit labels that are likely to change the updated reward model’s outputs; (2) maximize predicted rewards, to detect and correct reward hacking; (3) minimize predicted rewards, to safely explore unsafe states; or (4) maximize novelty of trajectories regardless of predicted rewards, to improve the diversity of the training data. To ensure that the hypothetical trajectories remain comprehensible to the user and resemble realistic behaviors, we use a generative model of initial states and a forward dynamics model for regularization. We call this method reward query synthesis via trajectory optimization (ReQueST).

Figure 2: Our method automatically synthesizes hypotheticals like the trolley problem. Consider a training environment in which the following two states are common: either one of the tracks is empty, or there are fewer people on the current track than the other track. In these states, the consequentialist and deontologist reward functions agree. After asking the user to label these states, we are not able to determine which of the two is the true reward function, since both are consistent with the training data. Our method queries the user for labels at states where the value of information is highest: states where there are more people on the current track than the other track, but there are still some people on the other track. By eliciting user labels at these unlikely-but-informative states, we learn a reward model that more accurately captures the user’s objectives.

Our primary contribution is ReQueST: an algorithm that synthesizes hypothetical behaviors in order to safely and efficiently train neural network reward models in environments with high-dimensional, continuous states. We evaluate ReQueST with simulated users in three domains: MNIST classification (lecun1998mnist), a state-based 2D navigation task, and the image-based Car Racing video game in the OpenAI Gym (brockman2016openai). Our experiments show that ReQueST learns robust reward models that transfer to new environments with different initial state distributions, achieving at least 2x better final performance than baselines adapted from prior work (e.g., see Figure 4

). In the navigation task, ReQueST safely learns to classify 100% of unsafe states as unsafe and deploys an agent that never visits unsafe states, while the baselines fail to learn about even one unsafe state and deploy agents with a failure rate of 75%.

2 Related Work

In this work, we align agent behavior with a user’s objectives by learning a model of the user’s reward function and training the agent via RL (russell1998learning; leike2018scalable). The idea behind modeling the user’s reward function – as opposed to the user’s policy (ross2011reduction), value function (warnell2018deep; reddy2018shared), or advantage function (macglashan2017interactive) – is to acquire a compact, transferable representation of the user’s objectives; not just in the training environment, but also in new environments with different dynamics or initial states.

The closest prior work is on active learning methods for learning rewards from pairwise comparisons (dorsa2017active; biyik2018batch; wirth2017survey), critiques (cui2018active), demonstrations (ibarz2018reward; brown2019risk), designs (mindermann2018active), and numerical feedback (daniel2014active). ReQueST differs in three key ways: it produces query trajectories using a generative model, in a way that enables trading off between producing realistic vs. informative queries; it optimizes queries not only to reduce model uncertainty, but also to detect reward hacking and safely explore unsafe states; and it scales to learning neural network reward models that operate on high-dimensional, continuous state spaces.

ReQueST shares ideas with prior work (saunders2018trial; prakash2019improving) on learning to detect unsafe behaviors by initially seeking out catastrophes, selectively querying the user, and using model-based RL. ReQueST differs primarily in that it learns a complete task specification, not just an unsafe state detector. ReQueST is also complementary to prior work on safe exploration, which typically assumes a known reward function and side constraints, and focuses on ensuring that the agent never visits unsafe states during policy optimization (dalal2018safe; garcia2015comprehensive).

3 Learning Rewards from User Feedback on Hypothetical Behavior

We formulate the reward modeling problem as follows. We assume access to a training environment that follows a Markov Decision Process (MDP;

sutton2018reinforcement) with unknown state transition dynamics , unknown initial state distribution , and an unknown reward function that can be evaluated on specific inputs by querying the user. We learn a model of the reward function by querying the user for reward signals. At test time, we train an RL agent with the learned reward function in a new environment with the same dynamics , but a potentially different initial state distribution . The goal is for the agent to perform well in the test environment with respect to the true reward function .

Our approach to this problem is outlined in Figure 1, and can be split into three steps. In step (1) we use off-policy data to train a generative model that can be used to evaluate the likelihood of a trajectory . This model enables us to synthesize hypothetical trajectories that can be shown to the user. In step (2) we produce synthetic trajectories, which consist of sequences of state transitions , that seek out different kinds of hypotheticals. We ask the user to label each transition with a scalar reward , and fit a reward model

using standard supervised learning techniques.

333In principle, other methods, such as pairwise comparisons (wirth2017survey) or implicit feedback (anonymous2020deep), could also be used to label the synthetic trajectories with user rewards. In step (3) we use standard RL methods to train an agent using the learned rewards. Since we typically learn a forward dynamics model as part of the generative model in step (1), we find that model-based RL is a good fit for training the agent in step (3).

3.1 Learning a Generative Model of Trajectories

In order to synthesize hypothetical outcomes that may be unlikely to occur in the training environment, we cannot simply take actions in the training environment and collect the resulting trajectories, as is done in prior work. Instead, we resort to training a generative model of trajectories, so that we can more efficiently sample unusual behaviors using the model.

In step (1) we collect off-policy data by interacting with the training environment in an unsupervised fashion; i.e., without the user in the loop. To simplify our experiments, we sample trajectories by following random policies that explore a wide variety of states.444In principle, safe expert demonstrations could be used instead of random trajectories. We used random trajectories to simplify our experiments. See Section 3.6 for further discussion. We use the observed trajectories to train a likelihood model,

(1)

where models the initial state distribution, models the forward dynamics, and are the model parameters (e.g., neural network weights). We train the model

using maximum-likelihood estimation, given the sampled trajectories. As described in the next section, the likelihood model is helpful for regularizing the synthetic trajectories shown to the user.

In environments with high-dimensional, continuous states, such as images, we also train a state encoder and decoder , where , , and . As described in Section 3.4, embedding states in a low-dimensional latent space is helpful for trajectory optimization. In our experiments, we train and using the variational auto-encoder method (VAE; kingma2013auto).

3.2 Representing the Reward Model as a Classifier

Our goal is to learn a model of the user’s reward function. In step (2) we represent by classifying state transitions as good, unsafe, or neutral – similar to cui2018active – and assigning a known, constant reward to each of these three categories:

(2)

where is the mean of an ensemble of classifiers , and are the weights of the -th neural network in the ensemble. is the constant reward for any state transition in class , where . Modeling the reward function as a classifier simplifies our experiments and makes it easier for the user to provide labels. In principle, our method can also work with other architectures, such as a more straightforward regression model . As described in Section 3.3, we use an ensemble method to model uncertainty.

3.3 Designing Objectives for Informative Queries

Our approach to reward modeling involves asking the user to label trajectories with reward signals. In step (2) we synthesize query trajectories to elicit user labels that are informative for learning the reward model.

To generate a useful query, we synthesize a trajectory that maximizes an acquisition function (AF) denoted by . The AF evaluates how useful it would be to elicit reward labels for , then update the reward model given the newly-labeled data. Since we do not assume knowledge of the test environment where the agent is deployed, we cannot optimize the ideal AF: the value of information (VOI; savage1954foundations), defined as the gain in performance of an agent that optimizes the updated reward model in the test environment. Prior work on active learning tackles this problem by optimizing proxies for VOI (settles2009active). We use AFs adapted from prior work, as well as novel AFs that are particularly useful for reward modeling.

In this work, we use four AFs that are easy to optimize for neural network reward models. The first AF maximizes reward model uncertainty, eliciting user labels for behaviors that are likely to change the updated reward model’s outputs. is a proxy for VOI, since improving the agent requires improving the predicted rewards. The second AF maximizes predicted rewards, surfacing behaviors for which the reward model might be incorrectly predicting high rewards.

is another useful heuristic for reward modeling, since preventing reward hacking improves agent performance. The third AF

minimizes predicted rewards, adding unsafe behaviors to the training data. While we do not consider to be a proxy for VOI, we find it helpful empirically for training neural network reward models, since it helps to balance the number of unsafe states (vs. neutral or good states) in the training data. The fourth AF maximizes the novelty of training data, encouraging uniform coverage of the space of behaviors regardless of their predicted reward. is a naïve proxy for VOI, but tends to be helpful in practice due to the difficulty of estimating the uncertainty of neural network reward models for .

Maximizing uncertainty. The first AF implements one of the simplest query selection strategies from the active learning literature: uncertainty sampling (lewis1994sequential). The idea is to elicit labels for examples that the model is least certain how to label, and thus reduce model uncertainty. To do so, we train an ensemble of neural network reward models, and generate trajectories that maximize the disagreement between ensemble members. Following lakshminarayanan2017simple, we measure ensemble disagreement using the average KL-divergence between the output of a single ensemble member and the ensemble mean,

(3)

where is the reward classifier defined in Section 3.2. Although more sophisticated methods of modeling uncertainty in neural networks exist (gal2016uncertainty), we find that ensemble disagreement works well in practice.555We did not compare to other ensemble-based approximations, such as mutual information (houlsby2011bayesian).

Maximizing reward. The second AF is intended to detect examples of false positives, or ‘reward hacking’: behaviors for which the reward model incorrectly outputs high reward (amodei2016concrete; christiano2017deep). The idea is to show the user what the reward model predicts to be good behavior, with the expectation that some of these behaviors are actually suboptimal, and will be labeled as such by the user. To do so, we simply synthesize trajectories that maximize .

Minimizing reward. The third AF is intended to augment the training data with more examples of unsafe states than would normally be encountered, e.g., by a reward-maximizing agent acting in the training environment. The idea is to show the user what the reward model considers to be unsafe behavior, with the expectation that the past training data may not contain egregiously unsafe behaviors, and that it would be helpful for the user to confirm whether the model has captured the correct notion of unsafe states. To do so, we produce trajectories that maximize .

Maximizing novelty. The fourth AF is intended to produce novel trajectories that differ from those already in the training data, regardless of their predicted reward; akin to prior work on geometric AFs (sener2017geometric). This is especially helpful early during training, when uncertainty estimates are not accurate, and the reward model has not yet captured interesting notions of reward-maximizing and reward-minimizing behavior. To do so, we produce trajectories that maximize the distance between and previously-labeled trajectories ,

(4)

In this work, we use a distance function that computes the Euclidean distance between state embeddings,

(5)

where is the state encoder trained in step (1).

For the sake of simplicity, we synthesize a separate trajectory for each of the AFs , , , and . In principle, multiple AFs could be combined to form hybrid AFs. For example, optimizing could yield trajectories that simultaneously maximize rewards and novelty.

3.4 Query Synthesis via Trajectory Optimization

We synthesize a query trajectory by solving the optimization problem,

(6)

where is the embedding of state in the latent space of the encoder trained in step (1), is the decoded trajectory, is the acquisition function (Section 3.3), is a regularization constant, and is the generative model of trajectories (Section 3.1). In this work, we assume is differentiable, and optimize using Adam (kingma2014adam).666Our method can be extended to settings where is not differentiable, by using a gradient-free optimization method to synthesize . This can be helpful, e.g., when using a non-differentiable simulator to model the environment. Optimizing low-dimensional, latent states instead of high-dimensional, raw states reduces computational requirements, and regularizes the optimized states to be more realistic.777Our approach to query synthesis draws inspiration from direct collocation methods in the trajectory optimization literature (betts2010practical), feature visualization methods in the neural network interpretability literature (olah2017feature), and prior work on active learning with deep generative models (huijser2017active).

The regularization constant from Equation 6 controls the trade-off between how realistic is and how aggressively it maximizes the AF. Setting can result in query trajectories that are incomprehensible to the user and unlikely to be seen in the test environment, while setting to a high value can constrain the query trajectories from seeking interesting hypotheticals. The experiments in Section 4.5 analyze this trade-off in further detail.

1:  Require
2:  Initialize
3:  while  not converged do
4:     for   do
5:         
6:         for  do
7:             {Query the user}
8:            
9:         end for
10:     end for
11:     for  do
12:         
13:     end for
14:  end while
15:  Return reward model {Defined via in Equation 2}
Algorithm 1 Reward Query Synthesis
via Trajectory Optimization (ReQueST)

Our reward modeling algorithm is summarized in Algorithm 1. Given a generative model of trajectories , it generates one query trajectory for each of the four AFs, asks the user to label the states in the query trajectories, retrains the reward model ensemble on the updated training data using maximum-likelihood estimation,888Note that, in line 12 of Algorithm 1, we train each ensemble member on all of the data , instead of a random subset of the data (i.e., bootstrapping). As in lakshminarayanan2017simple, we find that simply training each reward network using a different random seed works well in practice for modeling uncertainty. and repeats this process until the user is satisfied with the outputs of the reward model. The ablation study in Section 4.6 analyzes the effect of using different subsets of the four AFs to generate queries.

3.5 Deploying a Model-Based RL Agent

Given the learned reward model , the agent can, in principle, be trained using any RL algorithm in step (3). In practice, since our method learns a forward dynamics model in step (1), we find that model-based RL is a good fit for training the agent in step (3). In this work, we deploy an agent that combines planning with model-predictive control (MPC):

(7)

where the future states are predicted using the forward dynamics model trained in step (1), is the planning horizon, and is the reward model trained in step (2). We solve the optimization problem in the right-hand side using Adam (kingma2014adam).

3.6 Safe Exploration

One of the benefits of our method is that, since it learns from synthetic trajectories instead of real trajectories, it only has to imagine visiting unsafe states, instead of actually visiting them. Although unsafe states may be visited during unsupervised exploration of the environment for training the generative model in step (1), the same generative model can be reused to learn reward models for any number of future tasks. Hence, the cost of visiting a fixed number of unsafe states in step (1) can be amortized across a large number of tasks in step (2). We could also train the generative model on other types of off-policy data instead, including safe expert demonstrations and examples of past failures.

Another benefit of our method is that, as part of the data collection process in step (2), the user gets to observe query trajectories that reveal what the reward model has learned. Thus, the user can choose to stop providing feedback when they are satisfied with the reward model’s notions of reward-maximizing and reward-minimizing behaviors; and when they see that uncertainty-maximizing queries are genuinely ambiguous, instead of merely uncertain to the model while being easy for the user to judge. This provides a safer alternative to debugging a reward model by immediately deploying the agent and observing its behavior without directly inspecting the reward model beforehand.

4 Experimental Evaluation

We seek to answer the following questions. Q1: Does synthesizing hypothetical trajectories elicit more informative labels than rolling out a policy in the training environment? Q2: Can our method detect and correct reward hacking? Q3: Can our method safely learn about unsafe states? Q4: Do the proposed AFs improve upon random sampling from the generative model? Q5: How does the regularization constant control the trade-off between realistic and informative queries? Q6: How much do each of the four AFs contribute to performance?

To answer these questions under ideal assumptions, we run experiments in three domains – MNIST (lecun1998mnist), state-based 2D navigation (Figure 3), and image-based Car Racing from the OpenAI Gym (brockman2016openai) – with simulated users that label trajectories using a ground-truth reward function. In each domain, we setup a training environment with initial state distribution , and a test environment with initial state distribution , as described in Section 3. In many real-world settings, the user can help initialize the reward model by providing a small number of (suboptimal) demonstrations and labeling them with rewards. Hence, we initialize the training data in line 2 of Algorithm 1 with a small set of labeled, suboptimal, user demonstrations collected in the training environment.

MNIST classification. This domain enables us to focus on testing the active learning component of our method, since the standard digit classification task does not involve sequential decision-making. Here, the initial state is a grayscale image of a handwritten digit, and the action is a discrete classification. When we generate queries, we synthesize an image , and ask the simulated user to label it with an action . The initial state distribution of the training environment

puts a probability of

on sampling , and a probability of on sampling . We intentionally introduce a significant shift in the state distribution between the training and test environments, by setting the initial state distribution of the test environment to the complement of ; i.e., putting a probability of on sampling , and a probability of on sampling . This mismatch is intended to test the robustness of the learned classifier; i.e., how well it performs under distribution shift. We train a state encoder and decoder in step (1) by training a VAE with an 8-dimensional latent space on all the images in the MNIST training set.999Note that this differs from the random sampling method for collecting off-policy data described in Section 3.1

. Though the initial state distribution of the training environment is a uniform distribution over

, we train the generative model on all the digits . This simplifies our experiments, and enables ReQueST to synthesize hypothetical digits from .

State-based 2D navigation. This domain enables us to focus on the challenges of sequential decision-making, without dealing with high-dimensional states. Here, the state is the agent’s position, and the action

is a velocity vector. The task requires navigating to a target region, while avoiding a trap region. The simulated user labels a state transition

with category , by looking at the state , and identifying whether it is inside the goal region (good), inside the trap region (unsafe), or outside both regions (neutral). The initial state distribution of the training environment is a delta function at the origin: . We intentionally introduce a significant shift in the state distribution between the training and test environments, by setting the initial state distribution of the test environment to a delta function at the opposite corner of the unit square: . As in MNIST, this mismatch is intended to test the generalization of reward models. The task is harder to complete in the test environment, since the agent starts closer to the trap, and must navigate around the trap to reach the goal (Figure 3).

Figure 3: Left: The 2D navigation task, where the agent navigates to the goal region (green) in the lower left while avoiding the trap region (red) in the upper right. The agent starts in the lower left corner in the training environment, and starts in the upper right corner in the test environment. Right: Examples of hypothetical states synthesized throughout learning, illustrating the qualitative differences in the behaviors targeted by each AF.
Figure 4: Experiments that address Q1 – does synthesizing hypothetical trajectories elicit more informative labels than rolling out a policy in the training environment? – by comparing our method, which uses synthetic trajectories, to baselines that only use real trajectories generated in the training environment. The results on MNIST, 2D navigation, and Car Racing show that our method (orange) significantly outperforms the baselines (blue and gray), which never succeed in 2D navigation. The x-axis represents the number of queries to the user, where each query elicits a label for a single state transition

. The shaded areas show standard error over three random seeds.

Image-based Car Racing. This domain enables us to test whether our method scales to learning sequential tasks with high-dimensional states. Here, the state is an RGB image with a top-down view of the car (Figure 13 in the appendix), and the action controls steering, gas, and brake. The simulated user labels a state transition with category , by looking at the state , and identifying whether it shows the car driving onto a new patch of road (good), off-road (unsafe), or in a previously-visited road patch (neutral). Here, we set the same initial state distribution for the training and test environments, since the reward modeling problem is challenging even when the initial state distributions are identical. We train a generative model in step (1) using the unsupervised approach in ha2018recurrent, which trains a VAE that compresses images, a recurrent dynamics model that predicts state transitions under partial observability, and a mixture density network that predicts stochastic transitions.

Section A.1 in the appendix discusses the setup of each domain, including the methods used to train the generative model in step (1), in further detail.

4.1 Robustness Compared to Baselines

Our first experiment tests whether our method can learn a reward model that is robust enough to perform well in the test environment, and tracks how many queries to the user it takes to learn an accurate reward model.

Manipulated factors. To answer Q1, we compare our method to a baseline that, instead of generating hypothetical trajectories for the user to label, generates trajectories by rolling out a policy that optimizes the current reward model in the training environment – an approach adapted from prior work (christiano2017deep). The baseline generates in line 5 of Algorithm 1 by rolling out the MPC policy in Equation 3.5, instead of solving the optimization problem in Equation 6. To test how generating queries using a reward-maximizing policy compares to using a policy that does not depend on the reward model, we also evaluate a simpler baseline that generates query trajectories using a uniform random policy, instead of the MPC policy.

Dependent measures. We measure performance in MNIST using the agent’s classification accuracy in the test environment; in 2D navigation, the agent’s success rate at reaching the goal while avoiding the trap in the test environment; and in Car Racing, the agent’s true reward, which gives a bonus for driving onto new patches of road, and penalizes going off-road.101010We also measure performance in the training environment, without state distribution shift. See Figure 11 in the appendix for details. We establish a lower bound on performance using a uniform random policy, and an upper bound by deploying an MPC agent equipped with a reward model trained on a large, offline dataset of 100 expert trajectories and 100 random trajectories containing balanced classes of good, unsafe, and neutral state transitions.

Analysis. The results in Figure 4 show that our method produces reward models that transfer to the test environment better than the baselines. Our method also learns to outperform the suboptimal demonstrations used to initialize the reward model (Figure 10 in the appendix).

In MNIST, our method performs substantially better than the baseline, which samples queries from the initial state distribution of the training environment. The reason is simple: the initial state distribution of the test environment differs significantly from that of the training environment. Since our method is not restricted to sampling from the training environment, it performs better than the baseline.

In 2D navigation, our method substantially outperforms both baselines, which never succeed in the test environment. This is unsurprising, since the training environment is set up in such a way that, because the agent starts out in the lower left corner, they rarely visit the trap region in the upper right by simply taking actions – whether reward-maximizing actions (as in the first baseline), or uniform random actions (as in the second baseline). Hence, when a reward model trained by the baselines is transferred to the test environment, it is not aware of the trap, so the agent tends to get caught in the trap on its way to the goal. Our method, however, is not restricted to feasible trajectories in the training environment, and can potentially query the label for any position in the environment – including the trap (see Figure 3). Hence, our method learns a reward model that is aware of the trap, which enables the agent to navigate around it in the test environment.

In Car Racing, our method outperforms both baselines. This is mostly due to the fact that the baselines tend to generate queries that are not diverse and rarely visit unsafe states, so the resulting reward models are not able to accurately distinguish between good, unsafe, and neutral states. Our method, on the other hand, explicitly seeks out a wide variety of states by maximizing the four AFs, which leads to more diverse training data, and a more accurate reward model.

4.2 Detecting Reward Hacking

Figure 5: Experiments that address Q2 – can our method detect and correct reward hacking? – by comparing our method, which uses synthetic trajectories, to baselines that only use real trajectories generated in the training environment. The results on 2D navigation show that our method (orange) significantly outperforms the baselines (blue and gray). The x-axis represents the number of queries to the user, where each query elicits a label for a single state transition . The shaded areas show standard error over three random seeds.

One of the benefits of our method is that it can detect and correct reward hacking before deploying the agent, using reward-maximizing synthetic queries. In the next experiment, we test this claim.

Manipulated factors. We replicate the experimental setup in Section 4.1 for 2D navigation, including the same baselines.

Dependent measures. We measure performance using the false positive rate of the reward model: the fraction of neutral or unsafe states incorrectly classified as good, evaluated on the offline dataset of trajectories described in Section 4.1. A reward model that outputs false positives is susceptible to reward hacking, since a reward-maximizing agent can game the reward model into emitting high rewards by visiting incorrectly classified states.

Analysis. The results in Figure 5 show that our method drives down the false positive rate in 2D navigation: the learned reward model rarely incorrectly classifies an unsafe or neutral state as a good state. As a result, the deployed agent actually performs the desired task (center plot in Figure 4), instead of seeking false positives. As discussed in Section 4.3 and illustrated in the right-most plot of Figure 6, the baselines learn a reward model that incorrectly extrapolates that continuing up and to the right past the goal region is good behavior.

For a concrete example of reward-maximizing synthetic queries that detect reward hacking, consider the reward-maximizing queries in the upper right corner of Figure 3, which are analyzed in Section 4.6.

4.3 Safe Exploration

Figure 6: Experiments that address Q3 – can our method safely learn about unsafe states? – by comparing our method, which uses synthetic trajectories, to baselines that only use real trajectories generated in the training environment. The results on 2D navigation show that our method (orange) significantly outperforms the baselines (blue and gray). The x-axis represents the number of queries to the user, where each query elicits a label for a single state transition . The shaded areas show standard error over three random seeds. The heat maps represent the reward models learned by our method (left) and by the baselines (right).
Figure 7: Experiments that address Q4 – do the proposed AFs improve upon random sampling from the generative model? – by comparing our method, which synthesizes trajectories by optimizing AFs, to a baseline that ignores the AFs and randomly samples from the generative model. The results on MNIST, 2D navigation, and Car Racing show that our method (orange) significantly outperforms the baseline (blue) in Car Racing, and learns faster in MNIST and 2D navigation. The x-axis represents the number of queries to the user, where each query elicits a label for a single state transition . The shaded areas show standard error over three random seeds.

One of the benefits of our method is that it can learn a reward model that accurately detects unsafe states, without having to visit unsafe states during the training process. In the next experiment, we test this claim.

Manipulated factors. We replicate the experimental setup in Section 4.1 for 2D navigation, including the same baselines.

Dependent measures. We measure performance using the true negative rate of the reward model: the fraction of unsafe states correctly classified as unsafe, evaluated on the offline dataset of trajectories described in Section 4.1. We also use the crash rate of the deployed agent: the rate at which it gets caught in the trap region.

Analysis. The results in Figure 6 show that our method learns a reward model that classifies all unsafe states as unsafe, without visiting unsafe states during training (second and third figure from left); in fact, without visiting any states at all, since the queries are synthetic. This enables the agent to avoid crashing during deployment (first figure from left). The baselines differ from our method in that they actually have to visit unsafe states in order to query the user for labels at those states. Since the baselines tend to not visit unsafe states during training, they do not learn about unsafe states (second and fourth figure from left), and the agent frequently crashes during deployment (first figure from left).

4.4 Query Efficiency Compared to Baselines

Figure 8: Experiments that address Q5 – how does the regularization constant control the trade-off between realistic and informative queries? – by evaluating our method with different values of , which controls the trade-off between producing realistic trajectories (higher ) and informative trajectories (lower ). The results on MNIST, 2D navigation, and Car Racing show that, while intermediate and low values of work best for MNIST and 2D navigation respectively, a high value of works best for Car Racing. The x-axis is log-scaled. The error bars show standard error over three random seeds, which is negligible in the results for 2D navigation.

The previous experiment compared to baselines that are restricted to generating query trajectories by taking actions in the training environment. In this experiment, we lift this restriction on the baselines: instead of taking actions in the training environment, the baselines can make use of the generative model trained in step (1).

Manipulated factors. To answer Q4, we compare our method to a baseline that randomly samples trajectories from the generative model – using uniform random actions in Car Racing, samples from the VAE prior in MNIST, and uniform positions across the map in 2D navigation.

Dependent measures. We measure performance in MNIST using the reward model’s predicted log-likelihood of the ground-truth user labels in the test environment; in 2D navigation, the reward model’s classification accuracy on an offline dataset containing states sampled uniformly throughout the environment; and in Car Racing, the true reward collected by an MPC agent that optimizes the learned reward, where the true reward gives a bonus for driving onto new patches of road, and penalizes going off-road.

Analysis. The results in Figure 7 show that our method, which optimizes trajectories using various AFs, requires fewer queries to the user than the baseline, which randomly samples trajectories. This suggests that our four AFs guide query synthesis toward informative trajectories. These results, and the results from Section 4.1, suggest that our method benefits not only from using a generative model instead of the default training environment, but also from optimizing the AFs instead of randomly sampling from the generative model.

4.5 Effect of Regularization Constant

One of the core features of our method is that, in Equation 6, it can trade off between producing realistic queries that maximize the regularization term , and producing informative queries that maximize the AF . In this experiment, we examine how the regularization constant controls this trade-off, and how the trade-off affects performance.111111Note that the scale of the optimal may depend on the scale of the AF. In our experiments, we find that the same value of generally works well for all four AFs. See Section A.1 in the appendix for details.

Manipulated factors. To answer Q5, we sweep different values of the regularization constant . At one extreme, we constrain the query trajectories to be feasible under the generative model, by setting the next states to be the next states predicted by the dynamics model instead of free variables – we label this setting as for convenience (see Section A.1 in the appendix for details). At the other extreme, we set , which allows to be infeasible under the model. Note that, even when , the optimized trajectory is still regularized by the fact that it is optimized in the latent space of the state encoder , instead of, e.g., raw pixel space.

Dependent measures. We measure performance as in Section 4.1.

Figure 9: Experiments that address Q6 – how much do each of the four AFs contribute to performance? – by comparing our method to ablated variants that drop each AF, one at a time, from the set of four AFs in line 4 of Algorithm 1. The results on MNIST, 2D navigation, and Car Racing show that our method (orange) generally outperforms its ablated variants (blue, gray, red, and pink), although the usefulness of each AF depends on the domain and amount of training data.. The x-axis represents the number of queries to the user, where each query elicits a label for a single state transition . The shaded areas show standard error over three random seeds.

Analysis. The results in Figure 8 show that the usefulness of generating unrealistic trajectories depends on the domain. In MNIST, producing unrealistic images by decreasing can improve performance, although an intermediate value works best. In 2D navigation, setting to a low value is critical for learning the task. Note that we only tested and in this domain, since we intentionally setup the training and test environments as a sanity check, where should perform best, and should not succeed. In Car Racing, constraining the queries to be feasible () performs best.

There is a trade-off between being informative (by maximizing the AF) and staying on the distribution of states in the training environment (by maximizing likelihood). In domains like Car Racing – where the training and test environments have similar state distributions, and off-distribution queries can be difficult for the user to interpret and label – it makes sense to trade off being informative for staying on-distribution. In domains like MNIST and 2D navigation, where we intentionally create a significant shift in the state distribution between the training and test environments, it makes more sense to trade off staying on-distribution for being informative.

Visualizing synthesized queries. Figure 14 in the appendix shows examples of Car Racing query trajectories optimized with either or . Unsurprisingly, the queries appear less realistic, but clearly maximize the AF better than their counterparts.

4.6 Acquisition Function Ablation Study

We propose four AFs intended to produce different types of hypotheticals. In this experiment, we investigate the contribution of each type of query to the performance of the overall method.

Manipulated factors. To answer Q6, we conduct an ablation study, in which we drop out each the four AFs, one by one, from line 4 in Algorithm 1, and measure the performance of only generating queries using the remaining three AFs. We also visualize the queries generated by each of the four AFs, to illustrate their qualitative properties.

Dependent measures. We measure performance as in Section 4.4.

Analysis. The results in Figure 9 show that the usefulness of each AF depends on the domain and the amount of training data collected.

In MNIST, dropping hurts performance, suggesting that uncertainty-maximizing queries elicit useful labels. Dropping also hurts performance when the number of queries is small, but actually improves performance if enough queries have already been collected. Novelty-maximizing queries tend to be repetitive in practice: although they are distant from the training data in terms of Equation 5, they are visually similar to the existing training data in that they appear to be the same digits. Hence, while they are helpful at first, they hurt query efficiency later in training.

In 2D navigation, dropping hurts performance, while dropping any of the other AFs individually does not hurt performance. These results suggest that uncertainty-maximizing queries can be useful, in domains like MNIST and 2D navigation, where uncertainty can be modeled and estimated accurately.

In Car Racing, dropping hurts the most. Reward-minimizing queries elicit labels for unsafe states, which are rare in the training environment unless you explicitly seek them out. Hence, this type of query performs the desired function of augmenting the training data with more examples of unsafe states, thereby making the reward model better at detecting unsafe states.

Visualizing synthesized queries. Figure 3 and Figures 12 and 14 in the appendix illustrate examples of queries generated by each of the four AFs.

In MNIST (Figure 12 in the appendix), the uncertainty-maximizing queries are digits that appear ambiguous but coherent, while the novelty-maximizing queries tend to cluster around a small subset of the digits and appear grainy.

In 2D navigation (Figure 3), the demonstrations contain mostly neutral states en route to the goal, and a few good states at the goal. If we were to train on only the demonstrations, the reward model would be unaware of the trap. Initially, the queries, which we restrict to just one state transition from the initial state to a synthesized next state , are relatively uniform. The first reward-maximizing queries are in the upper right corner, which makes sense: the demonstrations contain neutral states in the lower left, and good states farther up and to the right inside the goal region, so the reward model extrapolates that continuing up and to the right, past the goal region, is good behavior. The reward model, at this stage, is susceptible to reward hacking – a problem that gets addressed when the user labels the reward-maximizing queries in the upper right corner as neutral.

After a few more queries, the reward-maximizing queries start to cluster inside the goal region, and the reward-minimizing queries cluster inside the trap. This is helpful early during training, for identifying the locations of the goal and trap. The uncertainty-maximizing queries cluster around the boundaries of the goal and the trap, since that is where model uncertainty is highest. This is helpful for refining the reward model’s knowledge of the shapes of the goal and trap. The novelty-maximizing queries get pushed to the corners of the environment. This is helpful for determining that the goal and trap are relatively small and circular, and do not bleed into the corners of map.

In Car Racing (Figure 14 in the appendix), the reward-maximizing queries show the car driving down the road and making a turn. The reward-minimizing queries show the car going off-road as quickly as possible. The uncertainty-maximizing queries show the car driving to the edge of the road and slowing down. The novelty-maximizing queries show the car staying still, which makes sense since the training data tends to contain mostly trajectories of the car in motion.

5 Discussion

Summary.

We contribute the ReQueST algorithm for learning a reward model from user feedback on hypothetical behavior. The key idea is to automatically generate hypotheticals that efficiently determine the user’s objectives. Simulated experiments on MNIST, state-based 2D navigation, and image-based Car Racing show that our method produces accurate reward models that transfer well to new environments and require fewer queries to the user, compared to baseline methods adapted from prior work. Our method detects reward hacking before the agent is deployed, and safely learns about unsafe states. Through a hyperparameter sweep, we find that our method can trade off between producing realistic vs. informative queries, and that the optimal trade-off varies across domains. Through an ablation study, we find that the usefulness of each of the four acquisition functions we propose for optimizing queries depends on the domain and the amount of training data collected. Our experiments broadly illustrate how models of the environment can be used to improve the way we learn models of task rewards.

Limitations and future work. The main practical limitation of our method is that it requires a generative model of initial states and a forward dynamics model, which can be difficult to learn from purely off-policy data in complex, visual environments. One direction for future work is relaxing this assumption; e.g., by incrementally training a generative model on on-policy data collected from an RL agent in the training environment (kaiser2019model; hafner2018learning). Another direction is to address the safety concerns of training on unsupervised interactions by using safe expert demonstrations instead (as discussed in Section 3.6).

Our method assumes the user can label agent behaviors with rewards. For complex tasks that involve extremely long-term decision-making and high-dimensional state spaces, such as managing public transit systems or sequential drug treatments, the user may not be able to meaningfully evaluate the performance of the agent. To address this issue, one could implement ReQueST inside a framework that enables users to evaluate complex behaviors, such as recursive reward modeling (leike2018scalable) or iterated amplification (christiano2018supervising).

Acknowledgments

Thanks to Gabriella Bensinyor, Tim Genewein, Ramana Kumar, Tom McGrath, Victoria Krakovna, Tom Everitt, Zac Kenton, Richard Ngo, Miljan Martic, Adam Gleave, and Eric Langlois for useful suggestions and feedback. Thanks in particular to Gabriella Bensinyor, who proposed the name and acronym of our method: reward query synthesis via trajectory optimization, or ReQueST. This work was supported in part by an NVIDIA Graduate Fellowship.

References

Appendix A Appendix

a.1 Implementation Details

Shooting vs. collocation. We use the notation to denote solving the optimization problem in Equation 6 with a shooting method instead of a collocation method. The shooting method optimizes , and represents using the forward dynamics model learned in step (1).

MNIST classification. We simulate the user in line 7 of Algorithm 1 as an expert, k-nearest neighbors classifier trained on all labeled data. We only generate queries using the AFs and in line 4 of Algorithm 1, since and are not useful for single-step classification. We replace with in Equation 3.3 and line 12 of Algorithm 1. We represent in Equation 2 using a feedforward neural network with two fully-connected hidden layers containing 256 hidden units each, and separate networks in the ensemble. The MPC agent in Equation 3.5 reduces to . The Gaussian prior distribution of the VAE yields the likelihood model, . The state inputs to the reward model are the latent embeddings produced by , instead of the raw pixel inputs. We set when synthesizing queries with the AF , and when synthesizing queries with the AF .

State-based 2D navigation. To encourage the agent to avoid the trap, the reward constants are asymmetric: , , and . Since the states are already low-dimensional, we simply use the identity function for the state encoder and decoder. We represent in Equation 2 using a feedforward neural network with two fully-connected hidden layers containing 32 hidden units each, and separate networks in the ensemble. We hard-code a Gaussian forward dynamics model, . Each episode lasts at most 1000 steps, and the maximum speed is restricted to . In Equation 3.5, we use a planning horizon of . In Equation 6, we use a query trajectory length of ; i.e., the query consists of one state transition from the hard-coded initial state to a synthesized next state . We set when synthesizing queries for any of the four AFs.

Car Racing. To encourage the agent to drive without being overly conservative, the reward constants are asymmetric: , , and . We represent in Equation 2 using a feedforward neural network with two fully-connected hidden layers containing 256 hidden units each, and separate networks in the ensemble. We train a generative model using the unsupervised approach in ha2018recurrent, which learns a VAE state encoder and decoder with a 32-dimensional latent space, a recurrent dynamics model with a 256-dimensional latent space, and a mixture density network with 5 components that predicts stochastic transitions. Since the environment is partially observable, we represent the state input to the reward model by concatenating the VAE latent embedding with the RNN latent embedding. Each episode lasts at most 1000 timesteps. In Equation 3.5, we use a planning horizon of . In Equation 6, we use a query trajectory length of . We set when synthesizing queries for any of the four AFs.

In the high-dimensional Car Racing environment, we find that optimizing Equation 6 leads to incomprehensible query trajectories , even for high values of the regularization constant . To address this issue, we modify the method in two ways that provide additional regularization. First, instead of optimizing the initial state in , we set it to some real state sampled from the training environment during step (1). Second, instead of optimizing , where , we optimize , where , …, . The function denotes using the mixture coefficients to compute the expected next state, instead of using the mixture coefficients predicted by the mixture density network. Thus, the likelihood regularization term becomes , where is the cross-entropy. This representation of the trajectory is easier to optimize, and results in more comprehensible queries.

Figure 10: Our method initializes the reward model with suboptimal user demonstrations, in line 2 of Algorithm 1. The experiments in Section 4.1 show that our method learns a reward model that enables the agent to outperform the suboptimal demonstrator. In 2D navigation (top), the agent gets to the goal faster than the demonstrator, even in the training environment – the demonstrator takes a tortuous path to the goal, while the agent goes straight to the goal. In Car Racing (bottom), the agent drives faster and visits more new road patches than the cautious, slow demonstrator. We do not include results for MNIST, since it does not make sense to initialize the classifier with incorrect labels in this domain.
Figure 11: Our method performs worse than or comparably to the baselines in Section 4.1, when the reward model is evaluated in the training environment instead of the test environment. Since there is no state distribution shift in this setting, training on real trajectories from the training environment (baselines) is more effective than training on hypothetical trajectories synthesized using our method (ReQueST). We do not include results for Car Racing, since the test environment is already identical to the training environment in this domain.
Figure 12: Examples of MNIST queries that optimize different AFs, illustrating the qualitative differences in the hypotheticals targeted by each AF. Top 10 rows: uncertainty-maximizing queries. Bottom 10 rows: novelty-maximizing queries. The uncertainty-maximizing queries are digits that appear ambiguous but coherent, while the novelty-maximizing queries tend to cluster around a small subset of the digits and appear grainy.
Figure 13: A screenshot of the image-based Car Racing video game in the OpenAI Gym.
Figure 14: Examples of Car Racing queries that optimize different AFs with different settings of the regularization constant , illustrating the qualitative differences in the hypotheticals targeted by each AF, and the trade-off between producing realistic () vs. informative () queries. When , the reward-maximizing query shows the car driving down the road and making a turn; the reward-minimizing query shows the car going off-road as quickly as possible; the uncertainty-maximizing query shows the car driving to the edge of the road and slowing down; and the novelty-maximizing query shows the car staying still, which makes sense since the training data tends to contain mostly trajectories of the car in motion. When , most of the behaviors are qualitatively similar to their counterparts, but less realistic and more aggressively optimizing the AF – only the novelty-maximizing query is qualitatively different, in that it seeks the boundaries of the map (the white void) instead of staying still. Full videos available at https://sites.google.com/berkeley.edu/request.