## 1 Introduction

A typical approach for building AI systems breaks the problem into two steps: 1) design a reward function; and 2) write an algorithm to optimize that reward function. In practice, system designers interleave these steps – after optimizing the designed reward function, they see what bad behavior it incentivizes, allowing them to design a better reward function.

Reward design, the problem of selecting an appropriate reward function, is both critically important, as it encodes the task the system should perform, and challenging, as it requires the system designer to anticipate all possible behaviors in all possible environments and determine appropriate incentives or penalties for each one. Hadfield-Menell et al. (2017) notes that in practice, the reward function provided by the designer is a *proxy* for the true reward, and should only be assumed to incentivize good behavior *in the training environment*. The authors use this assumption to attack the *inverse reward design* (IRD) problem: recovering the true reward function from the designer’s proxy reward function.

However, IRD only uses the final proxy reward function created after the designer has finished the entire reward design process. We may hope to do better by taking advantage of the information *within* the reward design process. In this paper, we propose structuring the reward design process as a series of queries about good behavior, each of which can be answered easily by the system designer. This lets us learn about the true reward from each query, instead of only from the final proxy reward.

Consider for example the personal shopping assistant in Fig. 1, top. Alice wants her robot to buy healthy foods from the supermarket. She designs a set of features that capture the ingredients and nutrients of the available products. She rewards vitamin A, so her robot gets carrots. However, unbeknownst to Alice, the supermarket introduces a new product: energy bars, which contain vitamin A, but also the rare unhealthy ingredient *Maltodextrin* (). Alice forgot to penalize because the store originally contained no products with both Maltodextrin and vitamin A. IRD would observe that must be better than (fat), since otherwise eggs would be chosen over carrots, but it cannot infer much about , since for a wide range of weights for the correct decision would still be to buy carrots. In contrast, if we could compare between cake and eggs during the reward design process, we could learn that is worse than .

We break the reward design problem into a series of smaller reward design problems, referred to as *queries*, and use IRD to update our beliefs about the true reward function. This is often a simpler space to optimize over and can reuse the same environment. For each query, the designer is presented with a set of candidate reward functions, and she chooses the best reward function *out of that set*. In the example above, active IRD will ask Alice to choose between a reward with and one with . The former reward would choose cake while the latter would choose eggs, and so Alice would choose the latter, allowing us to infer . The key idea is that *by choosing the set of rewards carefully, we can learn from Alice’s choices between different suboptimal rewards*, which a single iteration of IRD cannot do since it only observes a single (approximately) optimal proxy reward. In this work, we actively select queries that maximize the information gain about the true reward. This is in contrast to approaches that actively select envrionments (Amin et al., 2017) or trajectories (Sadigh et al., 2017; Cui and Niekum, 2018), which are often more complex and higher dimensional than reward parameters.

Our contributions are as follows: we 1) structure the reward design process as a series of queries from which we can learn using IRD; 2) design two kinds of queries, discrete and feature queries, emphasizing simplicity, usability and informativeness; 3) design algorithms that select queries for the designer that maximize expected information gain about the true reward; and 4) evaluate this approach with experiments using simulated human models in a personal shopping assistant domain and a 2D navigation domain. We find that our method leads to reduced regret at test time compared with vanilla IRD, often fully recovering the true reward function. Our results indicate that actively selecting the set of available reward functions is a promising direction for increasing the efficiency and effectiveness of reward design.

## 2 Background

### 2.1 Inverse Reward Design

In *inverse reward design* (IRD) (Hadfield-Menell et al., 2017) the goal is to infer a distribution over the true reward function the designer wants to optimize given an observed, but possibly misspecified, proxy reward function. This allows the agent to avoid problems that can arise in test environments due to an incorrect proxy reward function, such as unintended side effects. IRD achieves this by identifying and formalizing a key assumption: *proxy reward functions are likely to the extent that they lead to high true utility behavior in the training environment.*

We represent the fixed training environment faced by our agent as a Markov decision process without reward

. The policy is given by , a distribution over trajectories. This could correspond to a planning or reinforcement learning algorithm. We assume that the system designer selects a proxy reward

from a space of options so that the resulting behavior approximately optimizes a true reward function (which the designer knows implicitly). That is, we assume the designer approximately solves the*reward design problem*(Singh et al., 2009).

In the inverse reward design problem, , , and are known and the true reward must be inferred under the IRD assumption that incentivizes approximately optimal behavior in .

Note that the space of true rewards need not be the same as the space of proxies . In this work, we assume that a proxy reward is a linear function of pre-specified features (either hand-coded or learned through some other technique), and so we write it as , where are the features of trajectory . However, the only assumption we make about the true reward space

is that we can perform Bayesian inference over it. For simplicity, in our evaluation we use linear functions of features for

, but our techniques would work for more complex models such as Bayesian neural nets, which would allow us to infer rewards in complex environments where features are hard to obtain.The next section describes how to formalize the IRD assumption into an invertible likelihood model for .

### 2.2 Observation Model

An optimal designer would choose the proxy reward that maximizes the expected true value in . IRD models the designer as approximately optimal with rationality parameter :

(1) |

This model is then inverted to obtain the object of interest, the posterior distribution over true rewards .

Cost of inference. Computing the likelihood (1) is expensive as it requires calculating the normalization constant by summing over all possible proxy rewards and solving a planning problem for each. We cache a sample of trajectories for each and re-use them to evaluate the likelihood for different potential true reward functions .

Conversely, the normalization constant for the posterior integrates over and requires no additional planning. Approximate inference methods such as MCMC do not compute this normalizer at all.

### 2.3 Related work

A variety of approaches for learning reward functions have been proposed. In *inverse reinforcement learning* (IRL) (Ng and Russell, 2000; Ramachandran and Amir, 2007; Ziebart et al., 2008), the agent observes demonstrations of (approximately) optimal behavior, and infers a reward function that explains this behavior. Reward functions have also been learned from *expert ratings* (Daniel et al., 2014) and *human reinforcement* (Knox and Stone, 2009; Warnell et al., 2017).

Methods that learn reward functions from preferences, surveyed in Wirth et al. (2017), are particularly relevant to our work. Christiano et al. (2017) learn a reward function from preferences over pairs of trajectories, by sampling trajectories from a policy learned by deep RL and querying the user about pairs with high uncertainty. A similar setup is used in Wirth et al. (2016) and Akrour et al. (2012) based around other policy optimization methods. It is also possible to learn reward functions from preferences on actions (Fürnkranz et al., 2012) and states (Runarsson and Lucas, 2014).

Our work is most similar to Sadigh et al. (2017), which finds queries through gradient-based optimization in the trajectory space of a continuous environment. Their objective is expected volume removed from the hypothesis space by the query, which has an effect similar to our method of optimizing for information gain. We differ by learning from preferences over reward functions rather than from trajectories: sequences of state-action pairs.

Of course, a preference over reward functions implies a preference over (sets of) trajectories. However, we can create more targeted queries by optimizing directly in the relatively small, structured space of proxy rewards. Moreover, since we can maintain a well-calibrated distribution over true rewards, we know how far we are from obtaining the true reward function (as long as ). In some cases, we can exactly recover , guaranteeing generalization to new environments.

## 3 Query Design and Selection

In vanilla IRD, the designer selects an approximately optimal proxy reward from a large proxy space . In this work, the designer instead selects the reward function from small actively chosen sets , which we call *queries*. We first outline the criterion used to choose between queries, before describing two query types: discrete and feature queries.

### 3.1 Active selection criterion

We choose queries that maximize the expected information gained about the true reward given the user’s answer (Houlsby et al., 2011). Let denote the previous queries and answers, and the current belief over the true reward function. We can compute the predictive distribution over the user’s answer from the IRD observation model in (1) by marginalizing over possible . The mutual information between the random variables and is:

(2) |

where is the entropy .

This is also known as expected information gain. The first term in (2) is the predictive entropy, the model’s uncertainty about the user’s answer. This is a common active learning criterion, and works well for supervised active learning (Gal et al., 2017). Since predictive entropy selects for query-dependent noise in user answers, we substract the second term, which ensures that, in expectation, the user is not uncertain about their answer.

### 3.2 Discrete queries

A simple form of query is a finite set of reward functions, that is . For small , the user can observe the effects of each of the proxy rewards on the policy, so we can expect their answer to be nearly optimal. This also implies that the features do not have to be interpretable and could be learned.

Exploiting information about suboptimal proxies.
Consider the case of a perfectly rational designer, i.e. . In this setting, vanilla IRD merely learns which proxy reward is best, which means that a priori there are no more than possible outcomes. However, using discrete queries of size 2, we can compare two arbitrary rewards, allowing us to learn a complete preference ordering on , which could have up to outcomes. This conveys the maximal amount of information about that can be learned using only , since the designer’s answer to *any* such query can be perfectly predicted using the ordering.

Returning to Alice’s shopping assistant, Figure 1 (bottom) shows that the assistant can choose a discrete query that has Alice compare between two *suboptimal* choices, cake and eggs, from which we can infer that is worse than , after which the assistant avoids the newly introduced energy bars.

Greedy query selection. Searching over all queries of size requires evaluations of the expected information gain. We therefore grow queries greedily up to size , requiring only evaluations (Algorithm 1(a)). Empirically we find this compares favorably to a large random search.

Proxy pool. Many reward functions lead to the same optimal policy: a major problem for inverse reinforcement learning (Ng and Russell, 2000), but an advantage for us. Even if we discard many proxy rewards, most possible behaviors will remain. To this end, we initially uniformly sample a proxy space , and perform active selection from this much smaller subset. We compute belief updates over all of , so can still recover the true reward function .

We precompute trajectory samples for every proxy reward , which are needed for the likelihood in (1). This means that we never need to run planning during query selection or inference, making our method very efficient during designer interaction.

### 3.3 Feature queries

Recent work (Basu et al., 2018) shows that determining the relevant features in a user’s preferences leads to more efficient learning. Inspired by this, we consider queries where the designer specifies weights for a small set of features while the query specifies fixed weight values for any feature *not* in the set. Concretely, suppose we have features in total, and that weights can range over .
A feature query of size is characterized by a set of free weights and a *valuation* of fixed weights . It corresponds to the set of reward functions given by . The user must then specify weights for to uniquely determine a proxy reward. We could imagine a graphical user interface in which the designer can move sliders for each weight, and see the effect on the policy, in order to answer the query.

Discretization. Exactly evaluating the expected information gain is only tractable for finite queries. We therefore discretize the free (but not the fixed) weights. A coarse discretization is used for query selection, while a finer grid is used for user input.

Feature query selection. There are two variables to optimize over: which features are free, and the values of the fixed features. We select the free features greedily to maximize expected information gain, similarly to discrete queries, as shown in Algorithm 1(b).

Tuning the fixed weights is more difficult as we are optimizing over a continuous space . We use gradient descent, for which we wrote a differentiable implementation of value iteration based off of Tamar et al. (2016). We found gradient descent often converges to a local maximum of (2), and so we used a small random search over to find a good initialization, improving results considerably. Random search by itself works reasonably well and can be used when differentiable planning algorithms are not available.

Comparison to discrete queries. Discrete queries are computationally efficient, but are sample inefficient. The designer can easily choose from a small set of proxy reward functions, but each choice will yield only a small amount of information, necessitating many queries. Larger queries are more informative, but it is challenging for the designer to select from a large, unstructured set.

Feature queries are low-dimensional affine subspaces of the proxy reward function space. In each query, the designer can judge the effects of a few individual features that are currently most informative to tune. This increases the information received per query without overburdening the user, at the cost of a substantial increase in computational complexity.

## 4 Evaluation

Our primary metric is the test environment regret obtained when we plan using the posterior mean reward across a set of unseen test environments. We supplement this with another metric, the entropy of the agent’s belief , which measures how uncertain the agent is about the true reward. We selected queries per experiment and averaged results over runs, reporting the two measures after each query. Human input is simulated with the likelihood (1).

We seek to answer the following questions: (1) do many small queries help more than a single large query, as hypothesized in Section 3.2

, (2) how much does active selection improve upon random selection, (3) does the heuristic of greedy selection sacrifice substantial performance, (4) for feature queries, how much does free feature selection and valuation optimization help, and (5) which queries are most sample efficient?

Environments. Active IRD is performed on one training environment, and evaluated on many test environments, all drawn from the same distribution and sharing . We tested on two environment distributions. The shopping domain is a simple one-step decision problem similar to a bandit problem. There is only one state and actions (products), each described by

features (ingredients) that are determined by I.I.D. draws from a Gaussian distribution for each environment and product. The features have weights given by

and the task is to pick the product with the highest reward. Conceptually, a robot is trained in one store to select unseen products in many others.The 2D navigation task is a featurized GridWorld with random walls and objects in random positions for each environment. features are given by the Euclidean distances to these objects. describes the ‘temperatures’ of each object. Hot objects should be approached and cold ones avoided, so we call these environments ‘Chilly Worlds’. The policy is computed using steps of soft value iteration, which is differentiable.

True reward space . While in principle our method can be applied to any amenable to Bayesian inference, for computational efficiency we consider a finite space of true reward functions that are linear functions of the features, with

unless otherwise specified. As a result, instead of approximating the distribution over true rewards, we can compute it exactly. This allows us to evaluate the effect of our queries without worrying about variance in the results arising from the randomness in approximate inference algorithms.

### 4.1 Benefits of small queries

In Section 3.2 we hypothesized that smaller queries allow us to learn from comparisons between *suboptimal* behaviors, which vanilla IRD cannot do. To test this, we compare the performance of randomly chosen discrete queries of various sizes. Note that full IRD is equivalent to a query size. In this experiment, we used the maximal proxy space , and reduced to to make exact IRD feasible. IRD was run times to show its convergence behavior, although it would normally be run only once.

Figure 2(a) shows that using a smaller query space attains better generalization performance than full IRD after as few as five queries, validating our hypothesis. In the Shopping environment, smaller queries can bring the entropy and regret down to nearly zero, indicating that the true reward function has been successfully identified. Note that performance on the first query always increases with query size: a small query size only helps after a few queries, when new queries must be able to explore the designer’s preferences among new behaviors that previous queries have not explored.

### 4.2 Discrete query selection

We next turn to greedy active selection of discrete queries (Algorithm 1(a)). To evaluate how useful active selection is, we compare to a baseline of random query selection. To evaluate whether the greedy heuristic sacrifices performance, we would like to compare to a baseline that searches the entire space of discrete queries to find the best one. However, this is computationally infeasible, and so we compare against a large search over

randomly chosen queries (which still takes much longer to evaluate than greedy selection). The hyperparameter

was set to , which was more than enough to distinguish between potential true rewards.Figure 2(b) shows that active selection substantially outperforms random queries and full IRD. Active selection becomes more important over time, likely because a random query is unlikely to target the small amount of remaining uncertainty at later stages. Moreover, greedy query selection matches a large search over random queries, confirming previous empirical results showing greedy algorithms are approximately optimal for information gain (Sharma et al., 2015).

### 4.3 Feature query selection

For feature queries, we would like to evaluate how useful it is to optimize each part of the query. So, we compare among three alternatives: (1) randomly choosing free features, (2) actively selecting free features, and (3) actively selecting free features and optimizing the valuation of fixed features. In both (1) and (2), the fixed weights are set to .

Figure 3(a) compares these alternatives for feature queries with free feature. We find that both parts of the query should be optimized for maximal sample efficiency. However, if computational efficiency is paramount, greedy selection of the free features alone is a cheap and adequate alternative.

### 4.4 Sample efficiency of queries

To evaluate sample efficiency, we measure the *cumulative* test regret, that is, the sum of the test regrets after each query. An algorithm that learns faster will have lower test regret earlier in the process, leading to low cumulative test regret.

Figure 3(b) shows that for discrete queries, using larger query sizes (up to size 10) substantially reduces cumulative test regret. Feature queries are effective even when only a single feature is tuned at a time, matching size 10 discrete queries in the ‘chilly world’ and size 5 queries in the ‘shopping’ environment. There is a modest reduction in test regret from size 2 feature-based queries in the ‘shopping’ domain, matching size 10 discrete queries, but little improvement in the ‘chilly world’ domain.

## 5 Discussion

Summary.
Inverse reward design (IRD) only uses the final proxy reward chosen by the designer. This work structures the iterative reward design process as a series of simpler reward design queries. The simpler queries allows us to query the designer about their preference between *suboptimal* rewards. This provides information not available with vanilla IRD. Active IRD iteratively asks the user to choose from a set of reward functions, and uses IRD to update the belief about the true reward function. We designed two types of queries that trade off between usability, computational efficiency and sample efficiency. We demonstrate that this leads to better identification of the correct reward, less human effort, and reduced regret in novel environments.

Limitations and future work.
The primary contribution of our work is a conceptual investigation of a novel approach to learning reward functions. As a result, we have focused on simple environments which do not require a huge engineering effort to get results from. We do not expect any *conceptual* difficulty with realistic environments with non-linear rewards – indeed, the formalism in this paper already allows us to use Bayesian neural nets to represent . Naturally, doing inference in more complex spaces poses challenges that we hope to explore in future work. A second key question for future work is to identify ways to better understand how to apply the feature selection approach with a broad set of potential features.

Another limitation is that when some feature is not seen at all in the training environment, as in the Lava world example in Hadfield-Menell et al. (2017), we retain our uncertainty over it no matter what queries we choose. In future work, we intend to investigate how this can be mitigated through active design of environments, e.g., as in Amin et al. (2017).

While our evaluation established the performance benefits of active IRD, this was under a simulated human model. We would like to perform user studies to test how accurately real system designers can answer various types of queries. This would also help test our hypothesis that users are more accurate at picking from a small set than a large proxy space.

We hope that our work inspires new methods for reward design, such as new types of reward design queries. Overall, we are excited about the implications active IRD has not only in the short term, but also about its contribution to the general study of the value alignment problem.

## Acknowledgements

Removed for double blind submission.

## References

- Akrour et al. (2012) Akrour, R., Schoenauer, M., and Sebag, M. (2012). APRIL: Active preference learning-based reinforcement learning. In ECMLPKDD, pages 116–131.
- Amin et al. (2017) Amin, K., Jiang, N., and Singh, S. (2017). Repeated inverse reinforcement learning. In NIPS, pages 1815–1824.
- Basu et al. (2018) Basu, C., Singhal, M., and Dragan, A. D. (2018). Learning from richer human guidance: Augmenting comparison-based learning with feature queries. In HRI, pages 132–140.
- Christiano et al. (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. In NIPS, pages 4302–4310.
- Cui and Niekum (2018) Cui, Y. and Niekum, S. (2018). Active reward learning from critiques. In ICRA, pages 6907–6914. IEEE.
- Daniel et al. (2014) Daniel, C., Viering, M., Metz, J., Kroemer, O., and Peters, J. (2014). Active reward learning. In RSS.
- Fürnkranz et al. (2012) Fürnkranz, J., Hüllermeier, E., Cheng, W., and Park, S.-H. (2012). Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning, 89(1):123–156.
- Gal et al. (2017) Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep Bayesian active learning with image data. In ICML.
- Hadfield-Menell et al. (2017) Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., and Dragan, A. (2017). Inverse reward design. In NIPS.
- Houlsby et al. (2011) Houlsby, N., Huszar, F., Ghahramani, Z., and Lengyel, M. (2011). Bayesian active learning for classification and preference learning. CoRR, abs/1112.5745.
- Knox and Stone (2009) Knox, W. B. and Stone, P. (2009). Interactively shaping agents via human reinforcement: The TAMER framework. In KCAP, pages 9–16. ACM.
- Ng and Russell (2000) Ng, A. Y. and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In ICML.
- Ramachandran and Amir (2007) Ramachandran, D. and Amir, E. (2007). Bayesian inverse reinforcement learning. In IJCAI.
- Runarsson and Lucas (2014) Runarsson, T. P. and Lucas, S. M. (2014). Preference learning for move prediction and evaluation function approximation in othello. IEEE Transactions on Computational Intelligence and AI in Games, 6(3):300–313.
- Sadigh et al. (2017) Sadigh, D., Dragan, A., Sastry, S., and Seshia, S. A. (2017). Active preference-based learning of reward functions. In RSS.
- Sharma et al. (2015) Sharma, D., Kapoor, A., and Deshpande, A. (2015). On greedy maximization of entropy. In ICML.
- Singh et al. (2009) Singh, S., Lewis, R. L., and Barto, A. G. (2009). Where do rewards come from? In CogSci, pages 2601–2606.
- Tamar et al. (2016) Tamar, A., WU, Y., Thomas, G., Levine, S., and Abbeel, P. (2016). Value iteration networks. In NIPS, pages 2154–2162.
- Warnell et al. (2017) Warnell, G., Waytowich, N. R., Lawhern, V., and Stone, P. (2017). Deep TAMER: interactive agent shaping in high-dimensional state spaces. CoRR, abs/1709.10163.
- Wirth et al. (2017) Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. (2017). A survey of preference-based reinforcement learning methods. JMLR, 18(1):4945–4990.
- Wirth et al. (2016) Wirth, C., Furnkranz, J., Neumann, G., et al. (2016). Model-free preference-based reinforcement learning. In AAAI, pages 2222–2228.
- Ziebart et al. (2008) Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In AAAI, volume 8, pages 1433–1438. Chicago, IL, USA.