Preprocessing Reward Functions for Interpretability

03/25/2022
by   Erik Jenner, et al.
berkeley college
0

In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward function to spot potential deviations from the user's intention. Existing work has applied general-purpose interpretability tools to understand learned reward functions. We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized. We introduce a general framework for such reward preprocessing and propose concrete preprocessing algorithms. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.

READ FULL TEXT VIEW PDF

Authors

page 7

page 15

page 16

page 18

page 19

page 20

page 21

page 22

12/10/2020

Understanding Learned Reward Functions

In many real-world tasks, it is not possible to procedurally specify an ...
06/24/2020

Quantifying Differences in Reward Functions

For many tasks, the reward function is too complex to be specified proce...
11/19/2018

Scalable agent alignment via reward modeling: a research direction

One obstacle to applying reinforcement learning algorithms to real-world...
12/07/2019

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

General-purpose planning algorithms for automated driving combine missio...
07/14/2020

Programming by Rewards

We formalize and study “programming by rewards” (PBR), a new approach fo...
08/15/2017

Towards Learning Reward Functions from User Interactions

In the physical world, people have dynamic preferences, e.g., the same s...
01/24/2019

Learning Independently-Obtainable Reward Functions

We present a novel method for learning a set of disentangled reward func...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) agents have reached superhuman performance in many tasks, such as games, with clearly defined objectives [Silver et al., 2016, Berner et al., 2019, Vinyals et al., 2019]. However, real-world deployment of RL is often hampered by the difficulty of specifying an appropriate reward function by hand. A variety of methods to learn reward functions have been developed to address this challenge. These algorithms learn from human feedback such as demonstrations [Ng and Russell, 2000, Ramachandran and Amir, 2007, Ziebart et al., 2008, Fu et al., 2018, Bahdanau et al., 2019, Ibarz et al., 2018, Brown et al., 2019], preferences [Akrour et al., 2011, Wilson et al., 2012, Christiano et al., 2017, Sadigh et al., 2017, Ziegler et al., 2019, Ibarz et al., 2018, Brown et al., 2019], or even the initial state of the environment [Shah et al., 2018].

This gives rise to a new problem: how can we evaluate the learned reward function, and spot potential failure modes before deployment? We could train a policy on the reward model, and then evaluate this policy, such as by humans judging rollouts from the policy. However, this approach has serious drawbacks. Training a policy can be very expensive in complex environments, which makes it hard to quickly compare different learned rewards. It is also brittle: if the trained policy doesn’t perform well, it is unclear whether the fault lies with the reward function or the policy training procedure.

Moreover, we might want to know how well a learned reward transfers to different environment dynamics. Indeed, transfer is a major motivation for learning a reward rather than just directly learning a policy. But it is often challenging to specify the set of dynamics under which we would like the reward to be robust. Even where this is possible, training and evaluating a policy on all such environments is typically infeasible.

To address these problems, prior work introduced the EPIC distance [Gleave et al., 2021] to quantify the difference between two reward functions. When a ground truth reward is available, we can evaluate a learned reward simply by computing its EPIC distance to the ground truth. However, reward learning is most useful precisely when we do not have access to a ground truth reward. In this setting, EPIC may still have some limited utility for comparing and perhaps clustering several learned reward models, but it cannot tell us if any of these learned models are correct.

Michaud et al. [2020] instead suggest interpreting reward models to verify they capture user preferences. They find existing interpretability methods such as saliency maps can help understand reward models. However, they also found significant limitations in this approach, concluding that “reward interpretability may need significantly different methods from policy interpretability”.

We believe that significant advances in reward interpretability can be made by taking advantage of the special structure of reward functions. In particular, many different reward functions are equivalent, in the sense that they induce the same optimal policies – no matter the environment dynamics. Given a learned reward model, we can apply transformations that do not change the optimal policy, but simplify the reward function. We can then visualize this simplified reward instead of the original. We call this approach reward preprocessing, as we “preprocess” the reward model prior to visualization.

Our framework for reward preprocessing involves two key components: 1) a class of reward transformations that yield equivalent reward functions in some sense (e.g. by preserving the optimal policy under arbitrary environment dynamics), and 2) an objective that measures how interpretable a given reward function is. We then optimize over the class of transformations using the given objective to find the most interpretable equivalent reward function. A key property of this framework is that the learned reward model is treated as a black box. This means that it may use an arbitrary function approximator and can be learned using any reward learning algorithm and feedback modality.

In summary, our key contributions are: 1) a novel framework that exploits the intrinsic structure of reward functions to increase their interpretability before visualization. 2) Two concrete applications of this framework, using different objectives for interpretability. 3) An empirical evaluation, finding that our methods often significantly improve interpretability.

2 The Reward Preprocessing Framework

Our interpretability method operates on a reward function , where is the current state, is the action taken in that state, and is the next state. Our method only requires the ability to evaluate : there are no restrictions on how is computed or how it was learned. From , we produce a simpler but equivalent reward function , which we then visualize.

Concrete instantiations of this framework must make two choices. First, they must specify which reward functions are deemed equivalent via an equivalence relation . Second, they must provide some measure of “simplicity” or “interpretability”, represented by a cost function . We then seek to find a minimum cost reward function that is equivalent to :

(1)

In the following, we discuss how to choose the equivalence relation and cost function .

2.1 Equivalence Relation

We would like to treat two rewards as equivalent if they will produce the same behaviour in the intended downstream application. It is known that potential shaping [Ng et al., 1999] and rescaling by a positive constant never change the ordering of policies. It is therefore safe to treat such rewards as equivalent for most applications.

However, some applications permit a broader notion of equivalence. For example, if the reward model will only ever be used for policy optimization in a specific task, then we can include any transformations that preserve optimal policies in that task. A simple example is -redistribution: moving reward between different successor states, while preserving . This will not change the optimal policy, so long as the transition dynamics determining remain fixed. Skalse et al. [2022] characterize a variety of such equivalence classes under varying assumptions.

2.2 Choosing Cost Functions

The cost function should represent the interpretability of a reward function. Of course, no simple objective can capture the entire concept of interpretability since reward functions may be interpretable for a variety of reasons. For example, sparse rewards are often interpretable, as the user can pay attention only to the few transitions on which the agent receives a non-zero reward. However, a dense reward could still be easy to understand if it has some other simple structure: for example, taking on only two different values depending on which region of the world the agent is in.

Instead of looking for a single cost function that completely characterizes interpretability, we therefore suggest using multiple cost functions, each of which describes some condition that is sufficient but not necessary for interpretability. We can then find an optimal equivalent reward for each of the cost functions and present all of these rewards for the user to choose between. We can also rank the reward functions provided the cost functions are on a comparable scale, presenting lowest-cost rewards first.

Another factor determining the appropriate cost functions is the method used for visualization. For example, a reward function that has sparse output is ideal if we wish to show the user the reward of particular transitions. However, we might prefer sparsity in the features that the reward depends on if using higher-level visualization methods like saliency maps.

3 Methodology

In this section, we describe a few simple concrete instances of our reward preprocessing framework. Despite their simplicity, we find in section 4 that they nonetheless can yield significant improvements. However, these choices are likely far from optimal, and so should be viewed as establishing a lower bound on the benefit obtainable from reward preprocessing.

3.1 Potential Shaping Equivalence Relation

We define two rewards to be equivalent, , if they are equal up to potential shaping [Ng et al., 1999]. Specifically, if there exists some real-valued state-only function called a potential for which , where . Potential shaping changes the returns of an episode by only the potential of the initial state (in the finite horizon case, the potential of terminal states is restricted to be zero). Since the policy does not affect the initial state, the ordering over policies is invariant under potential shaping. This holds for arbitrary transition dynamics and initial state distributions. Therefore, rewards related to each other by potential shaping can be considered equivalent even under transfer to different environment dynamics.

A notable advantage of potential shaping for our purposes is that it is very easy to optimize over the resulting equivalence class. We simply parametrize the potential as a neural network

with parameters . Then the optimization problem from eq. 1 becomes

(2)

We optimize eq. 2

using (stochastic) gradient descent. This requires a differentiable cost function

, but does not require the reward function to be differentiable.

Rewards differing by a positive scale factor also produce the same policy ordering. However, since our visualization techniques can handle rewards at a range of scales, we choose to preserve the scale during preprocessing. Accordingly, we do not include rescaled rewards as equivalent.

3.2 Cost Functions

We evaluate two types of cost functions: a sparsity-inducing one based on the

norm and a smoothness-inducing measure of absolute deviation. In tabular (gridworld) settings, we evaluate these cost functions on a uniform distribution

over all possible transitions. In continous control environments, we evaluate on transitions sampled from the same distribution used for visualization.

Sparse rewards are easy to understand as the user only needs to attend to rewards with non-zero transitions. However, the norm is non-differentiable. Moreover, even if the ground-truth reward is sparse, learned reward functions are usually not exactly equivalent to a sparse reward due to the presence of noise. We therefore use two different relaxed notions of sparsity: the norm and the slightly transformed version . In particular, we minimize:

(3)

where is the distribution over transitions and is either or .

An alternative is to minimize the fluctuations between rewards of transitions adjacent in time. This creates a smoothly varying reward signal. The user can then understand the reward by looking at the trend over time. This frees the user from having to attend to the reward at every single transition, similar to sparsity. Again, we use an and a logarithmic version of such a smoothness cost:

(4)

4 Results

We evaluate our methods in two environments: a gridworld, with varying rewards, and the classic mountain car continuous control task [Boyan and Moore, 1995]. We test our method with a mixture of hand-designed and learned rewards. The hand-designed rewards consist of a simple ground-truth reward, with shaping and/or noise added to challenge the preprocessing method. The learned rewards are trained via either adversarial inverse reinforcement learning [Fu et al., 2018, AIRL], or via deep reinforcement learning from human preferences [Christiano et al., 2017, DRLHP]. Both methods are trained on synthetic data, consisting of rollouts from an expert policy (AIRL) or preference comparisons induced by the ground-truth reward (DRLHP).

In gridworld experiments, we use a tabular potential and reward model. That is, we learn a separate value and for each state and transition. In mountain car, we use small MLPs for the reward model and potentials, except for some cases where a linear potential is sufficient to find a simple equivalent reward. Our code is available at https://github.com/HumanCompatibleAI/reward-preprocessing.

4.1 Simplifying Shaped Rewards

We start by testing our method in a gridworld setting [Zuo, 2018]. While unrealistic, gridworlds have the considerable benefit of allowing the entire reward function to be easily visualized. This therefore allows a more thorough evaluation of our method than in other tasks.

In figs. 2 and 1, we visualize gridworld rewards before (leftmost column) and after (middle and right column) our preprocessing methods. In the Goal environment in fig. 1, the reward is simply on a single goal square in the top right corner and everywhere else. This is readily understood and our preprocessing largely retains this reward unchanged. However, when we add shaping with the Manhattan distance from the goal (second row), the reward becomes much harder to understand. Our preprocessing, however, is able to simplify this shaped reward to something close to the original sparse objective. Similar results hold for the negative Manhattan distance from the goal (third row) and the particularly confusing random shaping (last row).

In the Path environment in fig. 2, the original reward (top left) prefers a specific path for reaching the goal state. Once again, the shaped versions obscure this, but preprocessing reliably recovers a simple and interpretable reward.

In these plots, we use the version of the sparsity cost and the logarithmic version of the smoothness cost. These work slightly better than the other versions, but the difference is very small. The results for all versions can be found in figs. 7, 6, 5 and 4 in the appendix.

4.2 Understanding Learned Rewards

In the previous experiment, all the reward functions were exactly equivalent to the simple original reward. By contrast, learned reward models may be noisy or contain systematic errors, and may not be equivalent to any simple reward. To evaluate how our method performs in this more realistic setting, we trained reward models from demonstrations (AIRL) and preference comparisons (DRLHP) on synthetic data in both of the previous Goal and Path environments. The results of applying our preprocessing method are shown in figs. 15, 14, 13, 12, 11, 10, 9 and 8 in the appendix.

For the reward model learned using preference comparisons (DRLHP), even the preprocessed models look very noisy. The goal state does tend to be somewhat more visible in the preprocessed than the unprocessed rewards, but neither are easy to understand. The reward model learned by AIRL differs even more from the ground truth reward, and potential shaping is unable to bridge that gap.

It might be possible to remove more of the noise by using a larger equivalence class than potential shaping. However, expanding the equivalence class might mean the preprocessed reward would no longer induce the same optimal policy as the unmodified reward in some environment dynamics. Indeed, the fact that potential shaping is not sufficient to remove the noise suggests that what DRLHP and AIRL have learned is not just a complex but validly shaped version of the ground truth reward.

[width=]fig/gridworld/ground_truth_10_goal

Figure 1: Preprocessing can recover a sparse reward from complex shaping. The original sparse Goal reward is shown in the top-left, with three shaped versions below. These rewards are shown after preprocessing with the sparsity (middle) and smoothness (right) cost functions. The preprocessed rewards are easy to understand, and are similar across a range of shaping. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/ground_truth_10_path

Figure 2: Preprocessing can recover a simpler dense reward from complex shaped rewards. The original Path reward (top-left) incentivizes following a diagonal path to the goal state. The three shaped versions below largely obscure this pattern, but preprocessing is able to recover something similar to the original reward. This is notable as the original reward is not sparse, so still achieves a relatively high cost under the norm, but is nonetheless lower cost than the highly complex shaped rewards. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

4.3 Mountain Car

Since the mountain car environment has an infinite number of possible transitions, we cannot plot the rewards of all possible transitions as we did in the gridworld tasks. Instead, we visualize reward functions by plotting the reward signal over time during expert trajectories. Figure 3 visualizes two learned reward models (left) and the reward signal after preprocessing with a log sparse (middle) and log smooth (right) cost function.

The model in the top row was trained using DRLHP on synthetically generated preferences. Specifically, we sampled Boltzmann-rational preferences between trajectory fragments based on the ground truth reward. In the second row, we first learned an optimal state value function for the mountain car environment and then used this to shape the ground truth reward before generating preferences. This simulates human feedback, which may be shaped compared to a sparse ground truth since humans already reward incremental progress [Christiano et al., 2017].

As in the gridworld setting, both the learned and preprocessed rewards are noisy. However, the preprocessed reward functions are still significantly simpler than the learned models, especially in the shaped case. The sparsity cost function performs better here than the smoothness cost. Figure 3 uses the logarithmic version of both but the version in fig. 18 yields almost exactly the same results.

Notably, the residual noise after preprocessing in fig. 3 is likely not removable by potential shaping. In particular, we find in figs. 17 and 16 that preprocessing on shaped versions of the ground-truth reward recover simple, noise-free rewards. The residual noise is therefore likely an accurate depiction of errors in the learned reward.

[width=]fig/mountain_car/preference_comparisons_log.pdf

Figure 3: Preprocessing can simplify complex learned reward models for mountain car. The left column shows reward models learned using synthetic preference comparisons based on the ground-truth reward (top), and the ground-truth shaped with an optimal value function (bottom). Preprocessing for sparsity (middle) and smoothness (right) produces simpler and less noisy reward curves, especially in the shaped setting. Each plot shows the reward during a rollout over five episodes (separated by the gray vertical lines).

5 Related Work

Interpreting reward models has recently begun to receive some attention. Russell and Santos [2019]

apply standard interpretatibility methods from supervised learning to reward functions. Specifically, they use feature importance estimates from a simple fitted global model and from LIME 

[Ribeiro et al., 2016] to interpret the reward function.

Globally fitting a simpler model to a reward function has some similarities to our reward preprocessing approach. However, a major difference is that the simple model will usually not be equivalent to the original reward function. In contrast, we learn an equivalent but still simplified reward model. This is possible because we exploit the structure that reward functions naturally have, whereas Russell and Santos [2019] only apply preexisting interpretability methods.

Michaud et al. [2020] also apply existing interpretability methods to understand reward models. In contrast to Russell and Santos, they work directly with the given reward, without fitting a simpler model. They suggest and combine three different approaches, namely gradient saliency maps, occlusion maps and hand-crafted counterfactual inputs. All of these methods can also be applied to supervised learning more broadly and do not take advantage of the structure of reward functions.

Our reward preprocessing framework is complementary to these methods for interpreting reward functions. We advocate first preprocessing a given reward to select a maximally comprehensible equivalent reward function. The resulting reward function can then be visualized or otherwise interpreted using a range of techniques.

While there is only a handful of work seeking to understand learned reward functions, considerably more work has focused on interpreting policies [Puiutta and Veith, 2020]. One approach is to learn a policy from a class of intrinsically simple functions rather than neural networks [Verma et al., 2018]. Alternatively, Juozapaitis et al. [2019] present a method that explains policy actions by an additive decomposition of -values. Another promising recent direction is using causal models to explain policy behavior [Madumal et al., 2020, Déletang et al., 2021].

Devidze et al. [2021] approach interpretability of reward functions from a different angle: rather than interpreting a complex learned reward function, they aim to design a reward function that trades off between interpretability (operationalized as sparsity) and ease of policy optimization.

6 Limitations and Future Work

One limitation of our approach is that while potential shaping does not change the optimal policy, it can make the policy optimization problem easier or harder. Consequently, the policy learned by an RL algorithm might well differ between the unmodified learned reward and the theoretically “equivalent” reward used for visualization. This issue is most significant in environments where policy optimization can be challenging. Reasoning about how shaping affects RL algorithm performance is challenging, so this is only a significant factor when the tool is being used by trained practitioners.

The above limitation is a way in which potential shaping can be too big an equivalence class. However, there is also a sense in which it is too small. In practice, we do not usually care about a reward function transferring to all possible transition dynamics. If it is known the transition dynamics satisfy certain invariants, then we may be able to use a larger equivalence class while still guaranteeing optimal policy preservation.

In addition to modifying the equivalence class, there are also numerous alternative cost functions that could be employed. In particular, the cost functions we suggest are targeted at the visualizations we use in this paper. Other visualizations might benefit from different cost functions. However, it seems likely that the basic concepts of sparsity and smoothness will be useful in many settings. For example, visualizations using gradient saliency maps might benefit from maximizing the sparsity of the gradients, rather than of the rewards themselves.

7 Conclusion

We have introduced a novel framework to preprocess reward functions prior to visualization. Our empirical results demonstrate this methodology can recover simple reward functions from shaped versions of ground-truth rewards. Moreover, our method can substantially simplify even noisy learned reward models. However, some low-quality learned reward models are still difficult to understand even with our method, suggesting that reward learning algorithms often converge to models significantly different from the user’s intended preferences.

References

  • R. Akrour, M. Schoenauer, and M. Sebag (2011) Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases, Cited by: §1.
  • D. Bahdanau, F. Hill, J. Leike, E. Hughes, A. Hosseini, P. Kohli, and E. Grefenstette (2019) Learning to understand goal specifications by modelling reward. In ICLR, Cited by: §1.
  • C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. d. O. Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang (2019) Dota 2 with large scale deep reinforcement learning. Note: arXiv: 1912.06680 [cs.LG] External Links: 1912.06680 Cited by: §1.
  • J. Boyan and A. W. Moore (1995) Generalization in reinforcement learning: safely approximating the value function. In NIPS, pp. 369–376. Cited by: §4.
  • D. S. Brown, W. Goo, P. Nagarajan, and S. Niekum (2019) Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In ICML, Cited by: §1.
  • P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei (2017) Deep reinforcement learning from human preferences. In NeurIPS, pp. 4299–4307. Cited by: §1, §4.3, §4.
  • G. Déletang, J. Grau-Moya, M. Martic, T. Genewein, T. McGrath, V. Mikulik, M. Kunesch, S. Legg, and P. A. Ortega (2021) Causal analysis of agent behavior for AI safety. Note: arXiv: 2103.03938 [cs.AI] External Links: 2103.03938 Cited by: §5.
  • R. Devidze, G. Radanovic, P. Kamalaruban, and A. Singla (2021) Explicable reward design for reinforcement learning agents. In NeurIPS, Cited by: §5.
  • J. Fu, K. Luo, and S. Levine (2018) Learning robust rewards with adversarial inverse reinforcement learning. In ICLR, Cited by: §1, §4.
  • A. Gleave, M. Dennis, S. Legg, S. Russell, and J. Leike (2021) Quantifying differences in reward functions. In ICLR, Cited by: §1.
  • B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei (2018) Reward learning from human preferences and demonstrations in Atari. In NeurIPS, Cited by: §1.
  • Z. Juozapaitis, A. Koul, A. Fern, M. Erwig, and F. Doshi-Velez (2019) Explainable reinforcement learning via reward decomposition. In

    IJCAI Workshop on Explainable Artificial Intelligence

    ,
    Cited by: §5.
  • P. Madumal, T. Miller, L. Sonenberg, and F. Vetere (2020) Explainable reinforcement learning through a causal lens. In AAAI, Cited by: §5.
  • E. J. Michaud, A. Gleave, and S. Russell (2020) Understanding learned reward functions. Note: NeurIPS Deep RL workshop External Links: 2012.05862 Cited by: §1, §5.
  • A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Cited by: §2.1, §3.1.
  • A. Y. Ng and S. Russell (2000) Algorithms for inverse reinforcement learning. In ICML, Cited by: §1.
  • E. Puiutta and E. M. S. P. Veith (2020) Explainable reinforcement learning: a survey. In Machine Learning and Knowledge Extraction, Cited by: §5.
  • D. Ramachandran and E. Amir (2007) Bayesian inverse reinforcement learning. In IJCAI, Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016)

    "Why should I trust you?": explaining the predictions of any classifier

    .
    In ACM SIGKDD, Cited by: §5.
  • J. Russell and E. Santos (2019)

    Explaining reward functions in markov decision processes

    .
    In International Florida Artificial Intelligence Research Society Conference, Cited by: §5, §5.
  • D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia (2017) Active preference-based learning of reward functions. In RSS, Cited by: §1.
  • R. Shah, D. Krasheninnikov, J. Alexander, P. Abbeel, and A. Dragan (2018) Preferences implicit in the state of the world. In ICLR, Cited by: §1.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
  • J. Skalse, M. Farrugia-Roberts, S. Russell, A. Abate, and A. Gleave (2022) Invariance in policy optimisation and partial identifiability in reward learning. arXiv. Cited by: §2.1.
  • A. Verma, V. Murali, R. Singh, P. Kohli, and S. Chaudhuri (2018) Programmatically interpretable reinforcement learning. In ICML, Cited by: §5.
  • O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. Wünsch, K. McKinney, O. Smith, T. Schaul, T. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
  • A. Wilson, A. Fern, and P. Tadepalli (2012) A Bayesian approach for policy learning from trajectory preference queries. In NeurIPS, Cited by: §1.
  • B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In AAAI, Cited by: §1.
  • D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving (2019)

    Fine-tuning language models from human preferences

    .
    Note: arXiv: 1909.08593v2 [cs.CL] External Links: 1909.08593 Cited by: §1.
  • X. Zuo (2018) Mazelab: a customizable framework to create maze and gridworld environments.. Note: https://github.com/zuoxingdong/mazelab Cited by: §4.1.

Appendix A Appendix

[width=]fig/gridworld/ground_truth_10_goal_l1

Figure 4: Goal reward preprocessed with versions of the sparsity and smoothness cost function. The results are very similar to those in fig. 1. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/ground_truth_10_goal_log

Figure 5: Goal reward preprocessed with logarithmic versions of the sparsity and smoothness cost functions. Again, the results are qualitatively similar to those in fig. 1. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/ground_truth_10_path_l1

Figure 6: Path reward preprocessed with versions of the sparsity and smoothness cost function. The results are very similar to those in fig. 2. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/ground_truth_10_path_log

Figure 7: Path reward preprocessed with logarithmic versions of the sparsity and smoothness cost functions. Compared to the sparse cost function in fig. 6, the log sparse cost recovers a slightly less symmetric but still significantly simplified reward. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/preference_comparisons_10_goal_l1.pdf

Figure 8: Reward models trained on synthetic data from the Goal reward using preference comparison (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the version of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/preference_comparisons_10_goal_log.pdf

Figure 9: Reward models trained on synthetic data from the Goal reward using preference comparison (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the logarithmic version of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/preference_comparisons_10_path_l1.pdf

Figure 10: Reward models trained on synthetic data from the Path reward using preference comparison (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the version of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/preference_comparisons_10_path_log.pdf

Figure 11: Reward models trained on synthetic data from the Path reward using preference comparison (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the logarithmic version of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/adversarial_10_goal_l1.pdf

Figure 12: Reward models learned using AIRL from expert demonstrations for the Goal reward (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the versions of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/adversarial_10_goal_log.pdf

Figure 13: Reward models learned using AIRL from expert demonstrations for the Goal reward (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the logarithmic versions of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/adversarial_10_path_l1.pdf

Figure 14: Reward models learned using AIRL from expert demonstrations for the Path reward (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the versions of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/gridworld/adversarial_10_path_log.pdf

Figure 15: Reward models learned using AIRL from expert demonstrations for the Path reward (leftmost column) and preprocessed versions of these (middle and right). The cost functions used here are the logarithmic versions of the sparsity and smoothness cost. Each heatmap shows the rewards for all possible transitions in a gridworld. The circle in the center of each square represents the reward for staying in that state. The four triangles in each square represent the reward of transitions leaving that square in each of the four directions.

[width=]fig/mountain_car/ground_truth_log.pdf

Figure 16: Preprocessing simplifies rewards in the continuous mountain car environment. The top-left shows the ground-truth reward over time, with three shaped versions below. The middle and right column show these rewards after preprocessing using the logarithmic sparsity and smoothness metrics. For the first two (linear) shapings, preprocessing recovers the ground truth reward exactly (up to a constant shift). In the more complex case in the last row, preprocessing still significantly simplifies the reward. See fig. 17 for versions with an cost function. Each plot shows the reward during a rollout over five episodes (separated by the gray vertical lines).

[width=]fig/mountain_car/ground_truth_l1.pdf

Figure 17: The top-left shows the ground-truth reward in mountain car over time, with three shaped versions below. The middle and right column show these rewards after preprocessing using the sparsity and smoothness metrics. This works reasonably well for these simple shaped rewards, although in the more complex last row these cost functions appear to perform less well than the logarithmic version in fig. 16. Each plot shows the reward during a rollout over five episodes (separated by the gray vertical lines).

[width=]fig/mountain_car/preference_comparisons_l1.pdf

Figure 18: Preprocessing can simplify complex learned reward models for mountain car. The left column shows reward models learned using synthetic preference comparisons based on the ground-truth reward (top), and the ground-truth shaped with an optimal value function (bottom). Preprocessing for sparsity (middle) and smoothness (right) produces simpler and less noisy reward curves, especially in the shaped setting. The results are extremely similar, although perhaps slightly worse, than the logarithmic version used in fig. 3. Each plot shows the reward during a rollout over five episodes (separated by the gray vertical lines).