Preprocessing Reward Functions for Interpretability

03/25/2022
by   Erik Jenner, et al.
0

In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward function to spot potential deviations from the user's intention. Existing work has applied general-purpose interpretability tools to understand learned reward functions. We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized. We introduce a general framework for such reward preprocessing and propose concrete preprocessing algorithms. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.

READ FULL TEXT

page 7

page 15

page 16

page 18

page 19

page 20

page 21

page 22

research
12/10/2020

Understanding Learned Reward Functions

In many real-world tasks, it is not possible to procedurally specify an ...
research
06/24/2020

Quantifying Differences in Reward Functions

For many tasks, the reward function is too complex to be specified proce...
research
01/09/2023

On The Fragility of Learned Reward Functions

Reward functions are notoriously difficult to specify, especially for ta...
research
07/14/2020

Programming by Rewards

We formalize and study “programming by rewards” (PBR), a new approach fo...
research
12/07/2019

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

General-purpose planning algorithms for automated driving combine missio...
research
07/13/2023

Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement

We explore the methodology and theory of reward-directed generation via ...
research
06/22/2023

Can Differentiable Decision Trees Learn Interpretable Reward Functions?

There is an increasing interest in learning reward functions that model ...

Please sign up or login with your details

Forgot password? Click here to reset