Defining and Characterizing Reward Hacking

09/27/2022
by   Joar Skalse, et al.
0

We provide the first formal definition of reward hacking, a phenomenon where optimizing an imperfect proxy reward function, ℛ̃, leads to poor performance according to the true reward function, ℛ. We say that a proxy is unhackable if increasing the expected proxy return can never decrease the expected true return. Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it "narrower") or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case. A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant. We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability. Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/22/2023

On the Expressivity of Multidimensional Markov Reward

We consider the expressivity of Markov rewards in sequential decision ma...
research
02/07/2021

Consequences of Misaligned AI

AI systems often rely on two key components: a specified goal or reward ...
research
09/09/2018

Active Inverse Reward Design

Reward design, the problem of selecting an appropriate reward function f...
research
07/15/2020

Identifying Reward Functions using Anchor Actions

We propose a reward function estimation framework for inverse reinforcem...
research
04/16/2018

Distribution Estimation in Discounted MDPs via a Transformation

Although the general deterministic reward function in MDPs takes three a...
research
10/19/2018

Supervising strong learners by amplifying weak experts

Many real world learning tasks involve complex or hard-to-specify object...
research
01/10/2022

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Reward hacking – where RL agents exploit gaps in misspecified reward fun...

Please sign up or login with your details

Forgot password? Click here to reset