Impossibility of deducing preferences and rationality from human policy
Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. However, human planning systematically deviates from rationality. Though there has been some IRL work which assumes humans are noisily rational, there has been little analysis of the general problem of inferring the reward of a human of unknown rationality. The observed behavior can, in principle, be decomposed into two composed into two components: a reward function and a planning algorithm that maps reward function to policy. Both of these variables have to be inferred from behaviour. This paper presents a "No Free Lunch" theorem in this area, showing that, without making `normative' assumptions beyond the data, nothing about the human reward function can be deduced from human behaviour. Unlike most No Free Lunch theorems, this cannot be alleviated by regularising with simplicity assumptions. The simplest hypotheses are generally degenerate. The paper will then sketch how one might begin to use normative assumptions to get around the problem, without which solving the general IRL problem is impossible. The reward function-planning algorithm formalism can also be used to encode what it means for an agent to manipulate or override human preferences.
READ FULL TEXT