Inverse Reinforcement Learning via Matching of Optimality Profiles

by   Luis Haug, et al.

The goal of inverse reinforcement learning (IRL) is to infer a reward function that explains the behavior of an agent performing a task. The assumption that most approaches make is that the demonstrated behavior is near-optimal. In many real-world scenarios, however, examples of truly optimal behavior are scarce, and it is desirable to effectively leverage sets of demonstrations of suboptimal or heterogeneous performance, which are easier to obtain. We propose an algorithm that learns a reward function from such demonstrations together with a weak supervision signal in the form of a distribution over rewards collected during the demonstrations (or, more generally, a distribution over cumulative discounted future rewards). We view such distributions, which we also refer to as optimality profiles, as summaries of the degree of optimality of the demonstrations that may, for example, reflect the opinion of a human expert. Given an optimality profile and a small amount of additional supervision, our algorithm fits a reward function, modeled as a neural network, by essentially minimizing the Wasserstein distance between the corresponding induced distribution and the optimality profile. We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.


page 1

page 2

page 3

page 4


Learning a Prior over Intent via Meta-Inverse Reinforcement Learning

A significant challenge for the practical application of reinforcement l...

On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference

Our goal is for agents to optimize the right reward function, despite ho...

Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning

We approach the problem of understanding how people interact with each o...

Versatile Inverse Reinforcement Learning via Cumulative Rewards

Inverse Reinforcement Learning infers a reward function from expert demo...

Variational Inverse Control with Events: A General Framework for Data-Driven Reward Definition

The design of a reward function often poses a major practical challenge ...

Batch Reinforcement Learning from Crowds

A shortcoming of batch reinforcement learning is its requirement for rew...

LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning

Multiple-Intent Inverse Reinforcement Learning (MI-IRL) seeks to find a ...

Please sign up or login with your details

Forgot password? Click here to reset