Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning

05/30/2022
by   Joseph Early, et al.
3

We generalise the problem of reward modelling (RM) for reinforcement learning (RL) to handle non-Markovian rewards. Existing work assumes that human evaluators observe each step in a trajectory independently when providing feedback on agent behaviour. In this work, we remove this assumption, extending RM to include hidden state information that captures temporal dependencies in human assessment of trajectories. We then show how RM can be approached as a multiple instance learning (MIL) problem, and develop new MIL models that are able to capture the time dependencies in labelled trajectories. We demonstrate on a range of RL tasks that our novel MIL models can reconstruct reward functions to a high level of accuracy, and that they provide interpretable learnt hidden information that can be used to train high-performing agent policies.

READ FULL TEXT

page 9

page 19

research
05/26/2023

Learning Interpretable Models of Aircraft Handling Behaviour by Reinforcement Learning from Human Feedback

We propose a method to capture the handling abilities of fast jet pilots...
research
09/12/2019

Joint Inference of Reward Machines and Policies for Reinforcement Learning

Incorporating high-level knowledge is an effective way to expedite reinf...
research
11/06/2019

Distributional Reward Decomposition for Reinforcement Learning

Many reinforcement learning (RL) tasks have specific properties that can...
research
08/25/2022

Learning Task Automata for Reinforcement Learning using Hidden Markov Models

Training reinforcement learning (RL) agents using scalar reward signals ...
research
06/13/2022

Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward

The remarkable success of reinforcement learning (RL) heavily relies on ...
research
10/03/2022

Reward Learning with Trees: Methods and Evaluation

Recent efforts to learn reward functions from human feedback have tended...
research
07/08/2023

Improving Prototypical Part Networks with Reward Reweighing, Reselection, and Retraining

In recent years, work has gone into developing deep interpretable method...

Please sign up or login with your details

Forgot password? Click here to reset