Online Learning of Non-Markovian Reward Models

09/26/2020
by   Gavin Rens, et al.
14

There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks, that is, rewards are non-Markovian. One natural and quite general way to represent history-dependent rewards is via a Mealy machine, a finite state automaton that produces output sequences from input sequences. In our formal setting, we consider a Markov decision process (MDP) that models the dynamics of the environment in which the agent evolves and a Mealy machine synchronized with this MDP to formalize the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown to the agent and must be learned. Our approach to overcome this challenge is to use Angluin's L^* active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine (MRM). Formal methods are used to determine the optimal strategy for answering so-called membership queries posed by L^*. Moreover, we prove that the expected reward achieved will eventually be at least as much as a given, reasonable value provided by a domain expert. We evaluate our framework on three problems. The results show that using L^* to learn an MRM in a non-Markovian reward decision process is effective.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2020

Learning Non-Markovian Reward Models in MDPs

There are situations in which an agent should receive rewards only after...
research
10/09/2018

Discovering General-Purpose Active Learning Strategies

We propose a general-purpose approach to discovering active learning (AL...
research
11/30/2021

Maliva: Using Machine Learning to Rewrite Visualization Queries Under Time Constraints

We consider data-visualization systems where a middleware layer translat...
research
12/12/2012

Anytime State-Based Solution Methods for Decision Processes with non-Markovian Rewards

A popular approach to solving a decision process with non-Markovian rewa...
research
09/11/2011

Decision-Theoretic Planning with non-Markovian Rewards

A decision process in which rewards depend on history rather than merely...
research
10/11/2017

Specification Inference from Demonstrations

Learning from expert demonstrations has received a lot of attention in a...
research
04/06/2023

Robust Decision-Focused Learning for Reward Transfer

Decision-focused (DF) model-based reinforcement learning has recently be...

Please sign up or login with your details

Forgot password? Click here to reset