Reward Shaping for Reinforcement Learning with Omega-Regular Objectives

01/16/2020 ∙ by E. M. Hahn, et al. ∙ 0

Recently, successful approaches have been made to exploit good-for-MDPs automata (Büchi automata with a restricted form of nondeterminism) for model free reinforcement learning, a class of automata that subsumes good for games automata and the most widespread class of limit deterministic automata. The foundation of using these Büchi automata is that the Büchi condition can, for good-for-MDP automata, be translated to reachability. The drawback of this translation is that the rewards are, on average, reaped very late, which requires long episodes during the learning process. We devise a new reward shaping approach that overcomes this issue. We show that the resulting model is equivalent to a discounted payoff objective with a biased discount that simplifies and improves on prior work in this direction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Preliminaries

A nondeterministic Büchi automaton is a tuple , where is a finite alphabet, is a finite set of states, is the initial state, are transitions, and is the transition-based acceptance condition.

A run of on is an -word in such that and, for , it is . We write for the set of transitions that appear infinitely often in the run . A run of is accepting if .

The language, , of (or, recognized by ) is the subset of words in that have accepting runs in . A language is -regular if it is accepted by a Büchi automaton. An automaton is deterministic if implies . is complete if, for all and all , there is a transition . A word in has exactly one run in a deterministic, complete automaton.

A Markov decision process (MDP) is a tuple where is a finite set of states, is a finite set of actions, , where

is the set of probability distributions over

, is the probabilistic transition function, is an alphabet, and is the labelling function of the set of transitions. For a state , denotes the set of actions available in . For states and , we have that equals .

A run of is an -word such that for all . A finite run is a finite such sequence. For a run we define the corresponding labelled run as . We write () for the set of runs (finite runs) of and () for the set of runs (finite runs) of starting from state . When the MDP is clear from the context we drop the argument .

A strategy in is a function that for all finite runs we have , where is the support of and is the last state of . Let denote the subset of runs that correspond to strategy and initial state . Let be the set of all strategies. We say that a strategy is pure if is a point distribution for all runs and we say that is positional if implies for all runs .

The behaviour of an MDP under a strategy with starting state is defined on a probability space over the set of infinite runs of from

. Given a random variable over the set of infinite runs

, we write for the expectation of over the runs of from state that follow strategy .

Given an MDP and an automaton , we want to compute an optimal strategy satisfying the objective that the run of is in the language of . We define the semantic satisfaction probability for and a strategy from state as:

When using automata for the analysis of MDPs, we need a syntactic variant of the acceptance condition. Given an MDP with initial state and an automaton , the product is an MDP augmented with an initial state and accepting transitions . The function is defined by

Finally, is defined by if, and only if, and . A strategy on the MDP defines a strategy on the product, and vice versa. We define the syntactic satisfaction probabilities as

Note that holds for a deterministic . In general, holds, but equality is not guaranteed because the optimal resolution of nondeterministic choices may require access to future events.

Definition 1 (GFM automata [3])

An automaton is good for MDPs if, for all MDPs , holds, where is the initial state of .

For an automaton to match , its nondeterminism is restricted not to rely heavily on the future; rather, it must be possible to resolve the nondeterminism on-the-fly.

2 Undiscounted Reward Shaping

We build on the reduction from [2, 3] that reduces maximising the chance to realise an -regular objective given by a good-for-MDPs Büchi automaton for an MDP to maximising the chance to meet the reachability objective in the augmented MDP (for ) obtained from by

  • adding a new target state (either as a sink with a self-loop or as a point where the computation stops; we choose here the latter view) and

  • by making the target a destination of each accepting transition of with probability and
    multiplying the original probabilities of all other destinations of an accepting transition by .

Let

Theorem 2.1 ([2, 3])

The following holds:

  1. (for ) and have the same set of strategies.

  2. For a strategy , the chance of reaching the target in is if, and only if, the chance of satisfying the Büchi objective in is :

  3. There is a such that, for all , an optimal reachability strategy for is an optimal strategy for satisfying the Büchi objective in :

    .

This allows for analysing the much simpler reachability objective in instead of the Büchi objective in , and is open to implementation in model free reinforcement learning.

However, it has the drawback that rewards occur late when is close to . We amend that by the following observation:

We build, for a good-for-MDPs Büchi automaton and an MDP , the augmented MDP (for ) obtained from in the same way as , i.e. by

  • adding a new sink state (as a sink where the computation stops) and

  • by making the sink a destination of each accepting transition of with probability and
    multiplying the original probabilities of all other destinations of an accepting transition by .

Different to , has an undiscounted reward objective, where taking an accepting (in ) transition provides a reward of , regardless of whether it leads to the sink or stays in the state-space of .

Let, for a run of that contains accepting transitions, the total reward be , and let

Note that the set of runs with has probability in : they are the runs that infinitely often do not move to on an accepting transition, where the chance that this happens at least times is for all .

Theorem 2.2

The following holds:

  1. (for ), (for ), and have the same set of strategies.

  2. For a strategy , the expected reward for is if, and only if, the chance of reaching the target in is :

    .

  3. The expected reward for is in .

  4. The chance of satisfying the Büchi objective in is if, and only if, the expected reward for is .

  5. There is a such that, for all , a strategy that maximises the reward for is an optimal strategy for satisfying the Büchi objective in .

Proof

(1) Obvious, because all the states and their actions are the same apart from the sink state for which the strategy can be left undefined.

(2) The sink state can only be visited once along any run, so the expected number of times a run starting at is going to visit while using strategy is the same as its probability of visiting , i.e., . The only way can be reached is by traversing an accepting transition and this always happens with the same probability . So the expected number of visits to is the expected number of times an accepting transition is used, i.e., , multiplied by .

(3) follows from (2), because cannot be greater than 1.

(4) follows from (2) and Theorem 2.1 (2).

(5) follows from (2) and Theorem 2.1 (3).

3 Discounted Reward Shaping

The expected undiscounted reward for can be viewed as a discounted reward for , by giving a reward to when passing through an accepting transition when accepting transitions have been passed before. We call this reward -biased.

Let, for a run of that contains accepting transitions, the -biased discounted reward be , and let

Theorem 3.1

For every strategy , the expected reward for is equal to the expected -biased reward for : .

This is simply because the discounted reward for each transition is equal to the chance of not having reached before (and thus still seeing this transition) in .

This improves over [1] because it only uses one discount parameter, , instead of two (called and in [1]) parameters (that are not independent). It is also simpler and more intuitive: discount whenever you have earned a reward.

References