1 Preliminaries
A nondeterministic Büchi automaton is a tuple , where is a finite alphabet, is a finite set of states, is the initial state, are transitions, and is the transitionbased acceptance condition.
A run of on is an word in such that and, for , it is . We write for the set of transitions that appear infinitely often in the run . A run of is accepting if .
The language, , of (or, recognized by ) is the subset of words in that have accepting runs in . A language is regular if it is accepted by a Büchi automaton. An automaton is deterministic if implies . is complete if, for all and all , there is a transition . A word in has exactly one run in a deterministic, complete automaton.
A Markov decision process (MDP) is a tuple where is a finite set of states, is a finite set of actions, , where
is the set of probability distributions over
, is the probabilistic transition function, is an alphabet, and is the labelling function of the set of transitions. For a state , denotes the set of actions available in . For states and , we have that equals .A run of is an word such that for all . A finite run is a finite such sequence. For a run we define the corresponding labelled run as . We write () for the set of runs (finite runs) of and () for the set of runs (finite runs) of starting from state . When the MDP is clear from the context we drop the argument .
A strategy in is a function that for all finite runs we have , where is the support of and is the last state of . Let denote the subset of runs that correspond to strategy and initial state . Let be the set of all strategies. We say that a strategy is pure if is a point distribution for all runs and we say that is positional if implies for all runs .
The behaviour of an MDP under a strategy with starting state is defined on a probability space over the set of infinite runs of from
. Given a random variable over the set of infinite runs
, we write for the expectation of over the runs of from state that follow strategy .Given an MDP and an automaton , we want to compute an optimal strategy satisfying the objective that the run of is in the language of . We define the semantic satisfaction probability for and a strategy from state as:
When using automata for the analysis of MDPs, we need a syntactic variant of the acceptance condition. Given an MDP with initial state and an automaton , the product is an MDP augmented with an initial state and accepting transitions . The function is defined by
Finally, is defined by if, and only if, and . A strategy on the MDP defines a strategy on the product, and vice versa. We define the syntactic satisfaction probabilities as
Note that holds for a deterministic . In general, holds, but equality is not guaranteed because the optimal resolution of nondeterministic choices may require access to future events.
Definition 1 (GFM automata [3])
An automaton is good for MDPs if, for all MDPs , holds, where is the initial state of .
For an automaton to match , its nondeterminism is restricted not to rely heavily on the future; rather, it must be possible to resolve the nondeterminism onthefly.
2 Undiscounted Reward Shaping
We build on the reduction from [2, 3] that reduces maximising the chance to realise an regular objective given by a goodforMDPs Büchi automaton for an MDP to maximising the chance to meet the reachability objective in the augmented MDP (for ) obtained from by

adding a new target state (either as a sink with a selfloop or as a point where the computation stops; we choose here the latter view) and

by making the target a destination of each accepting transition of with probability and
multiplying the original probabilities of all other destinations of an accepting transition by .
Let
Theorem 2.1 ([2, 3])
The following holds:

(for ) and have the same set of strategies.

For a strategy , the chance of reaching the target in is if, and only if, the chance of satisfying the Büchi objective in is :

There is a such that, for all , an optimal reachability strategy for is an optimal strategy for satisfying the Büchi objective in :
.
This allows for analysing the much simpler reachability objective in instead of the Büchi objective in , and is open to implementation in model free reinforcement learning.
However, it has the drawback that rewards occur late when is close to . We amend that by the following observation:
We build, for a goodforMDPs Büchi automaton and an MDP , the augmented MDP (for ) obtained from in the same way as , i.e. by

adding a new sink state (as a sink where the computation stops) and

by making the sink a destination of each accepting transition of with probability and
multiplying the original probabilities of all other destinations of an accepting transition by .
Different to , has an undiscounted reward objective, where taking an accepting (in ) transition provides a reward of , regardless of whether it leads to the sink or stays in the statespace of .
Let, for a run of that contains accepting transitions, the total reward be , and let
Note that the set of runs with has probability in : they are the runs that infinitely often do not move to on an accepting transition, where the chance that this happens at least times is for all .
Theorem 2.2
The following holds:

(for ), (for ), and have the same set of strategies.

For a strategy , the expected reward for is if, and only if, the chance of reaching the target in is :
.

The expected reward for is in .

The chance of satisfying the Büchi objective in is if, and only if, the expected reward for is .

There is a such that, for all , a strategy that maximises the reward for is an optimal strategy for satisfying the Büchi objective in .
Proof
(1) Obvious, because all the states and their actions are the same apart from the sink state for which the strategy can be left undefined.
(2) The sink state can only be visited once along any run, so the expected number of times a run starting at is going to visit while using strategy is the same as its probability of visiting , i.e., . The only way can be reached is by traversing an accepting transition and this always happens with the same probability . So the expected number of visits to is the expected number of times an accepting transition is used, i.e., , multiplied by .
(3) follows from (2), because cannot be greater than 1.
(4) follows from (2) and Theorem 2.1 (2).
(5) follows from (2) and Theorem 2.1 (3).
3 Discounted Reward Shaping
The expected undiscounted reward for can be viewed as a discounted reward for , by giving a reward to when passing through an accepting transition when accepting transitions have been passed before. We call this reward biased.
Let, for a run of that contains accepting transitions, the biased discounted reward be , and let
Theorem 3.1
For every strategy , the expected reward for is equal to the expected biased reward for : .
This is simply because the discounted reward for each transition is equal to the chance of not having reached before (and thus still seeing this transition) in .
References
 [1] Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos, and Miroslav Pajic. Control synthesis from linear temporal logic specifications using modelfree reinforcement learning. CoRR, abs/1909.07299, 2019.
 [2] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak. Omegaregular objectives in modelfree reinforcement learning. In Tools and Algorithms for the Construction and Analysis of Systems, pages 395–412, 2019. LNCS 11427.
 [3] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak. Goodformdps automata for probabilistic analysis and reinforcement learning. In Tools and Algorithms for the Construction and Analysis of Systems, page to appear, 2020.
Comments
There are no comments yet.