# Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all ϵ and γ we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an ϵ-optimal mean payoff with probability at least 1 - γ. (ii) Alternatively, for all ϵ and γ there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an ϵ-optimal mean payoff with probability at least 1 - γ. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.

## Authors

• 25 publications
• 18 publications
• 28 publications
• ### Transience in Countable MDPs

The Transience objective is not to visit any state infinitely often. Whi...
12/26/2020 ∙ by Stefan Kiefer, et al. ∙ 0

• ### Strategy Complexity of Parity Objectives in Countable MDPs

We study countably infinite MDPs with parity objectives. Unlike in finit...
07/07/2020 ∙ by Stefan Kiefer, et al. ∙ 0

• ### Combinations of Qualitative Winning for Stochastic Parity Games

We study Markov decision processes and turn-based stochastic games with ...
04/10/2018 ∙ by Krishnendu Chatterjee, et al. ∙ 0

• ### Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs

We study countably infinite Markov decision processes (MDPs) with real-v...
07/01/2021 ∙ by Richard Mayr, et al. ∙ 0

• ### Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally

We study the statistical limits of Imitation Learning (IL) in episodic M...
02/25/2021 ∙ by Nived Rajaraman, et al. ∙ 0

• ### Scenario-Based Verification of Uncertain MDPs

We consider Markov decision processes (MDPs) in which the transition pro...
12/24/2019 ∙ by Murat Cubuktepe, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

##### Reactive synthesis and online reinforcement learning.

Reactive systems are systems that maintain a continuous interaction with the environment in which they operate. When designing such systems, we usually face two partially conflicting objectives. First, to ensure a safe execution, we want some basic and critical properties to be enforced by the system no matter how the environment behaves. Second, we want the reactive system to be as efficient as possible given the actual observed behaviour of the environment in which the system is executed. As an illustration, let us consider a robot that needs to explore an unknown environment as fast and as efficiently as possible while avoiding any collision with other objects, human or robots in the environment. While operating the robot at low speed makes it easier to avoid collisions, it will impair our ability to explore the environment quickly even if this environment is free of other moving objects.

There has been, in the past, a large research effort to define mathematical models and algorithms in order to address the two objectives above, but in isolation only. To design safe control strategies, two-player zero-sum games with omega-regular objectives have been proposed [31, 4], while online-reinforcement-learning (RL, for short) algorithms for partially-specified Markov decision processes (MDPs) have been proposed (see e.g. [35, 23, 28, 30]) to learn strategies that reach optimal or near-optimal performance in the actual environment in which the system is executed. In this paper, we want to answer the following question: How efficient can online-learning techniques be if one imposes that only correct executions, i.e. executions that satisfy a specified omega-regular objective (defined by a parity objective), are explored during execution? So, we want to understand how to combine (reactive) synthesis and RL in order to construct reactive systems that are safe, yet, at the same time, can adapt their behaviour according to the actual environment in which they execute.

##### Contributions.

In order to answer in a precise way the question above, we consider a mathematical model which generalizes partially-specified MDPs with the mean payoff function and two-player zero-sum games with a parity objective. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that, in MDPs consisting of a single end component (EC), two combinations of guarantees on the parity and mean-payoff objectives can be achieved. (i) For all and , we show how to construct an online-learning finite-memory strategy which almost-surely satisfies the parity objective and which achieves an -optimal mean payoff with probability at least , for all instantiations of the partially unknown MDP (Proposition 5.2). (ii) Alternatively, for all and , we show how to construct an online-learning infinite-memory strategy which satisfies the parity objective surely and which achieves an -optimal mean payoff with probability at least , for all instantiations of the partially unknown MDP (Proposition 4.2). We extend the above results to MDPs consisting of more than one EC in a natural way (Theorem 5.3 and Theorem 4.3). We also study special cases that allow for improved optimality results as in the case of good ECs (Proposition 4.1 and Proposition 5.1). Finally, we show that there are partially-specified MDPs for which stronger combinations of the guarantees cannot be ensured.

##### Example.

We illustrate in this example how to synthesize a finite-memory learning strategy that almost-surely wins the parity objective and ensures with high probability outcomes that are near optimal for the mean-payoff.111We refer the reader interested in an example regarding sure winning for the parity objective to App. A. Consider the MDP on the right-hand side of Fig. 1 for which we know the support of the transition function but not the probabilities and (for simplicity the rewards are assumed to be known). First, note that while there is no surely winning strategy for the parity objective in this MDP, playing action forever in guarantees to visit state infinitely many times with probability one, i.e. this is a strategy that almost-surely wins the parity objective. Clearly, if then it is better to play for optimizing the mean-payoff, otherwise, it is better to play . As and

are unknown, we need to learn estimates

and for those values to take a decision. This can be done by playing and a number of times from and by observing how many times we get up and how many times we get down. If , we may choose to play forever in order to optimize our mean payoff. Then we face two difficulties. First, after the learning episode, we may instead observe while . This is because we may have been unlucky and observed statistics that differ from the real distribution. Second, playing always is not an option if we want to satisfy the parity objective with probability (almost surely). In this paper, we give algorithms to overcome the two problems and compute a finite-memory strategy that satisfies the parity objective with probability and is close to the optimal mean-payoff value with high probability. The finite-memory learning strategy produced by our algorithm works as follows in this example. First, it chooses large enough so that trying and from as many as times allows to learn and such that and with probability at least . Then, if the strategy plays for steps and then for steps. is chosen large enough so that the mean payoff of any outcome will be -close to the best obtainable mean payoff with probability at least . Furthermore, as is played infinitely many times, the upper-right state will be visited infinitely many times with probability . Hence, the strategy is also almost-surely satisfying the parity objective. Additionally, we also show in the paper that if we allow for learning all along the execution of the strategy then we can get, on this example, the exact optimal value and satisfy the parity objective almost surely. However, to do so, we need infinite memory.

##### Related works.

In [11, 17, 8, 16], we initiated the study of a mathematical model that combines MDPs and two-player zero sum games. With this new model, we provide formal grounds to synthesize strategies that guarantee both some minimal performance against any adversary and a higher expected performance against a given expected behaviour of the environment, thus essentially combining the two traditional standpoints from games and MDPs. Following this approach, in [1], Almagor et al. study MDPs equipped with a mean-payoff and parity objective. They study the problem of synthesizing a strategy that ensures an expected mean-payoff value that is as large as possible while satisfying a parity objective surely. In [15], Chatterjee and Doyen study how to enforce almost surely a parity objective together with threshold constraint on the expected mean-payoff. See also [10], where mean-payoff MDPs with energy constraints are studied. In all those works, the transition probability and the reward function are known in advance. In contrast, we consider the more complex setting in which the reward function is discovered on the fly during execution time and the transition probabilities need to be learned.

In [20, 36, 22, 2], RL is combined with safety guarantees. In those works, there is a MDP with a set of unsafe states that must be avoided at all cost. This MDP is then restricted to states and actions that are safe and cannot lead to unsafe states. Thereafter, classical RL is exercised. The problem that is considered there is thus very similar to the problem that we study here with the difference that they only consider safety constraints. For safety constraints, the reactive synthesis phase and the RL can be entirely decoupled with a two-phase algorithm. A simple two-phase approach cannot be applied to the more general setting of parity objectives. In our more challenging setting, we need to intertwine the learning with the satisfaction of the parity objective in a non trivial way. It is easy to show that reducing parity to safety, as in [7], could lead to learning strategies that are arbitrary far from the optimal value that our learning strategies achieve. In [37], Topcu and Wen study how to learn in a MDP with a discounted-sum (and not mean-payoff) function and liveness constraints expressed as deterministic Büchi automata that must be enforced almost surely. Contrary to our setting, they do not consider general omega-regular specifications expressed as parity objectives nor sure satisfaction.

In [9], we apply reinforcement learning to MDP where even the topology is unknown, only and, for convenience, the size of the state space is given. It optimizes the probability to satisfy an omega-regular property; however, no mean payoff is involved. Our usage of follows [9, 18] which argue it is necessary for fully statistical analysis of unbounded-horizon properties, but also that this assumption is realistic in many scenarios.

##### Structure of the paper.

In Sect. 2, we introduce the necessary preliminaries. In Sect. 3, we study online finite and infinite-memory learning strategies for mean-payoff objectives without omega-regular constraints. In Sect. 4, we study strategies for mean-payoff objectives under a parity constraint that must be enforced surely. In Sect. 5, we study strategies for mean-payoff objectives under a parity constraint that must be enforced almost surely.

## 2 Preliminaries

Let be a finite set. We denote by the set of all (rational) probabilistic distributions on , i.e. the set of all functions such that . For sets and and functions and , we write and instead of and respectively. The support of a distribution is the set The support of a function is the relation such that .

### 2.1 Markov chains

[Markov chains] A

Markov chain (MC, for short) is a tuple where is a (potentially countably infinite) set of states, is a (probabilistic) transition function , is a priority function, and is an (instantaneous) reward function. A run of an MC is an infinite sequence of states such that for all . We denote by the set of all runs of that start with the state .

Consider an initial state . The probability of every measurable event is uniquely defined [34, 27]. We denote by the probability of ; for a measurable function , we write for the expected value of the function under the probability measure .

##### Parity and mean payoff.

Consider a run of . We say satisfies the parity objective, written , if the minimal priority of states along the run is even. That is to say In a slight abuse of notation, we sometimes write Parity to refer to the set of all runs of a Markov chain which satisfy the parity objective . The latter set of runs is clearly measurable.

The mean-payoff function is defined for all runs of as follows This function is readily seen to be Borel definable [13], thus also measurable.

### 2.2 Markov decision processes

[Markov decision processes] A (finite discrete-time) Markov decision process (MDP, for short) is a tuple where is a finite set of states, a finite set of actions, a function that assigns to its set of available actions, a (partial probabilistic) transition function with defined for all and all , a priority function, and a reward function. We make the assumption that for all . A history in an MDP is a finite state-action sequence that ends in a state and respects and , i.e. if then and for all . We write to denote the state . For two histories , we write if is a proper prefix if .

[Strategies] A strategy in an MDP is a function such that . We write that a strategy is memoryless if whenever ; deterministic if for all histories the distribution is Dirac.

Throughout this work we will speak of steps, episodes, and following strategies. We write that follows (from the history ) during steps if for all , such that and , we have that . An episode is simply a finite sequence of steps, i.e. a finite infix of the history, during which one or more strategies may have been sequentially followed.

A stochastic Moore machine is a tuple where is a (potentially countably infinite) set of memory elements, is the initial memory element, is an update function, and is an output function. The machine is said to implement a strategy if for all histories we have , where is inductively defined as for all . It is easy to see that any strategy can be implemented by such a machine. A strategy is said to have finite memory if there exists a stochastic Moore machine that implements it and such that its set of memory elements is finite.

A (possibly infinite) state-action sequence is consistent with strategy if for all .

##### From MDPs to MCs.

The MDP and a strategy implemented by the stochastic Moore machine induce the MC where , for any , , and for any . To avoid clutter, we write instead of .

A strategy is said to be unichain if has a single recurrent class (i.e. bottom strongly-connected component).

##### End components.

Consider a pair where and gives a subset of actions allowed per state (i.e. for all ). Let be the directed graph where is the set of all pairs such that for some . We say is an end component (EC) if the following hold: if , for , then ; and the graph is strongly connected. Furthermore, we say the EC is good (for the parity objective) (a GEC, for short) if the minimal priority of a state from is even; weakly good if it contains a GEC.

For ECs and , let us denote by the fact that and for all . We denote by the set of all maximal ECs (MECs) in with respect to . It is easy to see that for all we have that , i.e. every state belongs to at most one MEC.

##### Model learning and robust strategies.

In this work we will “approximate” the stochastic dynamics of an unknown EC in an MDP. Below, we formalize what we mean by approximation. [Approximating distributions] Let be an MDP, an EC, and . We say is -close to in , denoted , if for all and all . If the inequality holds for all and all , then we write .

A strategy is said to be (uniformly) expectation-optimal if for all we have . The following result captures the idea that some expectation-optimal strategies for MDPs whose transition function have the same support are “robust”. That is, when used to play in another MDP with the same support and close transition functions, they achieve near-optimal expectation. [Adapted from [14, Theorem 5]] Consider values such that and a transition function such that and . For all memoryless deterministic expectation-optimal strategies in , for all , it holds that We say a strategy such as the one in the result above is -robust-optimal (with respect to the expected mean payoff).

### 2.3 Automata as proto-MDPs

We study MDPs with unknown transition and reward functions. It is therefore convenient to abstract those values and work with automata.

[Automata] A (finite-state parity) automaton is a tuple where is a finite set of states, is a finite alphabet of actions, is a transition relation, and is a priority function. We make the assumption that for all we have for some .

A transition function is then said to be compatible with if . For a transition function compatible with and a reward function , we denote by the MDP where . It is easy to see that the sets of ECs of MDPs and coincide for all compatible with and all reward functions . Hence, we will sometimes speak of the ECs of an automaton.

##### Transition-probability lower bound.

Let be a transition-probability lower bound. We say that is compatible with if for all we have that: either or .

## 3 Learning for MP: the Unconstrained Case

In this section, we focus on the design of optimal learning strategies for the mean-payoff function in the unconstrained single-end-component case. That is, we have an unknown strongly connected MDP with no parity objective.

We consider, in turn, learning strategies that use finite and infinite memory. Whereas classical RL algorithms focus on achieving an optimal expected value (see, e.g., [35]; cf. [6]), we prove here that a stronger result is achievable: one can ensure—using finite memory only—outcomes that are close to the best expected value with high probability. Further, with infinite memory the optimal outcomes can be ensured with probability . In both cases, we argue that our results are tight.

For the rest of this section, let us fix an automaton such that is an EC, and some .

##### Yardstick.

Let be a transition function compatible with and , and be a reward function. The optimal mean-payoff value that is achievable in the unique EC is defined as for any . Indeed, it is well known that the value on the right-hand side of the definition is the same for all states in the same EC.

Note that this value can always be obtained by a memoryless deterministic [21] and unichain [11] expectation-optimal strategy when and are known. We will use this value as a yardstick for measuring the performance of the learning strategies we describe below.

##### Model learning.

Our strategies learn approximate models of and to be able to compute near-optimal strategies. To obtain those models, we use an approach based on ideas from probably approximately correct (PAC) learning. Namely, we will execute a random exploration of the MDP for some number of steps and obtain an empirical estimation of its stochastic dynamics, see e.g. [33]. We say that a memoryless strategy is a (uniform random) exploration strategy for a function if for all . Each time the random exploration enters a state and chooses an action , we say that it performs an experiment on , and if the state reached is then we say that the result of the experiment is . Furthermore, the value is then known to us. To learn an approximation of the transition function , and to learn , the learning strategy remembers statistics about such experiments. If the random exploration strategy is executed long enough then it collects sufficiently many experiment results to accurately approximate the transition function and the exact reward function with high probability.

The next lemma gives us a bound on the number of -step episodes for which we need to exercise such a strategy to obtain the desired approximation with at least some given probability. It can be proved via a simple application of Hoeffding’s inequality. For all ECs and all one can compute (exponential in and polynomial in , , , and ) such that following an exploration strategy for during (potentially non-consecutive) episodes of -steps suffices to collect enough information to be able to compute a transition function such that

### 3.1 Finite memory

We now present a family of finite memory strategies that force, given any , outcomes with a mean payoff that is -close to the optimal expected value with probability higher than . The strategy is defined as follows.

1. First, follows the model learning strategy above for steps, according to Lemma 3, in order to obtain an approximation of such that with probability at least . A reward function is also constructed from the observed rewards.

2. Then, follows a memoryless deterministic expectation-optimal strategy for .

The following result tells us that if the learning phase is sufficiently long, then we can obtain, with , a near-optimal mean payoff with high probability.

###### Proposition .

For all , one can compute such that for the resulting finite memory strategy , for all , for all compatible with and , and for all reward functions , we have

###### Proof.

We will make use of Lemma 2.2. For that purpose, let where is as in the statement of the lemma. Next, we set where is as dictated by Lemma 3 using and . By Lemma 3, with probability at least our approximation is such that . It follows that, since , we have and we now have learned , again with probability . Finally, since , Lemma 2.2 implies the desired result. ∎

[Finite-memory implementability] Note that , as we described it previously, is not immediately seen to be a computable finite stochastic Moore machine. However, for all possible runs of length , we can compute —an approximation of —and . Using that information, the required finite-memory expectation-optimal strategy can be computed. We encode these (finitely many) strategies into the machine implementing so that it only has to choose which one to follow forever after the (finite) learning phase has ended. Hence, one can indeed construct a finite-memory strategy realizing the described strategy.

##### Optimality.

The following tells us that we cannot do better with finite memory strategies.

###### Proposition .

Let be the single-EC automaton on the right-hand side of Fig. 1 and . For all , the following two statements hold.

• For all finite memory strategies , there exist compatible with and , and a reward function , such that

• For all finite memory strategies , there exist compatible with and , and a reward function such that

###### Proof sketch.

With a finite-memory strategy we cannot satisfy a stronger guarantee than being -optimal with probability at least in this example. Indeed, as we can only use finite memory, we can only learn imprecise models of and . That is, we will always have a non-zero probability to have approximated or arbitrarily far from their actual values. It should then be clear that neither optimality with high probability nor almost-sure -optimality can be achieved. ∎

### 3.2 Infinite memory

While we have shown that probably approximately optimal is the best that can be obtained with finite memory learning strategies, we now establish that with infinite memory, one can guarantee almost sure optimality.

To this end, we define a strategy which operates in episodes consisting of two phases: learning and optimization. In episode , the strategy does the following.

1. It first follows an exploration strategy for during steps, there exist models and based on the experiments obtained throughout the steps during which has been followed so far.

2. Then, follows a unichain memoryless deterministic expectation-optimal strategy for during steps.

One can then argue that can be instantiated so that in every episode the finite average obtained so far gets ever close to with ever higher probability. This is achieved by choosing the as an increasing sequence so that the approximations get ever better with ever higher probability. Then, the are chosen so as to compensate for the past history, for the time before the induced Markov chain reaches its limit distribution, and for the future number of steps that will be spent learning in the next episode. The latter then allows us to use the Borel-Cantelli lemma to show that in the unknown EC we can obtain its value almost surely. The proof of this result is technically challenging and given in full details in the appendix.

###### Proposition .

One can compute a sequence such that for all ; additionally the resulting strategy is such that for all , for all compatible with and , and for all reward functions , we have

##### Optimality.

Note that is optimal since it obtains with probability the best value that can be obtained when the MDP is fully known, i.e. when and are known in advance.

## 4 Learning for MP under a Sure Parity Constraint

We show here how to design learning strategies that obtain near-optimal mean-payoff values while ensuring that all runs satisfy a given parity objective with certainty.

First, we note that all such learning strategies must avoid entering states from which there is no strategy to enforce the parity objective with certainty. Hence, we make the hypothesis that all such states have been removed from the automaton , and so we assume that for all there exists a strategy such that for all functions compatible with , for all reward functions , and for all , we have . It is worth noting that, in fact, there exists a memoryless deterministic strategy such that the condition holds for all  [4, 3]. Notice the swapping of the quantifiers over the initial states and the strategy, this is why we say it is uniformly winning (for the parity objective). The set of states to be removed, along with a uniformly winning strategy, can be computed quasi-polynomial time [12]. We say that an automaton with no states from which there is no winning strategy is surely good.

We study the design of learning strategies for mean-payoff optimization under sure parity constraints for increasingly complex cases.

### 4.1 The case of a single good EC

Consider a surely-good automaton such that is a GEC, i.e. the minimal color of a state in the EC is even, and some .

##### Yardstick.

For this case, we use as yardstick the optimal expected mean-payoff value:

##### Learning strategy.

We show here that it is possible to obtain an optimal mean-payoff with high probability. Note that our solution extends a result given by Almagor et al. [1] for known MDPs. The main idea behind our solution is to use the strategy from Proposition 3.2 in a controlled way: we verify that during all successive learning and optimization episodes, the minimal parity value that is visited is even. If during some episode, this is not the case, then we resort to a strategy that enforces the parity objective with certainty. Such is guaranteed to exist as is surely good.

###### Proposition .

For all , there exists a strategy such that for all , for all compatible with and , and for all reward functions , we have for all and .

###### Proof sketch.

We modify so as to “give up” on optimizing the mean payoff if the minimal even priority has not been seen during a long sequence of episodes. This will guarantee that the measure of runs which give up on the mean-payoff optimization is at most .

First, recall that we can instantiate so that for all . Hence, with some probability , during every learning phase, we visit a state with even minimal priority. We can then find a sequence of natural numbers such that , for some . Given this sequence, we apply the following monitoring. If for we write , then at the end of the -th episode we verify that during some learning phase from we have visited a state with minimal even priority, otherwise we switch to a parity-winning strategy forever. ∎

##### Optimality.

The following proposition tells us that the guarantees from Proposition 4.1 are indeed optimal w.r.t. our chosen yardstick.

###### Proposition .

Let be the single-GEC automaton on the left-hand side of Fig. 1 and . For all parity-winning strategies , there exist compatible with and , and a reward function , such that

###### Proof sketch.

Consider a reward function such that and and an arbitrary . It is easy to see that . However, any strategy that ensures the parity objective is satisfied surely must be such that, with probability , it switches to follow a strategy forever. Hence, with probability at least its mean-payoff is sub-optimal. ∎

### 4.2 The case of a single EC

We now turn to the case where the surely-good automaton consists of a unique, not necessarily good, EC . Let us also fix some .

An important observation regarding single-end-component MDPs that are surely good is that they contain at least one GEC as stated in the following lemma. For all surely-good automata such that is an EC there exists such that is a GEC in for all compatible with and all reward functions , i.e. is weakly good.

##### Yardstick.

Let and be fixed in the single EC, our yardstick for this case is defined as follows: That is is the best MP expectation value that can be obtained from a state with a parity-winning strategy. It is remarkable to note that we take the maximal value over all states in . As noted by Almagor et al. [1], this value is not always achievable even when and are a priori known, but it can be approached arbitrarily close.

##### Learning strategy.

The following proposition tells us that we can obtain a value close to with arbitrarily high probability while satisfying the parity objective surely.

###### Proposition .

For all there exists a strategy such that for all , for all compatible with and , and for all reward functions , we have for all and .

###### Proof sketch.

We define a strategy as follows. Let for as defined for Lemma 2.2. The strategy plays as follows.

1. It first computes such that with probability at least and a reward function by following an exploration strategy for during steps (see Lemma 3).

2. It then selects a contained maximal good EC (MGEC) with maximal value (see Lemma 4.2) and tries to reach it with probability at least by following during steps.

3. Finally, if the component is reached, it follows a strategy as in Proposition 4.1 with from then onward.

If the learning “fails” or if the component is not reached, the strategy reverts to following a winning strategy forever. (A failed learning phase is one in which the approximated distribution function does not have as its support.) ∎

##### Optimality.

The following states that we cannot improve on the result of Proosition 4.2.

###### Proposition .

Let be the single-EC automaton in Fig. 2 and . For all , the two following statements hold.

• For all strategies , there exist compatible with and , and a reward function , such that

• For all strategies , there exist compatible with and , and a reward function , such that

###### Proof sketch.

Observe that the MEC is not a good EC. However, it does contain the GECs with states and respectively. Now, since those two GECs are separated by , whose priority is , any winning strategy must at some point stop playing to and commit to a single GEC. Thus, the learning of the global EC can only last for a finite number of steps. It is then straightforward to argue that near-optimality with high-probability is the best achievable guarantee. ∎

### 4.3 General surely-good automata

In this section, we generalize our approach from single-EC automata to general automata. We will argue that, under a sure parity constraint, we can achieve a near-optimal meanpayoff with high probability in any MEC in which we end up with non-zero probability. That is, given that we stay in that EC with non-zero probability.

Consider a surely-good automaton and some . For all there exists a strategy such that for all , for all compatible with and , and all reward functions , we have

• for all and

• for all such that is weakly good and .

###### Proof sketch.

The strategy we construct follows a parity-winning strategy until a state contained in a weakly good MEC, that has not been visited before, is entered. In this case, the strategy follows (the strategy from Proposition 4.2). Observe that when switches to (a parity-winning strategy) it may exit the end component. If this happens, then the component is marked as visited and is followed until a new—not previously visited—maximal good end component is entered. In that case, we switch to once more. Crucially, the new strategy ignores MECs that are revisited ∎

[On the choice of MECs to reach] The strategy constructed for the proof of Theorem 4.3 has to deal with leaving a MEC due to the fallbacks to the parity-winning strategy . However, surprisingly, instead of actually following , upon entering a new MEC it has to restart the process of achieving a satisfactory mean-payoff. Indeed, otherwise the overall mass of sub-optimal runs from various MECs (each smaller than ) could get concentrated in a single MEC, thus violating the advertised guarantees.

The strategy could be simplified as follows. First, we follow any strategy to reach a bottom MEC (BMEC)—that is, a MEC from which no other MEC is reachable. By definition, the winning strategy can be played here and the MEC cannot be escaped. Therefore, in the BMEC we run the strategy as described, and after the fallback we indeed simply follow . If we did not reach a BMEC after a long time, we could switch to the fallback, too. While this strategy is certainly simpler, our general strategy has the following advantage. Intuitively, we can force the strategy to stay in any current good MEC, even if it is not bottom, and thus maybe achieve a more satisfactory mean-payoff. Further, whenever we want, we can force the strategy to leave the current MEC and go to a lower one. For instance, if the current estimate of the mean payoff is lower than what we hope for, we can try our luck in a lower MEC. We further comment on the choice of unknown MECs in the conclusions.

## 5 Learning for MP under an Almost-Sure Parity Constraint

In this section, we turn our attention to learning strategies that must ensure a parity objective not with certainty (as in previous section) but almost surely, i.e. with probability . As winning almost surely is less stringent, we can hope both for a stricter yardstick (i.e. better target values) and also better ways of achieving such high values. We show here that this is indeed the case. Additionally, we argue that several important learning results can now be obtained with finite-memory strategies.

As previously, we make the hypothesis that we have removed from all states from which the parity objective cannot be forced with probability (no such state can ever be entered). Note that to compute the set of states to remove, we do not need the knowledge of but only the support as given by . States to remove can be computed in polynomial time using graph-based algorithms (see, e.g., [5]). An automaton which contains only almost-surely winning states for the parity objective is called almost-surely good.

We have, as in the previous section, that for all automata there exists a memoryless deterministic strategy such that for all , for all compatible with , for all , the measure of the subset of such that is equal to (see e.g. [5]). Such a strategy is said to be uniformly almost-sure winning (for the parity objective). In the sequel, we denote such a strategy .

We now study the design of learning strategies for mean-payoff optimization under almost-sure parity constraints for increasingly complex cases.

### 5.1 The case of a good end component

Consider an automaton such that is a GEC, and some .

##### Yardstick.

For this case, we use as a yardstick the optimal expected mean-payoff value: for any .

##### Learning strategies.

We start by noting that from Section 3 also ensures that the parity objective is satisfied almost surely when exercised in a GEC.

###### Proposition .

One can compute a sequence such that for the resulting strategy we have that for all , for all compatible with and , and for all reward functions , we have and .

###### Proof.

By Proposition 3.2, one can choose parameter sequences such that for all and so that we obtain the second part of the claim. Then, since in every episode we have a non-zero probability of visiting a minimal even priority state, we obtain the first part of the claim as a simple consequence of the second Borel-Cantelli lemma. ∎

We now turn to learning using finite memory only. Consider parameters . Let for as defined for Lemma 2.2. The strategy that we construct does the following.

1. First, it computes such that with probability at least and a reward function by following an exploration strategy for during steps (see Lemma 3).

2. Then, it computes a unichain deterministic expectation-optimal strategy for and repeats the following forever: follow for steps, then follow for steps.

Using the fact that, in a finite MC with a single BSCC, almost all runs obtain the expected mean payoff and the assumption that the EC is good, one can then prove the following result.

###### Proposition .

For all one can compute such that for the resulting strategy , for all , for all compatible with and , and for all reward functions , we have and .

##### Optimality.

Obviously, the result of Proposition 5.1 is optimal as we obtain the best possible value with probability one. We claim that the result of Proposition 5.1 is also optimal as we have seen that when we use finite learning, we cannot do better than -optimality with high probability, this can be proved on the example of Fig. 2 with a similar argument to the one that has been developed for the proof of Proposition 4.2.

### 5.2 The case of a single end component

Consider an almost-surely-good automaton such that is an EC and some . The EC is not necessarily good but as the automaton is almost-surely-good, then we have the analogue of Lemma 4.2 in this context. For all almost-surely-good automata such that is an EC there exists such that is a GEC in for all compatible with and all reward functions , i.e. is weakly good.

##### Yardstick.

As a yardstick for this case, we use the following value: That is, is the best expected mean-payoff value that can be obtained in a GEC included in the EC. Such a good EC exists by Lemma 5.2.

##### Learning strategy.

We will now prove an analogue of Proposition 4.2. For any given we define the strategy as follows.

1. First, it follows an exploration strategy for during sufficiently many steps (say ) to compute an approximation of