Learning-Based Mean-Payoff Optimization in an Unknown MDP under Omega-Regular Constraints

04/24/2018
by   Jan Křetínský, et al.
0

We formalize the problem of maximizing the mean-payoff value with high probability while satisfying a parity objective in a Markov decision process (MDP) with unknown probabilistic transition function and unknown reward function. Assuming the support of the unknown transition function and a lower bound on the minimal transition probability are known in advance, we show that in MDPs consisting of a single end component, two combinations of guarantees on the parity and mean-payoff objectives can be achieved depending on how much memory one is willing to use. (i) For all ϵ and γ we can construct an online-learning finite-memory strategy that almost-surely satisfies the parity objective and which achieves an ϵ-optimal mean payoff with probability at least 1 - γ. (ii) Alternatively, for all ϵ and γ there exists an online-learning infinite-memory strategy that satisfies the parity objective surely and which achieves an ϵ-optimal mean payoff with probability at least 1 - γ. We extend the above results to MDPs consisting of more than one end component in a natural way. Finally, we show that the aforementioned guarantees are tight, i.e. there are MDPs for which stronger combinations of the guarantees cannot be ensured.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/26/2020

Transience in Countable MDPs

The Transience objective is not to visit any state infinitely often. Whi...
research
07/07/2020

Strategy Complexity of Parity Objectives in Countable MDPs

We study countably infinite MDPs with parity objectives. Unlike in finit...
research
04/10/2018

Combinations of Qualitative Winning for Stochastic Parity Games

We study Markov decision processes and turn-based stochastic games with ...
research
07/01/2021

Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs

We study countably infinite Markov decision processes (MDPs) with real-v...
research
02/25/2021

Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally

We study the statistical limits of Imitation Learning (IL) in episodic M...
research
12/24/2019

Scenario-Based Verification of Uncertain MDPs

We consider Markov decision processes (MDPs) in which the transition pro...

Please sign up or login with your details

Forgot password? Click here to reset