Efficient Strategy Iteration for Mean Payoff in Markov Decision Processes

07/06/2017
by   Jan Křetínský, et al.
0

Markov decision processes (MDPs) are standard models for probabilistic systems with non-deterministic behaviours. Mean payoff (or long-run average reward) provides a mathematically elegant formalism to express performance related properties. Strategy iteration is one of the solution techniques applicable in this context. While in many other contexts it is the technique of choice due to advantages over e.g. value iteration, such as precision or possibility of domain-knowledge-aware initialization, it is rarely used for MDPs, since there it scales worse than value iteration. We provide several techniques that speed up strategy iteration by orders of magnitude for many MDPs, eliminating the performance disadvantage while preserving all its advantages.

READ FULL TEXT

page 1

page 2

page 3

page 4

01/23/2013

A Method for Speeding Up Value Iteration in Partially Observable Markov Decision Processes

We present a technique for speeding up the convergence of value iteratio...
04/19/2020

Faster Algorithms for Quantitative Analysis of Markov Chains and Markov Decision Processes with Small Treewidth

Discrete-time Markov Chains (MCs) and Markov Decision Processes (MDPs) a...
06/07/2022

Concentration bounds for SSP Q-learning for average cost MDPs

We derive a concentration bound for a Q-learning algorithm for average c...
06/09/2019

Toward Solving 2-TBSG Efficiently

2-TBSG is a two-player game model which aims to find Nash equilibriums a...
01/29/2021

Optimistic Policy Iteration for MDPs with Acyclic Transient State Structure

We consider Markov Decision Processes (MDPs) in which every stationary p...
06/19/2019

Strategy Representation by Decision Trees with Linear Classifiers

Graph games and Markov decision processes (MDPs) are standard models in ...
04/21/2009

Optimistic Initialization and Greediness Lead to Polynomial Time Learning in Factored MDPs - Extended Version

In this paper we propose an algorithm for polynomial-time reinforcement ...