Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

06/03/2013
by   Bruno Scherrer, et al.
0

Given a Markov Decision Process (MDP) with n states and a totalnumber m of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimalγ-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most O(m/1-γ(1/1-γ))iterations, improving by a factor O( n) a result by Hansen etal., while Simplex-PI terminates after at most O(nm/1-γ(1/1-γ))iterations, improving by a factor O( n) a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor γ: quantities ofinterest are bounds τ_t and τ_r---uniform on all states andpolicies---respectively on the expected time spent in transientstates and the inverse of the frequency of visits in recurrentstates given that the process starts from the uniform distribution.Indeed, we show that Simplex-PI terminates after at most Õ(n^3 m^2 τ_t τ_r ) iterations. This extends arecent result for deterministic MDPs by Post & Ye, in which τ_t< 1 and τ_r < n, in particular it shows that Simplex-PI isstrongly polynomial for a much larger class of MDPs. We explain whysimilar results seem hard to derive for Howard's PI. Finally, underthe additional (restrictive) assumption that the state space ispartitioned in two sets, respectively states that are transient andrecurrent for all policies, we show that both Howard's PI andSimplex-PI terminate after at most Õ(m(n^2τ_t+nτ_r))iterations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2020

Lower Bounds for Policy Iteration on Multi-action MDPs

Policy Iteration (PI) is a classical family of algorithms to compute an ...
research
11/28/2019

Analysis of Lower Bounds for Simple Policy Iteration

Policy iteration is a family of algorithms that are used to find an opti...
research
01/23/2013

On the Complexity of Policy Iteration

Decision-making problems in uncertain or stochastic domains are often fo...
research
11/28/2022

Some Upper Bounds on the Running Time of Policy Iteration on Deterministic MDPs

Policy Iteration (PI) is a widely used family of algorithms to compute o...
research
02/21/2023

Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space

In this paper, we revisit the regret of undiscounted reinforcement learn...
research
02/23/2023

Intermittently Observable Markov Decision Processes

This paper investigates MDPs with intermittent state information. We con...
research
01/31/2006

A Study on the Global Convergence Time Complexity of Estimation of Distribution Algorithms

The Estimation of Distribution Algorithm is a new class of population ba...

Please sign up or login with your details

Forgot password? Click here to reset