Multi-Step Greedy and Approximate Real Time Dynamic Programming

09/10/2019
by   Yonathan Efroni, et al.
0

Real Time Dynamic Programming (RTDP) is a well-known Dynamic Programming (DP) based algorithm that combines planning and learning to find an optimal policy for an MDP. It is a planning algorithm because it uses the MDP's model (reward and transition functions) to calculate a 1-step greedy policy w.r.t. an optimistic value function, by which it acts. It is a learning algorithm because it updates its value function only at the states it visits while interacting with the environment. As a result, unlike DP, RTDP does not require uniform access to the state space in each iteration, which makes it particularly appealing when the state space is large and simultaneously updating all the states is not computationally feasible. In this paper, we study a generalized multi-step greedy version of RTDP, which we call h-RTDP, in its exact form, as well as in three approximate settings: approximate model, approximate value updates, and approximate state abstraction. We analyze the sample, computation, and space complexities of h-RTDP and establish that increasing h improves sample and space complexity, with the cost of additional offline computational operations. For the approximate cases, we prove that the asymptotic performance of h-RTDP is the same as that of a corresponding approximate DP -- the best one can hope for without further assumptions on the approximation errors. h-RTDP is the first algorithm with a provably improved sample complexity when increasing the lookahead horizon.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/14/2012

Approximate Modified Policy Iteration

Modified policy iteration (MPI) is a dynamic programming (DP) algorithm ...
research
11/28/2018

Toward breaking the curse of dimensionality: an FPTAS for stochastic dynamic programs with multidimensional actions and scalar states

We propose a Fully Polynomial-Time Approximation Scheme (FPTAS) for stoc...
research
04/28/2016

Sequential Bayesian optimal experimental design via approximate dynamic programming

The design of multiple experiments is commonly undertaken via suboptimal...
research
07/04/2022

Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

Value iteration (VI) is a foundational dynamic programming method, impor...
research
06/22/2023

Fitted Value Iteration Methods for Bicausal Optimal Transport

We develop a fitted value iteration (FVI) method to compute bicausal opt...
research
10/06/2019

Biased Aggregation, Rollout, and Enhanced Policy Improvement for Reinforcement Learning

We propose a new aggregation framework for approximate dynamic programmi...

Please sign up or login with your details

Forgot password? Click here to reset