Self-guided Approximate Linear Programs

01/09/2020
by   Parshan Pakiman, et al.
0

Approximate linear programs (ALPs) are well-known models based on value function approximations (VFAs) to obtain heuristic policies and lower bounds on the optimal policy cost of Markov decision processes (MDPs). The ALP VFA is a linear combination of predefined basis functions that are chosen using domain knowledge and updated heuristically if the ALP optimality gap is large. We side-step the need for such basis function engineering in ALP – an implementation bottleneck – by proposing a sequence of ALPs that embed increasing numbers of random basis functions obtained via inexpensive sampling. We provide a sampling guarantee and show that the VFAs from this sequence of models converge to the exact value function. Nevertheless, the performance of the ALP policy can fluctuate significantly as more basis functions are sampled. To mitigate these fluctuations, we "self-guide" our convergent sequence of ALPs using past VFA information such that a worst-case measure of policy performance is improved. We perform numerical experiments on perishable inventory control and generalized joint replenishment applications, which, respectively, give rise to challenging discounted-cost MDPs and average-cost semi-MDPs. We find that self-guided ALPs (i) significantly reduce policy cost fluctuations and improve the optimality gaps from an ALP approach that employs basis functions tailored to the former application, and (ii) deliver optimality gaps that are comparable to a known adaptive basis function generation approach targeting the latter application. More broadly, our methodology provides application-agnostic policies and lower bounds to benchmark approaches that exploit application structure.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2021

Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

We provide improved gap-dependent regret bounds for reinforcement learni...
research
08/01/2019

Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

Policy gradient methods are among the most effective methods in challeng...
research
05/20/2019

Issues concerning realizability of Blackwell optimal policies in reinforcement learning

N-discount optimality was introduced as a hierarchical form of policy- a...
research
01/16/2013

Policy Iteration for Factored MDPs

Many large MDPs can be represented compactly using a dynamic Bayesian ne...
research
07/04/2012

Representation Policy Iteration

This paper addresses a fundamental issue central to approximation method...
research
02/17/2021

Self-Triggered Markov Decision Processes

In this paper, we study Markov Decision Processes (MDPs) with self-trigg...
research
03/28/2018

One-step dispatching policy improvement in multiple-server queueing systems with Poisson arrivals

Policy iteration techniques for multiple-server dispatching rely on the ...

Please sign up or login with your details

Forgot password? Click here to reset