Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

05/10/2019
by   Shipra Agrawal, et al.
0

We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near-optimality of a simple class of policies called "base-stock policies" for the underlying Markov Decision Process (MDP), as well as convexity of long run average-cost under those policies. We consider the relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexity properties and a newly derived bound on bias of base-stock policies to establish a connection to stochastic convex bandit optimization. Our main contribution is a learning algorithm with a regret bound of Õ(L√(T)+D) for the inventory control problem. Here L is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the number of time steps needed to generate enough demand for depleting one unit of inventory. Notably, even though the state space of the underlying MDP is continuous and L-dimensional, our regret bounds depend linearly on L. Our results significantly improve the previously best known regret bounds for this problem where the dependence on L was exponential and many further assumptions on demand distribution were required. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex cost functions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/05/2018

Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

The problem of reinforcement learning in an unknown and discrete Markov ...
research
06/27/2021

Regret Analysis in Deterministic Reinforcement Learning

We consider Markov Decision Processes (MDPs) with deterministic transiti...
research
06/05/2023

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Models of many real-life applications, such as queuing models of communi...
research
11/21/2017

Posterior Sampling for Large Scale Reinforcement Learning

Posterior sampling for reinforcement learning (PSRL) is a popular algori...
research
02/27/2014

Linear Programming for Large-Scale Markov Decision Problems

We consider the problem of controlling a Markov decision process (MDP) w...
research
02/14/2023

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Learning Markov decision processes (MDP) in an adversarial environment h...
research
02/21/2023

Reinforcement Learning in a Birth and Death Process: Breaking the Dependence on the State Space

In this paper, we revisit the regret of undiscounted reinforcement learn...

Please sign up or login with your details

Forgot password? Click here to reset