Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

06/10/2020
by   Qinbo Bai, et al.
0

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any γ∈(0,1/2), the proposed algorithm is shown to achieve O(T^1/2+γ) regret bound for the obtained reward and O(T^1-γ/2) regret bound for the constraint violation, where T is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2020

Model-Free Algorithm and Regret Analysis for MDPs with Peak Constraints

In the optimization of dynamic systems, the variables typically have con...
research
06/12/2021

Markov Decision Processes with Long-Term Average Constraints

We consider the problem of constrained Markov Decision Process (CMDP) wh...
research
04/27/2023

A Best-of-Both-Worlds Algorithm for Constrained MDPs with Long-Term Constraints

We study online learning in episodic constrained Markov decision process...
research
04/08/2016

A Low Complexity Algorithm with O(√(T)) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

This paper considers online convex optimization over a complicated const...
research
03/18/2018

Aggregating Strategies for Long-term Forecasting

The article is devoted to investigating the application of aggregating a...
research
02/02/2023

Constrained Online Two-stage Stochastic Optimization: New Algorithms via Adversarial Learning

We consider an online two-stage stochastic optimization with long-term c...
research
09/20/2018

Predicting Periodicity with Temporal Difference Learning

Temporal difference (TD) learning is an important approach in reinforcem...

Please sign up or login with your details

Forgot password? Click here to reset