Towards Painless Policy Optimization for Constrained MDPs
We study policy optimization in an infinite horizon, γ-discounted constrained Markov decision process (CMDP). Our objective is to return a policy that achieves large expected reward with a small constraint violation. We consider the online setting with linear function approximation and assume global access to the corresponding features. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms in terms of their primal and dual regret on online linear optimization problems. We instantiate this framework to use coin-betting algorithms and propose the Coin Betting Politex (CBP) algorithm. Assuming that the action-value functions are ε_b-close to the span of the d-dimensional state-action features and no sampling errors, we prove that T iterations of CBP result in an O(1/(1 - γ)^3 √(T) + ε_b√(d)/(1 - γ)^2) reward sub-optimality and an O(1/(1 - γ)^2 √(T) + ε_b √(d)/1 - γ) constraint violation. Importantly, unlike gradient descent-ascent and other recent methods, CBP does not require extensive hyperparameter tuning. Via experiments on synthetic and Cartpole environments, we demonstrate the effectiveness and robustness of CBP.
READ FULL TEXT