Cautious Reinforcement Learning via Distributional Risk in the Dual Domain

by   Junyu Zhang, et al.

We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite. Prior efforts are predominately afflicted by computational challenges associated with the fact that risk-sensitive MDPs are time-inconsistent. To ameliorate this issue, we propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning. The caution measures the distributional risk of a policy, which is a function of the policy's long-term state occupancy distribution. To solve this problem in an online model-free manner, we propose a stochastic variant of primal-dual method that uses Kullback-Lieber (KL) divergence as its proximal term. We establish that the number of iterations/samples required to attain approximately optimal solutions of this scheme matches tight dependencies on the cardinality of the state and action spaces, but differs in its dependence on the infinity norm of the gradient of the risk measure. Experiments demonstrate the merits of this approach for improving the reliability of reward accumulation without additional computational burdens.



page 11

page 12


On Sample Complexity of Projection-Free Primal-Dual Methods for Learning Mixture Policies in Markov Decision Processes

We study the problem of learning policy of an infinite-horizon, discount...

Convergence Rates of Average-Reward Multi-agent Reinforcement Learning via Randomized Linear Programming

In tabular multi-agent reinforcement learning with average-cost criterio...

Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning

Constrained Markov Decision Process (CMDP) is a natural framework for re...

Risk-Sensitive Reinforcement Learning: a Martingale Approach to Reward Uncertainty

We introduce a novel framework to account for sensitivity to rewards unc...

Towards Painless Policy Optimization for Constrained MDPs

We study policy optimization in an infinite horizon, γ-discounted constr...

Algorithms for CVaR Optimization in MDPs

In many sequential decision-making problems we may want to manage risk b...

Scalable Bilinear π Learning Using State and Action Features

Approximate linear programming (ALP) represents one of the major algorit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.