Per-Step Reward: A New Perspective for Risk-Averse Reinforcement Learning

by   Shangtong Zhang, et al.

We present a new per-step reward perspective for risk-averse control in a discounted infinite horizon MDP. Unlike previous work, where the variance of the episodic return random variable is used for risk-averse control, we design a new random variable indicating the per-step reward and consider its variance for risk-averse control. The expectation of the per-step reward matches the expectation of the episodic return up to a constant multiplier, and the variance of the per-step reward bounds the variance of the episodic return above. Furthermore, we derive the mean-variance policy iteration framework under this per-step reward perspective, where all existing policy evaluation methods and risk-neutral control methods can be dropped in for risk-averse control off the shelf, in both on-policy and off-policy settings. We propose risk-averse PPO as an example for mean-variance policy iteration, which outperforms PPO in many Mujoco domains. By contrast, previous risk-averse control methods cannot be easily combined with advanced policy optimization techniques like PPO due to their reliance on the squared episodic return, and all those that we test suffer from poor performance in Mujoco domains with neural network function approximation.


page 1

page 2

page 3

page 4


An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient

Restricting the variance of a policy's return is a popular choice in ris...

Risk-Averse Trust Region Optimization for Reward-Volatility Reduction

In real-world decision-making problems, for instance in the fields of fi...

Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes

In this paper we extend temporal difference policy evaluation algorithms...

Safe and Efficient Off-Policy Reinforcement Learning

In this work, we take a fresh look at some old and new algorithms for of...

Reward-Weighted Regression Converges to a Global Optimum

Reward-Weighted Regression (RWR) belongs to a family of widely known ite...

A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Risk management in dynamic decision problems is a primary concern in man...

Qualitative Measurements of Policy Discrepancy for Return-based Deep Q-Network

In this paper, we focus on policy discrepancy in return-based deep Q-net...

Please sign up or login with your details

Forgot password? Click here to reset