Conservative State Value Estimation for Offline Reinforcement Learning

02/14/2023
by   Liting Chen, et al.
0

Offline reinforcement learning faces a significant challenge of value over-estimation due to the distributional drift between the dataset and the current learned policy, leading to learning failure in practice. The common approach is to incorporate a penalty term to reward or value estimation in the Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution (OOD) states and actions, existing methods focus on conservative Q-function estimation. In this paper, we propose Conservative State Value Estimation (CSVE), a new approach that learns conservative V-function via directly imposing penalty on OOD states. Compared to prior work, CSVE allows more effective in-data policy optimization with conservative value guarantees. Further, we apply CSVE and develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states around the dataset, and the actor applies advantage weighted updates extended with state exploration to improve the policy. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.

READ FULL TEXT
research
09/27/2022

DCE: Offline Reinforcement Learning With Double Conservative Estimates

Offline Reinforcement Learning has attracted much interest in solving th...
research
07/12/2021

Cautious Actor-Critic

The oscillating performance of off-policy learning and persisting errors...
research
08/22/2023

Careful at Estimation and Bold at Exploration

Exploration strategies in continuous action space are often heuristic du...
research
09/16/2022

Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning

Provably efficient Model-Based Reinforcement Learning (MBRL) based on op...
research
04/19/2023

CASOG: Conservative Actor-critic with SmOoth Gradient for Skill Learning in Robot-Assisted Intervention

Robot-assisted intervention has shown reduced radiation exposure to phys...
research
01/03/2023

Contextual Conservative Q-Learning for Offline Reinforcement Learning

Offline reinforcement learning learns an effective policy on offline dat...
research
06/02/2023

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

In this paper, we introduce a novel method for enhancing the effectivene...

Please sign up or login with your details

Forgot password? Click here to reset