Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates

10/28/2021
by   Litian Liang, et al.
4

Temporal-Difference (TD) learning methods, such as Q-Learning, have proven effective at learning a policy to perform control tasks. One issue with methods like Q-Learning is that the value update introduces bias when predicting the TD target of a unfamiliar state. Estimation noise becomes a bias after the max operator in the policy improvement step, and carries over to value estimations of other states, causing Q-Learning to overestimate the Q value. Algorithms like Soft Q-Learning (SQL) introduce the notion of a soft-greedy policy, which reduces the estimation bias via soft updates in early stages of training. However, the inverse temperature β that controls the softness of an update is usually set by a hand-designed heuristic, which can be inaccurate at capturing the uncertainty in the target estimate. Under the belief that β is closely related to the (state dependent) model uncertainty, Entropy Regularized Q-Learning (EQL) further introduces a principled scheduling of β by maintaining a collection of the model parameters that characterizes model uncertainty. In this paper, we present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state space Markov Decision Processes. We also provide a principled numerical scheduling of β, extended from SQL and using model uncertainty, during the optimization process. We show the theoretical guarantees and the effectiveness of this update method in experiments on several discrete control environments.

READ FULL TEXT

page 6

page 8

page 17

research
11/28/2021

Count-Based Temperature Scheduling for Maximum Entropy Reinforcement Learning

Maximum Entropy Reinforcement Learning (MaxEnt RL) algorithms such as So...
research
01/28/2022

Safe Policy Improvement Approaches on Discrete Markov Decision Processes

Safe Policy Improvement (SPI) aims at provable guarantees that a learned...
research
02/05/2018

Guided Policy Exploration for Markov Decision Processes using an Uncertainty-Based Value-of-Information Criterion

Reinforcement learning in environments with many action-state pairs is c...
research
02/24/2023

Why Target Networks Stabilise Temporal Difference Methods

Integral to recent successes in deep reinforcement learning has been a c...
research
09/11/2019

Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning

Cumulative entropy regularization introduces a regulatory signal to the ...
research
09/09/2011

Perseus: Randomized Point-based Value Iteration for POMDPs

Partially observable Markov decision processes (POMDPs) form an attracti...
research
12/09/2017

Bayesian Q-learning with Assumed Density Filtering

While off-policy temporal difference methods have been broadly used in r...

Please sign up or login with your details

Forgot password? Click here to reset