In Hindsight: A Smooth Reward for Steady Exploration

by   Hadi S. Jomaa, et al.

In classical Q-learning, the objective is to maximize the sum of discounted rewards through iteratively using the Bellman equation as an update, in an attempt to estimate the action value function of the optimal policy. Conventionally, the loss function is defined as the temporal difference between the action value and the expected (discounted) reward, however it focuses solely on the future, leading to overestimation errors. We extend the well-established Q-learning techniques by introducing the hindsight factor, an additional loss term that takes into account how the model progresses, by integrating the historic temporal difference as part of the reward. The effect of this modification is examined in a deterministic continuous-state space function estimation problem, where the overestimation phenomenon is significantly reduced and results in improved stability. The underlying effect of the hindsight factor is modeled as an adaptive learning rate, which unlike existing adaptive optimizers, takes into account the previously estimated action value. The proposed method outperforms variations of Q-learning, with an overall higher average reward and lower action values, which supports the deterministic evaluation, and proves that the hindsight factor contributes to lower overestimation errors. The mean average score of 100 episodes obtained after training for 10 million frames shows that the hindsight factor outperforms deep Q-networks, double deep Q-networks and dueling networks for a variety of ATARI games.


page 1

page 2

page 3

page 4


Explainable Deterministic MDPs

We present a method for a certain class of Markov Decision Processes (MD...

Robust and Adaptive Temporal-Difference Learning Using An Ensemble of Gaussian Processes

Value function approximation is a crucial module for policy evaluation i...

Seeking entropy: complex behavior from intrinsic motivation to occupy action-state path space

Intrinsic motivation generates behaviors that do not necessarily lead to...

Enhancing reinforcement learning by a finite reward response filter with a case study in intelligent structural control

In many reinforcement learning (RL) problems, it takes some time until a...

Finite-Time Analysis for Double Q-learning

Although Q-learning is one of the most successful algorithms for finding...

Factors of Influence of the Overestimation Bias of Q-Learning

We study whether the learning rate α, the discount factor γ and the rewa...

Optimizing the Long-Term Behaviour of Deep Reinforcement Learning for Pushing and Grasping

We investigate the "Visual Pushing for Grasping" (VPG) system by Zeng et...

Please sign up or login with your details

Forgot password? Click here to reset