In Hindsight: A Smooth Reward for Steady Exploration

06/24/2019 ∙ by Hadi S. Jomaa, et al. ∙ 0

In classical Q-learning, the objective is to maximize the sum of discounted rewards through iteratively using the Bellman equation as an update, in an attempt to estimate the action value function of the optimal policy. Conventionally, the loss function is defined as the temporal difference between the action value and the expected (discounted) reward, however it focuses solely on the future, leading to overestimation errors. We extend the well-established Q-learning techniques by introducing the hindsight factor, an additional loss term that takes into account how the model progresses, by integrating the historic temporal difference as part of the reward. The effect of this modification is examined in a deterministic continuous-state space function estimation problem, where the overestimation phenomenon is significantly reduced and results in improved stability. The underlying effect of the hindsight factor is modeled as an adaptive learning rate, which unlike existing adaptive optimizers, takes into account the previously estimated action value. The proposed method outperforms variations of Q-learning, with an overall higher average reward and lower action values, which supports the deterministic evaluation, and proves that the hindsight factor contributes to lower overestimation errors. The mean average score of 100 episodes obtained after training for 10 million frames shows that the hindsight factor outperforms deep Q-networks, double deep Q-networks and dueling networks for a variety of ATARI games.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has gained considerable attention over the past five years. In this field of research, an agent attempts to learn a behavior through trial-and-error interactions with a dynamic environment, in order to maximize an allocated reward, immediate or delayed. Broadly speaking, an agent selects an optimal policy either by using an optimal value function, or by manipulating the policy directly. Thanks to the rich representational power of neural networks, the high-dimensional input obtained from real-world problems can be reduced to a set of latent representations with no need for hand-engineered features. Recently, a variant of Q-learning based on convolutional neural networks demonstrated remarkable results on a majority of games within the Arcade Learning Environment, by reformulating the RL objective as a sequential supervised learning task [Mnih et al.2015]. One of the contributing factors to this approach is the presence of an experience replay memory, which stores the transitions at every step. This leads to the temporal de-correlation between experiences, and hence upholds the i.i.d assumption which allows stochastic gradient-based learning [Lin1992]. Experience replay reduces the amount of episodes required for training [Schaul et al.2015] even though some transitions might not be immediately useful [Schmidhuber1991]. A well-defined optimal value function also plays a major role in RL tasks. At its core, an optimal value function is basically an approximation of the predicted reward given a state-action pair. Higher rewards are hence achieved by navigating the environment, acting greedily with respect to the value function.
In this paper, we reshape the reward as a weighted average between the expected discounted reward, for a sampled state-action pair, and its previously selected action-value. In one-step Q-learning, the loss, , is calculated based on the temporal difference (TD) between the discounted reward and the estimated action-value at any given state, Equation 1:


where , , and represent respectively the state, action, and reward at sampled iteration , with as the discount factor, and as the network parameters at current iteration , such that , and as the target network parameters.
One drawback of this equation is that it only focuses on the forward temporal difference, i.e. between the action-value at the current sampled state, and the next state

. It doesn’t however address how the decision process has evolved between updates by ignoring action values at previous iterations. We leverage this information under the assumption that it helps lower the variance of action-values throughout the learning process, while concurrently maintaining high rewards, and ultimately learn a better agent.

To that end, we introduce an additional term to the standard TD error, referred to as the hindsight factor, , that represents the difference between the action-value of the sampled state at the current iteration and its previously selected (stored) action-value, obtained with network parameters :


The key observation is that the hindsight factor acts as regularizer on the Q-network, unlike conventional regularization techniques that force restrictions on the network parameters directly. The hindsight factor can also be considered as an adaptive learning step controller that penalizes large deviations from previous models by incoporating the momentum of change in action values across updates. Inherently, if the hindsight factor increases, this means that the model parameters have significantly changed, leading to higher (or lower) action-values. The introduction of the hindsight factor restructures the reward, as a weighted average between what is expected, and what was estimated, ensuring that model updates are carried out cautiously.
We summarize the contributions of the paper as follows:

  • A novel extension to the optimal value function technique, which leads to an overall improved performance

  • Deterministic evaluation on a continuous state-space that shows how the hindsight factor reduces the bias in function approximation

  • Experiments on ATARI games that highlight the effect of the hindsight factor

  • Comparative analysis that demonstrates the effect of adding the hindsight factor to multiple variations of deep Q-networks

2 Background

A standard reinforcement learning setup [Sutton and Barto2018], consists of an agent interacting with an environment

at discrete timesteps. This formulation is based on a Markov Decision Process (MDP) represented by

. At any given timestep , the agent receives a state , upon which it selects an action , and a scalar reward is observed. The transition function generates a new state . The agents behaviour is governed by a policy, , which computes the true state-action value, as


where represents the discount factor balancing between immediate and future rewards.
To solve this sequential decision problem, the optimal policy selects the action that maximizes the discounted cumulative reward, , where denotes the optimal action value.
One of the most prominent value-based methods for solving reinforcement learning problems is Q-learning [Watkins and Dayan1992], which directly estimates the optimal value function and obeys the fundamental identity, known as the Bellman equation [Bellman1957]


As the number of states increases, learning all action values per state separately becomes computationally taxing, which is why the value function is approximated via a paramatrized network, resulting in .

2.1 Deep Q-networks

Deep Q-network (DQN) is a model-free algorithm presented by [Mnih et al.2015], which learns the Q-function in a supervised fashion. The objective is to minimize the loss function,



as the probability distribution over the action space, and the target

as the expected discounted reward, Equation 6:


In this approach, a target network, shares the same architecture as the online network , however it is only updated after a fixed number of iterations which increases the stability of the algorithm. The correlation between sequential observations is reduced by uniformly sampling transitions of the form for the off-policy from a replay buffer [Lin1993].

2.2 Double Deep Q-networks

Double DQN (DDQN) [Van Hasselt et al.2016] is a variation of DQN, where the action selected and its corresponding value are obtained from two separate networks. In other words, based on the sampled state , the action is selected based on the greedy policy of the network, i.e. , whereas the reward is calculated based on the action-value of the selected from the target network, resulting in the following target,


This comes as a solution to the overoptimistic value estimates, which result from using the same network to select and evaluate the action.

2.3 Dueling Network Architectures

Sharing similar lower level representation, the dueling architecture (DUEL) [Wang et al.2015] extends DQN by explicitly separating the representation of the state values from the state-dependent action values. This allows the network to understand which states are valuable agnostic to the action selected, which is particularly important for setups where the action does not have a major effect on the environment. The action value is estimated as a function of two modules, the action advantage module , and the state-value as expressed in Equation 8:


with as the shared parameters, as the parameters of the state-value module, and as the parameters of the action value module.
Other variations that have been proposed on Q-networks include selecting an action based on the average of it value over across previous networks [Anschel et al.2017] which preserves the original loss function, and -regularization [Farebrother et al.2018] to avoid overfitting on the training environment.
The proposed modification to the existing Q-learning networks shares the same input-output interface, however, it reformulates the existing loss function , where the objective is to minimize the difference between the action-value and the future discounted reward, by introducing the hindsight factor that adaptively manages reward expectation.


(a) True value and an estimate
(b) Bias in DQN as a function of state
(c) Bias in DDQN as a function of state
Figure 1: Illustration of Overestimations

3 In Hindsight

Hindsight is an extension of the conventional value iteration techniques in reinforcement learning that considers the previous performance of the network when calculating the reward. In this supervised formulation, the target is calculated based on the forward directional temporal difference, i.e. between the estimated action-value and the expected discounted reward (future), Equation 1, dropping any use of previous action-state value (past). To counter that effect, we introduce the hindsight factor, Equation 9, to balance the current action-value estimate, and prevent overestimation. [Thrun and Schwartz1993] Intuitively, this term represents the confidence of the agent in previous actions. More specifically, if the estimated action-value at current iteration , is much higher than the previously estimated action-value at iteration , , given the same state representation , then in hindsight action was not optimal in the global sense, even though it was selected based on greedy policy as . If on the other hand, the historical temporal difference is small, then the network is converging in the optimal direction, as given the same state and the same action, the corresponding action value is equally high.
The total loss would simply be a weighted sum between the forward temporal difference, and the backward temporal difference, . If we expand the loss and factorize the components, we end up with a new loss representation, Equation 10,


with as the hindsight coefficient. The hindsight factor inherently restructures the target reward as a smooth trade-off between the expected and the previously estimated reward. This is derived by first expanding the loss, Equation 11,


since and are simply constants, i.e. independent of , we can ignore them, which leaves us with Equation 12:


In order to complete the squares, we introduce the constant , and divide by to obtain the final loss as Equation 13:


With this formulation, can be considered as the smoothened reward, a balance between the current discounted reward and the previous action-value. Notice that the proposed model doesn’t introduce any additional computations, as both loss terms and share the same set of gradients. which results in the following parameter update, Equation 14:


with as the scalar step size.

1:Randomly initialize and
2:Initialize modified experience replay buffer
3:for episodes  do
4:     Initialize environment
5:     for t do
6:         Determine and Execute action
7:         Receive reward and new state
8:         Store experiences in
9:         Sample a minibatch of experiences from
10:         Set target value
with as the index of the sampled observation
11:         Update the network by minimizing
Algorithm 1 in Hindsight Algorithm

In order to implement this algorithm, we modify the experience replay buffer, to accommodate the action-values of the states. Hence at every frame, we store the transitions () in the memory. The goal is to improve the performance of the Q-function, by introducing updates that do not emphasize solely on the future discounted reward, but also take into account not to deviate from the values associated with decisions in the agent’s experience in older encounters, and reduces overestimation errors.

4 Overestimation and Approximation Errors

One of the issues of function estimation based on Q-learning is the overestimation phenomenon [Thrun and Schwartz1993]

that lead asymptotically to sub-optimal policies. Assuming action-values are corrupted by uniformly distributed noise in an interval

, target values would be overestimated by a value with an upper bound of , due to the max operator, with as the discount factor and as the number of actions. Overestimations also have a tight lower bound [Van Hasselt et al.2016], which is derived as , with . The DDQN approach reduces overestimation, and replaces the positive bias with a negative one.
The effect of the hindsight factor on overestimation is demonstrated in the following function estimation experiment[Van Hasselt et al.2016]. The environment is described as a continuous real-valued state-space with 10 discrete actions per state. Each action represents a polynomial function, with a chosen degree of 6, fitted to a subset of integer states, with two adjacent states missing; for action , states -5 and -4 are removed, for action , states -4 and -3, and so on. Each action has the same true value, defined as either or .
We are able to reproduce the experiment for DQN and DDQN and obtain the exact overestimation values, presented in the original approximation [Van Hasselt et al.2016], as can be seen in Figure 1. Systemic overestimation is an artifact of recursive function approximation, which leads to a detorioration of value estimates as the action values are assumed true, when in fact they contain noise. Introducing the hindsight factor maintains low bias in the estimates, especially when applied to DQN. We also notice, that even though the bias is slightly higher than DDQN, it is indeed however much smoother, which translates into an overall better estimation. However, we realize later that applying the hindsight factor to DDQN can in some cases lead to an extremely cautious exploration within the game, and respectively lower rewards.

Method DQN DQN-H
Wins w.r.t all 2 4
Wins w.r.t counterpart 10 23
Score 676 2874
Wins w.r.t all 2 2
Wins w.r.t counterpart 15 18
Score 1632 2593
Wins w.r.t all 6 17
Wins w.r.t counterpart 9 24
Score 3247 4342
Table 1: Summarized performance for 33 games
Figure 2: Performance curves for various ATARI games using variants of Q-learning techniques; DDQN(dark blue),DDQN-H(green);DQN(red),DQN-H(cyan); DUEL(pink),DUEL-H(yellow)

5 Experimental Results

We now demonstrate the practical advantage of adding the hindsight factor to the Q-learning loss function. To do so, we reimplement several variations of deep Q-learning methods, namely: DQN [Mnih et al.2015], Double DQN [Van Hasselt et al.2016], and dueling networks [Wang et al.2015]

. All the models are trained in TensorFlow

[Abadi et al.2016]

on a GeForce 1080 TI GPU, using the hyperparameters provided by

[Mnih et al.2015], with the average runtime duration per baseline amounting to 30 GPU hours. The proposed model modifies these existing architectures by introducing the hindsight factor as an additional loss term, and is referred to as -H. As shown in Algrorithm 1, the buffer is extended to accommodate the action-value per state. We evaluated the proposed method on more than 30 ATARI games, which differ in terms of difficulty, number of actions, as well as the importance of memory, i.e. previous state-action values. We do not report on the games that did not achieve any significant learning for the specified number of frames. The results showcase the importance of the hindsight factor under various settings, and its contribution to an overall improved performance over the deep Q-network counterparts. Table 1 summarizes the mean score of 100 episodes after training for 10 million frames.

Figure 2 represents the performance curves of the baselines and the proposed approach for a selection of games. The hindsight factor has a different effect on every approach depending on the nature of the game. However, the results are clearly indicative of the power of hindsight. Conventional Q-learning techniques lead to an early rise in performance, which is attributed to a more courageous exploration, as compared to a delayed increase in the reward when using the hindsight factor, attributed to the cautious exploration. However, as the learning algorithm progresses, the baselines, seem to plateau at a local optima, as the performance remains consistent for several million frames.
As mentioned earlier, the hindsight factor models an adaptive learning rate controller, further discussed in the following section. This is also realized experimentally, for example in AMIDAR, when DQN performance detoriorates (in red) at the final one million frames, whereas DQN-H keeps on improving which is also a sign that introducing the hindsight factor prevents overfitting. Even for simple games such as BREAKOUT, the relative difference in performance between the proposed approach and the counterpart baseline is significant.
We also take a look at the values of the selected actions using the hindsight method. Smoothing the discounted reward by previous reward values turns out to have a great impact on the action values. Throughout the training process, the action-values selected by applying the hindsight factor seem to increase at a steady (linear) pace with no signs of convergance as the number of frames exceeds ten million frames. The opposite can be said about the regular Q-learning techniques. Overestimation in standard Q-learning can be avoided with DDQN, where we notice that it results in the lower set of action-values as compared to DQN and DUELING. Nevertheless, these values are still higher than their hindsight counterparts, especially at the early stages of learning, which proves that there are still some reminent overestimation inherent in DDQN.

5.0.1 The underlying effect

The underlying effect of the hindsight factor is that it adaptively changes the learning rate, as it establishes a direct dependance on the action value, unlike exsiting adaptive optimizers such as ADAM [Kingma and Ba2014]

and RMSProp

[Hinton et al.2012], which depend only on the evolution of the gradients. To estimate the value of state-action pairs in a discounted Markov Decision Processes, Equation 15 is introduced [Watkins and Dayan1992]:


where represents the learning rate at iteration and represents the next state resulting from action at state . Given the hindsight factor, we replace the target reward with , which leads to,


with . The effect of introducing is an adaptive state-action pair updates that dynamically changes over time. Now if we replace with , i.e. the effect would be scaling down the learning rate by a factor of . Nevertheless, the learning rate would still be fixed and does not adapt to the change in model parameters. We highlight this effect in Figure 3 as we see that with a halved learning rate results in a better performance for DQN and DEUL, however still results in higher overestimation errors and plateaus at an early stage. Reducing the learning rate partially models the effect provided by the additonal loss, as it is still independent of the action-values, and does not particularly help with overfitting. This is evident by the early spike of the DQN at during the first two million frames of training.

Figure 3: Performance curves for ASTERIX where the baselines have lower learning rate. DQN-H(cyan), DDQN-H(green), DUEL-H(yellow), DQN-H-HALF(red), DDQN-H-HALF(dark blue), DUEL-H-HALF(pink)

5.0.2 Adjusting the Hindsight coefficient

For the previous experiments, we have fixed the hindsight coefficient to 1, and hence uniformly weighing the reward between the expected gain and historic achievement. In the following, we juxtapose the performance obtained by setting , and . First we notice that if we set the hindsight coefficent to , the agent is prone to diverge almost immediately, so no results are shown. On the other hand, setting the hindsight coefficient to , as expected, results in slightly higher action values that is caused by the decreased dependence of new action-values on the history, and hence allowing for more overestimation, Figure 4. A lower hindsight coefficient has a positive impact at the early stages of learning when the agent is still exploring the environment, however, as the number of frames increases, the agent becomes prone to overfitting, and ultimately results in a lower performance.

Figure 4: Performance curves with and ; DQN-H(cyan), DDQN-H(green), DUEL-H(yellow), DQN-H-HALF(red), DDQN-H-HALF(dark blue), DUEL-H-HALF(pink)

Optimizing the Q-function using the hindsight factor as a regularizer to smoothen the expected reward turns out to improve the performance well before the action-values seem to converge. However, with some games we notice that the performance is negatively effected by this formulation. This might be attributed to the penalty which the hindsight factor indirectly applies on exploration. In addition, it is worth noting that as the hindsight coefficient decreases to , the action-values start to come closer to their counterpart models.

6 Conclusion

Existing Q-learning techniques aim at maximizing the expected reward, by minimizing the difference between the current action-value and the expected discounted reward. However, they offer no insight into the past as the progress of the estimator, measured through the difference between the current action-value and the action-value at the same state at a previous iteration, is ignored. In this paper, we proposed the introduction of the hindsight factor, an additional loss function that shares the same gradients of the prediction network, and hence incurring no extra computational efforts. The hindsight factor acts a reward regularizer, forcing the reward to be more realistic and hence avoiding overestimation. The new reward is a trade-off between the expected discounted reward, and the historic temporal difference. Through a deterministic function estimation problem, we are able to prove that by adding the hindsight factor to exiting function estimators via Q-learning, we are able to reduce the average error, and produce a stable estimation. The underlying effect of the hindsight factor is translated as an adaptively controlled learning rate that outperforms the respective base models. We have shown that in general outperforms deep Q-networks, double deep Q-networks and dueling networks in , , and of the games, respectively.
Moving forward, it would be interesting to study the effect of introducing an adaptive hindsight coefficient, based on the absolute reward improvement across frames.