1 Introduction
Reinforcement learning (RL) has gained considerable attention over the past five years. In this field of research, an agent attempts to learn a behavior through trialanderror interactions with a dynamic environment, in order to maximize an allocated reward, immediate or delayed. Broadly speaking, an agent selects an optimal policy either by using an optimal value function, or by manipulating the policy directly. Thanks to the rich representational power of neural networks, the highdimensional input obtained from realworld problems can be reduced to a set of latent representations with no need for handengineered features. Recently, a variant of Qlearning based on convolutional neural networks demonstrated remarkable results on a majority of games within the Arcade Learning Environment, by reformulating the RL objective as a sequential supervised learning task [Mnih et al.2015]. One of the contributing factors to this approach is the presence of an experience replay memory, which stores the transitions at every step. This leads to the temporal decorrelation between experiences, and hence upholds the i.i.d assumption which allows stochastic gradientbased learning [Lin1992]. Experience replay reduces the amount of episodes required for training [Schaul et al.2015] even though some transitions might not be immediately useful [Schmidhuber1991].
A welldefined optimal value function also plays a major role in RL tasks. At its core, an optimal value function is basically an approximation of the predicted reward given a stateaction pair. Higher rewards are hence achieved by navigating the environment, acting greedily with respect to the value function.
In this paper, we reshape the reward as a weighted average between the expected discounted reward, for a sampled stateaction pair, and its previously selected actionvalue. In onestep Qlearning, the loss, , is calculated based on the temporal difference (TD) between the discounted reward and the estimated actionvalue at any given state, Equation 1:
(1) 
where , , and represent respectively the state, action, and reward at sampled iteration , with as the discount factor, and as the network parameters at current iteration , such that , and as the target network parameters.
One drawback of this equation is that it only focuses on the forward temporal difference, i.e. between the actionvalue at the current sampled state, and the next state
. It doesn’t however address how the decision process has evolved between updates by ignoring action values at previous iterations. We leverage this information under the assumption that it helps lower the variance of actionvalues throughout the learning process, while concurrently maintaining high rewards, and ultimately learn a better agent.
To that end, we introduce an additional term to the standard TD error, referred to as the hindsight factor, , that represents the difference between the actionvalue of the sampled state at the current iteration and its previously selected (stored) actionvalue, obtained with network parameters :
(2) 
The key observation is that the hindsight factor acts as regularizer on the Qnetwork, unlike conventional regularization techniques that force restrictions on the network parameters directly. The hindsight factor can also be considered as an adaptive learning step controller that penalizes large deviations from previous models by incoporating the momentum of change in action values across updates. Inherently, if the hindsight factor increases, this means that the model parameters have significantly changed, leading to higher (or lower) actionvalues. The introduction of the hindsight factor restructures the reward, as a weighted average between what is expected, and what was estimated, ensuring that model updates are carried out cautiously.
We summarize the contributions of the paper as follows:

A novel extension to the optimal value function technique, which leads to an overall improved performance

Deterministic evaluation on a continuous statespace that shows how the hindsight factor reduces the bias in function approximation

Experiments on ATARI games that highlight the effect of the hindsight factor

Comparative analysis that demonstrates the effect of adding the hindsight factor to multiple variations of deep Qnetworks
2 Background
A standard reinforcement learning setup [Sutton and Barto2018], consists of an agent interacting with an environment
at discrete timesteps. This formulation is based on a Markov Decision Process (MDP) represented by
. At any given timestep , the agent receives a state , upon which it selects an action , and a scalar reward is observed. The transition function generates a new state . The agents behaviour is governed by a policy, , which computes the true stateaction value, as(3) 
where represents the discount factor balancing between immediate and future rewards.
To solve this sequential decision problem, the optimal policy selects the action that maximizes the discounted cumulative reward, , where denotes the optimal action value.
One of the most prominent valuebased methods for solving reinforcement learning problems is Qlearning [Watkins and Dayan1992], which directly estimates the optimal value function and obeys the fundamental identity, known as the Bellman equation [Bellman1957]
(4) 
As the number of states increases, learning all action values per state separately becomes computationally taxing, which is why the value function is approximated via a paramatrized network, resulting in .
2.1 Deep Qnetworks
Deep Qnetwork (DQN) is a modelfree algorithm presented by [Mnih et al.2015], which learns the Qfunction in a supervised fashion. The objective is to minimize the loss function,
(5) 
with
as the probability distribution over the action space, and the target
as the expected discounted reward, Equation 6:(6) 
In this approach, a target network, shares the same architecture as the online network , however it is only updated after a fixed number of iterations which increases the stability of the algorithm. The correlation between sequential observations is reduced by uniformly sampling transitions of the form for the offpolicy from a replay buffer [Lin1993].
2.2 Double Deep Qnetworks
Double DQN (DDQN) [Van Hasselt et al.2016] is a variation of DQN, where the action selected and its corresponding value are obtained from two separate networks. In other words, based on the sampled state , the action is selected based on the greedy policy of the network, i.e. , whereas the reward is calculated based on the actionvalue of the selected from the target network, resulting in the following target,
(7) 
This comes as a solution to the overoptimistic value estimates, which result from using the same network to select and evaluate the action.
2.3 Dueling Network Architectures
Sharing similar lower level representation, the dueling architecture (DUEL) [Wang et al.2015] extends DQN by explicitly separating the representation of the state values from the statedependent action values. This allows the network to understand which states are valuable agnostic to the action selected, which is particularly important for setups where the action does not have a major effect on the environment. The action value is estimated as a function of two modules, the action advantage module , and the statevalue as expressed in Equation 8:
(8) 
with as the shared parameters, as the parameters of the statevalue module, and as the parameters of the action value module.
Other variations that have been proposed on Qnetworks include selecting an action based on the average of it value over across previous networks [Anschel et al.2017] which preserves the original loss function, and regularization [Farebrother et al.2018] to avoid overfitting on the training environment.
The proposed modification to the existing Qlearning networks shares the same inputoutput interface, however, it reformulates the existing loss function , where the objective is to minimize the difference between the actionvalue and the future discounted reward, by introducing the hindsight factor that adaptively manages reward expectation.
(9) 
3 In Hindsight
Hindsight is an extension of the conventional value iteration techniques in reinforcement learning that considers the previous performance of the network when calculating the reward. In this supervised formulation, the target is calculated based on the forward directional temporal difference, i.e. between the estimated actionvalue and the expected discounted reward (future), Equation 1,
dropping any use of previous actionstate value (past). To counter that effect, we introduce the hindsight factor, Equation 9, to balance the current actionvalue estimate, and prevent overestimation. [Thrun and Schwartz1993]
Intuitively, this term represents the confidence of the agent in previous actions. More specifically, if the estimated actionvalue at current iteration , is much higher than the previously estimated actionvalue at iteration , , given the same state representation , then in hindsight action was not optimal in the global sense, even though it was selected based on greedy policy as . If on the other hand, the historical temporal difference is small, then the network is converging in the optimal direction, as given the same state and the same action, the corresponding action value is equally high.
The total loss would simply be a weighted sum between the forward temporal difference, and the backward temporal difference, .
If we expand the loss and factorize the components, we end up with a new loss representation, Equation 10,
(10) 
with as the hindsight coefficient. The hindsight factor inherently restructures the target reward as a smooth tradeoff between the expected and the previously estimated reward. This is derived by first expanding the loss, Equation 11,
(11) 
since and are simply constants, i.e. independent of , we can ignore them, which leaves us with Equation 12:
(12) 
In order to complete the squares, we introduce the constant , and divide by to obtain the final loss as Equation 13:
(13) 
With this formulation, can be considered as the smoothened reward, a balance between the current discounted reward and the previous actionvalue. Notice that the proposed model doesn’t introduce any additional computations, as both loss terms and share the same set of gradients. which results in the following parameter update, Equation 14:
(14) 
with as the scalar step size.
In order to implement this algorithm, we modify the experience replay buffer, to accommodate the actionvalues of the states. Hence at every frame, we store the transitions () in the memory. The goal is to improve the performance of the Qfunction, by introducing updates that do not emphasize solely on the future discounted reward, but also take into account not to deviate from the values associated with decisions in the agent’s experience in older encounters, and reduces overestimation errors.
4 Overestimation and Approximation Errors
One of the issues of function estimation based on Qlearning is the overestimation phenomenon [Thrun and Schwartz1993]
that lead asymptotically to suboptimal policies. Assuming actionvalues are corrupted by uniformly distributed noise in an interval
, target values would be overestimated by a value with an upper bound of , due to the max operator, with as the discount factor and as the number of actions. Overestimations also have a tight lower bound [Van Hasselt et al.2016], which is derived as , with . The DDQN approach reduces overestimation, and replaces the positive bias with a negative one.The effect of the hindsight factor on overestimation is demonstrated in the following function estimation experiment[Van Hasselt et al.2016]. The environment is described as a continuous realvalued statespace with 10 discrete actions per state. Each action represents a polynomial function, with a chosen degree of 6, fitted to a subset of integer states, with two adjacent states missing; for action , states 5 and 4 are removed, for action , states 4 and 3, and so on. Each action has the same true value, defined as either or .
We are able to reproduce the experiment for DQN and DDQN and obtain the exact overestimation values, presented in the original approximation [Van Hasselt et al.2016], as can be seen in Figure 1. Systemic overestimation is an artifact of recursive function approximation, which leads to a detorioration of value estimates as the action values are assumed true, when in fact they contain noise. Introducing the hindsight factor maintains low bias in the estimates, especially when applied to DQN. We also notice, that even though the bias is slightly higher than DDQN, it is indeed however much smoother, which translates into an overall better estimation. However, we realize later that applying the hindsight factor to DDQN can in some cases lead to an extremely cautious exploration within the game, and respectively lower rewards.
Method  DQN  DQNH 

Wins w.r.t all  2  4 
Wins w.r.t counterpart  10  23 
Score  676  2874 
Method  DDQN  DDQNH 
Wins w.r.t all  2  2 
Wins w.r.t counterpart  15  18 
Score  1632  2593 
Method  DUEL  DUELH 
Wins w.r.t all  6  17 
Wins w.r.t counterpart  9  24 
Score  3247  4342 
5 Experimental Results
We now demonstrate the practical advantage of adding the hindsight factor to the Qlearning loss function. To do so, we reimplement several variations of deep Qlearning methods, namely: DQN [Mnih et al.2015], Double DQN [Van Hasselt et al.2016], and dueling networks [Wang et al.2015]
. All the models are trained in TensorFlow
[Abadi et al.2016]on a GeForce 1080 TI GPU, using the hyperparameters provided by
[Mnih et al.2015], with the average runtime duration per baseline amounting to 30 GPU hours. The proposed model modifies these existing architectures by introducing the hindsight factor as an additional loss term, and is referred to as H. As shown in Algrorithm 1, the buffer is extended to accommodate the actionvalue per state. We evaluated the proposed method on more than 30 ATARI games, which differ in terms of difficulty, number of actions, as well as the importance of memory, i.e. previous stateaction values. We do not report on the games that did not achieve any significant learning for the specified number of frames. The results showcase the importance of the hindsight factor under various settings, and its contribution to an overall improved performance over the deep Qnetwork counterparts. Table 1 summarizes the mean score of 100 episodes after training for 10 million frames.Figure 2 represents the performance curves of the baselines and the proposed approach for a selection of games.
The hindsight factor has a different effect on every approach depending on the nature of the game. However, the results are clearly indicative of the power of hindsight. Conventional Qlearning techniques lead to an early rise in performance, which is attributed to a more courageous exploration, as compared to a delayed increase in the reward when using the hindsight factor, attributed to the cautious exploration. However, as the learning algorithm progresses, the baselines, seem to plateau at a local optima, as the performance remains consistent for several million frames.
As mentioned earlier, the hindsight factor models an adaptive learning rate controller, further discussed in the following section. This is also realized experimentally, for example in AMIDAR, when DQN performance detoriorates (in red) at the final one million frames, whereas DQNH keeps on improving which is also a sign that introducing the hindsight factor prevents overfitting. Even for simple games such as BREAKOUT, the relative difference in performance between the proposed approach and the counterpart baseline is significant.
We also take a look at the values of the selected actions using the hindsight method. Smoothing the discounted reward by previous reward values turns out to have a great impact on the action values. Throughout the training process, the actionvalues selected by applying the hindsight factor seem to increase at a steady (linear) pace with no signs of convergance as the number of frames exceeds ten million frames. The opposite can be said about the regular Qlearning techniques. Overestimation in standard Qlearning can be avoided with DDQN, where we notice that it results in the lower set of actionvalues as compared to DQN and DUELING. Nevertheless, these values are still higher than their hindsight counterparts, especially at the early stages of learning, which proves that there are still some reminent overestimation inherent in DDQN.
5.0.1 The underlying effect
The underlying effect of the hindsight factor is that it adaptively changes the learning rate, as it establishes a direct dependance on the action value, unlike exsiting adaptive optimizers such as ADAM [Kingma and Ba2014]
and RMSProp
[Hinton et al.2012], which depend only on the evolution of the gradients. To estimate the value of stateaction pairs in a discounted Markov Decision Processes, Equation 15 is introduced [Watkins and Dayan1992]:(15) 
where represents the learning rate at iteration and represents the next state resulting from action at state . Given the hindsight factor, we replace the target reward with , which leads to,
(16) 
with . The effect of introducing is an adaptive stateaction pair updates that dynamically changes over time. Now if we replace with , i.e. the effect would be scaling down the learning rate by a factor of . Nevertheless, the learning rate would still be fixed and does not adapt to the change in model parameters. We highlight this effect in Figure 3 as we see that with a halved learning rate results in a better performance for DQN and DEUL, however still results in higher overestimation errors and plateaus at an early stage. Reducing the learning rate partially models the effect provided by the additonal loss, as it is still independent of the actionvalues, and does not particularly help with overfitting. This is evident by the early spike of the DQN at during the first two million frames of training.
5.0.2 Adjusting the Hindsight coefficient
For the previous experiments, we have fixed the hindsight coefficient to 1, and hence uniformly weighing the reward between the expected gain and historic achievement. In the following, we juxtapose the performance obtained by setting , and . First we notice that if we set the hindsight coefficent to , the agent is prone to diverge almost immediately, so no results are shown. On the other hand, setting the hindsight coefficient to , as expected, results in slightly higher action values that is caused by the decreased dependence of new actionvalues on the history, and hence allowing for more overestimation, Figure 4. A lower hindsight coefficient has a positive impact at the early stages of learning when the agent is still exploring the environment, however, as the number of frames increases, the agent becomes prone to overfitting, and ultimately results in a lower performance.


Optimizing the Qfunction using the hindsight factor as a regularizer to smoothen the expected reward turns out to improve the performance well before the actionvalues seem to converge. However, with some games we notice that the performance is negatively effected by this formulation. This might be attributed to the penalty which the hindsight factor indirectly applies on exploration. In addition, it is worth noting that as the hindsight coefficient decreases to , the actionvalues start to come closer to their counterpart models.
6 Conclusion
Existing Qlearning techniques aim at maximizing the expected reward, by minimizing the difference between the current actionvalue and the expected discounted reward.
However, they offer no insight into the past as the progress of the estimator, measured through the difference between the current actionvalue and the actionvalue at the same state at a previous iteration, is ignored.
In this paper, we proposed the introduction of the hindsight factor, an additional loss function that shares the same gradients of the prediction network, and hence incurring no extra computational efforts. The hindsight factor acts a reward regularizer, forcing the reward to be more realistic and hence avoiding overestimation. The new reward is a tradeoff between the expected discounted reward, and the historic temporal difference. Through a deterministic function estimation problem, we are able to prove that by adding the hindsight factor to exiting function estimators via Qlearning, we are able to reduce the average error, and produce a stable estimation. The underlying effect of the hindsight factor is translated as an adaptively controlled learning rate that outperforms the respective base models. We have shown that in general outperforms deep Qnetworks, double deep Qnetworks and dueling networks in , , and of the games, respectively.
Moving forward, it would be interesting to study the effect of introducing an adaptive hindsight coefficient, based on the absolute reward improvement across frames.
References
 [Abadi et al.2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [Anschel et al.2017] Oron Anschel, Nir Baram, and Nahum Shimkin. Averageddqn: Variance reduction and stabilization for deep reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 176–185. JMLR. org, 2017.

[Bellman1957]
Richard Bellman.
Functional equations in the theory of dynamic programming–vii. a partial differential equation for the fredholm resolvent.
Proceedings of the American Mathematical Society, 8(3):435–440, 1957.  [Farebrother et al.2018] Jesse Farebrother, Marlos C Machado, and Michael Bowling. Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123, 2018.
 [Hinton et al.2012] Geoffrey Hinton, N Srivastava, and Kevin Swersky. Lecture 6a overview of mini–batch gradient descent. https://class.coursera.org/neuralnets2012001/lecture, 2012. Online.
 [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Lin1992] LongJi Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(34):293–321, 1992.
 [Lin1993] LongJi Lin. Reinforcement learning for robots using neural networks. Technical report, CarnegieMellon Univ Pittsburgh PA School of Computer Science, 1993.
 [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [Schaul et al.2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 [Schmidhuber1991] Jürgen Schmidhuber. Curious modelbuilding control systems. In Neural Networks, 1991. 1991 IEEE International Joint Conference on, pages 1458–1463. IEEE, 1991.
 [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [Thrun and Schwartz1993] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993.
 [Van Hasselt et al.2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 2, page 5. Phoenix, AZ, 2016.
 [Wang et al.2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
 [Watkins and Dayan1992] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
Comments
There are no comments yet.