Recurrent Value Functions

by   Pierre Thodoroff, et al.
McGill University

Despite recent successes in Reinforcement Learning, value-based methods often suffer from high variance hindering performance. In this paper, we illustrate this in a continuous control setting where state of the art methods perform poorly whenever sensor noise is introduced. To overcome this issue, we introduce Recurrent Value Functions (RVFs) as an alternative to estimate the value function of a state. We propose to estimate the value function of the current state using the value function of past states visited along the trajectory. Due to the nature of their formulation, RVFs have a natural way of learning an emphasis function that selectively emphasizes important states. First, we establish RVF's asymptotic convergence properties in tabular settings. We then demonstrate their robustness on a partially observable domain and continuous control tasks. Finally, we provide a qualitative interpretation of the learned emphasis function.



There are no comments yet.


page 8

page 16


Deep RBF Value Functions for Continuous Control

A core operation in reinforcement learning (RL) is finding an action tha...

Decorrelated Double Q-learning

Q-learning with value function approximation may have the poor performan...

Learning Successor States and Goal-Dependent Values: A Mathematical Viewpoint

In reinforcement learning, temporal difference-based algorithms can be s...

General Value Function Networks

In this paper we show that restricting the representation-layer of a Rec...

A compact, hierarchical Q-function decomposition

Previous work in hierarchical reinforcement learning has faced a dilemma...

Inference on weighted average value function in high-dimensional state space

This paper gives a consistent, asymptotically normal estimator of the ex...

The Gambler's Problem and Beyond

We analyze the Gambler's problem, a simple reinforcement learning proble...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free Reinforcement Learning (RL) is a widely used framework for sequential decision making in many domains such as robotics Kober et al. [2013], Abbeel et al. [2010] and video games Vinyals et al. [2017], Mnih et al. [2013, 2016]. However, its use in the real-world remains limited due, in part, to the high variance of value function estimates Greensmith et al. [2004], leading to poor sample complexity Gläscher et al. [2010], Kakade and others [2003]. This phenomenon is exacerbated by the noisy conditions of the real-world Fox et al. [2015], Pendrith . Real-world applications remain challenging as they often involve noisy data such as sensor noise and partially observable environments.

The problem of disentangling signal from noise in sequential domains is not specific to Reinforcement Learning and has been extensively studied in the Supervised Learning literature. In this work, we leverage ideas from time series literature and Recurrent Neural Networks to address the robustness of value functions in Reinforcement Learning. We propose Recurrent Value Functions (RVFs): an exponential smoothing of the value function. The value function of the current state is defined as an exponential smoothing of the values of states visited along the trajectory where the value function of past states are summarized by the previous RVF.

However, exponential smoothing along the trajectory can result in a bias when the value function changes dramatically through the trajectory (non-stationarity). This bias could be a problem if the environment encounters sharp changes, such as falling of a cliff, and the estimates are heavily smoothed. To alleviate this issue, we propose to use exponential smoothing on value functions using a trainable state-dependent emphasis function which controls the smoothing coefficients. Intuitively, the emphasis function adapts the amount of emphasis required on the current value function and the past RVF to reduce bias with respect to the optimal value estimate. In other words, the emphasis function identifies important states in the environment. An important state can be defined as one where its value differs significantly from the previous values along the trajectory

. For example, when falling off a cliff, the value estimate changes dramatically, making states around the cliff more salient. This emphasis function serves a similar purpose to a gating mechanism in a Long Short Term Memory cell of a Recurrent Neural Network

Hochreiter and Schmidhuber [1997].

To summarize the contributions of this work, we introduce RVFs to estimate the value function of a state by exponentially smoothing the value estimates along the trajectory. RVF formulation leads to a natural way of learning an emphasis function which mitigates the bias induced by smoothing. We provide an asymptotic convergence proof in tabular settings by leveraging the literature on asynchronous stochastic approximation Tsitsiklis [1994]. Finally, we perform a set of experiments to demonstrate the robustness of RVFs with respect to noise in continuous control tasks and provide a qualitative analysis of the learned emphasis function which provides interpretable insights into the structure of the solution.

2 Technical Background

A Markov Decision Process (MDP), as defined in

Puterman [1994], consists of a discrete set of states , a transition function , and a reward function . In each round , the learner observes current state and selects an action . As a response, it receives a reward and moves to a new state . We define a stationary policy

as a probability distribution over actions conditioned on states

, such that . In policy evaluation, the goal is to find the optimal value function that estimates the discounted expected return of a policy at a state , , with discount factor . In this paper, we only consider policy evaluation and simplify the model: .

In practice, is approximated using Monte Carlo rollouts Sutton and Barto or TD methods Sutton [1988]. For example, the target used in TD(0) is . In Reinforcement Learning, the aim is to find a function parametrized by that approximates . We thus learn a set of parameters that minimizes the squared loss:


which yields the following update on the parameters by taking the derivative with respect to :


where is a learning rate.

3 Recurrent Value Functions (RVFs)

As mentioned earlier, performance of value-based methods are often heavily impacted by the quality of the data obtained Fox et al. [2015], Pendrith . For example, in robotics, noisy sensors are common and can significantly hinder performance of popular methods Romoff et al. [2018]. In this work, we propose a method to improve the robustness of value functions by estimating the value of a state using the estimate at time step and the estimates of previously visited states where . Mathematically, the Recurrent Value Function (RVF) of a state at time step is given by:


where . estimates the value of a state as a convex combination of current estimate and previous estimate . can be recursively expanded further, hence the name Recurrent Value Function. is the emphasis function which updates the recurrent value estimate.

In contrast to traditional methods that attempt to minimize Eq. 1, the goal here is to find a set of parameters that minimize the following error:


where is a function parametrized by , and is a function parametrized by . This error is similar to the traditional error in Eq. 1, but we replace the value function with . In practice, can be any target such as TD(0), TD(N), TD() or Monte Carlo Sutton and Barto [1998] which is used in Reinforcement Learning. We minimize Eq. 4 by updating and using the semi-gradient technique which results in the following update rule:


where is the TD error with RVF in the place of the usual value function. The complete algorithm using the above update rules can be found in Algorithm 1.

1:Input: ,,,
3:Output: Return the learned parameters
4:for t do
5:     Take action , observe
6:      Compute the RVF
7:      Compute TD error with respect to RVF
8:      Update parameters of the value function
9:      Update parameters of the emphasis function
10:end for
Algorithm 1 Recurrent Temporal Difference(0)

As discussed earlier,

learns to identify states whose value significantly differs from previous estimates. While optimizing for the loss function described in Eq.

4, the learns to bring the RVF closer to the target . It does so by placing greater emphasis on whichever is closer to the target, either or . Concisely, the updated behaviour can be split into four scenarios. A detailed description of these behaviours is provided in Table 1. Intuitively, if the past is not aligned with the future, will emphasize the present. Likewise, if the past is aligned with the future, then will place less emphasis on the present. This behaviour is further explored in the experimental section.

Table 1: Behaviour of based on the loss

Note that, the gradients of take a recursive form (gradient through time) as shown in Eq. 6. The gradient form is similar to LSTM Hochreiter and Schmidhuber [1997], and GRU Chung et al. [2014] where

acts as a gating mechanism that controls the flow of gradient. LSTM uses a gated exponential smoothing function on the hidden representation to assign credit more effectively. In contrast, we propose to exponentially smooth the outputs (value functions) directly rather than the hidden state. This gradient can be estimated using backpropagation through time by recursively applying the chain rule where:


However, this can become computationally expensive in environments with a large episodic length, such as continual learning. Therefore, we could approximate the gradient using a recursive eligibility trace:.


In the following section, we present the asymptotic convergence proof of RVF.

Asymptotic convergence

For this analysis, we consider the simplest case: a tabular setting with TD(0) and a fixed set of . In the tabular setting, each component of and estimates one particular state, allowing us to simplify the notation. In this section, we simplify the notation by dropping and such that and . In the tabular setting, convergence to the fixed point of an operator is usually proven by casting the learning algorithm as a stochastic approximation Tsitsiklis [1994], Borkar [2009], Borkar and Meyn [2000] of the form:


where is a contraction operator and is a noise term. The main idea is to cast the Recurrent Value Function as an asynchronous stochastic approximation Tsitsiklis [1994] with an additional regularization term. By bounding the magnitude of this term, we show that the operator is a contraction. The algorithm is asynchronous because the eligibility trace only updates certain states at each time step.

We consider the stochastic approximation formulation described in Eq. 8 with the following operator for any :


for all states with . can be interpreted as a regularization term composed of the difference between and .
To obtain this operator we first examine the update to made during the trajectory at time step :


where and . is a convex combination of all encountered in the trajectory, with the exception of , weighted by their respective contribution() to the estimate . For example, if we consider updating at and have the following , the value of will be mainly composed of . The main component of the error will be . An example on how to obtain this decomposition can be found in the section A.1.1 of Appendix. In practice, one can observe an increase in the magnitude of this term with a decrease in eligibility. This suggests that the biased updates contribute less to the learning. Bounding the magnitude of to ensure contraction is the key concept used in this paper to ensure asymptotic convergence.

We consider the following assumptions to prove convergence: The first assumption deals with the ergodic nature of the Markov chain. It is a common assumption in theoretical Reinforcement Learning that guarantees an infinite number of visits to all states, thereby avoiding chains with transient states.

Assumption 1.

The Markov chain is ergodic.

The second assumption concerns the relative magnitude of the maximum and minimum reward, and allow us to bound the magnitude of the regularization term.

Assumption 2.

We define and as the maximum and minimum reward in an MDP. All rewards are assumed to be positive and scaled in the range such that the scaled maximum reward satisfies the following:


where is a constant to be defined based on .

In theory, scaling the reward is reasonable as it does not change the optimal solution of the MDP van Hasselt et al. [2016]. In practice, however, this may be constraining as the range of the reward may not be known beforehand. This assumption could be relaxed by considering the trajectory’s information to bound . As an example, one could consider any physical system where transitions in the state space are smooth (continuous state space) and bounded by some Lipschitz constant in a similar manner than Shah and Xie [2018].

As mentioned earlier, the key component of the proof is to control the magnitude of the term in Eq. 10: . As the eligibility of this update gets smaller, the magnitude of the term gets bigger. This suggests that not updating certain states whose eligibility is less than the threshold C can help mitigate biased updates. Depending on the values of and , we may need to set a threshold to guarantee convergence.

Theorem 1.

Define and . is a contraction operator if the following holds:

  • Let be the set of functions such that . The functions are initialized in .

  • For a given and we select such that .

We outline important details of the proof here. A full version can be found in Appendix A.1.2. For two sets of value functions :


where is the described in Eq. 10 with value function . For any contraction operator, we can now guarantee that is a contraction operator, where holds from Assumption 2. We provide an example in Appendix A.1.3 to set based on and . We can guarantee that converges to a fixed point of the operator with probability using Theorem 3 of Tsitsiklis [1994]. The assumptions of Theorem 3 of Tsitsiklis [1994] are discussed in section A.1.4 of Appendix.

4 Related work

One important similarity of RVFs is with respect to the online implementation of return Sutton and Barto [1998], Dayan [1992]. Both RVF and online returns have an eligibility trace form, but the difference is in RVF’s capacity to ignore a state based on . In this paper we argue that this can provide more robustness to noise and partial observability. The ability of RVF to emphasize a state is similar to the interest function in emphatic TD Mahmood et al. [2015], however, learning a state-dependant interest function and remains an open problem. In contrast, RVF has a natural way of learning by comparing the past and the future. The capacity to ignore states shares some motivations to semi-Markov decision process Puterman [1990]. Learning and ignoring states can be interpreted as learning temporal abstraction over the trajectory in policy evaluation. In reward shaping literature, several works such as Temporal Value Transport Hung et al. [2018], Temporal Regularization Thodoroff et al. [2018], Natural Value Approximator Xu et al. [2017] attempt to modify the target to either enforce temporal consistency or to assign credit efficiently. This departs from our formulation as we consider estimating a value function directly by using the previous estimates rather than by modifying the target. As a result of modifying the estimate, RVFs can choose to ignore a gradient while updating, which is not possible in other works. For example, in settings where the capacity is limited, updating on noisy states can be detrimental for learning. Finally, RVF can also be considered as a partially observable method Kaelbling et al. [1998]. However, it differs significantly from the literature as it does not attempt to infer the underlying hidden state explicitly, but rather only decides if the past estimates align with the target. We argue that inferring an underlying state may be significantly harder than learning to ignore or emphasize a state based on its value. This is illustrated in the next section.

5 Experiments

In this section, we perform experiments on various tasks to demonstrate the effectiveness of RVF. First, we explore RVF robustness to partial observability on a synthetic domain. We then showcase RVF’s robustness to noise on several complex continuous control tasks from the Mujoco suite Todorov et al. [2012]. An example of policy evaluation is also provided in Appendix A.3.1 as a reading.

5.1 Partially observable multi-chain domain

We consider the simple chain MDP described in Figure 0(a). This MDP has three chains connected together to form a Y. Each of the three chains (left of , right of , right of ) is made up of a sequence of states. The agent starts at and navigates through the chain. At the intersection , there is a probability to go up or down. The chain on top receives a reward of while the one at the bottom receives a reward of . Every other transition has a reward of , unless specified otherwise.

(a) Simple chain MDP.
(b) Results on the aliased Y-chain.
Figure 1: (a) Simple chain MDP: The agent starts at state and navigates along the chain. States and are aliased. (b) Results on the aliased Y-chain on various methods, such as TD, TD(), GRU, RTD(), and Optimal RTD() (O-RTD(0)) averaged over 20 random seeds.

We explore the capacity of recurrent learning to solve a partially observable task in the Y chain. In particular, we consider the case where some states are aliased (share a common representation). The representation of the states and in Figure 0(a) are aliased. The goal of this environment is to correctly estimate the value of the aliased state (due to the discount factor(0.9) and the length of each chain being 3). When TD methods such as TD() or TD() are used, the values of the aliased states and are close to as the reward at the end of the chain is and . However, when learning (emphasis function

is modelled using a sigmoid function), Recurrent Value Functions achieve almost no error in their estimate of the aliased states as illustrated in Figure

0(b). This can be explained by observing that on the aliased state due to the fact that the previous values along the trajectory are better estimates of the future than those of the aliased state. As , and tend to rely on their more accurate previous estimates, and . We see that learning to ignore certain states can at times be sufficient to solve an aliased task. We also compare with a recurrent version (O-RTD) where optimal values of are used. In this setting, and other states have . Another interesting observation is with respect to Recurrent Neural Networks. RNNs are known to solve tasks which have partial observability by inferring the underlying state. LSTM and GRU have many parameters that are used to infer the hidden state. Correctly learning to keep the hidden state intact can be sample-inefficient. In comparison, can estimate whether or not to put emphasis (confidence) on a state value using a single parameter. This is illustrated in Figure 0(b) where RNNs take 10 times more episodes to learn the optimal value when compared to RVF. This illustrates a case where learning to ignore a state is easier than inferring its hidden representation. The results displayed in Figure 0(b)

are averaged over 20 random seeds. For every method, a hyperparameter search is done to obtain their optimal value. These can be found in Appendix

A.3.2. We noticed that the emphasis function is easier to learn if the horizon of the target is longer, since a longer horizon provides a better prediction of the future. To account for this, we use -return as a target.

5.2 Deep Reinforcement Learning

Next, we test RVF on several environments of the Mujoco suite Todorov et al. [2012]. We also evaluate the robustness of different algorithms by adding

sensor noise (drawn from a normal distribution

) to the observations as presented in Zhang et al. [2018]. We modify the critic of A2C Wu et al. [2017] (R-A2C) and Proximal Policy Optimization (R-PPO) Schulman et al. [2017] to estimate the recurrent value function parametrized by . We parametrize using a seperate network with the same architecture as the value function (parametrized by ). We minimize the loss mentioned in Eq. 4 but replace the target with generalized advantage function () Schulman et al. [2015] for PPO and TD(

) for A2C. Using an automatic differentiation library (Pytorch

Paszke et al. [2017]), we differentiate the loss through the modified estimate to learn and . The default optimal hyperparameters of PPO and A2C are used. Due to the batch nature of PPO, obtaining the trajectory information to create the computational graph can be costly. In this regard, we cut the backpropagation after timesteps in a similar manner to truncated backpropagation through time. The number of backpropagation steps is obtained using hyperparameter search. Details can be found in Appendix A.3.3. We use a truncated backprop of in our experiments as we found no empirical improvements for

. For a fairer comparison in the noisy case, we also compare the performance of two versions of PPO with an LSTM. The first version processes one trajectory every update. The second uses a buffer in a similar manner to PPO, but the gradient is cut after 5 steps as the computation overhead from building the graph every time is too large. The performance reported is averaged over 20 different random seeds with a confidence interval of

displayed 111The base code used to develop this algorithm can be found here Kostrikov [2018].

5.2.1 Performance

Figure 2: Performance on Mujoco tasks. Results on the first row are generated without noise and on the second row by inducing a Gaussian noise () in the sensor inputs.

As demonstrated in Figure 2

, we observe a marginal increase in performance on several tasks such as Swimmer, Walker, Hopper, Half Cheetah and Double Inverted Pendulum in the fully observable setting. However, severe drops in performance were observed in the vanilla PPO when we induced partial observability by introducing a Gaussian noise to the observations. On the other hand, R-PPO (PPO with RVF) was found to be robust to the noise, achieving significantly higher performance in all the tasks considered. In both cases, R-PPO outperforms the partially observable models (LSTM). The mean and standard deviation of the emphasis function for both noiseless and noisy versions can be found in Appendix(

A2, A3). At the same time, A2C performance on both vanilla and recurrent versions (referred as R-A2C) were found to be poor. We increased the training steps on both versions and noticed the same observations as mentioned above once A2C started to learn the task. The performance plots, along with the mean and standard deviation of the emphasis function during training, can be found in Appendix (A4, A5, A6, A7).

5.2.2 Qualitative interpretation of the emphasis function

Hopper: At the end of training, we can qualitatively analyze the emphasis function () through the trajectory. We observe cyclical behaviour shown in Figure 2(b), where different colours describe various stages of the cycle. The emphasis function learned to identify important states and to ignore the others. One intuitive way to look at the emphasis function() is: If I were to give a different value to a state, would that alter my policy significantly? We observe an increase in the value of the emphasis function () when the agent must make an important decision, such as jumping or landing. We see a decrease in the value of the emphasis function () when the agent must perform a trivial action. This pattern is illustrated in Figure 2(a) and 2(b). This behaviour is cyclic and repetitive, a video of which can be found in the following link222

Phase 1: high Phase 2: low Phase 3: high Phase 4: low
(a) Cyclical behaviour of on Hopper.
(b) Behaviour of through the trajectory.
Figure 3: (a) The emphasis function learns to emphasize key states in the environment. The emphasis function is high when the agent is making important decisions, such as landing or taking off (Phase 1 and Phase 3). The emphasis function is low when the agent is making decisions while it is in the air (Phase 2 and Phase 4). (b) Behaviour of the emphasis function along the trajectory for various phases described in (a) for one period. The emphasis function keeps repeating the behaviour.

6 Discussions and Future Work

Temporal Credit assignment:

As mentioned earlier, we can control the flow of gradient by using emphasis function and pass gradient to the states that contributed to the reward but are located several time-steps earlier. We could potentially do credit assignment on states that are temporally far away by forcing the emphasis function between these states to be close to . This setting could be useful in problems with long horizons, such as lifelong learning and continual learning.

as an interest function:

In Reinforcement Learning, having access to a function quantifying the interest Mahmood et al. [2015] of a state can be helpful. For example, one could decide to explore from those states, prioritize experience replay based on those states, and use to set the to bootstrap from interesting states. Indeed, bootstrapping on states with a similar value (low ) than the one estimated will only result in variance. The most informative updates come from bootstrapping on states with different values (high ). We also believe to be related to the concepts of bottleneck states Tishby and Polani [2011] and reversibility.

Partially observable domain:

As demonstrated earlier, RVFs are able to correctly estimate the value of an aliased/noisy state using the trajectory’s estimate. We believe that this is a promising area to explore because, as the experiments suggest, ignoring an uninformative state can sometimes suffice to learn its value function. This is in contrast to traditional POMDP methods which attempt to infer the belief state. Smoothing the output with a gating mechanism could also be useful for sequential tasks in Supervised Learning, such as regression or classification.

Adjusting for the reward:

In practice, some environments in Reinforcement Learning have a constant reward at every time step, potentially inducing bias in estimates. It would be possible to modify the RVF formulation to account for the reward that was just seen, such that . Whether or not subtracting the reward can reduce the bias will depend on the environment considered.


In this work we propose Recurrent Value Functions to address variance issues in Model-free Reinforcement Learning. First, we prove the asymptotic convergence of the proposed method. We then demonstrate the robustness of RVF to noise and partial observability in a synthetic example and on several tasks from the Mujoco suite. Finally, we describe the behaviour of the emphasis function qualitatively.


  • P. Abbeel, A. Coates, and A. Y. Ng (2010) Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research 29 (13), pp. 1608–1639. Cited by: §1.
  • V. S. Borkar and S. P. Meyn (2000) The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38 (2), pp. 447–469. Cited by: §3.
  • V. S. Borkar (2009) Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer. Cited by: §3.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.
  • W. Chung, S. Nath, A. Joseph, and M. White (2018) Two-timescale networks for nonlinear value function approximation. Cited by: §A.3.1.
  • P. Dayan (1992) The convergence of td () for general . Machine learning 8 (3-4), pp. 341–362. Cited by: §4.
  • R. Fox, A. Pakman, and N. Tishby (2015) Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562. Cited by: §1, §3.
  • J. Gläscher, N. Daw, P. Dayan, and J. P. O’Doherty (2010) States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66 (4), pp. 585–595. Cited by: §1.
  • E. Greensmith, P. L. Bartlett, and J. Baxter (2004) Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5 (Nov), pp. 1471–1530. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §3.
  • C. Hung, T. Lillicrap, J. Abramson, Y. Wu, M. Mirza, F. Carnevale, A. Ahuja, and G. Wayne (2018) Optimizing agent behavior over long time scales by transporting value. External Links: arXiv:1810.06721 Cited by: §4.
  • L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: §4.
  • S. M. Kakade et al. (2003) On the sample complexity of reinforcement learning. Ph.D. Thesis. Cited by: §1.
  • J. Kober, J. A. Bagnell, and J. Peters (2013) Reinforcement learning in robotics: a survey. The International Journal of Robotics Research 32 (11), pp. 1238–1274. Cited by: §1.
  • I. Kostrikov (2018) PyTorch implementations of reinforcement learning algorithms. GitHub. Note: Cited by: footnote 1.
  • A. R. Mahmood, H. Yu, M. White, and R. S. Sutton (2015) Emphatic temporal-difference learning. arXiv preprint arXiv:1507.01569. Cited by: §4, §6.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §5.2.
  • [20] M. D. Pendrith On reinforcement learning of control actions in noisy and non-markovian domains. Citeseer. Cited by: §1, §3.
  • M. L. Puterman (1990) Markov decision processes. Handbooks in operations research and management science 2, pp. 331–434. Cited by: §4.
  • M. L. Puterman (1994) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.
  • J. Romoff, P. Henderson, A. Piché, V. Francois-Lavet, and J. Pineau (2018) Reward estimation for variance reduction in deep reinforcement learning. arXiv preprint arXiv:1805.03359. Cited by: §3.
  • J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: §5.2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.2.
  • D. Shah and Q. Xie (2018) Q-learning with nearest neighbors. arXiv preprint arXiv:1802.03900. Cited by: §3.
  • [27] R. S. Sutton and A. G. Barto Reinforcement learning: an introduction. Cited by: §2.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §3, §4.
  • R. S. Sutton (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §A.1, §2.
  • P. Thodoroff, A. Durand, J. Pineau, and D. Precup (2018) Temporal regularization for markov decision process. In Advances in Neural Information Processing Systems, pp. 1782–1792. Cited by: §4.
  • N. Tishby and D. Polani (2011) Information theory of decisions and actions. pp. 601–636. Cited by: §6.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. Cited by: §5.2, §5.
  • J. N. Tsitsiklis (1994) Asynchronous stochastic approximation and q-learning. Machine learning 16 (3), pp. 185–202. Cited by: §A.1.4, §A.1.4, §1, §3, §3.
  • H. P. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver (2016) Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pp. 4287–4295. Cited by: §3.
  • O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al. (2017) Starcraft ii: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §1.
  • Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba (2017) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems, pp. 5279–5288. Cited by: §5.2.
  • Z. Xu, J. Modayil, H. P. van Hasselt, A. Barreto, D. Silver, and T. Schaul (2017) Natural value approximators: learning when to trust past estimates. In Advances in Neural Information Processing Systems, pp. 2120–2128. Cited by: §4.
  • A. Zhang, N. Ballas, and J. Pineau (2018) A dissection of overfitting and generalization in continuous reinforcement learning. arXiv preprint arXiv:1806.07937. Cited by: §5.2.

Appendix A Appendix

a.1 Convergence Proof

TD(0) is known to converge to the fixed point of the bellman operator Sutton [1988]:


However, in practice we have access to a noisy version of the operator due to sampling process hence the noise term :


a.1.1 Derivation of , Eq. 10

We take an example with and consider :


To see that is a convex combination of the all the encountered along the trajectory weight by except it suffices to see that:


where the last line is true because

a.1.2 Proof theorem 1

Theorem 1.

Let’s define and . If the following holds

  • Let be the set of functions such that . We assume the functions are initialized in .

  • For a given and we select C such that

then is a contractive operator.


The first step is to prove that maps to itself for any noisy update . From 2) we know that we can then deduce that




The next step is to show that is a contractive operator:


and from the assumption we know that . ∎

a.1.3 Selecting C

To select C based on and it suffice to solve analytically for:


which is satisfied only if:


As an example for and any satisfies this inequality.

a.1.4 Assumption asynchronous stochastic approximation

We now discuss the assumptions of theorem 3 in Tsitsiklis [1994]

Assumption 1:

Allows for delayed update that can happen in distributed system for example. In this algorithm all ’s are updated at each time step and is not an issue here.

Assumption 2:

As described by Tsitsiklis [1994] assumption 2 “allows for the possibility of deciding whether to update a particular component at time , based on the past history of the process.”. This assumption is defined to accommodate for -greedy exploration in Q-learning. In this paper we only consider policy evaluation hence this assumptions holds.

Assumption 3:

The learning rate of each state must satisfy Robbins Monroe conditions such that there exists :


This can be verified by assuming that each state gets visited infinitely often and an appropriate decaying learning rate based on (state visitation count) is used (linear for example).

Assumption 5:

This assumption requires to be a contraction operator. This has been proven in theorem 1 of this paper.

a.2 Derivation of update rule

We wish to find minimizing the loss :


Taking the derivative of the R.H.S of 2 gives


We know that



Finally, the update rule is simply a gradient step using the above derivative.

a.3 Experiment

a.3.1 Policy Evaluation

Figure A1: Comparison of RMSVE of various methods. Value Function estimated using RVF has lower error compared to Value Function estimated using TD and comparable to Eligibility traces

In this experiment, we perform a policy evaluation on CartPole task. In this environment the agent has to balance the pole on a cart. The agent can take one of the two actions available which would move the cart to either left or right. A reward of +1 is obtained for every time step and an episode is terminated if the cart move too far from the center or the pole dips below a certain angle.

In this task, we use a pretrained network to obtain the features that represent the underlying state. We train a linear function approximator using those features to estimate the value function. This experimental setup is similar to Chung et al. [2018] but the features in our case are obtained through a pretrained network instead of training a separate network. The samples are generated following a fixed policy and the same samples were used across all the methods to estimate value function. Each sample consists of 5000 transitions and the results are averaged over 40 such samples. We calculated the optimal value using a Monte Carlo estimate for 2000 random states following the same policy. We use the trained linear network to predict value function on these 2000 states. Once we get the predictions we calculate their Root Mean Square Value Error (RMSVE). The best hyperparameters were obtained through hyperparameter search for each method separately. The optimal learning rate was found to be for TD and RVF while for eligibility traces when did a search in {}. The optimal beta learning rate was found to be when we searched in {} and the optimal lambda for eligibility traces was found to be when searched in {}. The RMSVE on various methods such as TD, eligibility traces - online TD() and the value functions of the state obtained using RVF algorithm are reported in Figure A1. We notice that the the value function learned through RVF algorithm has approximately the same error as the value function learned through eligibility traces. Both RVF and eligibility traces outperform TD methods.

a.3.2 Hyper-parameter TOY MDP

For every method the learning rate and was tuned for optimal performance in the range .
For RTD a learning rate of 0.5 for the value function and 1 for the beta function was found to be optimal with a lambda of 0.9.
For the GRU model we explored different amount of cell() to vary the capacity of the model. The optimal number of hidden cell we found is 10, learning rate 0.5 and lambda 0.9.

a.3.3 Deep Reinforcement Learning

The best hyperparameters were selected on 10 random seeds.


The following values were considered for the learning rate and . The optimal values for learning rate is the same one obtained in the original PPO paper and . We also compare with a larger network for PPO to adjust for the additional parameter of the performance of vanilla PPO were found to be similar. In terms of computational cost, RVF introduce a computational overhead slowing down training by a factor of 2 on a CPU(Intel Skylake cores 2.4GHz, AVX512) compared to PPO. The results are reported on 20 new random seeds and a confidence interval of is displayed.


The following values were considered for the learning rate and the best learning rate was found to be . We tested bootstrapping on steps, we found no empirical improvements between and but noticed a significant drop in performance for bootstrapping . We believe that, it is due to the increase in variance of the target as we increase the steps of the bootstrapping. The best hyperparameters were found by averaging across 10 random seeds. The results reported are averaged across 20 random seeds. The confidence interval of is displayed.

Figure A2: Mean beta values using recurrent PPO on Mujoco domains
Figure A3: Standard deviation of beta using recurrent PPO on Mujoco domains
Figure A4: Performance of a2c and recurrent a2c on Mujoco tasks without noise in observations for 10M steps
Figure A5: Performance of a2c and recurrent a2c on Mujoco tasks with a Gaussian noise(0.1) in observations for 10M steps
Figure A6: The mean of the emphasis function on various Mujoco tasks with and without noise plotted against the number of updates
Figure A7: The standard deviation of the emphasis function on various Mujoco tasks with and without noise plotted against the number of updates