1 Non-linear Bellman equations
The recursive formulation is appealing as it allows for algorithms that can sample and update estimates of these values independently of temporal span(van Hasselt and Sutton, 2015). The canonical formulation limits the modelling power of Bellman equations to cumulative rewards that are discounted exponentially: the weight on a reward steps in the future will be discounted with a factor .
We consider a broader class of Bellman equations that are non-linear in the rewards and future values:
This generalises the standard Bellman equation which is obtained for . We conjecture that the additional flexibility this provides can be useful for at least two purposes: 1) to model a wider range of natural phenomena, including to explain human and animal behaviour, and 2) to allow more efficient learning algorithms for prediction and control by widening the flexibility in design choices for such algorithms.
Three slightly more specific formulations will be of particular interest to us: non-linear transformations to the reward (), to the value (), and to the target as a whole (), where, in each case, and are scalar functions. In the latter case, we may choose , such that , where is a squashing function as will be defined later, such that can be roughly interpreted as estimating a squashed estimate of the expected return.
In all these cases the Bellman equations define the values of the states. The value should therefore be considered a function of the chosen non-linearity. Note that this is also true for the standard discounted formulation: the value of a state under a given discount differs from the value under a discount , and neither is necessarily equivalent to the undiscounted objective if and , neither in terms of value nor in terms of induced behaviour.
1.1 Pre-existing non-linear Bellman equations
Humans and animals seem to exhibit a different type of weighting of the future than would emerge from the standard linear Bellman equation which leads to exponential discounting when unrolled multiple steps because of the repeated multiplication with . One consequence is that the preference ordering of two different rewards occurring at different times can reverse, depending on how far in the future the first reward is. For instance, humans may prefer a single sparse reward of (e.g., $1) now over a reward of (e.g., $2) received a week later, but may also prefer a reward of received after 20 weeks over a reward of after 19 weeks. Such preferences reversals (Thaler, 1981; Ainslie and Herrnstein, 1981; Green et al., 1994) have been observed in human and animal studies, but cannot be predicted from exponential discounting. Instead, hyperbolic discounting has been proposed as a well-fitting mathematical model, where a reward in steps is discounted as , or some variation of this equation.
It has been shown that at least some of the data can be explained with a recursive formulation, called HDTD (Alexander and Brown, 2010), that uses a recursion . Note that this is a non-linear Bellman equation, due to the division by the value of . Interestingly, mixing values for multiple exponential discounts (as discussed by Sutton, 1995) can also closely approximate hyperbolic discounting (Kurth-Nelson and Redish, 2009; Fedus et al., 2019)
Separately, discounting is a useful tool to increase control performance. In modern reinforcement learning applications, the discount factor is rarely set to , even if the goal is to optimise the average total return per episode (e.g., Mnih et al., 2015; van Hasselt et al., 2016b; Wang et al., 2016; Hessel et al., 2018). Indeed, also when learning this factor from data, the learnt value often stays below (Xu et al., 2018). This makes intuitive sense: it can be substantially easier to learn a policy of control that is somewhat myopic: because we get to make more decisions later, when the sliding horizon into the future will have moved forward with time, the resulting behaviour is not necessarily much worse than the optimal policy for the far-sighted undiscounted formulation. In other words, it can be useful to learn a proxy for the true objective if the proxy is easier to learn.
. This effectively scales down the values before updating the parametric function (a multi-layer neural network) toward the scaled-down value, and then scale the values back up withbefore using these in the temporal difference update. The intuition is that it can be easier for the network to accurately represent the values in this transformed space, especially if the true values can have quite varying scales. For instance, in the popular Atari 2600 benchmark (Bellemare et al., 2013), the different games can have highly varying reward scales, which can be hard to deal with for learning algorithms that do not do anything special to deal with this (van Hasselt et al., 2016a).
Finally, we note that many distributional reinforcement learning algorithms (Bellemare et al., 2017; Dabney et al., 2018) can be interpreted as optimising non-linear Bellman equations. It may be helpful to view these as part of a larger set of possible transformations of the underlying reward signal (cf. Rowland et al., 2019).
2 Analysis of non-linear TD algorithms
We now examine properties of temporal-difference algorithms (Sutton, 1988) derived from non-linear Bellman equations. When is a parametric function with parameter , these algorithms have the following form
is some function of the reward and next state value.
This formulation subsumes the tabular case, where is just the scalar value at the cell in the table corresponding to state . The first question we can ask is whether the operator defined by is a contraction, which is a sufficient condition for convergence under standard assumptions (Bertsekas and Tsitsiklis, 1996). This will, of course, depend on the choice of .
2.1 Reward transforms
For (reward transforms), all the standard convergence results will hold (as long as is bounded). For instance, the tabular algorithms will converge to , and the linear algorithms will converge to the minimum of the mean-squared projected Bellman error (Sutton et al., 2009; Sutton and Barto, 2018) , where is a projection onto the space of representable value functions, such that where , and is a weighted norm , where
is the steady-state probability of being in state, which is assumed to be well-defined (e.g., the MDP is ergodic).
Reward transforms are more interesting than they perhaps at first appear. For instance, a reward transform exists that approximates a hyperbolic discount factor quite well, in the context of sparse rewards, but with a standard exponential discounted value.
denote a hyperbolically-discounted return with discount parameter . Consider episodes with a single non-zero reward on termination. This matches the setting of “do you prefer $X now or $Y later”, where we do not allow the possibility of both rewards happening. If the terminal reward occurred at time step , this implies . Then, for any , we would prefer a later non-zero reward to an immediate reward of if and only if . This induces a specific preference ordering on sparse rewards.
Consider an exponentially-discounted return
with non-linear reward transform . In the episodic setting, where only the terminal reward is non-zero, then for any exponential discount with parameter and hyperbolic discount with parameter , a reward transform exists that induces the exact same reward ordering for the geometric return and the hyperbolic return , in the sense that for any episode terminating (randomly) after steps with reward we prefer a return with sparse non-zero reward to an immediate reward if and only if . Concretely, this holds for the transform defined by , where is a positive constant that depends on the discount parameters and of the hyperbolic and geometric returns, respectively, and where is the reference reward.
Note that the reward transform function defined above satisfies . Let be the time step of the only non-zero reward . The exponentially-discounted return of the transformed rewards is then equal to
This implies that if and only if
which is true if and only if . By definition of , this is equivalent to
Interestingly, the predictions made by the transform above will differ from both HDTD and hyperbolic discounting for dense rewards and/or stochastic rewards. It is also not the only transform, in the more general class of non-linear Bellman equations, that will match the hyperbolic discounts for sparse rewards. It is an interesting open question whether the predictions for any of these alternatives could perhaps better fit reality better than existing models.
Some readers may find it unintuitive or undesirable that the return as defined in Theorem 1 depends exponentially on the reward. We note that we can easily add one additional transformation to map the returns into a different space, e.g., . As long as this outer transform (in this case the ) is monotonic, this does not change the preference orderings, so the conclusion of the theorem (and the equivalence in terms of preference ordering to hyperbolic discounting) still applies.
2.2 Non-linear discounting
Of separate interest are non-linear transformations to the bootstrap value, as in . Linear discounting is a special case, but we could instead use a non-linear function, which would imply that the discount factor could depend on the value (for instance similar to HDTD).
We propose to consider functions that have the following property:
This is sufficient for the resulting Bellman operator to be a contraction with a factor , as we prove below.
Let be defined by . Define . Then, , which means that is the (well-defined) fixed point, and that the operator contracts towards this fixed point with at least a factor .
For the first inequality, we used the fact that the maximum difference between expectations is always at most as large as the maximum difference over all elements in the support of the distribution of the expectation. For the last inequality, we used the fact that the derivative is bounded by , and therefore the function is -Lipschitz. ∎
More generally, with a slightly extension of this result, we can say for transforms of the form that these will contract with at least , where and are the Lipschitz constants of and , respectively.
To restrict the search space of potentially interesting functions we can, in addition to the property above, consider certain additional restrictions. We could, for instance, require that the space of functions we consider to be parametrised with a single number (we then denote these functions with ), where we attain the undiscounted linear objective for and a fully myopic objective that only looks at the immediate reward for at the extremes of the allowed range of . In addition, we may want the function to be symmetrical around the origin, and monotonic. This means, we might want to consider the following properties.
(myopic for ),
(undiscounted for ),
for all (symmetric),
iff , (monotonic).
The symmetric requirement simply requires
to be odd, which implies, for all .
Linear transformations, where , share these properties, but they allow for more general non-linear transformations. As an example, consider the following class that we will call power-discounting: for and for . For large , this function becomes very similar to (or for negative ), but it has the desired properties enumerated above. In particular, its derivative is , which tends to for large , which implies that larger values will contract faster, and is at most , for . Some examples are shown in Figure 1.
Note that the value of a state under non-linear discounting depends on the stochasticity in the environment’s dynamics and in the agent’s policy, as this stochasticity interacts with non-linear transformations such as power discounting in ways that might not be immediately obvious. To illustrate this, consider the action values under power discounting in a state , from which two actions and are available for the agent to select. In both cases, the immediate reward is zero, but selecting action leads deterministically to a state with value , while selecting action either leads to a state with value , or it terminates. The action values under linear discounting at and . For any , —action is always optimal. Under power discounting the agent’s preference between the two actions may reverse for large : the agent can become risk averse. In Figure 2 we plot the action gaps (y-axis) for multiple values of (x-axis) and (different lines). Note that for small the action gaps can be negative: the agent prefers the certain value of state over the uncertain value of , even if the latter has a larger (undiscounted) expected value.
3 Prediction and control performance
In addition to a larger design space to model natural phenomena, like hyperbolic discounting, non-linear Bellman equations may offer an interesting path towards algorithms that work well for prediction or control. This is similar to how it is common practice to add a discount factor, even if we actually mostly care about the undiscounted returns, simply because the resulting performance is better. Apart from a few hints in the literature (e.g. Pohlen et al., 2018; Kapturowski et al., 2019), this is still a relatively unexplored area, and this seems an interesting avenue for future research.
Non-linear Bellman equations allow us to capture rich and varied information about the world. It is thought to be important for our agents to learn many things (Sutton et al., 2011), and non-linear Bellman equations offer a rich and powerful toolbox to express even more varied predictive questions.
- Ainslie and Herrnstein (1981) G. Ainslie and R. J. Herrnstein. Preference reversal and delayed reinforcement. Animal Learning & Behavior, 9(4):476–482, 1981.
- Alexander and Brown (2010) W. H. Alexander and J. W. Brown. Hyperbolically discounted temporal difference learning. Neural computation, 22(6):1511–1527, 2010.
- Bellemare et al. (2013) M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. (JAIR), 47:253–279, 2013.
Bellemare et al. (2017)
M. G. Bellemare, W. Dabney, and R. Munos.
A distributional perspective on reinforcement learning.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 449–458, 2017.
- Bellman (1957) R. Bellman. Dynamic Programming. Princeton University Press, 1957.
- Bertsekas and Tsitsiklis (1996) D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic Programming. Athena Scientific, Belmont, MA, 1996.
Dabney et al. (2018)
W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos.
Distributional reinforcement learning with quantile regression.In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Fedus et al. (2019) W. Fedus, C. Gelada, Y. Bengio, M. G. Bellemare, and H. Larochelle. Hyperbolic discounting and learning over multiple horizons. arXiv preprint arXiv:1902.06865, 2019.
- Green et al. (1994) L. Green, N. Fristoe, and J. Myerson. Temporal discounting and preference reversals in choice between delayed outcomes. Psychonomic Bulletin & Review, 1(3):383–389, 1994.
- Hessel et al. (2018) M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Howard (1960) R. A. Howard. Dynamic programming and Markov processes. MIT Press, 1960.
- Kapturowski et al. (2019) S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In International Conference on Learning Representations, 2019.
Kurth-Nelson and Redish (2009)
Z. Kurth-Nelson and A. D. Redish.
Temporal-difference reinforcement learning with distributed representations.PLoS One, 4(10), 2009.
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Pohlen et al. (2018) T. Pohlen, B. Piot, T. Hester, M. G. Azar, D. Horgan, D. Budden, G. Barth-Maron, H. van Hasselt, J. Quan, M. Vecerík, M. Hessel, R. Munos, and O. Pietquin. Observe and look further: Achieving consistent performance on atari. CoRR, abs/1805.11593, 2018.
- Puterman (1994) M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc. New York, NY, USA, 1994.
- Rowland et al. (2019) M. Rowland, R. Dadashi, S. Kumar, R. Munos, M. G. Bellemare, and W. Dabney. Statistics and samples in distributional reinforcement learning. In International Conference on Machine Learning, pages 5528–5536, 2019.
- Sutton (1988) R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
- Sutton (1995) R. S. Sutton. TD models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995, pages 531–539. Elsevier, 1995.
- Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT press, Cambridge MA, 2018.
- Sutton et al. (2009) R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pages 993–1000. ACM, 2009.
- Sutton et al. (2011) R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
- Thaler (1981) R. Thaler. Some empirical evidence on dynamic inconsistency. Economics letters, 8(3):201–207, 1981.
- van Hasselt and Sutton (2015) H. van Hasselt and R. S. Sutton. Learning to predict independent of span. CoRR, abs/1508.04582, 2015.
- van Hasselt et al. (2016a) H. van Hasselt, A. Guez, M. Hessel, V. Mnih, and D. Silver. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pages 4287–4295, 2016a.
- van Hasselt et al. (2016b) H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with Double Q-learning. AAAI, 2016b.
- Wang et al. (2016) Z. Wang, N. de Freitas, T. Schaul, M. Hessel, H. van Hasselt, and M. Lanctot. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, New York, NY, USA, 2016.
- Watkins (1989) C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
- Xu et al. (2018) Z. Xu, H. P. van Hasselt, and D. Silver. Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems, pages 2402–2413, 2018.