Risk-sensitive reinforcement learning (RL) is important for practical and high-stake applications, such as self-driving and robotic surgery. In contrast with standard and risk-neutral RL, it optimizes some risk measure of cumulative rewards instead of their expectation. One foundational framework for risk-sensitive RL maximizes the entropic risk measure of the reward, which takes the form of
with respect to the policy , where is a given risk parameter and denotes the cumulative rewards.
Recently, the works of [20, 21] investigate the online setting of the above risk-sensitive RL problem. Under -episode MDPs with horizon length of , they propose two model-free algorithms, namely RSVI and RSQ, and prove that their algorithms achieve the regret upper bound (with its informal form given by)
without assuming knowledge of the transition distribution or access to a simulator. They also provide a lower bound (informally presented as)
that any algorithm has to incur, where is a linear function of . Despite the non-asymptotic nature of their results, it is not hard to see that a wide gap exists between the two bounds. Specifically, the upper bound has an additional factor compared to the lower bound, and even worse, this factor is dominating in the upper bound since the quadratic exponent in makes it exponentially larger than even for moderate values of and . It is unclear whether the factor of is intrinsic in the upper bound.
In this paper, we show that the additional factor in the upper bound is not intrinsic for the upper bound and can be eliminated by a refined algorithmic design and analysis. We identify two deficiencies in the existing algorithms and their analysis: (1) the main element of the analysis follows existing analysis of risk-neutral RL algorithms, which fails to exploit the special structure of the Bellman equations of risk-sensitive RL; (2) the existing algorithms use an excessively large bonus that results in the exponential blow-up in the regret upper bound.
To address the above shortcomings, we consider a simple transformation of the Bellman equations analyzed so far in the literature, which we call the exponential Bellman equation. A distinctive feature of the exponential Bellman equation is that they associate the instantaneous reward and value function of the next step in a multiplicative way, rather than in an additive way as in the standard Bellman equations. From the exponential Bellman equation, we develop a novel analysis of the Bellman backup procedure for risk-sensitive RL algorithms that are based on the principle of optimism. The analysis further motivates a novel exploration mechanism called doubly decaying
bonus, which helps the algorithms adapt to their estimation error over each horizon step while at the same time exploring efficiently. These discoveries enable us to propose two model-free algorithms for RL with the entropic risk measure based on the novel bonus. By combining the new analysis and bonus design, we prove that the preceding algorithms attain nearly optimal regret bounds under episodic and finite-horizon MDPs. Compared to prior results, our regret bounds feature an exponential improvement with respect to the horizon length and risk parameter, removing the factor offrom existing upper bounds. This significantly narrows the gap between upper bounds and the existing lower bound of regret.
In summary, we make the following theoretical contributions in this paper.
We investigate the gap between existing upper and lower regret bounds in the context of risk-sensitive RL, and identify deficiencies of the existing algorithms and analysis;
We consider the exponential Bellman equation, which inspires us to propose a novel analysis of the Bellman backup procedure for RL algorithms based on the entropic risk measure. It further motivates a novel bonus design called doubly decaying bonus. We then design two model-free risk-sensitive RL algorithms equipped with the novel bonus.
The novel analytic framework and bonus design together enable us to prove that the preceding algorithms achieve nearly optimal regret bounds, which improve upon existing ones by an exponential factor in terms of the horizon length and risk sensitivity.
2 Related works
The problem of RL with respect to the entropic risk measure is first proposed by the classical work of , and has since inspired a large body of studies [2, 4, 5, 6, 7, 8, 13, 16, 17, 18, 22, 23, 25, 26, 31, 33, 37, 38, 40, 41, 43]. However, the algorithms from this line of works require knowledge of the transition kernel or assume access to a simulator of the underlying environment. Theoretical properties of these algorithms are investigated based on these assumptions, but the results are mostly of asymptotic nature, which do not shed light on their dependency on key parameters of the environment and agent.
The work of  represents the first effort to investigate the setting where transitions are unknown and simulators of the environment are unavailable. It establishes the first non-asymptotic regret or sample complexity guarantees under the tabular setting. Building upon , the authors of  extend the results to the function approximation setting, by considering linear and general function approximations of the underlying MDPs. Nevertheless, as discussed in Section 1, both works leave open an exponential gap between the regret upper and lower bounds, which the present work aims to address via novel algorithms and analysis motivated by the exponential Bellman equation.
We remark that although the exponential Bellman equation has been previously investigated in the literature of risk-sensitive RL [5, 2], this is the first time that it is explored for deriving regret and sample complexity guarantees of risk-sensitive RL algorithms. In Appendix A, we also make connections between risk-sensitive RL and distributional RL through the exponential Bellman equation.
For a positive integer , we let . For two non-negative sequences and , we write if there exists a universal constant such that for all , and write if and . We use to denote while hiding logarithmic factors. For functions , where denotes their domain, we write if for any . We denote by the indicator function.
3 Problem background
3.1 Episodic and finite-horizon MDP
The setting of episodic Markov decision processes can be denoted by, where is the set of states, is the set of actions, is the length of each episode, and and are the sets of transition kernels and reward functions, respectively. We let and , and we assume . We let
denote the probability distribution over successor states of stepif action is executed in state at step . We assume that the reward function is deterministic. We also assume that both and are unknown to learning agents.
Under the setting of an episodic MDP, the agent aims to learn the optimal policy by interacting with the environment throughout episodes, described as follows. At the beginning of episode , an initial state is selected by the environment and we assume stays the same for all . In each step of episode , the agent observes state , executes an action , and receives a reward equal to from the environment. The MDP then transitions into state randomly drawn from the transition kernel . The episode terminates at step , in which the agent does not take actions or receive rewards. We define a policy as a collection of functions , where is the action that the agent takes in state at step of the episode.
3.2 Risk-sensitive RL
For each , we define the value function of a policy as the cumulative utility of the agent at state of step under the entropic risk measure, assuming that the agent commits to policy in later steps. Specifically, we define
where is a given risk parameter. The agent aims to maximize his cumulative utility in step 1, that is, to find a policy such that is maximized for all state . Under this setting, if , the agent is risk-seeking and if , the agent is risk-averse. Furthermore, as the agent tends to be risk-neutral and tends to the classical value function.
We may also define the action-value function , which is the cumulative utility of the agent who follows policy , conditional on a particular state-action pair; formally, this is given by
Under some mild regularity conditions , there always exists an optimal policy, which we denote as , that yields the optimal value for all .
For all , the Bellman equation associated with a policy is given by
for . In Equation (3), it can be seen that the action value of step is a non-linear function of the value function of the later step. This is in contrast with the linear Bellman equations in the risk-neutral setting (), where . Based on Equation (3), for , the Bellman optimality equation is given by
Exponential Bellman equation.
When , we obtain the corresponding optimality equation
given some estimate of the value function . Here, we denote by the sample average computed over elements in the set throughout past episodes, and it can be seen as an empirical MGF of cumulative rewards from step . Equation (5) also suggests the following policy improvement procedure for a risk-sensitive policy :
where denotes some estimated action-value function, possibly obtained from the quantity .
In the next section, we will discuss how the exponential Bellman equation (5) inspires the development of a novel analytic framework for risk-sensitive RL. Before proceeding, we introduce a performance metric for the agent. For each episode , recall that is the initial state chosen by the environment and let be the policy of the agent at the beginning of episode . Then the difference is called the regret of the agent in episode . Therefore, after episodes, the total regret for the agent is given by
which serves as the key performance metric studied in this paper.
4 Analysis of risk-sensitive RL
4.1 Mechanism of existing analysis
In this section, we provide an informal overview of the mechanism underlying the existing analysis of risk-sensitive RL. Let us focus on the case for simplicity of exposition; similar reasoning holds for . A key step in the existing regret analysis of RL algorithms is to establish a recursion on the difference over , where is the iterate of an algorithm in step of episode and is the value function of the policy used in episode . Such approach can be commonly found in the literature of algorithms that use the upper confidence bound [27, 28], in which the recursion takes the form of
for and some quantity . The work of , which studies the risk-sensitive setting under the entropic risk measure, also follows this approach and derives regret bounds by establishing the recursion of the form
where denotes the bonus which enforces the upper confidence bound and leads to the inequality for any policy , and is part of a martingale difference sequence. The derivation of Equation (11) is based on the Bellman equation (3), which shows that the action value is the sum of the reward and the entropic risk measure of . Following , we may then unroll the recursion (11) from to to get
given that . Using the inequality , and , we obtain the regret bound in  as . Therefore, it can be seen that the dominating factor in their regret bound originates in Equation (12), which can be further traced back to the exponential factor in the error dynamics (11).
4.2 Refined approach via exponential Bellman equation
While the existing analysis in (11) is motivated by the Bellman equation of the form given in (3), we propose to work on the exponential Bellman equation (5). Equation (5) operates on the quantities and , which can be thought of as the MGFs of the current and future values, while the reward function is involved as a multiplicative term. This motivates us to derive a new recursion:
where denote some bonus and martingale terms, respectively, and stands for the reward in step of episode . Unrolling Equation (13) yields
where . In words, the error of is bounded by the weighted sum of bonus and martingale difference terms, where the weights are given by , the exponential rewards up to step . We may then apply a localized linearization of the logarithmic function, which gives , and arrives at a regret upper bound (the formal regret bounds will be established in Theorems 1 and 2 below). Different from Equation (11) where rewards are only implicitly encoded in , in Equation (13) rewards are explicitly involved in the error dynamics via an exponential term.
To see why Equation (13) is intuitively correct, we may divide both sides of the equation by and take . By doing so, we should expect to obtain quantities from the error dynamics (10) of risk-neutral RL. Since the function satisfies that as for any fixed , we have
By comparing Equations (13) and (11), we see that while both error dynamics are derived from the same underlying Bellman equation, they inspire drastically different forms of recursion. Note that the multiplicative factor in Equation (13) is milder than the factor in Equation (11), since . This is the source of an improvement of our refined analysis over existing works. On the other hand, the success of applying the error dynamics (13) in our analysis crucially depends on the choice of bonus terms , as an improper choice would blow up the error . This observation motivates our novel bonus design, as we explain next in Section 5.
5.1 Overview of algorithms
In this section, we propose two model-free algorithms for RL with the entropic risk measure. We first present RSVI2, which is based on value iteration, in Algorithm 1. The algorithm has two main stages: it first estimates the value function using data accumulated up to episode (Line 3–10) and then executes the estimated policy to collect new trajectory (Line 11). In value function estimation, it computes the weights , or the empirical MGF of some estimated cumulative rewards evaluated at , which can be seen as a simple moving average over . Therefore, Line 5 functions as a concrete implementation of Equation (7) where the sample average is instantiated as a simple moving average. Then in Line 7, it computes an augmented estimate by combining with a bonus term (defined in Line 6). This is followed by thresholding to put in the proper range. Note that is an optimistic estimator of the quantity in Equation (5): the construction of is augmented by so that it encourages exploration of rarely visited state-action pairs in future episodes, and thereby follows the principle of Risk-Sensitive Optimism in the Face of Uncertainty . When , the bonus is subtracted from , since a higher level of optimism corresponds to a smaller value of the estimate. In addition, Line 11 follows the reasoning of policy improvement suggested in Equation (8).
Next, we introduce RSQ2 in Algorithm 2, which is based on Q-learning. Similar to Algorithm 1, it consists of value estimation (Line 8–11) and policy execution (Line 6) steps. By combining Lines 9 and 10, we see that Algorithm 2 computes the optimistic estimate as a projection of an exponential moving average of empirical MGFs:
where denotes a projection that depends on step . In particular, Line 9 can be interpreted as a computation of empirical MGFs evaluated at and thus a concrete implementation of Equation (7) using an exponential moving average. This is in contrast with the simple moving average update in Algorithm 1.
5.2 Doubly decaying bonus
Let us focus on for this discussion. In optimism-based algorithms, the bonus term is used to enforce the upper confidence bound in order to encourage sufficient exploration in uncertain environments. It takes the form of a multiplier times a factor that is inversely proportional to visit counts . Our bonus follows this structure and is given by
ignoring factors that do not vary in . In Equation (16), the quantity plays the role of the multiplier and is the factor that decreases in the visit count. While the component is common in bonus terms, our new bonus is designed to shrink its multiplier deterministically and exponentially across the horizon steps, as decreases from in step to in step . This is in sharp contrast with the bonus terms typically found in risk-neutral RL algorithms, where the multipliers are kept constant in (usually as a constant multiple of ). Furthermore, our bonus design is also in contrast with that in RSVI and RSQ proposed by , whose multiplier is and kept fixed along the horizon. Because decays both in the visit count (across episodes) and the multiplier (across the horizon), we name it as doubly decaying bonus. We remark that this is a novel feature of Algorithms 1 and 2, compared to RSVI and RSQ. Let us discuss how this new exploration mechanism is motivated from the error dynamics (14).
Motivation of exponential decay.
From Equation (14), we see that the error of the iterate is bounded by the sum of weighted bonus terms, where the weights are of the form and . Choosing ensures that the weighted bonus is on the order of at maximum. On the other hand, if we use the bonus as in , which is proportional to , then we would end up with a multiplicative factor in regret, which is exponentially larger than . An alternative way to understanding the exponential decay of our bonus is as follows. At step , the estimated value function is , which implies . The iterate (of Algorithm 1 or 2) is used to estimate , with its estimation error given by
where denotes an empirical average operator over historical data in step . Therefore, the estimation error of shrinks exponentially across the horizon. Since bonus is used to compensate for and dominate the estimation error, the minimal order of required is thus , which is exactly the multiplier in Equation (16).
As a passing note, we remark that the decaying multiplier is not necessary in risk-neutral RL algorithms, since the estimation error therein satisfies , which is upper bounded by for all . This implies that it suffices to simply set the bonus multiplier as a constant multiple of . In contrast, as we have explained, the estimation error of our algorithms decays exponentially in step , and an adaptive and exponentially decaying bonus is needed.
Comparison with Bernstein-type bonus.
denotes an empirical variance operator over historical data anddenotes a vanishing term as . Our bonus in Equation (16) is different from the Bernstein-type bonus in Equation (17) in mechanism: our bonus features the multiplier which decays exponentially and deterministically over , whereas the Bernstein-type bonus uses as the multiplier (ignoring the vanishing term). The term depends on the trajectory of the learning process. Therefore the multiplier is stochastic and stays on the polynomial order of across the horizon. Moreover, it is unclear how the multiplier behaves in terms of step .
6 Main results
The proof of the two theorems are provided in Appendices B and C, respectively. Note that the above results generalize those in the literature of risk-neutral RL: when , we recover the same regret bounds of LSVI in  and Q-learning in .
and a lower bound incurred by any algorithm
where is a linear function in ; for simplicity of presentation, we exclude polynomial dependencies on other parameters and logarithmic factors from the two bounds. In particular, the proof of the lower bound is based on reducing an hard instance of MDP to a multi-armed bandit. It is a priori unclear whether the extra exponential factor in the upper bound (18) is fundamental in the MDP setting, or is due to suboptimal analysis or algorithmic design. We would like to mention that although one trivial way of avoiding the factor in the upper bound (18) is to use a sufficiently small in the algorithms of  (e.g., so that ), such a small defeats the very purpose of have an appropriate degree of risk-sensitivity in the algorithms. Hence, an answer for all would be desirable.
In view of Theorems 1 and 2, we see that our Algorithms 1 and 2 achieve regret bounds that are exponentially sharper than those of RSVI and RSQ. In particular, our results eliminate the factor from Equation (18) thanks to the novel analysis and doubly decaying bonus in our algorithms, which are inspired by the exponential Bellman equation (5). As a result, our bounds significantly narrow the gap between upper bounds and the lower bound (19).
Z.Yang acknowledges Simons Institute (Theory of Reinforcement Learning). Y. Chen is partially supported by NSF grant CCF-1704828 and CAREER Award CCF-2047910. Z.Wang acknowledges National Science Foundation (Awards 2048075, 2008827, 2015568, 1934931), Simons Institute (Theory of Reinforcement Learning), Amazon, J.P. Morgan, and Two Sigma for their supports.
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos.
Minimax regret bounds for reinforcement learning.
International Conference on Machine Learning, pages 263–272, 2017.
-  Nicole Bäuerle and Ulrich Rieder. More risk-sensitive Markov decision processes. Mathematics of Operations Research, 39(1):105–120, 2014.
-  Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
-  Vivek S. Borkar. A sensitivity formula for risk-sensitive cost and the actor-critic algorithm. Systems & Control Letters, 44(5):339–346, 2001.
-  Vivek S. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27(2):294–311, 2002.
-  Vivek S. Borkar. Learning algorithms for risk-sensitive control. In Proceedings of the 19th International Symposium on Mathematical Theory of Networks and Systems–MTNS, pages 55–60, 2010.
-  Vivek S. Borkar and Sean P. Meyn. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192–209, 2002.
-  Rolando Cavazos-Cadena and Daniel Hernández-Hernández. Discounted approximations for risk-sensitive average criteria in Markov decision chains with finite state space. Mathematics of Operations Research, 36(1):133–146, 2011.
-  Lin Chen, Yifei Min, Mikhail Belkin, and Amin Karbasi. Multiple descent: Design your own generalization curve. In Advances in Neural Information Processing Systems, 2021.
-  Lin Chen, Bruno Scherrer, and Peter L Bartlett. Infinite-horizon offline reinforcement learning with linear function approximation: Curse of dimensionality and algorithm. arXiv preprint arXiv:2103.09847, 2021.
-  Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. In International Conference on Learning Representations, 2021.
Lin Chen, Qian Yu, Hannah Lawrence, and Amin Karbasi.
Minimax regret of switching-constrained online convex optimization: No phase transition.In Advances in Neural Information Processing Systems, 2020.
-  Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive, minimax, and mixed risk-neutral/minimax control of Markov decision processes. In Stochastic Analysis, Control, Optimization and Applications, pages 21–40. Springer, 1999.
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos.
Implicit quantile networks for distributional reinforcement learning.In International conference on machine learning, pages 1096–1105. PMLR, 2018.
Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos.
Distributional reinforcement learning with quantile regression.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
-  Giovanni B. Di Masi and Lukasz Stettner. Risk-sensitive control of discrete-time Markov processes with infinite horizon. SIAM Journal on Control and Optimization, 38(1):61–78, 1999.
-  Giovanni B. Di Masi and Lukasz Stettner. Infinite horizon risk sensitive control of discrete time Markov processes with small risk. Systems & Control Letters, 40(1):15–20, 2000.
-  Giovanni B. Di Masi and Łukasz Stettner. Infinite horizon risk sensitive control of discrete time Markov processes under minorization property. SIAM Journal on Control and Optimization, 46(1):231–252, 2007.
Value function in frequency domain and the characteristic value iteration algorithm.In Advances in Neural Information Processing Systems, 2019.
-  Yingjie Fei, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. In Advances in Neural Information Processing Systems, 2020.
-  Yingjie Fei, Zhuoran Yang, and Zhaoran Wang. Risk-sensitive reinforcement learning with function approximation: A debiasing approach. In International Conference on Machine Learning, pages 3198–3207. PMLR, 2021.
-  Wendell H Fleming and William M McEneaney. Risk-sensitive control on an infinite time horizon. SIAM Journal on Control and Optimization, 33(6):1881–1915, 1995.
-  Daniel Hernández-Hernández and Steven I. Marcus. Risk sensitive control of Markov processes in countable state space. Systems & Control Letters, 29(3):147–155, 1996.
-  Ronald A. Howard and James E. Matheson. Risk-sensitive Markov decision processes. Management Science, 18(7):356–369, 1972.
-  Wenjie Huang and William B Haskell. Stochastic approximation for risk-aware markov decision processes. IEEE Transactions on Automatic Control, 66(3):1314–1320, 2020.
-  Anna Jaśkiewicz. Average optimality for risk-sensitive control with general state space. The Annals of Applied Probability, 17(2):654–675, 2007.
-  Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I. Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
-  Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I. Jordan. Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388, 2019.
-  Joel Z Leibo, Cyprien de Masson d’Autume, Daniel Zoran, David Amos, Charles Beattie, Keith Anderson, Antonio García Castañeda, Manuel Sanchez, Simon Green, Audrunas Gruslys, et al. Psychlab: a psychology laboratory for deep reinforcement learning agents. arXiv preprint arXiv:1801.08116, 2018.
-  Shuyang Ling, Ruitu Xu, and Afonso S Bandeira. On the landscape of synchronization networks: A perspective from nonconvex optimization. SIAM Journal on Optimization, 29(3):1879–1907, 2019.
-  Steven I. Marcus, Emmanual Fernández-Gaucherand, Daniel Hernández-Hernandez, Stefano Coraluppi, and Pedram Fard. Risk sensitive Markov decision processes. In Systems and Control in the Twenty-first Century, pages 263–279. Springer, 1997.
-  Borislav Mavrin, Hengshuai Yao, Linglong Kong, Kaiwen Wu, and Yaoliang Yu. Distributional reinforcement learning for efficient exploration. In International Conference on Machine Learning, pages 4424–4434. PMLR, 2019.
-  Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learning. Machine Learning, 49(2-3):267–290, 2002.
-  Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In International Conference on Machine Learning, 2010.
-  Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Parametric return density estimation for reinforcement learning. arXiv preprint arXiv:1203.3497, 2012.
-  Yael Niv, Jeffrey A. Edlund, Peter Dayan, and John P. O’Doherty. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. Journal of Neuroscience, 32(2):551–562, 2012.
-  Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, pages 233–241, 2012.
-  Stephen D. Patek. On terminating Markov decision processes with a risk-averse objective function. Automatica, 37(9):1379–1386, 2001.
-  Mark Rowland, Marc Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An analysis of categorical distributional reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 29–37. PMLR, 2018.
-  Yun Shen, Wilhelm Stannat, and Klaus Obermayer. Risk-sensitive Markov control processes. SIAM Journal on Control and Optimization, 51(5):3652–3672, 2013.
-  Yun Shen, Michael J. Tobia, Tobias Sommer, and Klaus Obermayer. Risk-sensitive reinforcement learning. Neural Computation, 26(7):1298–1328, 2014.
-  Ganlin Song, Ruitu Xu, and John Lafferty. Convergence and alignment of gradient descentwith random back propagation weights. arXiv preprint arXiv:2106.06044, 2021.
-  Peter Whittle. Risk-sensitive Optimal Control, volume 20. Wiley New York, 1990.
-  Ruitu Xu, Lin Chen, and Amin Karbasi. Meta learning in the continuous time limit. In International Conference on Artificial Intelligence and Statistics, pages 3052–3060. PMLR, 2021.
-  Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parameterized quantile function for distributional reinforcement learning. Advances in neural information processing systems, 32:6193–6202, 2019.
Appendix A Connections to distribution RL
In this appendix, we establish connections between risk-sensitive RL and distributional RL via the lens of the exponential Bellman equation.
Distributional RL has been studied in the line of works [3, 15, 32, 34, 35, 39, 19, 45]. The framework of distributional RL is built upon the following key equation, namely the distributional Bellman equation:
for a fixed policy , where , , and is the reward distribution in step . Here, we use to denote equality in distribution. It can be seen that is the distribution of cumulative rewards under policy at step , when the state and action are visited in step . Based on Equation (20), a distributional Bellman optimality operator is given by
where again . Note that in Equation (21), the optimal action is greedy with respect to the expectation of the distribution . Most existing distributional RL algorithms work with distribution estimates such as quantiles [15, 14] or empirical distribution functions [3, 39].
Now recall the exponential Bellman equation (5), which takes the form
for any fixed , where , and is the deterministic reward function by our assumption. Given the definitions (1) and (2) with replaced by , we note that both and in the above equation depend on the value of (which we omit for simplicity of notations). Then by the definition of in Equation (2), one sees that represents the MGF of the cumulative rewards at step when policy is executed. Hence, the exponential Bellman equation for risk-sensitive RL provides an instantiation of Equation (20) through the MGF of rewards.
Appendix B Proof of Theorem 1
First, we set some notations and definitions. Define for a given . We adopt the shorthands and for . We let be the visit count of at the beginning of episode . We denote by , , the values of , , after the updates in step of episode , respectively. We also set .