Model-Free Risk-Sensitive Reinforcement Learning

by   Grégoire Delétang, et al.

We extend temporal-difference (TD) learning in order to obtain risk-sensitive, model-free reinforcement learning algorithms. This extension can be regarded as modification of the Rescorla-Wagner rule, where the (sigmoidal) stimulus is taken to be either the event of over- or underestimating the TD target. As a result, one obtains a stochastic approximation rule for estimating the free energy from i.i.d. samples generated by a Gaussian distribution with unknown mean and variance. Since the Gaussian free energy is known to be a certainty-equivalent sensitive to the mean and the variance, the learning rule has applications in risk-sensitive decision-making.



page 8

page 9


Risk-Sensitive Reinforcement Learning: a Martingale Approach to Reward Uncertainty

We introduce a novel framework to account for sensitivity to rewards unc...

Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff in Regret

We study risk-sensitive reinforcement learning in episodic Markov decisi...

Reinforcement Learning for Learning of Dynamical Systems in Uncertain Environment: a Tutorial

In this paper, a review of model-free reinforcement learning for learnin...

Parametric Return Density Estimation for Reinforcement Learning

Most conventional Reinforcement Learning (RL) algorithms aim to optimize...

Chrome Dino Run using Reinforcement Learning

Reinforcement Learning is one of the most advanced set of algorithms kno...

Omega-Regular Objectives in Model-Free Reinforcement Learning

We provide the first solution for model-free reinforcement learning of ω...

A Scheme for Dynamic Risk-Sensitive Sequential Decision Making

We present a scheme for sequential decision making with a risk-sensitive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Risk-sensitivity, the susceptibility to the higher-order moments of the return, is necessary for the real-world deployment of AI agents. Wrong assumptions, lack of data, misspecification, limited computation, and adversarial attacks are just a handful of the countless sources of unforeseen perturbations that could be present at deployment time. Such perturbations can easily destabilize risk-neutral policies, because they only focus on maximizing expected return while entirely neglecting the variance. This poses serious safety concerns 

[Russell et al., 2015, Amodei et al., 2016, Leike et al., 2017].

Risk-sensitive control has a long history in control theory [Coraluppi, 1997] and is an active area of research within reinforcement learning (RL). There are multiple different approaches to risk-sensitivity in RL: for instance in Minimax RL, inspired by classical robust control theory, one derives a conservative worst-case policy over MDP parameter intervals [Nilim and El Ghaoui, 2005, Tamar et al., 2014]; and the more recent CVaR approach relies on using the conditional-value-at-risk as a robust performance measure [Galichet et al., 2013, Cassel et al., 2018]. We refer the reader to García and Fernández [2015] for a comprehensive overview. Here we focus on one of the earliest and most popular approaches (see references), consisting of the use of exponentially-transformed values, or equivalently, the free energy as the risk-sensitive certainty-equivalent [Bellman, 1957, Howard and Matheson, 1972].

The certainty-equivalent of a stochastic value is defined as the representative deterministic value that a decision-maker uses as a summary of 

for valuation purposes. To illustrate, consider a first-order Markov chain over discrete states

with transition kernel , state-emitted rewards , and discount factor

. Such a process could for instance be the result of pairing a Markov Decision Process with a stationary policy. Typically RL methods use the expectation as the certainty-equivalent of stochastic transitions

[Bertsekas and Tsitsiklis, 1995, Sutton and Barto, 2018]. Therefore they compute the value of the current state by (recursively) aggregating the future values through their expectation, e.g.


Instead, Howard and Matheson [1972] proposed using the free energy as the certainty-equivalent, that is,


where is the inverse temperature parameter which determines whether the aggregation is risk-averse (), risk-seeking (), or even risk-neutral as a special case (). Indeed, if the future values are bounded, then is sigmoidal in shape as a function of , with three special values given by


These limit values highlight the sensitivity to the higher-order moments of the return. Because of this property, the free energy has been used as the certainty-equivalent for assessing the value of both actions and observations under limited control and model uncertainty respectively, each effect having their own inverse temperature. The work by Grau-Moya et al. [2016] is a demonstration of how to incorporate multiple types of effects in MDPs.

The present work addresses a longstanding problem pointed out by Mihatsch and Neuneier [2002]. An advantage of using expectations is that certainty-equivalents such as (1) are easily estimated using stochastic approximation schemes. For instance, consider the classical Robbins-Monro update [Robbins and Monro, 1951]


where is a stochastic target value, is a learning rate, and is the estimate of . Substituting and leads to the popular TD(0) update [Sutton and Barto, 1990]:


However, there is no model-free counterpart for estimating free energies (2) under general unknown distributions. The difficulty lies in that model-free updates rely on single (Monte-Carlo) unbiased samples, but these are not available in the case of the free energy due to the log-term on the r.h.s. of (2). This shortcoming led Mihatsch and Neuneier [2002] to propose the alternative risk-sensitive learning rule


and where

is a risk-sensitivity parameter. While the heuristic (

6) does produce risk-sensitive policies, these have no formal correspondence to free energies.


Our work contributes a simple model-free rule for risk-sensitive reinforcement learning. Starting from the Rescorla-Wagner rule


where is an indicator function marking the presence of a stimulus [Rescorla, 1972], we substitute  by twice the soft-indicator function of (11), which activates whenever  either over- or underestimates the target value , depending on the sign of the risk-sensitivity parameter . Using the substitutions appropriate for RL, we obtain the risk-sensitive TD(0)-rule


where is the standard temporal-difference error. We show the following surprising result: in the special case when the target is Gaussian, then the fixed point of this rule is precisely equal to the free energy.

The learning rule is trivial to implement, works as stated for tabular RL, and is easily adapted to the objective functions of deep RL methods [Mnih et al., 2015]. The learning rule is also consistent with findings in computational neuroscience [Niv et al., 2012], e.g. predicting asymmetric updates that are stronger for negative prediction errors in the risk-averse case [Gershman, 2015].

2 Analysis of the Learning Rule

Let be the Gaussian pdf with mean and precision . Given a sequence of i.i.d. samples drawn from with unknown and , consider the problem of estimating the free energy for a given inverse temperature , that is


We show that (9) can be estimated using the following stochastic approximation rule. If is the current estimate and a new sample arrives, update  according to


where is a learning rate and is the scaled logistic sigmoid


The next lemma shows that the unique and stable fixed point of the learning rule (10) is equal to the desired free energy value .

Lemma 1.

If are i.i.d. samples from , then the expected update of the learning rule (10) is twice differentiable and such that


The expected update of is


where we have dropped the subscript  from for simplicity. Using the Leibnitz integral rule it is easily seen that this function is twice differentiable w.r.t. , because the integrand is a product of twice differentiable functions.

Figure 1: Update rule and its error function. a) shows the update  to the estimate  caused by the arrival of a sample 

, weighted by its probability density. The expected update is determined by comparing the integrals of the positive and negative lobes.

b) Illustration of weighted update functions for different values of the current estimate . The positive lobes are either larger, equal, or smaller than the negative lobes for a that is either smaller, equal, or larger than the free energy respectively. c) Error function implied by the update rule. For a risk-neutral () estimator the error function is equal to the quadratic error . For a risk-averse estimator (), the error function is lopsided, penalizing under-estimates stronger than over-estimates. Furthermore, is an even function in .

The resulting update direction will be positive if the integral over the positive contributions outweight the negative contributions and vice versa. The integrand of (12) has a symmetry property: splitting the domain of integration  into and , using the change of variable , and recombining the two integrals into one gives


We will show that the integrand of (13) is either negative, zero, or positive, depending on the value of . Define the weighted update  as

This function is illustrated in Figure 1a. We are interested in the ratio


which compares the positive against the negative contributions to the integrand in (13). The first fraction of the r.h.s. of (14) is equal to

Using the symmetry property

of the logistic sigmoid function, the second fraction can be shown to be equal to

Substituting the above back into (14) results in

also illustrated in Figure 1b. Therefore, the integrand in (13) is either positive (), zero (), or negative (), allowing to conclude the claim of the lemma. ∎

3 Additional Properties

We discuss additional properties in order to strengthen the intuition and to clarify the significance of the learning rule; some practical implementation advice is given at the end.

Associated free energy functional.

The Gaussian free energy in (9) is formally related to the valuation of risk-sensitive portfolios used in finance [Markowitz, 1952]. It is well-known that the free energy is the extremum of the free energy functional, defined as the Kullback-Leibler-regularized expectation of :


This functional is convex in for and concave for . Taking either the minimum (for or maximum (for ) w.r.t.  yields


that is, the Gaussian free energy is a linear function of , where the intercept and the slope are equal to the expectation and half of the variance of respectively. The extremizer is the Gaussian


The above gives a precise meaning to the free energy as a certainty-equivalent. The choice of a non-zero inverse temperature  reflects a distrust in the reference probability density as a reliable model for . Specifically, the magnitude of quantifies the degree of distrust and the sign of indicates whether it is an under- or overestimation. This distrust results in using the extremizer (17) as a robust substitute for the original reference model for .

Game-theoretic interpretation.

In addition to the above, previous work [Ortega and Lee, 2014, Eysenbach and Levine, 2019, Husain et al., 2021] has shown that the free energy functional has an interpretation as a two-player game which characterizes its robustness properties. Following Ortega and Lee [2014], computing the Legendre-Fenchel dual of the KL regularizer yields an equivalent adversarial re-statement of the free energy functional (15), which for is given by


where the perturbations  are chosen by an adversary (Note: for the case one obtains a Minimax problem over and rather than a Maximin). From this dual interpretation, one sees that the distribution is chosen as if it were maximizing the expected value of , the adversarially perturbed version of . In turn, the adversary attempts to minimize , but at the cost of an exponential penalty for . More precisely, given the distribution , the adversarial best-response (ignoring constants) is


where the equality (a) is true for any choice of ; (b) holds if for some mean  and precision ; and where (c) holds if is the extremizer (17). Here we see that the adversarial perturbations can be arbitrarily bad if is not chosen cautiously: for instance, for the (Gaussian) Dirac delta


Error function.

Let be the instantaneous difference between the sample and the estimate. If the update rule (10

) corresponds to a stochastic gradient descent step, then what is the error function? That is, if

then what is ? Integrating the gradient with respect to gives


where is the softplus function [Dugas et al., 2001] and is Spence’s function (or dilogarithm) defined as

and where the constant of integration was chosen so that for all . This error function is illustrated in Figure 1c for a handful of values of . In the limit , the error function becomes:

thus establishing a connection between the quadratic error and the proposed learning rule.

Practical considerations.

The free energy learning rule (10) can be implemented as stated, for instance either using constant learning rate or using an adaptive learning rate fulfilling the Robbins-Monro conditions and .

A problem arises when most of the data falls within the near-zero saturated region of the sigmoid, which can occur due to an unfortunate initialization of the estimate . Since then for most , learning can be very slow. This problem can be mitigated using an affine transformation of the sigmoid that gaurantees a minimal rate , such as


which re-scales the sigmoid within the interval . We have found this adjustment to work well for , especially when it is only used during the first few iterations.

If one wishes to use the learning rule in combination with gradient-based optimization (as is typical in a deep learning architecture), we do not recommend using the error function (

21) directly. Rather, we suggest absorbing the factor directly into the learning rate (where as before, ). A simple way to achieve this consists in scaling the estimation error by said factor using a stop-gradient, that is,


since then the error gradient with respect to the model parameters  will be


Finally, a large chooses a target free energy within a tail of the distribution, leading to slower convergence. If one wishes to approximate a free energy that sits at standard deviations from the mean, then should be chosen as


However, since is not scale invariant and the scale is unknown, a good choice of must be determined empirically.

4 Experiments


Our first experiment is a simple sanity check. We estimated the free energy in an online manner using the learning rule (10

) from data generated by two i.i.d. sources: a standard Gaussian, and uniform distribution over the interval 

. Five different inverse temperatures were used (. For each condition, we ran ten estimation processes from 4000 random samples using the same starting point (). The learning rate was constant and equal to .

Figure 2: Estimation of the free energy from Gaussian (left panel) and uniform samples (right panel). Each plot shows 10 estimation processes (9 in pink, 1 in red) per choice of the inverse temperature, where . The true free energies are shown in black. The estimation of the free energy is accurate for Gaussian data but biased for uniform data.

The results are shown in figure 2. In the Gaussian case, the estimation processes successfully stabilize around the true free energies, with processes having larger converging slower, but fluctuating less. In the uniform case, the estimation processes do not settle around the correct free energy values for ; however, the found solutions increase monotonically with . These results validate the estimation method only for Gaussian data, as expected.

Reinforcement learning.

Next we applied the risk-sensitive learning rule to RL in a simple grid-world. The goal was to qualitatively investigate the types of policies that result from different risk-sensitivities. Shown in Figure 3a, the objective of the agent is to navigate to a terminal state containing a reward pill within no more than 25 time steps while avoiding the water. The reward pill delivers one reward point upon collection, whereas standing in the water penalizes the agent with minus one reward point per time step. In addition, there is a very strong wind: with 50% chance in each step, the wind pushes the agent one block in a randomly chosen cardinal direction.

Figure 3: Comparison of risk-sensitive RL agents. a) The task consists in picking up a reward located at the terminal state while avoiding stepping into water. A strong wind pushes the agent into a random direction 50% of the time. b) Bar plots showing the average return (blue) and the percentage of violations (red) for each policy, ordered from lowest to highest . c) State visitation frequencies for each policy, plus the optimal (deterministic) policy when there is no wind (black paths).

We trained R2D2 [Kapturowski et al., 2018] agents with the risk-sensitive cost function (23) using five uniformly spaced inverse temperatures ranging from  to . The architecture of our agents consisted of a first convolutional layer with -by--kernels and 128 channels, a dense layer with 

units, and a logit layer for the four possible actions (i.e. walking directions). The discount factor was set to

. Each agent was trained for 500K iterations with a batch size of 64, using the Adam optimizer with learning rate [Kingma and Ba, 2014]

. The target network was updated every 400 steps. The inputs to the network were observation tensors of binary features representing the 2D board. Note these agents didn’t use any recurrent cells and therefore no backpropagation through time was used. To train all the agents in this experiment we used 154 CPU core hours at 2.4 GHz and 22.5 GPU hours.

To analyze the resulting policies, we computed the episodic returns and the percentage of time the agents spent in the water (i.e. the “violations”) from 1000 roll-outs. The results, shown in Figure 3b, reveal that the risk-neutral policy () has the highest average return. However, the percentage of violations increases monotonically with . Figure 3c shows the state-visitation probabilities estimated from the same roll-outs. There are essentially three types of policies: risk-averse, taking the longest path away from the water; risk-neutral, taking middle path; and risk-seeking, taking the shortest route right next to the water. These are even more crisply revealed when the wind is de-activated. Interestingly, the risk-averse policy () does not always reach the goal, which explains why its return is slightly lower in spite of committing fewer violations.


In the last experiment we wanted to observe the premiums that risk-sensitive agents are willing to pay when confronted with a choice between a certain and a risky option. To do so, we used a two-arm bandit setup, where one arm (“certain”) delivered a fixed reward and the other arm (“risky”) a stochastic one—more precisely, drawn from a Gaussian distribution with mean and precision . Both the fixed payoff and the mean  of the risky arm were drawn from a standard Gaussian distribution at the beginning of an episode, which lasted twenty rounds. To build agents that can trade off exploration versus exploitation, we used memory-based meta-learning [Wang et al., 2016, Santoro et al., 2016], which is known to produce near-optimal bandit players [Ortega et al., 2019, Mikulik et al., 2020].

We meta-trained five R2D2 agents using risk-sensitives on the two-armed bandit task distribution (also randomizing the certain/risky arm positions) with discount factor . The network architecture and training parameters were as in the previous RL experiment, with the difference that the initial convolutional layer was replaced with a dense layer and an LSTM layer having 128 memory cells [Hochreiter and Schmidhuber, 1997]. We used backpropagation through time for computing the episode gradients. The input to the network consisted of the action taken and reward obtained in the previous step. This setup allows agents to adapt their choices to past interactions throughout an episode. To train all the agents in this experiment we used 88 CPU core hours at 2.4 GHz and 10 GPU hours.

Figure 4: Two-armed bandit policy profiles with different risk-sensitivities . The certain arm 1 pays a deterministic reward, while the risky arm 2 pays a stochastic reward drawn from with precision . The agents were meta-trained on bandits where the payoffs (i.e. arm 1’s payoff and arm 2’s mean) were drawn from a standard Gaussian distribution. The plots show the marginal probability of choosing the certain arm (blue) over the risky arm (red) after twenty interactions for every payoff combination. Each point in the uniform grid was estimated from 30 seeds. Note the deviations from the true risk-neutral indifference curve (black diagonal).

Figure 4 shows the agents’ choice profile in the last () time step. A true risk-neutral agent does not distinguish between a certain and risky option that have the same expected payoff (black diagonal). The main finding is that the indifference region (i.e. close to a 50% choice in white color) evolves significantly with increasing , implying that the agents with different risk attitudes are indeed willing to pay different risk premia (measured as the vertical distance of the indifference region from the diagonal). We observe two effects. The most salient effect is that the indifference region mostly moves from being beneath (risk-averse) to above (risk-seeking) the true risk-neutral indifference curve as increases. The second effect is that risk-averse policies ( and ) contain a large region of a stochastic choice profile that appears to depend only on the risky arm’s parameter. We do not have a clear explanation for this effect. Our hypothesis is that risk-averse policies assume adversarial environments, which require playing mixed strategies with precise probabilities. Finally, the risk-neutral agent () appears to be slightly risk-averse. We believe that this effect arises due to the noisy exploration policy employed during training.

5 Discussion

Summary of contributions.

In this work we have introduced a learning rule for the online estimation of the Gaussian free energy with unknown mean and precision/variance. The learning rule (10) is obtained by reinterpreting the stimulus-presence indicator component of the Rescorla-Wagner rule [Rescorla, 1972] as a (soft) indicator function for the event of either over- or underestimating the target value. In Lemma 1 we have shown that the free energy is the unique and stable fixed point of the expected learning dynamics. This is the main contribution.

Furthermore, we have shown how to use the learning rule for risk-sensitive RL. Since the free energy implements certainty-equivalents that range from risk-averse to risk-seeking, we were able to formulate a risk-sensitive, model-free update in the spirit of TD(0) [Sutton and Barto, 1990], thereby addressing a longstanding problem [Mihatsch and Neuneier, 2002] for the special case of the Gaussian distribution. Due to its simplicity, the rule is easy to incorporate into existing deep RL algorithms, for instance by modifying the error using a stop-gradient as shown in (23). In Section 3 we also elaborated on the role of the free energy within decision-making, pointing out its robustness properties and adversarial interpretation.

We also demonstrated the learning rule in experiments. Firstly, we empirically confirmed that the online estimates stabilize around the correct Gaussian free energies (Section 4–Estimation). Secondly, we showed how incorporating risk-attitudes into deep RL can lead to agents implementing qualitatively different policies which intuitively make sense (Section 4–RL). Lastly, we inspected the premia risk-sensitive agents are willing to pay for choosing a risky over a certain option, finding that agents have choice patterns that are more complex than we had anticipated (Section 4–Bandits).


As shown empirically in Section 4–Estimation, an important limitation of the learning rule is that its fixed point is only equal to the free energy when the samples are Gaussian (or approximately Gaussian, as justified by the CLT). Nevertheless, agents using the risk-sensitive TD(0) update (8) still display risk attitudes monotonic in , with reducing to the familiar risk-neutral case.

While Lemma 1 establishes the stable equilibrium of the expected update, it only guarantees convergence in continuous-time updates. To show convergence using discrete-time point samples, a stronger result is required. In particular, we conjecture that


If (26) is true, meaning that is 2-Lipschitz, then this could be combined with a result in stochastic approximation theory akin to Theorem 1 in Jaakkola et al. [1994] to prove convergence.

A shortcoming of our experiments using R2D2 agents is that they deterministically pick actions that maximize the Q-value. However, risk-averse agents see their environments as being adversarial, and these in turn require stochastic policies in order to achieve optimal performance.


Because it is impossible to anticipate the many ways in which a dynamically-changing environment will violate prior assumptions, requiring the robustness of ML algorithms is of vital importance for their deployment in real-world applications. Unforeseen events can render their decisions unreliable—and in some cases even unsafe.

Our work makes a small but nonetheless significant contribution to risk-sensitivity in ML. In essence, it suggests a minor modification to existing algorithms, biasing valuation estimates in a risk-sensitive manner. In particular, we expect the risk-sensitive TD(0)-learning rule to become an integral part of future deep RL algorithms.


  • Amodei et al. [2016] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
  • Bellman [1957] R. Bellman. Dynamic Programming. Princeton University Press, 1957.
  • Bertsekas and Tsitsiklis [1995] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming: an overview. In Proceedings of 1995 34th IEEE conference on decision and control, volume 1, pages 560–564. IEEE, 1995.
  • Cassel et al. [2018] A. Cassel, S. Mannor, and A. Zeevi. A general approach to multi-armed bandits under risk criteria. In Conference On Learning Theory, pages 1295–1306. PMLR, 2018.
  • Coraluppi [1997] S. P. Coraluppi. Optimal control of Markov decision processes for performance and robustness. University of Maryland, College Park, 1997.
  • Dugas et al. [2001] C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating second-order functional knowledge for better option pricing. In Advances in Neural Information Processing Systems, volume 13. MIT Press, 2001.
  • Eysenbach and Levine [2019] B. Eysenbach and S. Levine. If MaxEnt RL is the answer, what is the question?, 2019.
  • Galichet et al. [2013] N. Galichet, M. Sebag, and O. Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In

    Asian Conference on Machine Learning

    , pages 245–260. PMLR, 2013.
  • García and Fernández [2015] J. García and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  • Gershman [2015] S. J. Gershman. Do learning rates adapt to the distribution of rewards? Psychonomic Bulletin & Review, 22(5):1320–1327, 2015.
  • Grau-Moya et al. [2016] J. Grau-Moya, F. Leibfried, T. Genewein, and D. A. Braun. Planning with information-processing constraints and model uncertainty in Markov decision processes. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 475–491. Springer, 2016.
  • Hochreiter and Schmidhuber [1997] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Howard and Matheson [1972] R. A. Howard and J. E. Matheson. Risk-sensitive Markov decision processes. Management science, 18(7):356–369, 1972.
  • Husain et al. [2021] H. Husain, K. Ciosek, and R. Tomioka. Regularized policies are reward robust. In

    International Conference on Artificial Intelligence and Statistics

    , pages 64–72. PMLR, 2021.
  • Jaakkola et al. [1994] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural computation, 6(6):1185–1201, 1994.
  • Kappen [2005] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11):P11011, 2005.
  • Kappen et al. [2012] H. J. Kappen, V. Gómez, and M. Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
  • Kapturowski et al. [2018] S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In International conference on learning representations, 2018.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv e-prints, pages arXiv–1412, 2014.
  • Leike et al. [2017] J. Leike, M. Martic, V. Krakovna, P. A. Ortega, T. Everitt, A. Lefrancq, L. Orseau, and S. Legg. AI safety gridworlds. arXiv preprint arXiv:1711.09883, 2017.
  • Markowitz [1952] H. Markowitz. Portfolio selection. Journal of Finance, 1(7), 1952.
  • Mihatsch and Neuneier [2002] O. Mihatsch and R. Neuneier. Risk-sensitive reinforcement learning. Machine learning, 49(2):267–290, 2002.
  • Mikulik et al. [2020] V. Mikulik, G. Delétang, T. McGrath, T. Genewein, M. Martic, S. Legg, and P. A. Ortega. Meta-trained agents implement bayes-optimal agents. arXiv preprint arXiv:2010.11223, 2020.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  • Nilim and El Ghaoui [2005] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798, 2005.
  • Niv et al. [2012] Y. Niv, J. A. Edlund, P. Dayan, and J. P. O’Doherty. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. Journal of Neuroscience, 32(2):551–562, 2012.
  • Ortega and Braun [2011] D. A. Ortega and P. A. Braun. Information, utility and bounded rationality. In International Conference on Artificial General Intelligence, pages 269–274. Springer, 2011.
  • Ortega and Braun [2013] P. A. Ortega and D. A. Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683, 2013.
  • Ortega and Lee [2014] P. A. Ortega and D. Lee. An adversarial interpretation of information-theoretic bounded rationality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014.
  • Ortega et al. [2019] P. A. Ortega, J. X. Wang, M. Rowland, T. Genewein, Z. Kurth-Nelson, R. Pascanu, N. Heess, J. Veness, A. Pritzel, P. Sprechmann, et al. Meta-learning of sequential strategies. arXiv preprint arXiv:1905.03030, 2019.
  • Rescorla [1972] R. A. Rescorla. A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Current research and theory, pages 64–99, 1972.
  • Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Russell et al. [2015] S. Russell, D. Dewey, and M. Tegmark. Research priorities for robust and beneficial artificial intelligence. Ai Magazine, 36(4):105–114, 2015.
  • Santoro et al. [2016] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap.

    Meta-learning with memory-augmented neural networks.

    In International conference on machine learning, pages 1842–1850. PMLR, 2016.
  • Sutton and Barto [1990] R. S. Sutton and A. G. Barto. Time-derivative models of Pavlovian reinforcement. In Learning and Computational Neuroscience: Foundations of Adaptive Networks, pages 497–537. MIT Press, 1990.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.
  • Tamar et al. [2014] A. Tamar, S. Mannor, and H. Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, pages 181–189. PMLR, 2014.
  • Theodorou et al. [2010] E. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11:3137–3181, 2010.
  • Tishby and Polani [2011] N. Tishby and D. Polani. Information theory of decisions and actions. In Perception-action cycle, pages 601–636. Springer, 2011.
  • Todorov [2007] E. Todorov. Linearly-solvable Markov decision problems. In Advances in neural information processing systems, pages 1369–1376, 2007.
  • Toussaint [2009] M. Toussaint. Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pages 1049–1056, 2009.
  • Wang et al. [2016] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  • Ziebart et al. [2008] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.