Making Sense of Reinforcement Learning and Probabilistic Inference

01/03/2020 ∙ by Brendan O'Donoghue, et al. ∙ Google 46

Reinforcement learning (RL) combines a control problem with statistical estimation: the system dynamics are not known to the agent, but can be learned through experience. A recent line of research casts `RL as inference' and suggests a particular framework to generalize the RL problem as probabilistic inference. Our paper surfaces a key shortcoming in that approach, and clarifies the sense in which RL can be coherently cast as an inference problem. In particular, an RL agent must consider the effects of its actions upon future rewards and observations: the exploration-exploitation tradeoff. In all but the most simple settings, the resulting inference is computationally intractable so that practical RL algorithms must resort to approximation. We demonstrate that the popular `RL as inference' approximation can perform poorly in even very basic problems. However, we show that with a small modification the framework does yield algorithms that can provably perform well, and we show that the resulting algorithm is equivalent to the recently proposed K-learning, which we further connect with Thompson sampling.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic inference is a procedure of making sense of uncertain data using Bayes’ rule. The optimal control problem is to take actions in a known system in order to maximize the cumulative rewards through time. Probabilistic graphical models (PGMs) offer a coherent and flexible language to specify causal relationships, for which a rich literature of learning and inference techniques have developed (Koller and Friedman, 2009). Although control dynamics might also be encoded as a PGM, the relationship between action planning and probabilistic inference is not immediately clear. For inference, it is typically enough to specify the system and pose the question, and the objectives for learning emerge automatically. In control, the system and objectives are known, but the question of how to approach a solution may remain extremely complex (Bertsekas, 2005).

Perhaps surprisingly, there is a deep sense in which inference and control can represent a dual view of the same problem. This relationship is most clearly stated in the case of linear quadratic systems, where the Ricatti equations relate the optimal control policy in terms of the system dynamics (Welch et al., 1995)

. In fact, this connection extends to a wide range of systems, where control tasks can be related to a dual inference problem through rewards as exponentiated probabilities in a distinct, but coupled, PGM

(Todorov, 2007, 2008). A great benefit of this connection is that it can allow the tools of inference to make progress in control problems, and vice-versa. In both cases the connections provide new insights, inspire new algorithms and enrich our understanding (Toussaint and Storkey, 2006; Ziebart et al., 2008; Kappen et al., 2012).

Reinforcement learning (RL) is the problem of learning to control an unknown system (Sutton and Barto, 2018). Like the control setting, an RL agent should take actions to maximize its cumulative rewards through time. Like the inference problem, the agent is initially uncertain of the system dynamics, but can learn through the transitions it observes. This leads to a fundamental tradeoff: the agent may be able to improve its understanding through exploring poorly-understood states and actions, but it may be able to attain higher immediate reward through exploiting its existing knowledge (Kearns and Singh, 2002)

. In many ways, RL combines control and inference into a general framework for decision making under uncertainty. Although there has been ongoing research in this area for many decades, there has been a recent explosion of interest as RL techniques have made high-profile breakthroughs in grand challenges of artificial intelligence research

(Mnih et al., 2013; Silver et al., 2016).

A popular line of research has sought to cast ‘RL as inference’, mirroring the dual relationship for control in known systems. This approach is most clearly stated in the tutorial and review of Levine (2018), and provides a key reference for research in this field. It suggests that a generalization of the RL problem can be cast as probabilistic inference through inference over exponentiated rewards, in a continuation of previous work in optimal control (Todorov, 2009). This perspective promises several benefits: a probabilistic perspective on rewards, the ability to apply powerful inference algorithms to solve RL problems and a natural exploration strategy. In this paper we will outline an important way in which this perspective is incomplete. This shortcoming ultimately results in algorithms that can perform poorly in even very simple decision problems. Importantly, these are not simply technical issues that show up in some edge cases, but fundamental failures of this approach that arise in even the most simple decision problems.

In this paper we revisit an alternative framing of ‘RL as inference’. In fact, we show that the original RL problem was already an inference problem all along.111Note that, unlike control, connecting RL with inference will not involve a separate ‘dual’ problem. Importantly, this inference problem includes inference over the agent’s future actions and observations. Of course, this perspective is not new, and has long been known as simply the Bayes-optimal solution, see, e.g., Ghavamzadeh et al. (2015). The problem is that, due to the exponential lookahead, this inference problem is fundamentally intractable for all but the simplest problems (Gittins, 1979). For this reason, RL research focuses on computationally efficient approaches that maintain a level of statistical efficiency (Furmston and Barber, 2010; Osband et al., 2017).

We provide a review of the RL problem in Section 2, together with a simple and coherent framing of RL as probabilistic inference. In Section 3 we present three approximations to the intractable Bayes-optimal policy. We begin with the celebrated Thompson sampling algorithm, then we review the popular ‘RL as inference’ framing, as presented by Levine (2018), and highlight a clear and simple shortcoming in this approach. Finally, we review K-learning (O’Donoghue, 2018), which we re-interpret as a modification to the RL as inference framework that provides a principled approach to the statistical inference problem, as well as a presenting a relationship with Thompson sampling. In Section 4 we present computational studies that support our claims.

2 Reinforcement learning

We consider the problem of an agent taking actions in an unknown environment in order to maximize cumulative rewards through time. For simplicity, this paper will model the environment as a finite horizon, discrete Markov Decision Process (MDP)

.222This choice is for clarity; continuous, infinite horizon, or partially-observed environments do not alter our narrative. Here is the state space, is the action space and each episode is of fixed length . Each episode begins with state then for timesteps the agent selects action , observes transition with probability and receives reward , where we denote by the mean reward. We define a policy to be a mapping from

to probability distributions over

and write for the space of all policies. For any timestep , we define to be the sequence of observations made before time . An RL algorithm maps histories to policies .

Our goal in the design of RL algorithms is to obtain good performance (cumulative rewards) for an unknown , where is some family of possible environments. Note that this is a different problem from typical ‘optimal control’, that seeks to optimize performance for one particular known MDP ; although you might still fruitfully apply an RL algorithm to solve problems of that type. For any environment and any policy we can define the action-value function,


Where the expectation in (1) is taken with respect to the action selection for from the policy and evolution of the fixed MDP . We define the value function and write for the optimal Q-values over policies, and the optimal value function is given by .

In order to compare algorithm performance across different environments, it is natural to normalize in terms of the regret, or shortfall in cumulative rewards relative to the optimal value,


This quantity depends on the unknown MDP , which is fixed from the start and kept the same throughout, but the expectations are taken with respect to the dynamics of and the learning algorithm . For any particular MDP , the optimal regret of zero can be attained by the non-learning algorithm that returns the optimal policy for .

In order to assess the quality of a reinforcement learning algorithm, which is designed to work across some family of , we need some method to condense performance over a set to a single number. There are two main approaches to this:


where is a prior over the family . These differing objectives are often framed as Bayesian (average-case) (3) and frequentist (worst-case) (4) RL 333Technically, some frequentist results are high-probability bounds on the worst case rather than true worst-case bounds, but this distinction is not important for our purposes. Although these two settings are typically studied in isolation, it should be clear that they are intimately related through the choice of and . Our next section will investigate what it would mean to ‘solve’ the RL problem. Importantly, we show that both frequentist and Bayesian perspectives already amount to a problem in probabilistic inference, without the need for additional re-interpretation.

2.1 Solving the RL problem through probabilistic inference

If you want to ‘solve’ the RL problem, then formally the objective is clear: find the RL algorithm that minimizes your chosen objective, (3) or (4). To anchor our discussion, we introduce a simple decision problem designed to highlight some key aspects of reinforcement learning. We will revisit this problem setting as we discuss approximations to the optimal policy.

Problem 1 (One unknown action).

Fix and define . Both and share and ; they only differ through their rewards:

Where is a shorthand for deterministic reward of when choosing action .

Problem 1 is extremely simple, it involves no generalization and no long-term consequences: it is an independent bandit problem with only one unknown action. For known the optimal policy is trivial: choose in and in for all . An RL agent faced with unknown should attempt to optimize the RL objectives (3) or (4). Unusually, and only because Problem 1 is so simple, we can actually compute the optimal solutions to both in terms of (the total number of episodes) and where , the probability of being in .

For an optimal minimax RL algorithm is to first choose and observe . If then you know you are in so pick for all , for . If then you know you are in so pick for all , for . The minimax regret of this algorithm is , which cannot be bested by any algorithm.

Actually, the same RL algorithm is also Bayes-optimal for any provided . This relationship is not a coincidence. All admissible solutions to the minimax problem (4) are given by solutions to the average-case (3) for some ‘worst-case’ prior (Wald, 1950). As such, for ease of exposition, our discussion will focus on the Bayesian (or average-case) setting. However, readers should understand that the same arguments apply to the minimax objective.

In Problem 1, the key probabilistic inference the agent must consider is the effects of it own actions upon the future rewards, i.e., whether it has chosen action . Slightly more generally, where actions are independent and episode length , the optimal RL algorithm can be computed via Gittins indices, but these problems are very much the exception (Gittins, 1979). In problems with generalization or long-term consequences, computing the Bayes-optimal solution is computationally intractable. One example of an algorithm that converges to Bayes-optimal solution in the limit of infinite computation is given by Bayes-adaptive Monte-Carlo Planning (Guez et al., 2012). The problem is that, even for very simple problems, the lookahead tree of interactions between actions, observations and algorithmic updates grows exponentially in the search depth (Strehl et al., 2006). Worse still, direct computational approximations to the Bayes-optimal solution can fail exponentially badly should they fall short of the required computation (Munos, 2014). As a result, research in reinforcement learning amounts to trying to find computationally tractable approximations to the Bayes-optimal policy that maintain some degree of statistical efficiency.

3 Approximations for computational and statistical efficiency

The exponential explosion of future actions and observations means solving for the Bayes-optimal solution is computationally intractable. To counter this, most computationally efficient approaches to RL simplify the problem at time to only consider inference over the data that has been gathered prior to time . The most common family of these algorithms are ‘certainty equivalent’ (under an identity utility): they take a point estimate for their best guess of the environment , and try to optimize their control given these estimates . Typically, these algorithms are used in conjunction with some dithering scheme for random action selection (e.g., epsilon-greedy), to mitigate premature and suboptimal convergence (Watkins, 1989). However, since these algorithms do not prioritize their exploration, they may take exponentially long to find the optimal policy (Osband et al., 2014).

In order for an RL algorithm to be statistically efficient, it must consider the value of information. To do this, an agent must first maintain some notion of epistemic uncertainty, so that it can direct its exploration towards states and actions that it does not understand well (O’Donoghue et al., 2018). Here again, probabilistic inference finds a natural home in RL: we should build up posterior estimates for the unknown problem parameters, and use this distribution to drive efficient exploration.444For the purposes of this paper, we will focus on optimistic approaches to exploration, although more sophisticated information-seeking approaches merit investigation in future work (Russo and Van Roy, 2014).

3.1 Thompson sampling

One of the oldest heuristics for balancing exploration with exploitation is given by Thompson sampling, or probability matching

(Thompson, 1933). Each episode, Thompson sampling (TS) randomly selects a policy according to the probability it is the optimal policy, conditioned upon the data seen prior to that episode. Thompson sampling is a simple and effective method that successfully balances exploration with exploitation (Russo et al., 2018).

Implementing Thompson sampling amounts to an inference problem at each episode. For each

define the binary random variable

where denotes the event that action is optimal for state in timestep .555For the problem definition in Section 2 there is always a deterministic optimal policy for . The TS policy for episode is thus given by the inference problem,


where is the joint probability over all the binary optimality variables (hereafter we shall suppress the dependence on ). To understand how Thompson sampling guides exploration let us consider its performance in Problem 1 when implemented with a uniform prior . In the first timestep the agent samples . If it samples it will choose action and learn the true system dynamics, choosing the optimal arm thereafter. If it samples it will choose action and repeat the identical decision in the next timestep. Note that this procedure achieves BayesRegret 2.5 according to , but also minimax regret , which matches the optimal minimax performance despite its uniform prior.

Recent interest in TS was kindled by strong empirical performance in bandit tasks (Chapelle and Li, 2011). Following work has shown that this algorithm satisfies strong Bayesian regret bounds close to the known lower bounds for MDPs, under certain assumptions (Osband and Van Roy, 2017, 2016). However, although much simpler than the Bayes-optimal solution, the inference problem in (5) can still be prohibitively expensive. Table 1 describes one approach to performing the sampling required in (5) implicitly, by maintaining an explicit model over MDP parameters. This algorithm can be computationally intractable as the MDP becomes large and so attempts to scale Thompson sampling to complex systems have focused on approximate posterior samples via randomized value functions, but it is not yet clear under which settings these approximations should be expected to perform well (Osband et al., 2017). As we look for practical, scalable approaches to posterior inference one promising (and popular) approach is known commonly as ‘RL as inference’.

Before episode Sample
Bellman equation
Table 1: Model-based Thompson sampling.

3.2 The ‘RL as inference’ framework and its limitations

The computational challenges of Thompson sampling suggest an approximate algorithm that replaces (5) with a parametric distribution suitable for expedient computation. It is possible to view the algorithms of the ‘RL as inference’ approach in this light (Rawlik et al., 2013; Todorov, 2009; Toussaint, 2009; Deisenroth et al., 2013; Fellows et al., 2019); see Levine (2018) for a recent survey. These algorithms choose to model the probability of optimality according to,


for some , where is a trajectory (a sequence of state-action pairs) starting from at timestep , and where denotes the expectation under the posterior at episode

. With this potential in place one can perform Bayesian inference over the unobserved ‘optimality’ variables, obtaining posteriors over the policy or other variables of interest. This presentation of the RL as inference framework is slightly closer to the one in

Deisenroth et al. (2013, § than to Levine (2018), but ultimately it produces the same family of algorithms. We provide such a derivation in the appendix for completeness.

Applying inference procedures to (6) leads naturally to RL algorithms with some ‘soft’ Bellman updates, and added entropy regularization. We describe the general structure of these algorithms in Table 2. These algorithmic connections can help reveal connections to policy gradient, actor-critic, and maximum entropy RL methods (Mnih et al., 2016; O’Donoghue et al., 2017; Haarnoja et al., 2017, 2018; Eysenbach et al., 2018). The problem is that this resultant ‘posterior’ derived using (6) does not generally bear any close relationship to the agent’s epistemic probability that is optimal.

Bellman equation
Table 2: Soft Q-learning.

To understand how ‘RL as inference’ guides decision making, let us consider its performance in Problem 1. Practical implementations of ‘RL as inference’ estimate through observations. For large, and without prior guidance, the agent is then extremely unlikely to select action and so resolve its epistemic uncertainty. Even for an informed prior action selection according to the exploration strategy of Boltzmann dithering is unlikely to sample action for which (Levine, 2018; Cesa-Bianchi et al., 2017). This is because the ‘distractor’ actions with are much more probable under the Boltzmann policy.

This problem is the same problem that afflicts most dithering approaches to exploration. ‘RL as inference’ as a framework does not incorporate an agents epistemic uncertainty, and so can lead to poor policies for even simple problems. While (6) allows the construction of a dual ‘posterior distribution’, this distribution does not generally bear any relation to the typical posterior an agent should compute conditioned upon the data it has gathered, e.g., equation (5). Despite this shortcoming RL as inference has inspired many interesting and novel techniques, as well as delivered algorithms with good performance on problems where exploration is not the bottleneck (Eysenbach et al., 2018). However, due to the use of language about ‘optimality’ and ‘posterior inference’ etc., it may come as a surprise to some that this framework does not truly tackle the Bayesian RL problem. Indeed, algorithms using ‘RL as inference’ can perform very poorly on problems where accurate uncertainty quantification is crucial to performance. We hope that this paper sheds some light on the topic.

3.3 Making sense of ‘RL as Inference’ via K-learning

In this section we suggest a subtle alteration to the ‘RL as inference’ framework that develops a coherent notion of optimality. The K-learning algorithm was originally introduced through a risk-seeking exponential utility (O’Donoghue, 2018). In this paper we re-derive this algorithm as a principled approximate inference procedure with clear connections to Thompson sampling, and we highlight its similarities to the ‘RL as inference’ framework. We believe that this may offer a road towards combining the respective strengths of Thompson sampling and the ‘RL as inference’ frameworks. First, consider the following approximate conditional optimality probability at :


for some , and note that this is conditioned on the random variable . We can marginalize over possible Q-values yielding


where denotes the cumulant generating function of the random variable (Kendall, 1946). Clearly K-learning and the ‘RL as inference’ framework are similar, since equations (6) and (7) are closedly linked, but there is a crucial difference. Notice that the integral performed in (8) is with respect to the posterior over , which includes the epistemic uncertainty explicitly.

Before episode Calculate
Bellman equation
Table 3: K-learning.

Given the approximation to the posterior probability of optimality in (

8) we could sample actions from it as our policy, as done by Thompson sampling (5). However, that requires computation of the cumulant generating function , which is non-trivial. It was shown in (O’Donoghue, 2018) that an upper bound to the cumulant generating function could be computed by solving a particular ‘soft’ Bellman equation. The resulting K-values, denoted at , are also optimistic for the expected optimal Q-values. Specifically, for a sequence the following holds


Following a Boltzmann policy over these K-values satisfies a Bayesian regret bound which matches the current best bound for Thompson sampling up to logarithmic factors under the same set of assumptions. We summarize the K-learning algorithm in Table (3), where is a constant and and denotes the cumulant generating function of under the posterior.

Comparing Tables 2 and 3 it is clear that soft Q-learning and K-learning share some similarities: They both solve a ‘soft’ value function and use Boltzmann policies. However, the differences are important. Firstly, K-learning has an explicit schedule for the inverse temperature parameter , and secondly it replaces the expected reward (with respect to the posterior) with a quantity that is optimistic for the expected reward under the posterior. These two relatively small changes make K-learning a principled exploration and inference strategy.

To understand how K-learning drives exploration, consider its performance on Problem 1. Since this is a bandit problem we can compute the cumulant generating functions for each arm and then use the policy given by (8). For any non-trivial prior and choice of the cumulant generating function is optimistic for arm which results in the policy selecting arm more frequently, thereby resolving its epistemic uncertainty. As K-learning converges on pulling arm with probability one. This is in contrast to soft Q-learning where arm is exponentially unlikely to be selected as the exploration parameter grows.

3.3.1 Connections between K-learning and Thompson sampling

Since K-learning can be viewed as approximating the posterior probability of optimality of each action it is natural to ask how close an approximation it is. A natural way to measure this similarity is the Kullback–Leibler (KL) divergence between the distributions,

where we are using the notation and . This is different to the usual notion of distance that is taken in variational Bayesian methods, which would typically reverse the order of the arguments in the KL divergence (Blundell et al., 2015). However, in RL that ‘direction’ is not appropriate: a distribution minimizing may put zero probability on regions of support of . This means an action with non-zero probability of being optimal might never be taken. On the other hand a policy minimizing must assign a non-zero probability to every action that has a non-zero probability of being optimal, or incur an infinite KL divergence penalty. With this characterization in mind, and noting that the Thompson sampling policy satisfies , our next result links the policies of K-learning to Thompson sampling.

Theorem 1.

The K-learning value function and policy defined in Table 3 satisfy the following bound at every state and :


We defer the proof to Appendix 5.2. This theorem tells us that the distance between the true probability of optimality and the K-learning policy is bounded for any choice of . In other words, if there is an action that might be optimal then K-learning will eventually take that action.

3.4 Why is ‘RL as Inference’ so popular?

The sections above outline some surprising ways that the ‘RL as inference’ framework can drive suboptimal behaviour in even simple domains. The question remains, why do so many popular and effective algorithms lie within this class? The first, and most important point, is that these algorithms can perform extremely well in domains where efficient exploration is not a bottleneck. Furthermore, they are often easy to implement and amenable to function approximation (Peters et al., 2010; Kober and Peters, 2009; Abdolmaleki et al., 2018). Our discussion of K-learning in Section 3.3 shows that a relatively simple fix to this problem formulation can result in a framing of RL as inference that maintains a coherent notion of optimality. Computational results show that, in tabular domains, K-learning can be competitive with, or even outperform Thompson sampling strategies, but extending these results to large-scale domains with generalization is an open question (O’Donoghue, 2018; Osband et al., 2017).

The other observation is that the ‘RL as inference’ can provide useful insights to the structure of particular algorithms

for RL. It is valid to note that, under certain conditions, following policy gradient is equivalent to a dual inference problem where the ‘probabilities’ play the role of dummy variables, but are not supposed to represent the probability of optimality in the RL problem. In this light,

Levine (2018) presents the inference framework as a way to generalize a wide range of state of the art RL algorithms. However, when taking this view, you should remember that this inference duality is limited to certain RL algorithms, and without some modifications (e.g. Section 3.3) this perspective is in danger of overlooking important aspects of the RL problem.

4 Computational experiments

4.1 One unknown action (Problem 1)

Consider the environment of Problem 1 with uniform prior . We fix and consider how the Bayesian regret varies with . Figure 1 compares how the regret scales for Bayes-optimal (), Thompson sampling (), K-learning () and soft Q-learning (which grows linearly in for the optimal , but would typically grow exponentially for ). This highlights that, even in a simple problem, there can be great value in considering the value of information.

Figure 1: Regret scaling on Problem 1.Soft Q-learning does not scale gracefullywith . Figure 2: DeepSea exploration: a simpleexample where deep exploration is critical.

4.2 ‘DeepSea’ exploration

Our next set of experiments considers the ‘DeepSea’ MDPs introduced by Osband et al. (2017). At a high level this problem represents a ‘needle in a haystack’, designed to require efficient exploration, the complexity of which grows with the problem size . DeepSea (Figure 2) is a scalable variant of the ‘chain MDPs’ popular in exploration research (Jaksch et al., 2010). 666DeepSea figure taken with permission from the ‘bsuite’ Osband et al. (2019)

The agent begins each episode in the top-left state in an grid. At each timestep the agent can move left or right one column, and falls one row. There is a small negative reward for heading right, and zero reward for left. There is only one rewarding state, at the bottom right cell. The only way the agent can receive positive reward is to choose to go right in each timestep. Algorithms that do not perform deep exploration will take an exponential number of episodes to learn the optimal policy, but those that prioritize informative states and actions can learn much faster.

Figure 2(a) shows the ‘time to learn’ for tabular implementations of K-learning (Section 3.3), soft Q-learning (Section 3.2) and Thompson sampling (Section 3.1). We implement each of the algorithms with a prior for rewards and

prior for transitions. Since these problems are small and tabular, we can use conjugate prior updates and exact MDP planning via value iteration. As expected, Thompson sampling and K-learning scale gracefully to large domains but soft Q-learning does not.

4.3 Behaviour Suite for Reinforcement Learning

So far our experiments have been confined to the tabular setting, but the main focus of ‘RL as inference’ is for scalable algorithms that work with generalization. In this section we show that the same insights we built in the tabular setting extend to the setting of deep RL. To do this we implement variants of Deep Q-Networks with a single layer, 50-unit MLP (Mnih et al., 2013)

. To adapt K-learning and Thompson sampling to this deep RL setting we use an ensemble of size 20 with randomized prior functions to approximate the posterior distribution over neural network Q-values

(Osband et al., 2018) (full experimental details are included in Appendix 5.4). We then evaluate all of the algorithms on bsuite: a suite of benchmark tasks designed to highlight key issues in RL (Osband et al., 2019).

In particular, bsuite includes an evaluation on the DeepSea problems but with a one-hot pixel representation of the agent position. In Figure 2(b) we see that the results for these deep RL implementations closely match the observed scaling for the tabular setting. In particular, the algorithms motivated by Thompson sampling and K-learning both scale gracefully to large problem sizes, where soft Q-learning is unable to drive deep exploration. Our bsuite evaluation includes many more experiments that can be fit into this paper, but we provide a link to the complete results at In general, the results for Thompson sampling and K-learning are similar, with soft Q-learning performing significantly worse on ‘exploration’ tasks. We push a summary of these results to Appendix 6.

(a) Tabular state representation.
(b) One-hot pixel representation into neural net.
Figure 3: Learning times for DeepSea experiments. Dashed line represents .

5 Conclusion

This paper aims to make sense of reinforcement learning and probabilistic inference. We review the reinforcement learning problem and show that this problem of optimal learning already combined the problems of control and inference. As we highlight this connection, we also clarify some potentially confusing details in the popular ‘RL as inference’ framework. We show that, since this problem formulation ignores the role of epistemic uncertainty, that algorithms derived from that framework can perform poorly on even simple tasks. Importantly, we also offer a way forward, to reconcile the views of RL and inference in a way that maintains the best pieces of both. In particular, we show that a simple variant to the RL as inference framework (K-learning) can incorporate uncertainty estimates to drive efficient exploration. We support our claims with a series of simple didactic experiments. We leave the crucial questions of how to scale these insights up to large complex domains for future work.


  • A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. M. N. Heess, and M. Riedmiller (2018) Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
  • S. Asmussen and P. W. Glynn (2007) Stochastic simulation: Algorithms and analysis. Vol. 57, Springer Science & Business Media. Cited by: §5.2.
  • D. P. Bertsekas (2005) Dynamic programming and optimal control. Vol. 1, Athena Scientific. Cited by: §1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §3.3.1.
  • N. Cesa-Bianchi, C. Gentile, G. Neu, and G. Lugosi (2017) Boltzmann exploration done right. In Advances in Neural Information Processing Systems, pp. 6287–6296. Cited by: §3.2.
  • O. Chapelle and L. Li (2011) An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: §3.1.
  • M. P. Deisenroth, G. Neumann, J. Peters, et al. (2013) A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §3.2.
  • B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2018) Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §3.2, §3.2.
  • M. Fellows, A. Mahajan, T. G. Rudner, and S. Whiteson (2019) Virel: a variational inference framework for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 7120–7134. Cited by: §3.2.
  • T. Furmston and D. Barber (2010) Variational methods for reinforcement learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 241–248. Cited by: §1.
  • M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar (2015) Bayesian reinforcement learning: A survey.

    Foundations and Trends® in Machine Learning

    8 (5-6), pp. 359–483.
    Cited by: §1.
  • J. C. Gittins (1979) Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), pp. 148–177. Cited by: §1, §2.1.
  • A. Guez, D. Silver, and P. Dayan (2012) Efficient Bayes-adaptive reinforcement learning using sample-based search. In Advances in Neural Information Processing Systems, pp. 1025–1033. Cited by: §2.1.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning (ICML), Cited by: §3.2.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §3.2.
  • T. Jaksch, R. Ortner, and P. Auer (2010) Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: §4.2.
  • H. J. Kappen, V. Gómez, and M. Opper (2012) Optimal control as a graphical model inference problem. Machine learning 87 (2), pp. 159–182. Cited by: §1.
  • M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Machine Learning 49 (2-3), pp. 209–232. Cited by: §1.
  • M. G. Kendall (1946) The advanced theory of statistics.. Charles Griffin and Co., Ltd., London. Cited by: §3.3.
  • D. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. Note: arXiv preprint arXiv:1412.6980 Cited by: §5.4.
  • J. Kober and J. R. Peters (2009) Policy search for motor primitives in robotics. In Advances in neural information processing systems, pp. 849–856. Cited by: §3.4.
  • D. Koller and N. Friedman (2009) Probabilistic graphical models: principles and techniques. MIT press. Cited by: §1.
  • S. Levine (2018) Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §1, §1, §3.2, §3.2, §3.4, §5.1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1928–1937. Cited by: §3.2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. In

    NIPS Deep Learning Workshop

    Cited by: §1, §4.3, §6.1.
  • R. Munos (2014) From bandits to monte-carlo tree search: the optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning 7 (1), pp. 1–129. Cited by: §2.1.
  • B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih (2017) Combining policy gradient and Q-learning. In International Conference on Learning Representations (ICLR), Cited by: §3.2, 3rd item.
  • B. O’Donoghue, I. Osband, R. Munos, and V. Mnih (2018) The uncertainty Bellman equation and exploration. In Proceedings of the 35th International Conference on Machine Learning (ICML), Cited by: §3.
  • B. O’Donoghue (2018) Variational Bayesian reinforcement learning with regret bounds. arXiv preprint arXiv:1807.09647. Cited by: §1, §3.3, §3.3, §3.4, 2nd item, §5.2, §5.3.
  • I. Osband, J. Aslanides, and A. Cassirer (2018) Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8617–8629. Cited by: §4.3, 1st item, 2nd item, §5.4.
  • I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped DQN. In Advances In Neural Information Processing Systems, pp. 4026–4034. Cited by: 1st item.
  • I. Osband, Y. Doron, M. Hessel, J. Aslanides, E. Sezener, A. Saraiva, K. McKinney, T. Lattimore, C. Szepezvari, S. Singh, B. Van Roy, R. Sutton, D. Silver, and H. Van Hasselt (2019) Behaviour suite for reinforcement learning. Cited by: §4.3, §6, footnote 6.
  • I. Osband, D. Russo, Z. Wen, and B. Van Roy (2017) Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608. Cited by: §1, §3.1, §3.4, §4.2.
  • I. Osband, B. Van Roy, and Z. Wen (2014) Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635. Cited by: §3.
  • I. Osband and B. Van Roy (2016) On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732. Cited by: §3.1.
  • I. Osband and B. Van Roy (2017) Why is posterior sampling better than optimism for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), Cited by: §3.1.
  • J. Peters, K. Mülling, and Y. Altun (2010) Relative entropy policy search.. In AAAI, Cited by: §3.4.
  • K. Rawlik, M. Toussaint, and S. Vijayakumar (2013) On stochastic optimal control and reinforcement learning by approximate inference. In Twenty-Third International Joint Conference on Artificial Intelligence, Cited by: §3.2.
  • D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, et al. (2018) A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), pp. 1–96. Cited by: §3.1.
  • D. Russo and B. Van Roy (2014) Learning to optimize via information-directed sampling. In Advances in Neural Information Processing Systems, pp. 1583–1591. Cited by: footnote 4.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
  • A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman (2006) PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 881–888. Cited by: §2.1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1.
  • W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §3.1.
  • E. Todorov (2007) Linearly-solvable markov decision problems. In Advances in neural information processing systems, pp. 1369–1376. Cited by: §1.
  • E. Todorov (2008) General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286–4292. Cited by: §1.
  • E. Todorov (2009) Efficient computation of optimal actions. Proceedings of the national academy of sciences 106 (28), pp. 11478–11483. Cited by: §1, §3.2.
  • M. Toussaint and A. Storkey (2006) Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the 23rd international conference on Machine learning, pp. 945–952. Cited by: §1.
  • M. Toussaint (2009) Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. Cited by: §3.2.
  • A. Wald (1950) Statistical decision functions.. Cited by: §2.1.
  • C. J. C. H. Watkins (1989) Learning from delayed rewards. Ph.D. Thesis, University of Cambridge England. Cited by: §3.
  • G. Welch, G. Bishop, et al. (1995)

    An introduction to the Kalman filter

    Cited by: §1.
  • B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. Cited by: §1.


5.1 Soft Q-learning derivation

We present a derivation of soft Q-learning from the RL as inference parametric approximation to the probability of optimality. Although our presentation is slightly different to that of Levine (2018) we show here that the resulting algorithms are essentially identical. Recall from equation (6) that the parametric approximation to optimality we consider is given by

where is a trajectory starting from at time and is a hyper-parameter. Now we must marginalize out the possible trajectories using the (unknown) system dynamics. Since this is a certainty-equivalent algorithm we shall use the expected value of the transition probabilities, under the posterior at episode , which means we can write

and we make the additional assumption that the ‘prior’ is uniform across all actions for each (this assumption is standard in this framework, see Levine (2018)). In this case we obtain

Now with this we can rewrite

where is the normalization constant for state , since for any , and using Jensen’s we have the following bound

now if we introduce the soft Q-values that satisfy the soft Bellman equation


and we have the soft Q-learning algorithm (the approximation comes from the fact we used Jensen’s inequality to provide a bound).

5.2 Proof of theorem 1


The K-learning value function and policy defined in Table 3 satisfy the following bound at every state and :


Fix some particular state , and let the joint posterior over value and optimality be denoted by


where we use to denote the conditional distribution over Q-values conditioned on optimality. Recall that from equation (7) we have approximated the conditional posterior probability of optimality as

for some , which when yields

From Bayes’ rule this implies the following approximation to the conditional distribution


This is known as the exponential tilt of the posterior distribution and has a myriad of applications in statistics (Asmussen and Glynn, 2007). From this we could derive an approximation to the joint posterior (11), however, the K-learning policy does not follow (8) since computing the cumulant generating function is non-trivial. Instead we compute the K-values, which are the solution to a Bellman equation that provide a guaranteed upper bound on the cumulant generating function, and the K-learning policy is thus

where we have (O’Donoghue, 2018)


With that in mind we take our approximation to the joint posterior (11) to be

Now consider the KL-divergence between the true joint posterior and our approximate one, a quick calculation yields


for timestep and state . Considering the terms on the right hand side of (14) separately we have

where denotes the entropy, and using (12)

Now we sum these two terms, using (13) and the following identities


since , we obtain

The theorem follows from this and the fact that the K-learning value function is defined as

as well as the fact that

from equation (14). ∎

5.3 Problem 1 K-learning details

For a bandit problem the K-learning policy is given by

which requires the cumulant generating function of the posterior over each arm. For arm and the distractor arms there is no uncertainty, in which case the cumulant generating function is given by

In the case of arm the cumulant generating function is

In (O’Donoghue, 2018) it was shown that the optimal choice of is given by

which requires solving a convex optimization problem in variable . In the case of problem 1 the optimal choice of , which yields . Then, once arm has been pulled once and the true reward of arm has been revealed, its cumulant generating function has the same form as the others, and then the optimal choice of is simply

at which point K-learning is greedy with respect to the optimal arm.

5.4 Implementation details for bsuite evaluation

All three algorithms use the same neural network architecture consisting of an MLP (multilayer perceptron) with a single hidden layer with

hidden units. All three algorithms used a replay buffer of the most recent transitions to allow re-use of data. For all three the Adam optimizer (Kingma and Ba, 2014) was used with learning rate and batch-size , and learning is performed at every time-step. For both K-learning and soft Q-learning the temperature was set at . For Bootstrap DQN we chose an ensemble of size , and used the randomized prior functions (Osband et al., 2018) with scale . For K-learning, in order to estimate the cumulant generating function of the reward, we used an ensemble of neural networks predicting the reward for each state and action and used these to calculate the empirical cumulant generating function over them. Each of these was a single hidden layer MLP with hidden units. Finally, we noted that actually training a small ensemble of K-networks performed better than a single network, we used an ensemble of size for this purpose as well as using randomized priors to encourage diversity between the elements of the ensemble with scale . The K-learning policy was the Boltzmann policy over all the ensemble K-values at each state.


6 bsuite report: Making sense of RL and Inference

  1.5cm1.5cm The Behaviour Suite for Reinforcement Learning, or bsuite for short, is a collection of carefully-designed experiments that investigate core capabilities of a reinforcement learning (RL) agent. The aim of the bsuite project is to collect clear, informative and scalable problems that capture key issues in the design of efficient and general learning algorithms and study agent behaviour through their performance on these shared benchmarks. This report provides a snapshot of agent performance on bsuite2019, obtained by running the experiments from Osband et al. (2019).

6.1 Agent definition

All agents were run with the same network architecture (a single layer MLP with 50 hidden units a ReLU activation) adapting DQN

(Mnih et al., 2013)

. Full hyperparameters in Appendix


  • [noitemsep, nolistsep, leftmargin=*]

  • boot_dqn: bootstrapped DQN with prior networks (Osband et al., 2016, 2018).

  • k_learn: K-learning via ensemble with prior networks (O’Donoghue, 2018; Osband et al., 2018).

  • soft_q: soft Q-learning with temperature (O’Donoghue et al., 2017).

6.2 Summary scores

Each bsuite experiment outputs a summary score in [0,1]. We aggregate these scores by according to key experiment type, according to the standard analysis notebook. A detailed analysis of each of these experiments may be found in a notebook hosted on Colaboratory:

Figure 4: Snapshot of agent behaviour. Figure 5: Score for each bsuite experiment.

6.3 Results commentary

Overall, we see that the algorithms K-learning and Bootstrapped DQN perform extremely similarly across bsuite evaluations. However, there is a clear signal that soft Q-learning performs markedly worse on the tasks requiring efficient exploration. This observation is consistent with the hypothesis that algorithms motivated by ‘RL as Inference’ fail to account for the value of exploratory actions.

Beyond this major difference in exploration score, we see that Bootstrapped DQN outperforms the other algorithms on problems varying ‘Scale’. This too is not surprising, since both soft Q and K-learning rely on a temperature tuning that will be problem-scale dependent. Finally, we note that soft Q also performs worse on some ‘basic’ tasks, notably ‘bandit’ and ‘mnist’. We believe that the relatively high temperature (tuned for best performance on Deep Sea) leads to poor performance on these tasks with larger action spaces, due to too many random actions.