1 Introduction
Probabilistic inference is a procedure of making sense of uncertain data using Bayes’ rule. The optimal control problem is to take actions in a known system in order to maximize the cumulative rewards through time. Probabilistic graphical models (PGMs) offer a coherent and flexible language to specify causal relationships, for which a rich literature of learning and inference techniques have developed (Koller and Friedman, 2009). Although control dynamics might also be encoded as a PGM, the relationship between action planning and probabilistic inference is not immediately clear. For inference, it is typically enough to specify the system and pose the question, and the objectives for learning emerge automatically. In control, the system and objectives are known, but the question of how to approach a solution may remain extremely complex (Bertsekas, 2005).
Perhaps surprisingly, there is a deep sense in which inference and control can represent a dual view of the same problem. This relationship is most clearly stated in the case of linear quadratic systems, where the Ricatti equations relate the optimal control policy in terms of the system dynamics (Welch et al., 1995)
. In fact, this connection extends to a wide range of systems, where control tasks can be related to a dual inference problem through rewards as exponentiated probabilities in a distinct, but coupled, PGM
(Todorov, 2007, 2008). A great benefit of this connection is that it can allow the tools of inference to make progress in control problems, and viceversa. In both cases the connections provide new insights, inspire new algorithms and enrich our understanding (Toussaint and Storkey, 2006; Ziebart et al., 2008; Kappen et al., 2012).Reinforcement learning (RL) is the problem of learning to control an unknown system (Sutton and Barto, 2018). Like the control setting, an RL agent should take actions to maximize its cumulative rewards through time. Like the inference problem, the agent is initially uncertain of the system dynamics, but can learn through the transitions it observes. This leads to a fundamental tradeoff: the agent may be able to improve its understanding through exploring poorlyunderstood states and actions, but it may be able to attain higher immediate reward through exploiting its existing knowledge (Kearns and Singh, 2002)
. In many ways, RL combines control and inference into a general framework for decision making under uncertainty. Although there has been ongoing research in this area for many decades, there has been a recent explosion of interest as RL techniques have made highprofile breakthroughs in grand challenges of artificial intelligence research
(Mnih et al., 2013; Silver et al., 2016).A popular line of research has sought to cast ‘RL as inference’, mirroring the dual relationship for control in known systems. This approach is most clearly stated in the tutorial and review of Levine (2018), and provides a key reference for research in this field. It suggests that a generalization of the RL problem can be cast as probabilistic inference through inference over exponentiated rewards, in a continuation of previous work in optimal control (Todorov, 2009). This perspective promises several benefits: a probabilistic perspective on rewards, the ability to apply powerful inference algorithms to solve RL problems and a natural exploration strategy. In this paper we will outline an important way in which this perspective is incomplete. This shortcoming ultimately results in algorithms that can perform poorly in even very simple decision problems. Importantly, these are not simply technical issues that show up in some edge cases, but fundamental failures of this approach that arise in even the most simple decision problems.
In this paper we revisit an alternative framing of ‘RL as inference’. In fact, we show that the original RL problem was already an inference problem all along.^{1}^{1}1Note that, unlike control, connecting RL with inference will not involve a separate ‘dual’ problem. Importantly, this inference problem includes inference over the agent’s future actions and observations. Of course, this perspective is not new, and has long been known as simply the Bayesoptimal solution, see, e.g., Ghavamzadeh et al. (2015). The problem is that, due to the exponential lookahead, this inference problem is fundamentally intractable for all but the simplest problems (Gittins, 1979). For this reason, RL research focuses on computationally efficient approaches that maintain a level of statistical efficiency (Furmston and Barber, 2010; Osband et al., 2017).
We provide a review of the RL problem in Section 2, together with a simple and coherent framing of RL as probabilistic inference. In Section 3 we present three approximations to the intractable Bayesoptimal policy. We begin with the celebrated Thompson sampling algorithm, then we review the popular ‘RL as inference’ framing, as presented by Levine (2018), and highlight a clear and simple shortcoming in this approach. Finally, we review Klearning (O’Donoghue, 2018), which we reinterpret as a modification to the RL as inference framework that provides a principled approach to the statistical inference problem, as well as a presenting a relationship with Thompson sampling. In Section 4 we present computational studies that support our claims.
2 Reinforcement learning
We consider the problem of an agent taking actions in an unknown environment in order to maximize cumulative rewards through time. For simplicity, this paper will model the environment as a finite horizon, discrete Markov Decision Process (MDP)
.^{2}^{2}2This choice is for clarity; continuous, infinite horizon, or partiallyobserved environments do not alter our narrative. Here is the state space, is the action space and each episode is of fixed length . Each episode begins with state then for timesteps the agent selects action , observes transition with probability and receives reward , where we denote by the mean reward. We define a policy to be a mapping fromto probability distributions over
and write for the space of all policies. For any timestep , we define to be the sequence of observations made before time . An RL algorithm maps histories to policies .Our goal in the design of RL algorithms is to obtain good performance (cumulative rewards) for an unknown , where is some family of possible environments. Note that this is a different problem from typical ‘optimal control’, that seeks to optimize performance for one particular known MDP ; although you might still fruitfully apply an RL algorithm to solve problems of that type. For any environment and any policy we can define the actionvalue function,
(1) 
Where the expectation in (1) is taken with respect to the action selection for from the policy and evolution of the fixed MDP . We define the value function and write for the optimal Qvalues over policies, and the optimal value function is given by .
In order to compare algorithm performance across different environments, it is natural to normalize in terms of the regret, or shortfall in cumulative rewards relative to the optimal value,
(2) 
This quantity depends on the unknown MDP , which is fixed from the start and kept the same throughout, but the expectations are taken with respect to the dynamics of and the learning algorithm . For any particular MDP , the optimal regret of zero can be attained by the nonlearning algorithm that returns the optimal policy for .
In order to assess the quality of a reinforcement learning algorithm, which is designed to work across some family of , we need some method to condense performance over a set to a single number. There are two main approaches to this:
(3)  
(4) 
where is a prior over the family . These differing objectives are often framed as Bayesian (averagecase) (3) and frequentist (worstcase) (4) RL ^{3}^{3}3Technically, some frequentist results are highprobability bounds on the worst case rather than true worstcase bounds, but this distinction is not important for our purposes. Although these two settings are typically studied in isolation, it should be clear that they are intimately related through the choice of and . Our next section will investigate what it would mean to ‘solve’ the RL problem. Importantly, we show that both frequentist and Bayesian perspectives already amount to a problem in probabilistic inference, without the need for additional reinterpretation.
2.1 Solving the RL problem through probabilistic inference
If you want to ‘solve’ the RL problem, then formally the objective is clear: find the RL algorithm that minimizes your chosen objective, (3) or (4). To anchor our discussion, we introduce a simple decision problem designed to highlight some key aspects of reinforcement learning. We will revisit this problem setting as we discuss approximations to the optimal policy.
Problem 1 (One unknown action).
Fix and define . Both and share and ; they only differ through their rewards:
Where is a shorthand for deterministic reward of when choosing action .
Problem 1 is extremely simple, it involves no generalization and no longterm consequences: it is an independent bandit problem with only one unknown action. For known the optimal policy is trivial: choose in and in for all . An RL agent faced with unknown should attempt to optimize the RL objectives (3) or (4). Unusually, and only because Problem 1 is so simple, we can actually compute the optimal solutions to both in terms of (the total number of episodes) and where , the probability of being in .
For an optimal minimax RL algorithm is to first choose and observe . If then you know you are in so pick for all , for . If then you know you are in so pick for all , for . The minimax regret of this algorithm is , which cannot be bested by any algorithm.
Actually, the same RL algorithm is also Bayesoptimal for any provided . This relationship is not a coincidence. All admissible solutions to the minimax problem (4) are given by solutions to the averagecase (3) for some ‘worstcase’ prior (Wald, 1950). As such, for ease of exposition, our discussion will focus on the Bayesian (or averagecase) setting. However, readers should understand that the same arguments apply to the minimax objective.
In Problem 1, the key probabilistic inference the agent must consider is the effects of it own actions upon the future rewards, i.e., whether it has chosen action . Slightly more generally, where actions are independent and episode length , the optimal RL algorithm can be computed via Gittins indices, but these problems are very much the exception (Gittins, 1979). In problems with generalization or longterm consequences, computing the Bayesoptimal solution is computationally intractable. One example of an algorithm that converges to Bayesoptimal solution in the limit of infinite computation is given by Bayesadaptive MonteCarlo Planning (Guez et al., 2012). The problem is that, even for very simple problems, the lookahead tree of interactions between actions, observations and algorithmic updates grows exponentially in the search depth (Strehl et al., 2006). Worse still, direct computational approximations to the Bayesoptimal solution can fail exponentially badly should they fall short of the required computation (Munos, 2014). As a result, research in reinforcement learning amounts to trying to find computationally tractable approximations to the Bayesoptimal policy that maintain some degree of statistical efficiency.
3 Approximations for computational and statistical efficiency
The exponential explosion of future actions and observations means solving for the Bayesoptimal solution is computationally intractable. To counter this, most computationally efficient approaches to RL simplify the problem at time to only consider inference over the data that has been gathered prior to time . The most common family of these algorithms are ‘certainty equivalent’ (under an identity utility): they take a point estimate for their best guess of the environment , and try to optimize their control given these estimates . Typically, these algorithms are used in conjunction with some dithering scheme for random action selection (e.g., epsilongreedy), to mitigate premature and suboptimal convergence (Watkins, 1989). However, since these algorithms do not prioritize their exploration, they may take exponentially long to find the optimal policy (Osband et al., 2014).
In order for an RL algorithm to be statistically efficient, it must consider the value of information. To do this, an agent must first maintain some notion of epistemic uncertainty, so that it can direct its exploration towards states and actions that it does not understand well (O’Donoghue et al., 2018). Here again, probabilistic inference finds a natural home in RL: we should build up posterior estimates for the unknown problem parameters, and use this distribution to drive efficient exploration.^{4}^{4}4For the purposes of this paper, we will focus on optimistic approaches to exploration, although more sophisticated informationseeking approaches merit investigation in future work (Russo and Van Roy, 2014).
3.1 Thompson sampling
One of the oldest heuristics for balancing exploration with exploitation is given by Thompson sampling, or probability matching
(Thompson, 1933). Each episode, Thompson sampling (TS) randomly selects a policy according to the probability it is the optimal policy, conditioned upon the data seen prior to that episode. Thompson sampling is a simple and effective method that successfully balances exploration with exploitation (Russo et al., 2018).Implementing Thompson sampling amounts to an inference problem at each episode. For each
define the binary random variable
where denotes the event that action is optimal for state in timestep .^{5}^{5}5For the problem definition in Section 2 there is always a deterministic optimal policy for . The TS policy for episode is thus given by the inference problem,(5) 
where is the joint probability over all the binary optimality variables (hereafter we shall suppress the dependence on ). To understand how Thompson sampling guides exploration let us consider its performance in Problem 1 when implemented with a uniform prior . In the first timestep the agent samples . If it samples it will choose action and learn the true system dynamics, choosing the optimal arm thereafter. If it samples it will choose action and repeat the identical decision in the next timestep. Note that this procedure achieves BayesRegret 2.5 according to , but also minimax regret , which matches the optimal minimax performance despite its uniform prior.
Recent interest in TS was kindled by strong empirical performance in bandit tasks (Chapelle and Li, 2011). Following work has shown that this algorithm satisfies strong Bayesian regret bounds close to the known lower bounds for MDPs, under certain assumptions (Osband and Van Roy, 2017, 2016). However, although much simpler than the Bayesoptimal solution, the inference problem in (5) can still be prohibitively expensive. Table 1 describes one approach to performing the sampling required in (5) implicitly, by maintaining an explicit model over MDP parameters. This algorithm can be computationally intractable as the MDP becomes large and so attempts to scale Thompson sampling to complex systems have focused on approximate posterior samples via randomized value functions, but it is not yet clear under which settings these approximations should be expected to perform well (Osband et al., 2017). As we look for practical, scalable approaches to posterior inference one promising (and popular) approach is known commonly as ‘RL as inference’.
Before episode  Sample  

Bellman equation 


Policy 
3.2 The ‘RL as inference’ framework and its limitations
The computational challenges of Thompson sampling suggest an approximate algorithm that replaces (5) with a parametric distribution suitable for expedient computation. It is possible to view the algorithms of the ‘RL as inference’ approach in this light (Rawlik et al., 2013; Todorov, 2009; Toussaint, 2009; Deisenroth et al., 2013; Fellows et al., 2019); see Levine (2018) for a recent survey. These algorithms choose to model the probability of optimality according to,
(6) 
for some , where is a trajectory (a sequence of stateaction pairs) starting from at timestep , and where denotes the expectation under the posterior at episode
. With this potential in place one can perform Bayesian inference over the unobserved ‘optimality’ variables, obtaining posteriors over the policy or other variables of interest. This presentation of the RL as inference framework is slightly closer to the one in
Deisenroth et al. (2013, §2.4.2.2) than to Levine (2018), but ultimately it produces the same family of algorithms. We provide such a derivation in the appendix for completeness.Applying inference procedures to (6) leads naturally to RL algorithms with some ‘soft’ Bellman updates, and added entropy regularization. We describe the general structure of these algorithms in Table 2. These algorithmic connections can help reveal connections to policy gradient, actorcritic, and maximum entropy RL methods (Mnih et al., 2016; O’Donoghue et al., 2017; Haarnoja et al., 2017, 2018; Eysenbach et al., 2018). The problem is that this resultant ‘posterior’ derived using (6) does not generally bear any close relationship to the agent’s epistemic probability that is optimal.
Bellman equation 


Policy 
To understand how ‘RL as inference’ guides decision making, let us consider its performance in Problem 1. Practical implementations of ‘RL as inference’ estimate through observations. For large, and without prior guidance, the agent is then extremely unlikely to select action and so resolve its epistemic uncertainty. Even for an informed prior action selection according to the exploration strategy of Boltzmann dithering is unlikely to sample action for which (Levine, 2018; CesaBianchi et al., 2017). This is because the ‘distractor’ actions with are much more probable under the Boltzmann policy.
This problem is the same problem that afflicts most dithering approaches to exploration. ‘RL as inference’ as a framework does not incorporate an agents epistemic uncertainty, and so can lead to poor policies for even simple problems. While (6) allows the construction of a dual ‘posterior distribution’, this distribution does not generally bear any relation to the typical posterior an agent should compute conditioned upon the data it has gathered, e.g., equation (5). Despite this shortcoming RL as inference has inspired many interesting and novel techniques, as well as delivered algorithms with good performance on problems where exploration is not the bottleneck (Eysenbach et al., 2018). However, due to the use of language about ‘optimality’ and ‘posterior inference’ etc., it may come as a surprise to some that this framework does not truly tackle the Bayesian RL problem. Indeed, algorithms using ‘RL as inference’ can perform very poorly on problems where accurate uncertainty quantification is crucial to performance. We hope that this paper sheds some light on the topic.
3.3 Making sense of ‘RL as Inference’ via Klearning
In this section we suggest a subtle alteration to the ‘RL as inference’ framework that develops a coherent notion of optimality. The Klearning algorithm was originally introduced through a riskseeking exponential utility (O’Donoghue, 2018). In this paper we rederive this algorithm as a principled approximate inference procedure with clear connections to Thompson sampling, and we highlight its similarities to the ‘RL as inference’ framework. We believe that this may offer a road towards combining the respective strengths of Thompson sampling and the ‘RL as inference’ frameworks. First, consider the following approximate conditional optimality probability at :
(7) 
for some , and note that this is conditioned on the random variable . We can marginalize over possible Qvalues yielding
(8) 
where denotes the cumulant generating function of the random variable (Kendall, 1946). Clearly Klearning and the ‘RL as inference’ framework are similar, since equations (6) and (7) are closedly linked, but there is a crucial difference. Notice that the integral performed in (8) is with respect to the posterior over , which includes the epistemic uncertainty explicitly.
Before episode  Calculate  

Bellman equation 


Policy 
Given the approximation to the posterior probability of optimality in (
8) we could sample actions from it as our policy, as done by Thompson sampling (5). However, that requires computation of the cumulant generating function , which is nontrivial. It was shown in (O’Donoghue, 2018) that an upper bound to the cumulant generating function could be computed by solving a particular ‘soft’ Bellman equation. The resulting Kvalues, denoted at , are also optimistic for the expected optimal Qvalues. Specifically, for a sequence the following holds(9) 
Following a Boltzmann policy over these Kvalues satisfies a Bayesian regret bound which matches the current best bound for Thompson sampling up to logarithmic factors under the same set of assumptions. We summarize the Klearning algorithm in Table (3), where is a constant and and denotes the cumulant generating function of under the posterior.
Comparing Tables 2 and 3 it is clear that soft Qlearning and Klearning share some similarities: They both solve a ‘soft’ value function and use Boltzmann policies. However, the differences are important. Firstly, Klearning has an explicit schedule for the inverse temperature parameter , and secondly it replaces the expected reward (with respect to the posterior) with a quantity that is optimistic for the expected reward under the posterior. These two relatively small changes make Klearning a principled exploration and inference strategy.
To understand how Klearning drives exploration, consider its performance on Problem 1. Since this is a bandit problem we can compute the cumulant generating functions for each arm and then use the policy given by (8). For any nontrivial prior and choice of the cumulant generating function is optimistic for arm which results in the policy selecting arm more frequently, thereby resolving its epistemic uncertainty. As Klearning converges on pulling arm with probability one. This is in contrast to soft Qlearning where arm is exponentially unlikely to be selected as the exploration parameter grows.
3.3.1 Connections between Klearning and Thompson sampling
Since Klearning can be viewed as approximating the posterior probability of optimality of each action it is natural to ask how close an approximation it is. A natural way to measure this similarity is the Kullback–Leibler (KL) divergence between the distributions,
where we are using the notation and . This is different to the usual notion of distance that is taken in variational Bayesian methods, which would typically reverse the order of the arguments in the KL divergence (Blundell et al., 2015). However, in RL that ‘direction’ is not appropriate: a distribution minimizing may put zero probability on regions of support of . This means an action with nonzero probability of being optimal might never be taken. On the other hand a policy minimizing must assign a nonzero probability to every action that has a nonzero probability of being optimal, or incur an infinite KL divergence penalty. With this characterization in mind, and noting that the Thompson sampling policy satisfies , our next result links the policies of Klearning to Thompson sampling.
Theorem 1.
The Klearning value function and policy defined in Table 3 satisfy the following bound at every state and :
(10) 
We defer the proof to Appendix 5.2. This theorem tells us that the distance between the true probability of optimality and the Klearning policy is bounded for any choice of . In other words, if there is an action that might be optimal then Klearning will eventually take that action.
3.4 Why is ‘RL as Inference’ so popular?
The sections above outline some surprising ways that the ‘RL as inference’ framework can drive suboptimal behaviour in even simple domains. The question remains, why do so many popular and effective algorithms lie within this class? The first, and most important point, is that these algorithms can perform extremely well in domains where efficient exploration is not a bottleneck. Furthermore, they are often easy to implement and amenable to function approximation (Peters et al., 2010; Kober and Peters, 2009; Abdolmaleki et al., 2018). Our discussion of Klearning in Section 3.3 shows that a relatively simple fix to this problem formulation can result in a framing of RL as inference that maintains a coherent notion of optimality. Computational results show that, in tabular domains, Klearning can be competitive with, or even outperform Thompson sampling strategies, but extending these results to largescale domains with generalization is an open question (O’Donoghue, 2018; Osband et al., 2017).
The other observation is that the ‘RL as inference’ can provide useful insights to the structure of particular algorithms
for RL. It is valid to note that, under certain conditions, following policy gradient is equivalent to a dual inference problem where the ‘probabilities’ play the role of dummy variables, but are not supposed to represent the probability of optimality in the RL problem. In this light,
Levine (2018) presents the inference framework as a way to generalize a wide range of state of the art RL algorithms. However, when taking this view, you should remember that this inference duality is limited to certain RL algorithms, and without some modifications (e.g. Section 3.3) this perspective is in danger of overlooking important aspects of the RL problem.4 Computational experiments
4.1 One unknown action (Problem 1)
Consider the environment of Problem 1 with uniform prior . We fix and consider how the Bayesian regret varies with . Figure 1 compares how the regret scales for Bayesoptimal (), Thompson sampling (), Klearning () and soft Qlearning (which grows linearly in for the optimal , but would typically grow exponentially for ). This highlights that, even in a simple problem, there can be great value in considering the value of information.
4.2 ‘DeepSea’ exploration
Our next set of experiments considers the ‘DeepSea’ MDPs introduced by Osband et al. (2017). At a high level this problem represents a ‘needle in a haystack’, designed to require efficient exploration, the complexity of which grows with the problem size . DeepSea (Figure 2) is a scalable variant of the ‘chain MDPs’ popular in exploration research (Jaksch et al., 2010). ^{6}^{6}6DeepSea figure taken with permission from the ‘bsuite’ Osband et al. (2019)
The agent begins each episode in the topleft state in an grid. At each timestep the agent can move left or right one column, and falls one row. There is a small negative reward for heading right, and zero reward for left. There is only one rewarding state, at the bottom right cell. The only way the agent can receive positive reward is to choose to go right in each timestep. Algorithms that do not perform deep exploration will take an exponential number of episodes to learn the optimal policy, but those that prioritize informative states and actions can learn much faster.
Figure 2(a) shows the ‘time to learn’ for tabular implementations of Klearning (Section 3.3), soft Qlearning (Section 3.2) and Thompson sampling (Section 3.1). We implement each of the algorithms with a prior for rewards and
prior for transitions. Since these problems are small and tabular, we can use conjugate prior updates and exact MDP planning via value iteration. As expected, Thompson sampling and Klearning scale gracefully to large domains but soft Qlearning does not.
4.3 Behaviour Suite for Reinforcement Learning
So far our experiments have been confined to the tabular setting, but the main focus of ‘RL as inference’ is for scalable algorithms that work with generalization. In this section we show that the same insights we built in the tabular setting extend to the setting of deep RL. To do this we implement variants of Deep QNetworks with a single layer, 50unit MLP (Mnih et al., 2013)
. To adapt Klearning and Thompson sampling to this deep RL setting we use an ensemble of size 20 with randomized prior functions to approximate the posterior distribution over neural network Qvalues
(Osband et al., 2018) (full experimental details are included in Appendix 5.4). We then evaluate all of the algorithms on bsuite: a suite of benchmark tasks designed to highlight key issues in RL (Osband et al., 2019).In particular, bsuite includes an evaluation on the DeepSea problems but with a onehot pixel representation of the agent position. In Figure 2(b) we see that the results for these deep RL implementations closely match the observed scaling for the tabular setting. In particular, the algorithms motivated by Thompson sampling and Klearning both scale gracefully to large problem sizes, where soft Qlearning is unable to drive deep exploration. Our bsuite evaluation includes many more experiments that can be fit into this paper, but we provide a link to the complete results at bit.ly/rlinferencebsuite. In general, the results for Thompson sampling and Klearning are similar, with soft Qlearning performing significantly worse on ‘exploration’ tasks. We push a summary of these results to Appendix 6.
5 Conclusion
This paper aims to make sense of reinforcement learning and probabilistic inference. We review the reinforcement learning problem and show that this problem of optimal learning already combined the problems of control and inference. As we highlight this connection, we also clarify some potentially confusing details in the popular ‘RL as inference’ framework. We show that, since this problem formulation ignores the role of epistemic uncertainty, that algorithms derived from that framework can perform poorly on even simple tasks. Importantly, we also offer a way forward, to reconcile the views of RL and inference in a way that maintains the best pieces of both. In particular, we show that a simple variant to the RL as inference framework (Klearning) can incorporate uncertainty estimates to drive efficient exploration. We support our claims with a series of simple didactic experiments. We leave the crucial questions of how to scale these insights up to large complex domains for future work.
References
 Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
 Stochastic simulation: Algorithms and analysis. Vol. 57, Springer Science & Business Media. Cited by: §5.2.
 Dynamic programming and optimal control. Vol. 1, Athena Scientific. Cited by: §1.
 Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §3.3.1.
 Boltzmann exploration done right. In Advances in Neural Information Processing Systems, pp. 6287–6296. Cited by: §3.2.
 An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: §3.1.
 A survey on policy search for robotics. Foundations and Trends® in Robotics 2 (1–2), pp. 1–142. Cited by: §3.2.
 Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §3.2, §3.2.
 Virel: a variational inference framework for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 7120–7134. Cited by: §3.2.
 Variational methods for reinforcement learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 241–248. Cited by: §1.

Bayesian reinforcement learning: A survey.
Foundations and Trends® in Machine Learning
8 (56), pp. 359–483. Cited by: §1.  Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological), pp. 148–177. Cited by: §1, §2.1.
 Efficient Bayesadaptive reinforcement learning using samplebased search. In Advances in Neural Information Processing Systems, pp. 1025–1033. Cited by: §2.1.
 Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine Learning (ICML), Cited by: §3.2.
 Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §3.2.
 Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: §4.2.
 Optimal control as a graphical model inference problem. Machine learning 87 (2), pp. 159–182. Cited by: §1.
 Nearoptimal reinforcement learning in polynomial time. Machine Learning 49 (23), pp. 209–232. Cited by: §1.
 The advanced theory of statistics.. Charles Griffin and Co., Ltd., London. Cited by: §3.3.
 Adam: A method for stochastic optimization. Note: arXiv preprint arXiv:1412.6980 Cited by: §5.4.
 Policy search for motor primitives in robotics. In Advances in neural information processing systems, pp. 849–856. Cited by: §3.4.
 Probabilistic graphical models: principles and techniques. MIT press. Cited by: §1.
 Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909. Cited by: §1, §1, §3.2, §3.2, §3.4, §5.1.
 Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pp. 1928–1937. Cited by: §3.2.

Playing atari with deep reinforcement learning.
In
NIPS Deep Learning Workshop
, Cited by: §1, §4.3, §6.1.  From bandits to montecarlo tree search: the optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning 7 (1), pp. 1–129. Cited by: §2.1.
 Combining policy gradient and Qlearning. In International Conference on Learning Representations (ICLR), Cited by: §3.2, 3rd item.
 The uncertainty Bellman equation and exploration. In Proceedings of the 35th International Conference on Machine Learning (ICML), Cited by: §3.
 Variational Bayesian reinforcement learning with regret bounds. arXiv preprint arXiv:1807.09647. Cited by: §1, §3.3, §3.3, §3.4, 2nd item, §5.2, §5.3.
 Randomized prior functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8617–8629. Cited by: §4.3, 1st item, 2nd item, §5.4.
 Deep exploration via bootstrapped DQN. In Advances In Neural Information Processing Systems, pp. 4026–4034. Cited by: 1st item.
 Behaviour suite for reinforcement learning. Cited by: §4.3, §6, footnote 6.
 Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608. Cited by: §1, §3.1, §3.4, §4.2.
 Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635. Cited by: §3.
 On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732. Cited by: §3.1.
 Why is posterior sampling better than optimism for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning (ICML), Cited by: §3.1.
 Relative entropy policy search.. In AAAI, Cited by: §3.4.
 On stochastic optimal control and reinforcement learning by approximate inference. In TwentyThird International Joint Conference on Artificial Intelligence, Cited by: §3.2.
 A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), pp. 1–96. Cited by: §3.1.
 Learning to optimize via informationdirected sampling. In Advances in Neural Information Processing Systems, pp. 1583–1591. Cited by: footnote 4.
 Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
 PAC modelfree reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 881–888. Cited by: §2.1.
 Reinforcement learning: an introduction. MIT press. Cited by: §1.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §3.1.
 Linearlysolvable markov decision problems. In Advances in neural information processing systems, pp. 1369–1376. Cited by: §1.
 General duality between optimal control and estimation. In 2008 47th IEEE Conference on Decision and Control, pp. 4286–4292. Cited by: §1.
 Efficient computation of optimal actions. Proceedings of the national academy of sciences 106 (28), pp. 11478–11483. Cited by: §1, §3.2.
 Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the 23rd international conference on Machine learning, pp. 945–952. Cited by: §1.
 Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, pp. 1049–1056. Cited by: §3.2.
 Statistical decision functions.. Cited by: §2.1.
 Learning from delayed rewards. Ph.D. Thesis, University of Cambridge England. Cited by: §3.

An introduction to the Kalman filter
. Cited by: §1.  Maximum entropy inverse reinforcement learning. Cited by: §1.
Appendix
5.1 Soft Qlearning derivation
We present a derivation of soft Qlearning from the RL as inference parametric approximation to the probability of optimality. Although our presentation is slightly different to that of Levine (2018) we show here that the resulting algorithms are essentially identical. Recall from equation (6) that the parametric approximation to optimality we consider is given by
where is a trajectory starting from at time and is a hyperparameter. Now we must marginalize out the possible trajectories using the (unknown) system dynamics. Since this is a certaintyequivalent algorithm we shall use the expected value of the transition probabilities, under the posterior at episode , which means we can write
and we make the additional assumption that the ‘prior’ is uniform across all actions for each (this assumption is standard in this framework, see Levine (2018)). In this case we obtain
Now with this we can rewrite
where is the normalization constant for state , since for any , and using Jensen’s we have the following bound
now if we introduce the soft Qvalues that satisfy the soft Bellman equation
then
and we have the soft Qlearning algorithm (the approximation comes from the fact we used Jensen’s inequality to provide a bound).
5.2 Proof of theorem 1
Theorem.
The Klearning value function and policy defined in Table 3 satisfy the following bound at every state and :
Proof.
Fix some particular state , and let the joint posterior over value and optimality be denoted by
(11) 
where we use to denote the conditional distribution over Qvalues conditioned on optimality. Recall that from equation (7) we have approximated the conditional posterior probability of optimality as
for some , which when yields
From Bayes’ rule this implies the following approximation to the conditional distribution
(12) 
This is known as the exponential tilt of the posterior distribution and has a myriad of applications in statistics (Asmussen and Glynn, 2007). From this we could derive an approximation to the joint posterior (11), however, the Klearning policy does not follow (8) since computing the cumulant generating function is nontrivial. Instead we compute the Kvalues, which are the solution to a Bellman equation that provide a guaranteed upper bound on the cumulant generating function, and the Klearning policy is thus
where we have (O’Donoghue, 2018)
(13) 
With that in mind we take our approximation to the joint posterior (11) to be
Now consider the KLdivergence between the true joint posterior and our approximate one, a quick calculation yields
(14) 
for timestep and state . Considering the terms on the right hand side of (14) separately we have
where denotes the entropy, and using (12)
Now we sum these two terms, using (13) and the following identities
and
since , we obtain
The theorem follows from this and the fact that the Klearning value function is defined as
as well as the fact that
from equation (14). ∎
5.3 Problem 1 Klearning details
For a bandit problem the Klearning policy is given by
which requires the cumulant generating function of the posterior over each arm. For arm and the distractor arms there is no uncertainty, in which case the cumulant generating function is given by
In the case of arm the cumulant generating function is
In (O’Donoghue, 2018) it was shown that the optimal choice of is given by
which requires solving a convex optimization problem in variable . In the case of problem 1 the optimal choice of , which yields . Then, once arm has been pulled once and the true reward of arm has been revealed, its cumulant generating function has the same form as the others, and then the optimal choice of is simply
at which point Klearning is greedy with respect to the optimal arm.
5.4 Implementation details for bsuite evaluation
All three algorithms use the same neural network architecture consisting of an MLP (multilayer perceptron) with a single hidden layer with
hidden units. All three algorithms used a replay buffer of the most recent transitions to allow reuse of data. For all three the Adam optimizer (Kingma and Ba, 2014) was used with learning rate and batchsize , and learning is performed at every timestep. For both Klearning and soft Qlearning the temperature was set at . For Bootstrap DQN we chose an ensemble of size , and used the randomized prior functions (Osband et al., 2018) with scale . For Klearning, in order to estimate the cumulant generating function of the reward, we used an ensemble of neural networks predicting the reward for each state and action and used these to calculate the empirical cumulant generating function over them. Each of these was a single hidden layer MLP with hidden units. Finally, we noted that actually training a small ensemble of Knetworks performed better than a single network, we used an ensemble of size for this purpose as well as using randomized priors to encourage diversity between the elements of the ensemble with scale . The Klearning policy was the Boltzmann policy over all the ensemble Kvalues at each state.
6 bsuite report: Making sense of RL and Inference
1.5cm1.5cm The Behaviour Suite for Reinforcement Learning, or bsuite for short, is a collection of carefullydesigned experiments that investigate core capabilities of a reinforcement learning (RL) agent. The aim of the bsuite project is to collect clear, informative and scalable problems that capture key issues in the design of efficient and general learning algorithms and study agent behaviour through their performance on these shared benchmarks. This report provides a snapshot of agent performance on bsuite2019, obtained by running the experiments from github.com/deepmind/bsuite Osband et al. (2019).
6.1 Agent definition
All agents were run with the same network architecture (a single layer MLP with 50 hidden units a ReLU activation) adapting DQN
(Mnih et al., 2013). Full hyperparameters in Appendix
5.4.
[noitemsep, nolistsep, leftmargin=*]

soft_q: soft Qlearning with temperature (O’Donoghue et al., 2017).
6.2 Summary scores
Each bsuite experiment outputs a summary score in [0,1]. We aggregate these scores by according to key experiment type, according to the standard analysis notebook. A detailed analysis of each of these experiments may be found in a notebook hosted on Colaboratory: bit.ly/rlinferencebsuite.
6.3 Results commentary
Overall, we see that the algorithms Klearning and Bootstrapped DQN perform extremely similarly across bsuite evaluations. However, there is a clear signal that soft Qlearning performs markedly worse on the tasks requiring efficient exploration. This observation is consistent with the hypothesis that algorithms motivated by ‘RL as Inference’ fail to account for the value of exploratory actions.
Beyond this major difference in exploration score, we see that Bootstrapped DQN outperforms the other algorithms on problems varying ‘Scale’. This too is not surprising, since both soft Q and Klearning rely on a temperature tuning that will be problemscale dependent. Finally, we note that soft Q also performs worse on some ‘basic’ tasks, notably ‘bandit’ and ‘mnist’. We believe that the relatively high temperature (tuned for best performance on Deep Sea) leads to poor performance on these tasks with larger action spaces, due to too many random actions.
Comments
There are no comments yet.