## 1 Introduction

Many classification schemes exist for reinforcement learning (RL) algorithms. Algorithms can be classified as either model-based or model-free, depending on whether a model of the environment is utilised. Alternatively, RL algorithms can be classified as either policy-based or planning-based, on-policy or off-policy, and online or offline. These classification schemes help provide a unified perspective on RL, highlighting similarities and differences amongst approaches and aiding the development of novel algorithms.

In this work, we highlight a relatively uncharted classification scheme based on *iterative* and *amortised* inference.
Inspired by the control as inference (CAI) framework (dayan1997using; rawlik2010approximate; toussaint2006probabilistic; ziebart2010modeling; levine2018reinforcement; fellows2019virel), we cast the problem of reward maximization in terms of variational inference.
In this context, iterative inference approaches directly optimize the posterior distribution, while amortised methods learn a parameterised function (e.g., a policy, or amortised value function) which maps directly from states to the quantity of interest (such as actions or Q-values).

We demonstrate that this classification scheme provides a principled partioning of a wide range of existing approaches to RL, including policy gradient methods, Q-learning, actor-critic methods, trajectory optimisation and stochastic planning, and that by doing so it provides a novel perspective which highlights algorithmic commonalities that may otherwise be overlooked. We find that existing implementations of iterative inference generally correspond to model-based planning, whereas implementations of amortised inference generally correspond to model-free policy optimisation. Importantly, this classification scheme highlights unexploited regions of algorithmic design space in iterative policies and amortised plans, and also in the combination of iterative and amortised methods. Exploring these new regions has the potential to inspire novel RL algorithms.

## 2 Control as Inference

We consider a Markov Decision Process (MDP) defined by

, where denotes actions and denotes states. State transitions are governed by , and the reward function is . is a factor which discounts the sum of rewards , where denotes a trajectory . RL aims to optimise a policy distribution. The probability of trajectories under this policy is given by

. In traditional RL, the objective is to maximise the expected sum of returns (throughout we assume ).To reformulate this objective in terms of probabilistic inference, we construct a graphical model where the posterior distribution over actions

recovers the optimal policy. This requires the graphical model to incorporate some notion of reward, which is achieved by introducing an additional binary variable

, where , which is referred to as an ‘optimality’ variable, as implies time step was ‘optimal’. Since we only ever desire optimality, we will drop it from the notation: . The corresponding graphical model is shown in the Appendix.The objective of CAI is to obtain the optimal posterior over a full trajectory. There are many ways to approximate this desired posterior, ranging from variational inference (wainwright2008graphical; beal2003variational), message passing algorithms (yedidia2011message; weiss2000correctness), and importance sampling (kappen2016adaptive; kutschireiter2020hitchhiker). Each of these approaches can be seen to correspond to a family of RL algorithms in the literature. There are also two approaches to optimising the posterior . One can either directly optimize it, in which case one infers a full sequence of actions – i.e. a *plan* – or else one can choose to infer a sequence of single-step action posteriors which corresponds to sequentially inferring *policies*.

One approach to approximating the true posterior, which underlies a large subset of common RL algorithms, is through variational inference. Here, we introduce an approximate posterior , and use this to construct a variational bound on the true-posterior. Let denote an agent’s generative model, which can be factorized as:

The likelihood of optimality is usually defined as , thereby maintaining consistency with traditional RL objectives. Here, is a temperature parameter which scales the contribution of the reward and entropy terms. Traditional RL algorithms are recovered as . We additionally assume an uninformative action prior . Given these definitions, the variational bound is defined as (see Appendix for a derivation and discussion of the assumptions required).

(1) |

Therefore, maximising provides a tractable method for minimizing the divergence between true and approximate posterior. We can further simplify Eq. 1 by assuming and , which causes the variational and generative transition dynamics to cancel, allowing us to rewrite as:

(2) |

where is the Shannon entropy. Maximising is thus equivalent to maximising both the expected likelihood of optimality and the entropy of . The inclusion of an entropy term over actions provides several benefits such as including a mechanism for offline learning (nachum2017bridging; levine2020offline), improving, and increasing algorithmic stability and robustness. Empirically, algorithms derived from the control as inference framework often outperform their non-stochastic counterparts (haarnoja2018soft; hafner2018learning; hausman2018learning).

## 3 Iterative & Amortised Inference

In the wider literature on probabilistic inference, a key distinction is made between *iterative* and *amortised* approaches to inference.
Iterative methods directly optimise the parameters of the approximate posterior, a process which is carried out for each data-point. While this inference procedure could be theoretically be single-step, in practice most algorithms are iterative, hence the name. Examples of this method include belief propagation (pearl2014probabilistic) variational message passing (winn2005variational), stochastic variational inference (hoffman2013stochastic), black box variational inference (ranganath2013black), and expectation-maximisation (dempster1977maximum).

In contrast, amortised approaches to inference (marino2018iterative) learn a parameterised function which maps directly from to the parameters of the approximate posterior . Amortised inference models are learned by optimising the parameters in order to maximise over the available dataset . In practice,

is often implemented as a neural network with weights

. Amortised inference forms the basis of variational autoencoders

(kingma2013auto), one of the most popular tools for inference in machine learning. We use

to denote a variational posterior optimised through iterative inference with parameters , and to denote an*amortised*posterior with parameters (of the amortisation function) . These two approaches optimise subtly different objectives, which we present below

^{1}

^{1}1For convenience, we showcase the distinction on the variational lower-bound derived earlier..

(3) | ||||

## 4 Classification Scheme

We propose a classification scheme which partitions RL algorithms along two orthogonal axes of variation – whether they optimize *plans* or *policies*, and whether they utilize *iterative* or *amortised* inference. Below, we classify a number of established RL algorithms in terms of our scheme.

### Policy Gradients

By directly differentiating the variational lower-bound, one can derive the policy-gradient class of algorithms (sutton2000policy; schulman2017equivalence). Typically, amortised methods are used and the optimisation is performed over the whole dataset. We can derive updates for by differentiating the amortised objective in Eq. 3 w.r.t , :

(4) | ||||

Equation 4 resembles the standard policy gradient objective, but with an additional entropy term over actions, which encourages exploration by rewarding entropic policies.

### Q-Learning

Instead of directly differentiating the variational lower bound, one can instead try to optimize the variational posterior directly through dynamic programming.

Now, using the fact that:

One can solve this problem recursively by passing backwards messages of the form . Intuitively these messages correspond to the probability of acting optimally from the current state and action to the time horizon. These share a close mathematical and intuitive relationship with value functions and state-action value functions, or Q-functions. We define and . Armed with these definitions, we can write:

The and functions can be computed recursively, which corresponds an iterative message passing algorithm. In a tree-structured MDP, this algorithm corresponds exactly to belief propagation (yedidia2005constructing). Alternatively, we can amortise the computation over a dataset by learning a function which maps a state-action pair to a Q-value directly. This function can then be trained on a dataset by a bootstrapping gradient descent.

This update rule differs from the standard Q-learning update in two ways. First, the and functions contain action-entropy terms. Secondly, we have a ’soft-max’ instead of a ’hard-max’ over the next-state value function. As , the effect of the entropy term and the soft-max will disappear and the standard Q-learning algorithm will be obtained. Moreover, CAI policy-gradients and Q-learning can be combined to yield the soft-actor-critic (SAC) (haarnoja2018soft; haarnoja2018applications) which is a simple, robust, and state-of-the-art model-free algorithm.

### Trajectory Optimisation

We can also consider directly inferring a *sequence* of actions .
This can be achieved by maximising the variational objective in Eq.2 using mirror descent (bubeck2014convex; okada2018acceleration; okada2019variational), leading to the following iterative update rule:

(5) |

where denotes the current iteration and . To infer , Eq. 5 is applied for each state, making this an iterative inference algorithm. Recent work has demonstrated that Eq. 5 generalises a number of stochastic optimisation methods used extensively in model-based planning (okada2019variational), including the cross-entropy method (CEM) (rubinstein2013cross) and model-predictive path-integral control (MPPI) (williams2017model). Given this generalisation, these methods differ only in their definition of the optimality likelihood . For instance, CEM defines this quantity as , where is an arbitrary threshold and is the indicator function, while MPPI defines the optimality likelihood as . Control as inference thus provides a unified perspective on previously unrelated algorithms.

An alternative approach comes in the form of sequential Monte Carlo (SMC) methods, which provide an elegant approach to probabilistic planning. Here, we attempt to approximate the true posterior with a set of particles with weights . We can derive the update laws for these particles as (see piche2018probabilistic for a full derivation). Since the posterior is represented as a set of particles instead of a parametrised distribution, this approach is easily able to handle multimodal posteriors

(6) | ||||

where

is a learned value function. Our scheme views this as an iterative planning algorithm, and naturally suggests the idea of amortising the importance weights while also providing the variational lower-bound over the dataset as the principled loss function to optimize against.

## 5 Identifying Novel Algorithms

By classifying RL algorithms in terms of amortised vs iterative inference and policy-based vs planning-based, it becomes evident that regions of the design space corresponding to *iterative polices* and *amortised plans* remain relatively unexplored. Next, we discuss the properties and potential implementation of these novel algorithm classes.

### Iterative Policies

The majority of model-free policy-based algorithms utilize amortised inference. However, it is possible to construct a policy-based algorithm that utilises iterative inference, i.e. one that optimises a specific policy or Q-function for each state. This could be achieved by applying policy gradients to simulated trajectories. While this is likely to be inefficient, the two methods could be combined, by initializing the iterative policy with the amortised policy, so that the iterative policy effectively fine-tunes the amortised with respect to the current state and could be used adaptively when the amortised estimate is known to be poor. Interestingly, since the iterative Q-values and policies would be specific to a single state and used for MPC, they do not need to be globally accurate, allowing for more severe approximations than possible with amortised models. iLQR methods

(li2004iterative) sit within this quadrant. These methods iteratively infer a policy for each specific state by making a linear approximation to the dynamics and a quadratic approximation to the cost. It would be interesting to combine this intuition with RL by constructing locally approximate Q-values or policies.### Amortised Plans

Planning algorithms generally utilise iterative inference for optimisation, such as by gradient or mirror descent on a variational bound (okada2020planet; srinivas2018universal). However, one could also construct amortised plans, which learn a global function mapping from states to sequences of actions. Policy gradients present a potential method for learning amortised plans, whereby an approximate posterior over action sequences is optimised using Eq. 4.

Amortising planning may substantially improve the computational efficiency of MPC algorithms, especially when adaptively combined with iterative planning, so that expensive online iterative planning is only used where absolutely necessary. Amortised plans correspond to the notion of fixed-action-patterns in the study of biological behaviour (lorenz2013foundations), and could enable agents to learn temporally extended ’macro-actions’ .

## 6 Conclusion

We have explored a novel classification scheme for RL algorithms, based on iterative and amortised variational inference within the mathematically principled control-as-inference framework. Our scheme informatively partitions a range of influential RL approaches, including policy gradients, Q-learning, actor-critic methods, and trajectory optimisation algorithms such as CEM, MPPI and SMC – highlighting relationships which would have otherwise remained obscure. We have shown how constructing algorithmic design-spaces based on fundamental distinctions can reveal unexplored design choices. Our work highlights the importance of identifying the common factors underlying disparate RL algorithms and the utility of building unifying conceptual frameworks through which to understand them.

Future work may explore still other classification schemes that can be derived from the perspective of control as inference. For instance, algorithms can be classified in terms of whether the approximate posterior is parametric or non-parametric (marinodesign), and whether the action prior is learned or uniform (marinoinference) , and whether variational, dynamic-programming, or importance-sampling inference methods are used. Ultimately, by fully quantifying and classifying the relevant axes of variation, we hope to develop a unified understanding of the design-space of RL algorithms, which would be instrumental in situating, clarifying, and inspiring future research.

Finally, our work illuminates the possibility of combining iterative and amortised inference (marino2018iterative)

. This approach has been explored in the context of unsupervised learning, where a hybrid approach to inference can help overcome the shortcomings of using either iterative or amortised inference alone (Tschantz et al 2020, in press). Given the correspondence between iterative & amortised inference and planning & policies, our scheme suggests a potential avenue towards combining the sample efficiency of model-based planning and the asymptotic performance of model-free policy optimisation in a mathematically principled manner.

## References

## Appendix A Bound derivations

We wish to minimize the KL divergence between approximate and true posterior. Since this divergence is intractable (it contains the true posterior which is intractable), we instead show that the divergence between the approximate posterior and generative model (which is tractable) lower-bounds the divergence we want. Thus, by maximizing this lower-bound, we bring the true and approximate posteriors closer together.

Now, if we substitute in the definitions of the approximate posterior and generative model from the main text, we obtain.

We see that the CAI objective breaks down into three separate terms. The first, the expected reward, quantifies the expected sum of rewards an agent is likely to obtain for a given trajectory. The state-complexity term penalizes trajectories where the approximate and prior trajectories differ, while the action-trajectory term penalizes the divergence between the agent’s actions and some prior action distribution. If we assume, as is commonly done in the literature that the approximate and generative dynamics are the same: , then the state-complexity term vanishes. This is a well motivated assumption, since having separate approximate dynamics effectively means the agent thinks it has control over the environmental dynamics, and will thus tend towards risk-seeking policies if it does not actually have this degree of control over its environment.

The CAI framework also often ignores the action prior . This does not necessarily lead to a lack of generality since the action prior can always be subsumed into the reward. Nevertheless, it is often intuitively useful to think of utilising the action prior in some way. For instance, in many control tasks, action itself is costly. For instance, consider the task of flying a rocket. Actions such as applying thrust deplete fuel, and thus have a cost associated with them which can be well-modelled with an action prior of 0 (any action at all incurs a small penalty).

The action prior also provides a mathematically principled way to combine iterative and amortised inference. Suppose that we optimize the iterative bound for each datapoint, but we also have a trained amortised policy , then we can set the action prior to be the output of the amortised scheme and infer the iterative posterior using this prior.

If, as is commonly done we ignore the action prior by assuming it is uniform then the action-complexity term disappears and we obtain for the bound.

Which corresponds exactly to the bound given in Equation 2 of the main text.

Comments

There are no comments yet.