I Introduction
Controlling stochastic dynamic systems is a research activity that transcends all engineering disciplines. Particularly for systems that are high dimensional, nonlinear and continuous in the system statespace, finding good controllers remains one of the most challenging problems faced by their respective research communities. Formally the goal is to find a policy, mapping system outputs to inputs, so that the closedloop system is both stable, and, often, exhibits rich and complex behaviour or dynamic skills such as locomotion or manipulation [mordatch2012trajopt, toussaint2015lgp, posa2014trajopt]. A well known paradigm to synthesize such policies is Optimal Control. The policy is determined such that when applied to the system it is expected to accumulate a minimized cost over a finite (or infinite) time horizon. By design of the cost it is possible to encode abstract behavioural and dynamic features into the policy.
Finding such optimal policies is an elegant and appealing theoretical concept but is met by significant difficulties in practice and explicit expressions for the optimal policy exist only for a handful of problem statements. In an effort to address these issues, some researchers have turned to probabilistic methods, attempting to rephrase the problem of deterministic optimization as one of probabilistic inference.
It is possible to view these endeavours as part of a larger movement referred to as probabilistic numerics [oates2019probnum, hennig2015probabilistic]
. Algorithms for numerical problems, such as optimization, proceed iteratively and typically maintain a deterministic running estimate of the correct solution. With every iteration that estimate is improved based on new information. Contrastingly, probabilistic numerics pursues methods that, in place of such point estimates, update probability measures over the space of possible solutions. Note that the manifestation and interpretation of probability is here in the first place epistemic and representative for missing information in a computation that itself is otherwise entirely deterministic.
When pursued in the context of (optimal) control, these attempts are usually referred to with the umbrella term Control as Inference (CaI) [levine2018reinforcement]
. In part, they consolidate dualities and connections between control and estimation that have captivated researchers for decades and which have resurfaced in recent theoretical work contributed by the Reinforcement Learning (RL) community. From the practical point of view, they give rise to new theoretical paradigms that hopefully will bring forth original algorithms that can draw from the mature (approximate) inference machinery developed by the Machine Learning community.
Ia Background
Since the conception of modern system theory, the history of control and estimation has been vividly intertwined, cultivating cautious aspirations to pursue dualities and equivalences early on. In the tradition of other summaries of this kind, reminiscing about the historical roots of optimal control and estimation, we honour the seminal work of Rudolf Kalman [kalman1960, ho1964bayesian]
and the duality between the Linear Quadratic Regulator and the Kalman Filter. Unfortunately it proofed difficult to generalise such duality beyond a linear setting and not for a lack of trying
[pearson1966duality, pavon1982duality, todorov2008general].It is fairly safe to state that a serious systemtheoretic interest in inferring the optimal policy traces back to the discovery of Linearly Solvable Optimal Control (LSOC) by Kappen [kappen2005linear]
, followed by that of Linearly Solvable Markov Decision Processes by Todorov
[todorov2007linearly], later extended by [dvijotham2012linearly] and [lefebvre2020elsoc]. These frameworks refer to a peculiar subclass of stochastic optimal control where the optimal policy can be expressed as a path integral over future passive or uncontrolled trajectories, in the sense of Feynman and Kac [kac1951some, feynman2005space]. Put more explicitly, the optimal policy can be expressed in terms of a conditional expectation over the uncontrolled path distribution. Contemporary the framework is often referred to as Path Integral Control (PIC) and derived methods as PIC methods. Soon it was explored how these results could be leveraged to design actual policies by evaluating the integral [kappen2007path]or by solving an eigenvalue problem
[todorov2009eigenfunction].Following earlier work from about the same period as LSOC [toussaint2006probabilistic], in 2009 Toussaint [toussaint2009pros, toussaint2009robot] pioneered the idea to formulate a probabilistic graphical model whose most likely instance coincided approximately with the optimal one and used it as a theoretical basis to derive a gradient based trajectory optimization algorithm based on the concept of message passing [bishop2006pattern]. Toussaint was therefore the first to exploit the similarity between the graphical models underlying the Bayesian estimation framework and optimal control. This approach is one instance of CaI, often denoted as Inference for Control or Inference to Control (I2C).
Another field that clearly showcases interest in both control and inference is Reinforcement Learning (RL). Contemporary the problem of RL is cast as a stochastic optimal control problem. The main ambition in RL is to learn the optimal policy simply by eliciting and observing systemenvironment interactions. As noted by Peters the use of data as a means to learn or extract the optimal policy then relies on tricks rather than a factual buildin inference mechanism [janpetersyoutube]. Put differently, learning is possible only due to the fact that the cost we try to minimize contains an expectation but not necessarily expresses the policy itself in terms of one (e.g. policy search). Until now LSOC is the only framework that expresses the policy itself as an expectation and that does not require any tricks to do so. In an attempt to address this principle shortcoming, the RL community proposed (indirectly) informationtheoretic connections regularizing stochastic optimal control problems with informationtheoretic measures such as the relative and differential entropy. Peters [peters2010relative] explored regularizing the objective with a relative entropy term inspired by the superior performance of natural over vanilla gradients in policy search methods [kakade2001natural, bagnell2003covariant]. The work on relative entropic RL culminated into the stateoftheart trust region policy optimization method [schulman2015trust] and closely related proximal policy algorithm [schulman2017proximal].
Second, say that we try to infer a cost model from (human) demonstrations, then the standard stochastic optimal control framework makes for a poor model, as it can not explain the (minor) suboptimalities that might be present in demonstrations. To address this issue, Ziebart [ziebart2008maximum] introduced a differential entropy term in the control objective and described the solution of the Entropy Regularized Linear Quadratic Gaussian Regulator. Interestingly enough, the variational solution to these regularized problems are stochastic policies, or to be more precise, the optimal policy is given by a distribution rather than a deterministic function. One quickly argued that these ideas might offer an answer to the problem of exploration in RL. This stimulated the development of Maximum Entropy or soft RL algorithms [levine2013guided, haarnoja2017reinforcement, haarnoja2018soft]. Meanwhile work was done attempting to unify the ideas of Toussaint with the relative entropy framework and LSOC [rawlik2013stochastic].
In parallel the work on LSOC and PIC also proceeded. In 2010, Theodorou devised the first RL policy search algorithm from the theory of LSOC [theodorou2010generalized, theodorou2010reinforcement]. The work of [stulp2012path] hinted at a deeper connection with stochastic optimization techniques inspiring the first theoretical investigations on dualities between information theory and stochastic optimal control [theodorou2013information, theodorou2015nonlinear] and refuelling work on combining LSOC with the ideas from RL on entropy regularization [gomez2014policy, pan2015sample, kappen2016adaptive, thalmeier2020adaptive]. Finally during the second part of the previous decade a series of model based algorithms were proposed that make use of the fact that in the context of LSOC the policy can be expressed as a path integral to update the control based on sampled trajectories [williams2017model, williams2018information, rajamaki2016sampled, lefebvre2019path]. Common applications are trajectory optimization [summers2020lyceum, nagabandi2020deep], Model Predictive Control [williams2016agressive, kahn2021land, kahn2021badgr] and Reinforcement Learning [chebotar2017path]. More recently a generalised PIC problem was proposed that unified LSOC with the Relative Entropy framework [lefebvre2020elsoc]. Noteworthy is the underlying principle of Entropic Optimization and the first mentioning of the Entropic Optimal Control (EOC) framework. In [lefebvre2021deoc] this view was generalised showing that for deterministic systems, the EOC framework also gives rise to path integral expressions for the optimal policy completing the unification between PIC and EOC.
Finally we mention a series of work that directly derive from the view on optimization pioneered by Toussaint [toussaint2009robot], here referred to as I2C. This idea was embraced by a number of works in Reinforcement Learning to derive a series of RL approaches such as the work of [neumann2011variational, levine2013variational, levine2013guided] amongst others. The latest RL algorithm that is inspired by the inference for control paradigm is Maximum a posteriori Policy Optimization (MPO) [abdolmaleki2018maximum, lee2021beyond, liu2021motor]. Finally we can mention the work of Watson et al. who recently revived the pioneering work by Toussaint on I2C and message passing algorithms [imohiosen2020active, watson2021advtrajopt, watson2021stochastic] and attempted unifications with LSOC [watson2021control].
IB Contribution
By now the probabilistic inference view on optimization and trajectory optimization for motion planning in particular has been well accepted by the robotics community. Consider for example the papers by [kalakrishnan2011stomp, dong2016motion, mukadam2018continuous] and [mukadam2019steap] where Gaussian processes were used as a probabilistic motion prior.
We mention these papers not because they are essential to the work presented here but because they illustrate why probabilistic inference approaches to control and planning should be recognized as a promising research direction. They all aim to exploit the probabilistic view for its computational ease and flexibility. Despite these successes, one must recognize that the probabilistic model that is used to motivate theoretical derivations is only a proxy for the optimal control problem that truly motivates the work.
In this work therefore we explore the relation between the governing principles associated to CaI and the actual optimal control problems that lie underneath. To summarize we first cast stochastic and risk sensitive optimal control probabilistically. Second we show that either problem can be decomposed and solved using principles from variational inference which allows to iterate for the deterministic solution of the associated optimal control problems. In conclusion we discuss in depth several directions to derive practical algorithms.
Ii Preliminaries
In this section we introduce the two optimal control problems (OCPs) that enjoy our interest. Second we introduce the probabilistic machinery that we will use to tackle these problems in section III. The presentation of the problems is in line of with the principle motivation of this work, which is to lay bear the machinery underlying CaI.
Iia Optimal Control
The mathematical stage where the events of this work will unfold is set by controlled Markov Models. To specify the associated probabilistic statespace model we characterise the following set of conditional probability distributions describing the probabilistic dynamics of the system. An initial state distribution
, a transition probability and a policy distribution . State and controls are defined as and . The policy distributions determine the probability of applying given at some time . Throughout we assume that the state can be measured exactly. For notational convenience we further define the variable tuple and the notation formats and which we use to refer concisely to a leading or trailing part of a sequence with referring to the end or start of the corresponding subsequence respectively. We silently assume that any complete sequence starts at time and ends at time . The distribution of some trajectory conditioned on a sequence of policy distributions is then defined below.IiA1 Stochastic Optimal Control
The first problem considered in this work is Stochastic Optimal Control (SOC) where we aim to render some cost function extreme by administering a sequence of carefully selected control distributions . Throughout we use notation to refer to anyprobability distribution space; which one is implied by the context. This mathematical problem formalises the setting for many methodologies tailored to control applications amongst.
(1) 
where
The cost function denotes an accumulated cost or costtogo. The functions and represents the cost rate and final cost and can be used to encode abstract behavioural and dynamical features into the control problem.
IiA2 Risk Sensitive Optimal Control
Following the observation that for affine system models the solution to either the deterministic or stochastic version of this problem are equivalent, a risk sensitive version has been proposed where instead of minimizing the cumulative performance criteria , the exponential of that objective, , is maximized [jacobson1973optimal]. Thinking of the trajectory distribution as a collection of alternative histories, using such an exponential utility function puts less (or more) emphasis on the successful histories. We refer to the work of Whittle [whittle1981risk, whittle1996optimal, whittle2002risk] for an excellent exhibition on risk sensitive optimal control and the use of utility functions in decision making in general. This generalised problem is referred to as Risk Sensitive Optimal Control (RSOC). The solution is referred to as risk seeking or averse for either .
(2) 
In this work we focus on the risk averse version but note that the risk seeking version can generally be treated in the same way as will be presented here. More important to the developments to come is the observation that for the limit , the RSOC collapses onto the SOC problem. For conciseness we further absorb into which can be achieved by appropriate scaling. In conclusion we note that the the logarithm is a monotonically increasing function so that minimization of (2) is equivalent to maximization of the expectation without the negative transform.
IiA3 Some remarks
Although we have introduced an expression for the trajectory distribution using some arbitrary control or policy distributions , we emphasize that neither problem is in fact rendered extreme by a control distribution sequence. In either case the solution is given by a sequence of optimal deterministic policy functions, . This property can be proven rigorously but also follows from the intuition that the objective function can not be rendered extreme by gambling on the outcome of a control given a state . We further note that regardless both problems can be treated probabilistically, in the sense that we substitute a policy distribution, with the optimal solution collapsing onto a Dirac delta distribution. More formally we have that the set of all deterministic policies is contained in the set of all policy distributions so that we only extend the optimization space. This representation will proof useful to our purpose later on.
Second, both problems provide us with the same solution when the underlying dynamics are also deterministic. One easily verifies that the way of thinking of the trajectory distribution as a collection of alternative histories then collapses onto a single history, rendering the influence of the utility function irrelevant. In that same line of thought it follows that the solution of a deterministic optimal control problem is uniquely defined by the optimal control sequence and associated optimal state sequence . As such the optimal trajectory is defined as an instantiation of the underlying optimal policy given the initial state so that . For stochastic systems the solution cannot be a unique trajectory but must be a policy given that the state transitions differently for every history so that we should adapt our control policy progressively accordingly.
In conclusion we may want to note that both problems can be treated analytically using the principle of Dynamic Programming provided that either problem exhibits a so called optimal substructure, which allows to decompose the problem into a number of smaller optimization subproblems that are backward recursive dependent. As such it is possible to construct backward recursive equations that govern the optimal policy. For the purpose of conciseness we have not included these governing equations here. They are included in Appendix A. We do note that these backward recursive sequence of optimization subproblems cannot be solved for general OCPs. Only a small subset of OCP formulation can be treated in this way. For (discretetime) continuous statespaces this problem can not be solved analytically but for the LinearQuadraticRegulator (LQR). Again we refer to Appendix A. In order to treat nonlinear optimal control problems a wide variety of iterative numerical procedures have been developed better known as trajectory optimization algorithms. Amongst them are the iterative LQR (iLQR) [todorov2005ilqr], Differential Dynamic Programming (DDP) [mayne1966ddp, tassa2014control, howell2019ddp, theodorou2010stochastic], Direct Multiple Shooting (DMS) [bock1984multiple, diehl2006fast] and hybrids between the DDP and DMS [giftthaler2018family, mastalli2020crocoddyl, mastalli2020direct]. At the heart of all of these algorithms lies the LQR solution. The dynamics are linearised^{1}^{1}1With the exception of the DDP algorithm where also second order derivative of the dynamics are used. and a quadratic approximation of the cost is constructed about some reference trajectory . The iterative LQR policy is then calculated about the reference and a closedloop simulation of the dynamics is performed to determine the next iterate trajectory .
IiB A Primer on Probabilistic Inference
Probability theory offers a mathematical setting to model systems where the uncertainties of the system are taken into account [sarkka2013bayesian]. Probabilistic inference then refers to the process of reasoning with incomplete information according to rational principles [jaynes2003]. Thus whereas the theory of mathematical logic formalises reasoning with absolute truths, in this view probability theory formalises the practice of common sense. In this setting probability distributions can be used to quantify both aleatoric as well as epistemic uncertainty. In this section we review two principle inference techniques known as Bayesian and Entropic Inference that distinguish themselves in that they allow to make precise revisions of our uncertainty in light of new evidence [bishop2006pattern]. More formally prior distributions, modelling our previous state of belief, are updated into posterior distributions, modelling our new state of belief^{2}^{2}2As we distance ourselves from the classic or frequentist interpretation of probability, we may refer to our state of uncertainty as belief and to probability distributions as belief functions. In a third paragraph we review the main ideas behind Variational Inference which will serve as our principle inference engine.
IiB1 Bayesian Inference
Generally speaking Bayesian Inference (BI) may refer to the practice of computing conditional probabilities according to the rules of probability. As such Bayes’ rule is used to convert a prior probability,
, capturing our assumptions about the hidden random variable
, into a posterior probability,
, by incorporating the evidence provided by the observations . The first term in the most righthand side is called the likelihood function expressing the probability of for different values ofA simplifying and therefore key feature is the existence of conditional independence which often makes it possible to make inferences in arbitrarily complex probabilistic models. Applications in estimation are filtering and smoothing. We refer to [sarkka2013bayesian] for excellent coverage of the material. A summary of useful results is also given in Appendix B.
IiB2 Entropic Inference
As was described in the previous paragraph the BI procedure allows to process information that is represented by data and in particular by the outcome of experiments. We can arguably refer to such information as empirical evidence.
Entropic Inference (EI) [caticha2011entropic, caticha2013entropic] is an informationtheoretic concept that allows to treat information represented by constraints that affect our hypothesis space, ergo our belief space. Correspondingly we can refer to such information as structural evidence. The principle that facilitates EI is that of minimum relative entropy or discrimination information [kullback1951information, kullback1997information] which in turn are extensions of Laplace’s or Jaynes’s principle of insufficient reason [jaynes1986background]. In words it states that the unique posterior, , living in the constrained distribution space, , describing events in some space , is the one that is hardest to discriminate from the prior, . The measure used here to discriminate between two distributions is the relative entropy.
The posterior can then be determined by minimizing the relative entropy whilst constraining the search space to . In informationtheory this operation is sometimes referred to as the information or Iprojection of the prior onto the constrained distribution space [thomas2006elements, murphy2012machine, nielsen2018information].

The Iprojection of onto the constrained distribution space is defined by the following optimization problem
(3)
The relative entropy therefore provides a measure of the inefficiency of assuming that the distribution is when the true distribution is ^{3}^{3}3 A more elaborate way of viewing this is that the relative entropy quantifies the average additional amount of information required to decode the message and using some optimal decoding scheme assuming is distributed according to whilst instead is distributed according to [thomas2006elements].. The relative entropy is asymmetric in its arguments, always positive and zero if and only if . This last property implies that if , the posterior is also given by
. When the prior is a uniform distribution, modelling a lack of a priori information, the concept collapses onto the Maximum Entropy principle
[jaynes1982rationale].A common constraint is the following
and then the corresponding posterior is known as the MaximumEntropy distribution. Note that the posterior is here proportional to the prior multiplied with a likelihood similar to the Bayesian update. The parameter is a Lagrangian multiplier which value is so that the constraint is satisfied.
Further it can be shown that a consistent inference procedure follows only from the Iprojection not by flipping the arguments in the objective [caticha2011entropic, caticha2013entropic, caticha2015belief]
. The reciprocal projection is known as the Moment or Mprojection of the prior
onto the distribution space [murphy2012machine, nielsen2018information].
The reciprocal Mprojection of onto the constrained distribution space is defined by the following optimization problem
(4)
Since the relative entropy is asymmetric in its arguments, the Iprojection and the Mprojection exhibit different behaviour. For the Iprojection, typically underestimates the support of and locks onto its principle modes because should equal zero when does to ensure the relative entropy stays finite. For the Mprojection, typically overestimates the support of . That is because when .
IiB3 Variational Inference
In this final paragraph we cover the main ideas from Variational Inference (VI). In essence VI is a problem decomposition technique for finding Maximum Likelihood Estimates (MLE) of probabilistic models with latent variables [bishop2006pattern, murphy2012machine]. So as opposed to the inference techniques discussed so far VI is used to find a point estimate of probabilistic model parameters not so much a posterior belief.
Consider a probabilistic models with latent variables and observations
. Further suppose that the joint distribution
is characterised by a set of parameters . The goal of MLE is then to identify that probabilistic model that explains the given observations best by identifying probability distributions and .(5) 
where
(6)  
It is wellknown that this objective is hard to optimize on account of the integral expression, provided that is unobserved and the distribution is unknown before attaining a value for the parameter . To circumvent the intractable inference, an inference distribution is introduced. The inference distribution allows to decompose into another surrogate objective which is often referred to as the evidence lower bound (ELBO) and a relative entropy error term. Note that since the relative entropy term is always positive the ELBO constraints from below.
This problem decomposition now allows to tackle the original optimization problem in two consecutive and iterated steps. This two step procedure is, amongst others, at the basis of the ExpectationMaximization algorithm. First the error term is minimized to find the optimal inference distribution
fixing parameters , second the surrogate objective is maximized to find the optimal parameter values fixing the inference distribution, and so on. Suppose then that the current value of the parameter vector is
. It is easily seen that the Estep is solved for the posterior distribution . Then in the Mstep the inference distribution is fixed and the ELBO is maximized with respect to rendering some new value . By definition, choosing to maximize improves at least as much as improved. For many practical applications cannot be evaluated explicitly so that the expectation is approximated using e.g. Monte Carlo sampling. The Estep then boils down to sampling the posterior so that the Mstep can be evaluated. From a technical perspective the inference distribution is chosen so that it generate samples near the observed data . In the Mstep we then optimize the ELBO objective which is now better behaved as a function of .
Minimize the error w.r.t. fixing .

Maximize the ELBO w.r.t. fixing .
Note that the Estep can also be understood as maximizing the ELBO for fixed provided the decomposition below. As such the EM algorithm can also be viewed as an example of coordinate ascent. Furthermore we can recognize the Estep as performing an Iprojection of onto the distribution space spanned by and the Mstep as performing an Mprojection of onto the space spanned by .
Iii Optimal Control as Variational Inference
In this section we demonstrate that either of the optimal control problems introduced in section IIA can be addressed using VI. We detail our approach in two separate subsections, addressing one problem at a time. In a third subsection we make a comparison between both results and discuss in detail various connections with other fields.
Iiia Regularized Stochastic Optimal Control
Let us first retake from the SOC problem in (1). We start our analysis with the problem decomposition in (6). Although lacking any straightforward probabilistic motivation, this decomposition matches a VI structure.
The optimal control framework imposes the structure of a controlled trajectory distribution onto the inference distribution so that we do not need to adopt the conventional form prescribed by IIB3. For ease of reference we renamed the two subproblems and . Application of the EM algorithm then boils down to the following steps.

Minimize for keeping fixed.

Maximize for keeping fixed.
Solution of the Mstep is here of course trivial. The distribution that renders the relative entropy minimal is . On a sidenote we remark that this amounts to the Mprojection of onto the structure of . We must emphasize that although the solution of the Mstep is trivial, exploitation of the EM algorithm does induce an iterative mechanism to treat the underlying optimal control problem. The real question here reduces to solving the Estep. Note that this is equivalent to the Iprojection of onto the probability distribution space defined by controlled trajectory distributions
(7) 
Versions of this particular problem have popped up and were treated in many earlier works with varying probabilistic motivation depending on the associated reference. For later comparison it will proof convenient to include a derivation. Derivations and associated motivations can also be found in the works of [rawlik2013stochastic, levine2013variational] and [levine2018reinforcement]. Though we note that the iterative aspect has only been recognized by [rawlik2013stochastic].
IiiA1 Explicit solution of the Estep
First verify that problem (7) can be recast as
We can treat this problem as a variational OCP as it too exhibits an optimal substructure which we may exploit to solve it. Let us therefore define the following function
which satisfies the recursion
where
Variational optimization of the problem definition of then yields
and
It is easily verified that these expressions establish a backward recursive procedure to derive a sequence of policy distributions initialising with .
Explicit evaluation of the backward recursion is however restricted to a limited class of wellbehaved optimal control problems. We come back to this in section IIIA3. Nevertheless it is already remarkable that as opposed to the underlying regular optimal control problem, the regularized version can be solved explicitly, replacing the minimization operators with transformed expectations. See Appendix A.
IiiA2 Alternative motivation
We can arrive at the same result from the previous paragraph using a different line of thought. It serves our development right to include such an alternative argument as it provides a principled probabilistic motivation for problem (7) where we lacked one earlier on. Furthermore it attributes new significance to the principle of EI as a technical instrument.
Consider therefore the problem of minimizing the function which we assume to posses a single global minimum for ease of exposition. We initiate our argument with the observation that classic numerical optimization methods iterate a point estimate, say , of the solution . Then with every iteration new information is retrieved upon which the point estimate is updated. Now instead of focussing on a point estimate of the isolated solution, we could also model this iterative search procedure with a sequence of beliefs (over iterations, not time) with every expressing our uncertainty about any possible solutions. This is in agreement with the ideas from probabilistic numerics. The more information that is gathered over the iterations, the more certain we should get about the true solution. It can be anticipated that, to have any use in optimization, the expected cost must decrease with every iteration, ergo . To establish such as sequence, we require a consistent inference procedure that facilitates an update of the form . To construct such an inference procedure we will make use of the principle of minimal relative entropy. Since a priori no or little information about the possible solutions is available, this situation can be represented mathematically by encoding uncertainty about the solution in a prior probability density which support covers the feasible solution space but attributed equal probability to each solution. Then in order to gradually decrease the uncertainty over iterations, a posterior belief is desired that discriminates least from the prior belief but so that the expected objective function value produces a lower estimate on the expected cost. Then following the interpretation of the relative entropy as a measure of the inefficiency of assuming that the distribution is when the true distribution is [9], we could use the Iprojection to project onto the distribution space defined by the constraint . The problem that we solve is thus defined as . Using the method of Lagrangian multipliers it is easily verified that where must be determined so that the constraint is satisfied. It can be shown that so that we can choose instead of and attain the same goal [luo2019minima, lefebvre2020elsoc]. In other words a belief space containing distributions that perform slightly better on than the prior. By definition this update mechanism produces a monotonically decreasing sequence implying that any derived algorithm is potentially also monotonically decreasing. Note that in practice we can not desire to keep on subtracting an amount from the expected cost every iteration but this is an issue that we treat later on by parametrising the problem differently. In any case it follows that when the sequence converges, that is when it reaches the minimal expected cost attainable, it must coincide with the Dirac delta centered at the true solution . The exact same argument can now be repeated for the regular OCP (1).
As such we need to solve the following variational optimization problem^{4}^{4}4Note that this is an alternative formulation of the relative entropy constrained problem defined by for example [peters2010relative] and [schulman2015trust] amongst others. However here we flip the objective and constraint so to reveal the connection with Entropic Inference which we argue is the underlying principle. On account of the method of Lagrangian multipliers either problem formulation ultimately amounts to the same mathematical problem after appropriate scaling of the Lagrangian multipliers or cost function.
subject to
Invoking the method of Lagrangian multipliers and assuming appropriate scaling of and so that the Lagrangian multiplier has value we retrieve the familiar problem (7)
As long as the Lagrangian multiplier is larger than zero the posterior expected cost will be smaller than the prior expected cost so that instead of choosing we can choose a value for the Lagrangian multiplier. For ease of exposition we chose . The larger we choose the value of the multiplier the faster the implied sequence will converge to the underlying deterministic solution. This observation solves the issue that we can not keep on subtracting an amount . Instead we can keep on iterating the distributions using the procedure described in section IIIA1. The reduction in expected cost is then determined by the cost functional and the value of the multiplier and converges to zero for .
Based on the present argument, we refer to this treatment of stochastic optimal control as Entropic Optimal Control suggesting that a generalised view has been reached.
IiiA3 Entropic LQR
As noted we can solve the expression from the previous paragraph for a limited class of OCPs. Problems with a discrete statespace for one. The second class of problems for which we can solve the regularized OCP are LinearQuadraticRegulators (LQR).
Consider the OCP defined by the linear Gaussian transition probability , cost rate and terminal cost . In the context of the LQR it is reasonable to require that the inference policy distribution is linear Gaussian . It can then be anticipated that both and will be quadratic in their arguments and that the solution is also linear Gaussian in the state.
Here indicates unspecified (or redundant) information and symmetric matrix entries. Since one verifies that
As for any including we have
(8)  
Finally we can evaluate the expression for the function
As is the case for the regular LQR, the stochasticity of the dynamics () only affects the constant terms of the quadratic value and functions (not given here) whilst the solution of interest is only affected by the linear and quadratic terms. It follows that the solution is equivalent for stochastic as well as for deterministic systems.
(9)  
IiiB Regularized Risk Sensitive Optimal Control
In this section we retake from problem (2). Before we submit it to a similar analysis as was conducted above, we seek out analogies with Bayesian estimation. For a prior on Bayesian estimation we may refer to [sarkka2013bayesian] and [barfoot2017state] and Appendix B.
IiiB1 Equivalent Bayesian estimation problem
To establish a rigorous connection with the Bayesian estimation framework the probabilistic statespace model introduced in section IIA can be extended with a probabilistic measurement model or so called emission probability , see Figure 4.
The measurements are considered independent of any history of the trajectory apart from the present state and action. This probabilistic statespace model is known as a controlled Hidden Markov Model (HMM). The distribution of a measurement sequence
conditioned on a trajectory is defined asThe joint distribution parametrised by a policy sequence is then defined as
The probabilistic model sketched here is well studied and can be used to treat complex inference problems, see Appendix B. Two important examples are the Bayesian filtering problem where we seek the distribution and the Bayesian smoothing problem where we seek the marginal a posterior distribution . Both of these problems are well studied in the literature (see [sarkka2013bayesian] and [barfoot2017state] again). From these general concepts one can derive practical algorithms such as the Kalman Filter and RauchTungStriebel (RTS) smoother that operate on linear Gaussian statespace models. The presence of a policy distribution is usually not considered however the governing equation are trivially extended to treat controlled HMMs.
To establish a connection with control the following artificial emission model is proposed [toussaint2009robot, levine2018reinforcement, watson2021control]. We introduce an auxiliary variable that we assume to have adopted a value with probability proportional to
Now let us reconsider the MLE problem
Then using the substitutions proposed above one easily verifies that this problem is equivalent to the risk sensitive optimal control problem in (2)
This observation is fundamental in that it establishes that there is no technical difference between the Bayesian interpretation and the underlying RSOC problem. So we may treat this problem further by addressing it as a Bayesian estimation problem rather than an optimal control problem.
This proofs to be a crucial insight that will allow us to treat the risk sensitive case in exactly the same way as we have treated the regular optimal control problem. We continue our analysis by decomposing the problem into two subproblems (9). As opposed to the strategy followed before here we must introduce a generic inference distribution similar to that in section IIB3. Application of the EM algorithm then boils down to the following steps.

Minimize for keeping fixed.

Maximize for keeping fixed.
In this case the solution of the Estep is trivial. Specifically the error is rendered minimal for . Since we will iterate this solution it is better to replace with the more familiar expression where is substituted for the now old or previous control belief sequence. On a sidenote remark that this amounts to the Iprojection of onto the trajectory distribution space. The question here thus reduces to solving the Mstep. The substitution proposed earlier now makes it less confusing that in the Mstep we are trying to find an updated version of than the version that is contained in the expression for the inference distribution . Finally one observes that the emission probability can be removed from the objective without altering its solution. Therefore solution of the Mstep reduces to the Mprojection of onto the space of controlled trajectory distributions
(10) 
As far as we are aware of this problem has not been treated by [toussaint2009robot], [levine2018reinforcement] or [watson2021control] nor did the authors draw a similar connection between the proposed probabilistic inference model, the underlying RSOC problem and the EM algorithm.
IiiB2 Explicit solution of the Mstep
First we show how to further reduce problem (10)
This decomposition illustrates that as opposed to problem (7), the optimization problem is not subject to an optimal substructure and can be treated independently for each separate policy distribution.
Further taking into account the normalization condition it follows that
So remarkably the solution of is equivalent to the probability of parametrised by the prior policy sequence and conditioned onto the measurements . Then before we venture into further technicalities we can submit distribution to a preliminary assessment.
Since we conditioned on the state , it follows that this distribution must be equivalent to rather than . This is a direct result of the Markov property since no more information about the input at time instant can be contained in the older measurements than is already contained by the state itself. Again we refer to Appendix B. Put differently this means that once we have arrived at some state , we can only hope to reproduce the measurements but can no longer affect any of the preceding measurements . So again some form of optimal substructure is present although here it is not a property of the optimization problem but rather a property of the distribution . As will be shown shortly the distribution therefore leads to very similar structures as the ones encountered whilst treating (1).
Using Bayes’ rule can be decomposed as
which reduces the problem to finding efficient expressions for the probabilities and . The latter is a generalisation of the backward filtering distribution, a lesser known concept in Bayesian estimation, to the controlled probabilistic statespace model assumed here. The former can be easily derived from there.
As semantically implied the backward filtering distribution adheres to a backward recursive expression.
For fixed and we can recycle the definition of and define the functions
By definition then it follows that
whereas
On first notice the solutions of the problems (7) and (10) are equivalent. Again we have found a backward recursive procedure to derive a sequence of policy distributions which is apparently equivalent to that of . The subtle difference lies in the definition of the functions. We will discuss the difference in more detail later on. At the moment it suffices to remark that the difference can be motivated by the difference in the underlying optimal control problems.
Based on similarities with the treatment of stochastic optimal control problems we may refer to this inference procedure as another instance of Entropic Optimal Control tailored to the risk sensitive counterpart.
IiiB3 A note on conditioning
As a tangent to our investigation we may briefly elaborate on the exposed connection between Bayesian estimation and risk sensitive optimal control that follows from the technical discussion above. In essence now there exists no difference between computing the simulated marginal distribution and the conditional or smoothed marginal distribution ^{5}^{5}5This is not entirely true since this would require to determine the simulated distribution using the informed transition probability . However the effect on the policy is irrelevant given that afterwards we condition on the state. This nuance is rendered irrelevant altogether when we consider deterministic systems in which case .. Anticipating the substitution