# Entropy Regularised Deterministic Optimal Control: From Path Integral Solution to Sample-Based Trajectory Optimisation

Sample-based trajectory optimisers are a promising tool for the control of robotics with non-differentiable dynamics and cost functions. Contemporary approaches derive from a restricted subclass of stochastic optimal control where the optimal policy can be expressed in terms of an expectation over stochastic paths. By estimating the expectation with Monte Carlo sampling and reinterpreting the process as exploration noise, a stochastic search algorithm is obtained tailored to (deterministic) trajectory optimisation. For the purpose of future algorithmic development, it is essential to properly understand the underlying theoretical foundations that allow for a principled derivation of such methods. In this paper we make a connection between entropy regularisation in optimisation and deterministic optimal control. We then show that the optimal policy is given by a belief function rather than a deterministic function. The policy belief is governed by a Bayesian-type update where the likelihood can be expressed in terms of a conditional expectation over paths induced by a prior policy. Our theoretical investigation firmly roots sample based trajectory optimisation in the larger family of control as inference. It allows us to justify a number of heuristics that are common in the literature and motivate a number of new improvements that benefit convergence.

• 6 publications
• 7 publications
05/06/2022

10/07/2019

### Stochastic Optimal Control as Approximate Input Inference

Optimal control of stochastic nonlinear dynamical systems is a major cha...
09/13/2019

### HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints

Learning optimal feedback control laws capable of executing optimal traj...
03/23/2021

### Smoothing-Averse Control: Covertness and Privacy from Smoothers

In this paper we investigate the problem of controlling a partially obse...
10/16/2020

### Direct Policy Optimization using Deterministic Sampling and Collocation

We present an approach for approximately solving discrete-time stochasti...
03/01/2019

### GuSTO: Guaranteed Sequential Trajectory Optimization via Sequential Convex Programming

Sequential Convex Programming (SCP) has recently seen a surge of interes...
03/29/2021

### Distributionally Robust Trajectory Optimization Under Uncertain Dynamics via Relative-Entropy Trust Regions

Trajectory optimization and model predictive control are essential techn...

## I Introduction

Trajectory optimisation is ubiquitous in robotics. It is used to synthesise complex dynamic behaviour [mordatch2012trajopt] as well as to compute real-time feedback control [dantec2021icra]. Currently there are two classes of algorithms that address trajectory optimisation: (1) gradient based algorithms deriving from iLQR [todorov2005ilqr] and DDP [mayne1966ddp] that rely on linear-quadratic approximations about a nominal trajectory, and (2) sample-based approaches akin to Model Predictive Path Integral (MPPI) control [williams2016agressive, williams2017model, williams2018information]. As opposed to the first category, the second class of methods relies on sampled trajectories to probe the local optimisation landscape and use approximate inference techniques to update the solution. These methods ought to be less prone to failure when the models are non-differentiable or many local minima exist [ha2016path]. In addition, they are highly parallelisable [williams2018information]. For this reason sample-based trajectory optimisation has gained the interest of many researchers in robotics[ha2018path, kahn2021badgr, nagabandi2020deep, bhardwaj2021fast].

MPPI derives from a restricted subclass of stochastic optimal control where inference emerges naturally; see [kappen2005linear] and sec. II. This class is known as Linearly Solvable Optimal Control (LSOC) or path integral control. Essentially, the optimal policy can be expressed as a conditional expectation of an exponential cost-to-go. Because the expectation is taken over the passive stochastic dynamics, it is possible to compute the control from passive sampled trajectories.

A body of work has proposed adaptations to MPPI by pursuing similarities with well-known gradient-based and stochastic optimisation algorithms[stulp2012path, bhardwaj2021fast, rajamaki2016sampled, ghandi2021ieeeral, lefebvre2019path]. These adjustments outperform the original algorithm but lack justification. The main objective of this article is to root the derivation of sample-based trajectory optimisers in a generic theoretical framework. In fact, MPPI belongs to a larger family of ideas collectively known as control as inference. The idea is to reformulate the optimal control problem as a probabilistic inference problem so one can draw from the computational machinery that addresses inference. There are a number of ways to do this, and there is still ongoing research to determine how the different approaches are connected to each other [rawlik2013stochastic, Watson2021cai, levine2018reinforcement]

. The main classes are path integral control, entropy regularised reinforcement learning

[ziebart2010modeling, rawlik2013stochastic] and message passing [toussaint2009robot, watson2020stochastic]. Message passing algorithms reformulate stochastic optimal control as Bayesian input estimation by reinterpreting the likelihood of an observation as the desirability of a state-action pair. This problem can be solved using the Bayesian filtering and smoothing equations. As far as we are aware of, no sample-based algorithms are related to this idea. Entropy regularisation in reinforcement learning addresses stochastic optimal control with additional entropic terms in the control objective. This leads to a so-called soft Bellman recursive equation, which can be solved for Linear Gaussian Quadratic regulators but not for arbitrary non-linear problems.

Regardless, entropy regularisation turns out to be a fruitful direction for our purpose. Recent research illustrates how the principle of entropic inference can be put forth as a principled motivation for entropy regularisation in deterministic optimisation [lefebvre2020elsoc, luo2019minima]. In this paper we make a direct connection between entropy regularised optimisation and deterministic optimal control. To this end, we answer two questions

1. Why does entropic regularisation work in stochastic optimisation, and what is the underlying principle?

2. Why do heuristic adjustments of MPPI, not justified by theory, work better than the original algorithm?

The first step towards answering these questions was taken in [lefebvre2020elsoc], where the underlying problem statement was still closely related to LSOC. As a result, the dynamics had to be invertible and the optimisation variable was a state transition distribution rather than a policy. To extract a locally linear feedback policy, the dynamics also needed to be control affine.

In this paper, we essentially show that we can drop these assumptions. We specifically demonstrate that entropic deterministic control gives rise to a path integral expression for optimal control and how a method akin to MPPI can be derived from this result. Therewith, we establish a theoretical foundation that allows the derivation of generic sample-based search algorithms tailored to deterministic trajectory optimisation.

### I-a Notation

Discrete time is denoted with subscript . The iteration number in algorithms is denoted with subscript . Entities in sample sets are labelled with subscript . We use bold font to denote sequences, e.g. . We use time subscript to denote a subsequence starting from time instant , e.g. . A sequence can also be expressed as . We rely on context to imply the range. We combine both notations so that e.g. refers to iterates of a sequence . Expression implies that the uncertain variable is distributed according to distribution , denotes the expected value of .

denotes the multivariate Gaussian or Normal distribution with mean

and positive definite covariance matrix . is often shortened to .

### I-B Problem description

We consider finite horizon discrete time deterministic optimal control. Variables and denote states and controls, respectively. We assume dynamics are governed by a non-linear time variant difference equation given by . State-action couples are denoted as , except for . A trajectory is defined as . A feasible trajectory satisfies the dynamic difference equation. Functions and denote the running and terminal costs, respectively. The cost-to-go from is defined as

 Rn(τn)=rN(sN)+∑N−1n′=nrn′(sn′,an′)

We consider the trajectory optimisation problem defined below. The problem solves for open-loop control sequence .

 minaR0(τ0) s.t. sn+1=fn(sn,an), s0=s∗0 (1)

The optimal cost-to-go or value function is defined as , subject to the dynamics and initial state . Relying on Bellman’s principle of optimality, (1) can be cast into a recursive problem. This gives rise to Bellman’s backward recursion equation with boundary condition . Sequence contains state-dependent closed-loop policy functions .

 Vn(s) =minarn(s,a)+Vn+1(fn(s,a)) (2) an(s) =argminarn(s,a)+Vn+1(fn(s,a))

## Ii Control as inference through LSOC

LSOC refers to an interesting but restrictive subclass of continuous-time stochastic optimal control problems. Here, we give a brief overview of the main ideas [kappen2005linear]. We consider control affine stochastic dynamics where denotes a Wiener process, so that

A state dependent cost-to-go is defined as

 C(t)=cT(s(T))+∫Ttc(τ,s(τ))dτ

We look for a continuous time optimal control . The cost is appended with an input-dependent cost. Note that the rate is inversely proportional to the noise covariance.

defines the continuous-time value function. The expectation is taken over the path probability

induced by the stochastic dynamics and conditioning on .

 V(t,s(t))=mina(t→T)EP(a|s(t))[C(t)+∫Ttλ2∥a(τ)∥2Σ−1dτ] (3)

Defining the desirability function , it can be shown that the solution of (3

) is governed by a linear partial differential equation. Remarkably, the solution can then be expressed as a path integral according to the Feynman–Kac formula. Note that the expectation is taken over

passive paths.

 Z(t,s(t))=EP(0|s(t)[exp(−1λC(t))]

Second it can also be shown that the optimal control satisfies

 a∗(t,s(t))=1Z(t,s(t))EP(0|s(t)[exp(−1λC(t))dξ]

On account of Girsanov’s theorem, the measure of the expectation can be changed to the system dynamics induced by any arbitrary control [ha2016path]. This basically amounts to importance sampling in continuous time. Here, represents the Radon-Nikodym derivative of with respect to .

 a∗(t,s(t))∝EP(ag|s(t)[dP(0)% dP(ag)exp(−1λC(t))dξ]

This summarises the main concepts from LSOC.

From this theory, one can derive a sample-based trajectory optimisation algorithm known as MPPI. Because this (and related) method(s) rely on the calculation of a path integral, they are also referred to as path integral control. The MPPI algorithm is summarised in Alg. 1. Due to Girsanov, sampled trajectories are obtained about a reference trajectory induced by with simulated control perturbations . An updated control is inferred according to the theory of LSOC.

A few other remarks are in place:

1. The algorithm solves stochastic optimal control problem (3). The policy itself is (thus) deterministic and inference is established by the inherent process noise.

2. The control penalty in (3) is quadratic, i.e., , with . Hence, when choosing two out of the three parameters , and , we fix the problem that we solve.

3. In practice, the covariance is often updated comparable to CMA-ES [stulp2012path, bhardwaj2021fast], .

4. Also, the theoretical running cost in line 8 is replaced by a generalised running cost, .

Although the two adjustments proposed above lead to better performance, they are not supported by any theory. Furthermore, the combination of remarks 2 and 3 implies that we are solving different problems with every update. By consequence, we argue they are also not properly understood.

## Iii Entropic optimisation

In this section we answer question 1) mentioned in the introduction. It serves as a stepping stone to question 2).

Algorithms for numerical problems, such as optimisation, proceed iteratively, with each iteration providing information that improves a running estimate of the correct solution. Probabilistic numerics [oates2019probnum] pursues methods that, in place of such estimates, update beliefs111The manifestation and interpretation of probability is epistemic and arises from missing information in a computation that is otherwise deterministic. over the solution space. In brief, in this section, we aim to rephrase the generic problem of optimisation as a problem of inference. By embedding deterministic optimisation into the framework of entropic inference, we provide a theoretical argument for the use of entropy regularisation in optimisation in addition to the vast empirical validation documented in previous work.

### Iii-a Entropic inference

In probability theory, inference refers to the rational processing of incomplete information

[jaynes2003]. In the present context, we make use of probabilities to encode our uncertainty about an underlying deterministic quantity and refer to them as beliefs. We seek a posterior, , which encodes new information into a prior, , which encodes information that we already have.

In Bayesian inference, new information is contained in data or experiments. To the contrary, we are interested in information that is contained in an expectation. The principle that allows us to address this type of information is that of minimum relative entropy or discrimination information

[kullback1951information, jaynes1986background, jaynes1982rationale]. The principle states that the unique posterior,

, living in the space of probability distributions,

, constrained by an expectation of the form , is the one that is hardest to discriminate from the prior, . Equivalently, it is the one that minimises their relative entropy. Mathematically, this gives rise to a variational optimisation problem of the following form

 minπ∈P D[π∥ρ]relative entropy s.t. μ=Eπ[g]new information (4)

where . The solution is a Boltzmann distribution with so that .

### Iii-B Entropic inference for optimisation

Mathematical optimisation addresses problems of the following form where represents the feasible subset.

 x∗=argminx∈Xq(x) (5)

Classic numerical optimisation strategies iterate a running estimate of the solution , assimilating new information with each iteration.

Instead of focussing on a single estimate, we wish to model this search with a belief sequence . The more information that is assimilated, the more certain we get about the solution. To suit the purpose of optimisation, such a sequence should exhibit the following property

 limg→∞Eπg[x]=argminx∈Xq=x∗ (6)

Second, we need an inference procedure that facilitates an update operation based on some form of new information

 πg+1←πg

Our approach to arrive at such an inference procedure tailored to optimisation is straightforward. A priori no information is available. We can represent this situation mathematically by encoding our uncertainty about the solution in a prior probability density function,

222If no information is available, we can choose .

. In order to gradually decrease our uncertainty, we then look for a posterior probability density function,

, that discriminates the least from our initial guess but has an expectation over the objective function that produces a lower estimate than the prior expectation.

 minπ∈P D[π∥ρ] (7) s.t. Eπ[q]≤Eρ[q]−Δ, Δ>0

Whether a solution exists depends on the parameter . If it exists, the solution is a Boltzmann distribution similar to (4). It can be shown that if we choose , there also exists some [luo2019minima]. For now, this is sufficient.

 π(x)∝ρ(x)⋅e−λq(x) (8)

Clearly, these expressions bear close correspondence with the classical Bayesian update, generating a posterior, , by multiplying a prior, , with an expression encoding a likelihood, which is the likelihood of optimality in this case. By applying substitutions and , we can establish the sequence . This sequence has exactly property (6) [luo2019minima].

This property basically implies that the sequence converges to the Dirac delta distribution centred at the optimum. Since the next idea is to emulate the behaviour of this distribution numerically, the property also implies that the sample set will become more and more localised. Although this is exactly what we want from a theoretical point of view, this is troublesome from an algorithmic perspective. Therefore, we desire to encode a second prior into the sequence that stimulates exploration. We introduce the augmented entropic optimisation problem. Here represents the uniform on and is a scaling factor that allows us to attribute more importance to either prior.

 minπ∈P αD[π∥πg]+(1−α)D[π∥UX],0<α<1 (9) s.t. Eπ[q]≤Eπg[q]−Δ, Δ>0

For some , the solution is given by333

For notational convenience, we absorb the uniform distribution

into the proportionality. Note that is therefore zero outside the set . [lefebvre2020elsoc]

 πg+1∝παg⋅U1−αX⋅e−λq∝παg⋅e−λq (10)

and it can be verified that

 limα→1limg→∞πg∝limα→1e−λ1−αq∝δ(x−x∗) (11)

By emulating this sequence numerically, we can construct a stochastic search algorithm tailored to the problem (5).

### Iii-C Stochastic search methods

The idea is to update a parametrised belief and match its empirical distribution features with those of the theoretical distribution . In practice, we use a parametric density function with (e.g., ) to approximate the entities in . We derive the associated parameter sequence, , from samples. Therefore, we project the theoretical entity onto the density space generated by , minimising their relative entropy. The objective is then manipulated into an expectation over , which we estimate by sampling .

 θg+1 =argminθ∈ΘD[πg+1∥πθ] ≈argmaxθ∈Θ^EDg[e−λ(q+(1−α)logπθg)logπθ]

where . For details, see appx. A.

A basic implementation is presented in Alg. 2. Clearly, the computation bears close correspondence with Alg. 1, though it remains unclear what underlying principle is mutual. For further exploration of similar ideas, we refer to [abdolmaleki2015model, abdolmaleki2017deriving].

## Iv Entropic deterministic optimal control

In this section we answer question 2). We provide an original derivation for a sample based trajectory optimiser that demonstrates similar heuristics as were described in sec. II. Contrary to MPPI contrary our derivation follows directly from of the ideas in sec. III applied to the problem in sec. I-B.

Although entropy regularisation is a well-known concept in reinforcement learning, it has only been applied to stochastic problems that are naturally embedded in a probabilistic framework. As we will show for deterministic optimal control, entropy regularisation gives rise to explicit expressions for the posterior policies in terms of a conditional expectation taken over trajectories induced by prior policies similar to the setting of LSOC. The crucial difference with Entropic Deterministic Optimal Control (EDOC) is that LSOC inherently solves a stochastic optimal control problem whilst EDOC still solves a deterministic optimal control problem. The inference in LSOC is facilitated by the input noise inherent to the stochastic problem. Contrarily, in EDOC, the inference is put in place intentionally. The major consequence of this difference is that it disentangles the inference from the control.

### Iv-a Entropic Bellman equation

We consider the following EDOC problem, where and .

 minπ∈P (12) s.t.

Starting from problem (1), we reason as follows. The optimisation variables are given by the control sequence . This problem is no different than problem (5), so we could address it with Alg. 2444 Note that in this case, the weights would not be time dependent.. However, we would instead like to exploit the dynamic structure of the problem. Although the solution of (1) is indeed a trajectory, from (2), we know that a sequence of optimal policies, , underpins this trajectory, which we only happen to know evaluated over deterministic dynamics. Therefore, we introduce the conditional policy beliefs, . These beliefs express our uncertainty about control given state at time for the -th iteration. Second, we define a trajectory probability density function conditioned on and .

 pn(πn) =p(τn|sn,πn) =∏N−1n=nδ(sn′+1−fn′(sn′,an′))πn(an|sn)

This definition allows us to cast (1) in a probabilistic manner.

 Vn(sn)=minanEpn(πn)[Rn(τn)]

In the case that the dynamics are deterministic, this probabilistic problem is completely equivalent to the original problem and solves for a Dirac distribution or deterministic policy. In turn, this probabilistic model motivates problem (12).

We will address the solution to this problem in the next paragraph. First, we further motivate our problem definition by demonstrating that we would have arrived at the same result if we had regularised the Bellman equation (2) instead. The following lemma certifies the consistency of the proposed regularisation. It also emphasises that we only regularise over the optimisation variables, , but not the dynamics. Finally, it introduces the entropic Bellman equation.

###### Lemma 1.

The solution of problem (12) is governed by the following entropic Bellman equation

 Vg+1,n =minπ∈PEπ[Qg+1,n]+D[π∥παg,n⋅U1−αA] (13) πg+1,n =argminπ∈PEπ[Qg+1,n]+D[π∥παg,n⋅U1−αA]

where .

###### Proof.

First, by introducing multiplier , we can recast (12). Note that the dynamics cancel out in the fractions.

 minπn∈PEpn(πn)[λRn+αlogπnπg,n+(1−α)logπnUA,n]

Then, writing out the objective, we retrieve an optimisation problem that adheres to Bellman’s principle of optimality

 minπn∈PEpn(πn)[λrN+…∑N−1n′=n(λrn′+αlogπn′πg,n′+(1−α)logπn′UA)]=minπn∈PEπn[λrn+αlogπnπg,n+(1−α)logπnUA+…minπn+1∈PEpn+1(πn+1)[…λRn+1+αlogπn+1πg,n+1+(1−α)πn+1UA,n+1]]=minπn∈PEπn[λrn+αlogπnπg,n+(1−α)logπnUA+Vg+1,n+1]

where we have defined

 Vg+1,n+1=minπn+1∈PEpn+1(πn+1)[λRn+1+logπn+1παg,n+1U1−αA,n+1]

The lemma follows. ∎

Henceforth, we will treat as a hyper-parameter.

### Iv-B Path integral solution

In this section, we address the tractability of the EDOC problems in (12) and (13). First, we establish a recurrence relation for the optimal policy belief sequence similar to that done for problem (9) with equation (10). Second, we illustrate that this solution gives rise to an explicit path integral expression for the optimal posterior policy belief.

The first result is summarised by the following theorem. For the proof, we refer to appendix B. This result is known as the soft Bellman equation and has been studied in combination with stochastic system dynamics, e.g., in [rawlik2013stochastic] for and in [levine2018reinforcement] for . The second, which is novel, is summarised in the corollary beneath it.

###### Theorem 1.

The solution of problem (13) is given by

 πg+1,n ∝παg,nexp(−Qg+1,n+Vg+1,n) Vg+1,n =−log∫παg,nexp(−Qg+1,n)da

Theorem 1 points out that, similarly to equation (10), the relationship between the posterior and prior optimal policy belief functions ( and , respectively) is governed by a Bayesian type recurrence relation. However, in this case, the recursion still depends on the functions and as a consequence of the problem’s sequential nature. This is in contrast to (10), which only depends on the objective .

Fortunately, the results can be further developed into an explicit expression for the posterior optimal policy belief function that depends solely on prior information contained in . This result is a direct consequence of the deterministic system dynamics and is summarised by the following corollary.

###### Corollary 1.

The posterior optimal policy belief function can be expressed as

 πg+1,n(a|s)=πg,n(a|s)e−rg,nZg+1,n+1(fn(s,a))Zg+1,n(s)

where

 rg,n =λrn+(1−α)logπg,n Zg+1,n =exp(−Vg+1,n)

and

 Zg+1,n(s) =Epn(πg,n)[exp(−Rg,n)] pn(πg+1,n) ∝pn(πg,n)exp(−Rg,n) Rg,n =λrN+∑N−1n′=nrg,n′
###### Proof.

First, we define the function as in the theorem. Substituting the transformation into the solution of Theorem 1 yields the following recurrence relation for (this is only because is evaluated deterministically in ).

 Zg+1,n =∫παg,ne−λrnZg+1,n+1da =Eπg,n[e−rg,nZg+1,n+1]

We can then easily develop this recursion into an explicit expression for as a conditional expectation over the path probability initialised at , that is, . Recall that this probability is conditioned on , so is indeed a function of it, which proves the expression for .

 Zg+1,n=Epn(πg,n)[exp(−Rg,n)]

Now we can substitute this result into the recursive expression for and recover the main result.

Finally, one can substitute the explicit expression for into the trajectory probability, .

 pn(πg+1,n)pn(πg,n)=Zg+1,n+1Zg+1,nZg+1,n+2Zg+1,n+1⋯Zg+1,N−1Zg+1,N−21Zg+1,N−1e−Rg,n

It follows that the quotients cancel out when we multiply over the entire trajectory expect for which in fact normalizes the trajectory density function. ∎

The corollary above implies that the function, , and hence the posterior policy belief, , can be quantified explicitly by evaluating conditional expectations over the prior path probabilities, . Although these results follow quite naturally from Theorem 1, they have important consequences in terms of the tractability and computability that are unique to the entropic regularisation of deterministic (rather than stochastic) optimal control. As noted, the framework collapses when aside from the purposeful uncertainty, stochasticity is introduced on account of the dynamics. In this case, an expectation over the stochastic dynamics emerges in the definition of , i.e., instead of . As a result, it is impossible to develop the recursion from Theorem 1 into explicit expressions because we cannot get rid of the expectation in the exponent.

Clearly, these expressions also bear a close correspondence with LSOC. We note that the Feynman-Kac formula establishes a relation between certain partial differential equations and stochastic processes. In particular, it expresses the solution of a partial differential equation as a conditional expectation. The association with stochastic processes emerges, as they constitute a framework where such conditional expectations arise naturally. In our work, the conditional expectation is not associated to any stochastic process, but rather to the Bayesian policy beliefs. Put differently we compute what we might refer to as a Bayesian path integral instead of a stochastic path integral. The difference lies in that the conditional expectation is taken over purposeful uncertainty introduced to construct a consistent inference procedure about the underlying deterministic optimal control problem; it is not inherent to the problem.

### Iv-C Entropic MPPI

From corollary (1), we can derive a sample-based trajectory optimiser akin to MPPI. The algorithm demonstrates similar heuristics to those discussed in sec. II, amongst novel attributes. Although the algorithm’s structure closely relates to Alg. 1, its derivation stems from an entirely different theoretical context. As a disclaimer, we note that our derivation seeks similarities with Alg. 1 intentionally and therefore uses a restricted class of parametric policy beliefs. However, we note that corollary (1) does not in fact exclude the use of any policy, so one could also consider using a Gaussian mixture to address certain problems, e.g., those with many local optima.

In particular, we approximate the posterior policy beliefs, , with a parametric distribution and infer the associated parameters from samples. In this case, we apply the prior policy belief sequence, , for a given and fixed initial state, , and collect data in the form of sample paths or trajectories, . This means that we will sample from the policies as if they were probability density functions rather than Bayesian belief functions. Averaging over these sample trajectories then allows us to evaluate the expressions in corollary 1. Given the clear resemblance with Alg. 1, we can reasonably refer to it as Entropic MPPI (EMPPI); Alg. 3.

#### Iv-C1 Policy belief parametrisation

Pursuing general tractability, we are interested in locally linear Gaussian policies with temporal distribution parameters, , where . With the exception of the covariance, this is similar to imposing a piecewise linear controller. Since the normal is unimodal, this approximation renders the stochastic search method locally similar to gradient-based approaches.

 πg,n(a|s)≈πn(a|s;θg)=N(a|kg,n+Kg,ns,Σg,n)

#### Iv-C2 Projection strategy

Since the beliefs are conditioned on , we extend the projection idea with an expectation over the available samples for that time instant, . This expression denotes the probability of state conditioned on the initial state and the prior policy belief sequence .

 θg,n =argmaxθ∑jwn,j∑jwn,jlogπ(an,j|sn,j,θ)

with

 −logwn,j=Rg,n,j=λRn,j+(1−α)∑N−1n′=nlogπg,n′,j (14)

Apart from the cost, , the weights also contain a second term . Given that if was less likely than , this auxiliary cost encourages unlikely whilst penalizing likely actions. Note that if we had not augmented the entropic optimisation problem with a second prior this term would vanish (

). This optimisation problem is solved most efficiently by calculating a likelihood weighted estimate of the joint Gaussian distribution,

. Parameters are then found by conditioning on . Further note that we solve this problem for every

 Kg+1,n =^Σas,g+1,n^Σ−1ss,g+1,n (15) kg+1,n =^μa,g+1,n−Kg+1,n^μs,g+1,n Σg+1,n =^Σaa,g+1,n−Kg+1,n^Σss,g+1,nK⊤g+1,n

where

 ^μτ,g+1,n =⟨τg,n,j⟩ ^Σττ,g+1,n =⟨(τg,n,j−^μτ,g+1,n)(τg,n,j−^μτ,g+1,n)⊤⟩

with .

Similar algorithms are described in references [williams2016agressive, williams2017model, williams2018information, bhardwaj2021fast, rajamaki2016sampled, ghandi2021ieeeral, lefebvre2019path]. Amongst these MPPI implementations, the presence of feedback, the presence of the term in the likelihood weights and the applicability to general deterministic optimal control problems are, as far as we are aware, unique to Alg. 3, and are now theoretically justified.

#### Iv-C3 Numerical implementation

To increase the overall numerical stability, we first use exponential smoothing

 θg+1,n ←βθg+1,n+(1−β)θg,n (16)

Second the use of Monte Carlo estimates implies that . To estimate the covariance matrix of the trajectory, , we need at least a multiple of

samples. It is clear that the updates are prone to high variance for finite

. To remedy this issue we project the time signals on a polynomial space of order spanned by the basis . Despite this measure the update for remained too unstable. Hence instead we substitute a fixed gain matrix for in the updates in (15). Finally, since , an auxiliary procedure is used that changes until .

#### Iv-C4 Relation to other algorithms

Standard gradient-based trajectory optimisation algorithms iterate between a forward and a backward pass to probe the local problem geometry about the iterate trajectory and to optimise it, respectively. Similarly, Alg. 1 and 3 iterate between a forward Monte Carlo step and an inference step. As opposed to the backward step in gradient-based algorithms, the inference step does not possess a causal structure and is thus amenable to parallelisation . In Alg. 1 and 3, the weights are time dependent, so in some way, the recursive nature of the problem emerges. This is opposed to Alg. 2 and other stochastic search algorithms such as CMA-ES[abdolmaleki2017deriving], which could be used to solve (1) as well. The use of conditional policies is not straightforward, nor are the weights time dependent. So taking into account the blue print architecture of Alg. 2, EMPPI can be understood as a temporal rollout of an Evolutionary Strategy or a stochastic implementation of gradient-based trajectory optimisation algorithms where gradient information is inferred from the sampled trajectories. Finally, we emphasise again that Alg. 1 solves specific stochastic optimal control problem (3) whilst Alg. 3 solves general deterministic optimal control problem (1).

### Iv-D Numerical example

We present results for numerical experiments with a 4 dimensional planar robot arm operating in an environment with a single obstacle. This is not only to demonstrate the practical implications of our theoretical investigation (corollary 1 specifically), but also to investigate the effect of the different MPPI algorithms and changing their parameter settings (Alg. 3) on the exploration versus exploitation behaviour of the resulting distributions.

#### Iv-D1 Environment

The environment consists of a planar pendulum with links of length . Masses are concentrated at the end of each link. We use torque inputs directly instead of generating kinematic trajectories and relying on low-level controllers. An OCP is formulated with horizon . We use a relatively coarse time discretisation . The cost rate function is defined as , penalising the energy consumption and aggressive moves. We intentionally do not encode information about the non-linearity of the dynamics nor the obstacles. The system is drawn towards a final end-effector configuration using the final cost term , where

represents the distance vector between the end-effector and goal configuration. We do not represent the obstacles in the cost function. Interactions with the obstacles are strictly through the dynamics; gradient-based algorithms are not in favour here. The contact dynamics are modelled through forces

, where is the Heaviside function, is the distance vector between the obstacle and the near contact point and is the Jacobian matrix computed at the nearest contact point. Described contact dynamics are inelastic.

#### Iv-D2 Experiments

We compare three versions of Alg. 3. In version A, we set , only update , set and . Version A is the most closely related to the MPPI implementation in [williams2018information] and therefore serves as a baseline. An important improvement of MPPI comes from updating the covariance (deviating from the theory of LSOC) [stulp2012path, bhardwaj2021fast]. As a result, it can be observed empirically that the covariance collapses prematurely. Therefore, in version B, we set but update . According to (11), the policy belief functions will converge to a Dirac delta and it is anticipated that the search will converge prematurely. Version C implements the full algorithm. Versions A, B and C are initialised with the feedforward and with covariances . Unless specified, we set and then run 200 generations.

#### Iv-D3 Results

The solution after generations is visualised in Fig. 1. Clearly, only C completes the tasks successfully. As anticipated, we can observe the presence of premature distribution collapse for version B. The entropy of the distribution evaporates and the search stalls. The performance of A is superior to that of B yet also it fails to execute the final reach of the complete manoeuvre within 200 generations. The main benefit of C over A is that it can automatically adapt the covariance of the policy. This allows the policy to discover interesting directions more rapidly. One can also observe that some of the alternative histories still collide with the obstacle; howevers the bulk effectively reaches the goal. These observations are confirmed by Fig. 3 and 3. In particular, for version C, Fig. 3 clearly illustrates how the covariance of the policy self-adapts to the progress made on the problem.

## V Conclusion

Sample- based trajectory optimisation is a promising tool for robotics with complex and non-smooth dynamics and cost functions, both to synthesise complex behaviour and to compute real-time feedback controls. In this contribution, we have proposed an alternative derivation of the popular MPPI algorithm. Our derivation is founded on the framework of EDOC, an entropy-regularised version of the standard deterministic optimal control problem, for which we have shown that the optimal policy can be given by a policy belief function and expressed as a Bayesian path integral similarly to LSOC. We argue that our derivation allows for more principled future algorithmic development of sample-based trajectory optimisers and will become incorporated into other challenges and applications.

## Acknowledgements

This research received funding from the “Flemisch Artificial Intelligence Research (FlAIR)” programme, Belgium.

## Appendix A Stochastic search algorithms

We return to the derivation in section III-C. Now, we wish to manipulate it into an expectation over the prior , which would allow us to estimate it using Monte Carlo sampling. To arrive at the second line, we substitute the definition for the relative entropy. Since we optimise for , we can further neglect the first term. Then, to arrive at the third line, we make use of the recurrence relation. This expectation is then approximated using a sample where .

 θg+1 =argminθ∈ΘD[πg+1∥πθ] =argmaxθ∈ΘEπg[π1−αge−λqlogπθ] ≈argmaxθ∈Θ^EDg[e−(λq+(1−α)logπθg)logπθ]

When we approximate beliefs using the Normal distribution ; this approach generates the updates shown in algorithm 2. We refer to [lefebvre2020elsoc] for further details.

## Appendix B Proof of Theorem 1

###### Proof.

Consider , where and are multipliers associated to the normalisation and inequality constraints.

 L =