1 Introduction
^{†}^{†}^{∗}Equal contribution.Model predictive control (MPC) [20] is an effective tool for control tasks involving dynamic environments, such as helicopter aerobatics [1] and aggressive driving [29]
. One reason for its success is the pragmatic principle it adopts in choosing controls: rather than wasting computational power to optimize a complicated controller for the fullscale problem, which may be difficult to accurately model, MPC instead optimizes a simple controller (e.g., an openloop control sequence) over a shorter planning horizon that is just sufficient to make a sensible decision at the current moment. By alternating between optimizing the simple controller and applying its corresponding control on the real system, MPC results in a closedloop policy that can handle modeling errors and dynamic changes in the environment.
Various MPC algorithms have been proposed, using tools ranging from constrained optimization techniques [8, 20, 27] to samplingbased techniques [29]. In this paper, we show that, while these algorithms were originally designed based on different assumptions and heuristics, if we view them through the lens of online learning [16], many of them actually follow the same general update rule.
Online learning is an abstract theoretical framework for analyzing online decision making. Formally, it concerns iterative interactions between a learner and an environment over rounds. At round , the learner makes a decision from some decision set
. The environment then chooses a loss function
based on the learner’s decision, and the learner suffers a cost . In addition to seeing the decision’s cost, the learner may be given additional information about the loss function (e.g., its gradient at ) to aid in choosing the next decision . The learner’s goal is to minimize the accumulated costs , e.g., by minimizing regret [16].We find that the MPC process bears a strong similarity with online learning. At time (i.e., round ), an MPC algorithm optimizes a controller (i.e., the decision) over some cost function (i.e., the perround loss). To do so, it observes the cost of the initial controller (i.e., ), improves the controller (to ), and executes a control based on the improved controller in the environment to get to the next state (which in turn defines the next perround loss).
In view of this connection, we propose a generic framework, DMDMPC (Dynamic Mirror Decent Model Predictive Control), for synthesizing MPC algorithms. DMDMPC is based on a firstorder online learning algorithm called dynamic mirror descent (DMD) [14], a generalization of mirror descent [4] for dynamic comparators. We show that several existing MPC algorithms [6, 30] are special cases of DMDMPC, given specific choices of step sizes, loss functions, and regularization. Furthermore, using DMDMPC, we demonstrate how new MPC algorithms can be derived systematically, including for systems with discrete controls, with only mild assumptions on the regularity of the cost function. This allows us to even work with cost functions that are discontinuous, such as indicator functions. Thus, DMDMPC offers a spectrum from which practitioners can easily customize new algorithms for their applications.
In the experiments, we apply DMDMPC to design a range of MPC algorithms and study their empirical performance. Our results indicate the extra design flexibility offered by DMDMPC does make a difference in practice; by properly selecting additional hyperparameters, which are obscured in the previous approaches, we are able to improve the performance of existing algorithms. Finally, we apply DMDMPC on a realworld AutoRally car platform
[13] for autonomous driving tasks, matching stateoftheart performance.Notation: We summarize the convention used in the following sections. As our discussions will involve planning horizons, for clarity, we use lightface to denote variables that are meant for a single time step, and boldface to denote the variables congregated across the MPC planning horizon. For example, we use to denote the planned control at time and to denote an step planned control sequence starting from time . Often we will use a subscript to extract elements from a congregated variable; in the above example, we use to the denote the element in (note the subscript index starts from zero). By definition, this coincides with . Thus, another way of writing the congregated variables is as . All the variables in this paper are finitedimensional.
2 An Online Learning Perspective on MPC
2.1 The MPC Problem Setup
Let be finite. We consider the problem of controlling a discretetime stochastic dynamical system
(1) 
for some stochastic transition map . At time , the system is in state . Upon the execution of control , the system randomly transitions to the next state , and an instantaneous cost is incurred. Our goal is to design a statefeedback control law (i.e., a rule of choosing based on ) such that the system exhibits good performance (e.g., accumulating low costs over time steps).
In this paper, we adopt the MPC approach to choosing : at state , we imagine controlling a stochastic dynamics model (which approximates our system ) for time steps into the future. Our planned controls come from a control distribution
that is parameterized by some vector
, where is the feasible parameter set. In each simulation (i.e., rollout), we sample^{1}^{1}1This can be sampled in either an openloop or closedloop fashion when recursively computing the state trajectory. a control sequence from the control distribution and recursively apply it to to generate a predicted state trajectory :More compactly, we write the above as
(2) 
in terms of some that is defined naturally according to the above recursion. Through these simulations, we desire to select a parameter that minimizes an MPC objective , which aims to predict the performance of the system if we were to apply the control distribution starting from .^{2}^{2}2 can be seen as a surrogate for the longterm performance of our controller. Typically, we set the planning horizon to be much smaller than the amount of time the system is being controlled. This makes the controller easier to optimize and mitigates modeling errors. In other words, we wish to find the that solves
(3) 
Once is decided, we then sample^{3}^{3}3 This setup can also optimize deterministic policies. For instance, we can define to be a Gaussian policy with the mean being the deterministic policy. After optimizing , we return the mean instead of sampling from . This technique is known as “smoothing” [16], which is preferred when the MPC objective or the model transition is nondifferentiable. from , extract the first control , and apply it on the real dynamical system in (1) (i.e., set ) to go to the next state . Because is determined based on , MPC is effectively statefeedback.
The motivation behind MPC is to use the MPC objective to reason about the controls required to achieve desirable longterm behaviors. Consider the statistic
(4) 
where is a terminal cost function. A popular MPC objective is
, which estimates the expected
step future costs. Later in Section 3.1, we will discuss several MPC objectives and their properties.Although the idea of MPC sounds intuitively promising, the optimization can only be approximated in practice, because (3) is often a stochastic program (like the example above) and the control command needs to be computed at a high frequency. In consideration of this imperfection, it is common to heuristically bootstrap the previous approximate solution as the initialization to the current problem. Specifically, let be the approximate solution to the previous problem and denote the initial condition of in solving (3), and consider sampling and . We set
(5) 
by defining a shift operator that outputs a new parameter in . This can be chosen to satisfy desired properties, one example being that when conditioned on and , the marginal distributions of are the same for both of and of . A simple example of this property is shown in Fig. 1. Note that also involves a new control that is not in , so the choice of is not unique but algorithm dependent; for example, we can set of to follow the same distribution as (cf. Section 3.2). Because the subproblems in (3) of two consecutive time steps share all control variables except for the first and the last ones, the “shifted” parameter to the current problem should be almost as good as the optimized parameter is to the previous problem. In other words, setting provides a warm start to (3) and amortizes the computational complexity of solving for .
2.2 The Online Learning Perspective
As discussed, the iterative update process of MPC resembles the setup of online learning [16]. Here we provide the details to convert an MPC setup into an online learning problem. Recall from the introduction that online learning mainly consists of three components: the decision set, the learner’s strategy for updating decisions, and the environment’s strategy for updating perround losses. We show the counterparts in MPC that correspond to each component below. Note that in this section we will overload the notation to mean .
We use the concept of perround loss in online learning as a mechanism to measure the decision uncertainty in MPC, and propose the following identification (shown in Fig. 2) for the MPC setup described in the previous section: we set the rounds in online learning to synchronize with the time steps in MPC, set the decision set as the space of feasible parameters of the control distribution , set the learner as the MPC algorithm which in round outputs the decision and side information , and set the perround loss of the environment as
(6) 
In other words, in round of this online learning setup, the learner plays a decision along with a side information (based on the optimized solution and the shift operator in (5)), the environment selects the perround loss (by applying to the real dynamical system in (1) to transit the state to ), and finally the learner receives and incurs (which measures the suboptimality of the future plan made by the MPC algorithm).
This online learning setup differs slightly from the standard setup in its separation of the decision and the side information ; while our setup can be converted into a standard one that treats as the sole decision played in round , we adopt this explicit separation in order to emphasize that the variable part of the incurred cost pertains to only . That is, the learner cannot go back and revert the previous control already applied on the system, but only uses to update the current and future controls .
The performance of the learner in online learning (which by our identification is the MPC algorithm) is measured in terms of the accumulated costs . For problems in nonstationary setups, a normalized way to describe the accumulated costs in the online learning literature is the concept of dynamic regret [14, 32], which is defined as
(7) 
where . Dynamic regret quantifies how suboptimal the played decision is on the current loss function . In our proposed problem setup, the optimality concept associated with dynamic regret conveys a consistency criterion desirable for MPC: we would like to make a decision at state such that, after applying control and entering the new state , its shifted plan remains close to optimal with respect to the new loss function . If the dynamics model is accurate and the MPC algorithm is ideally solving (3), we can expect that bootstrapping the previous solution through (5) into would result in a small instantaneous gap which is solely due to unpredictable future information (such as the stochasticity in the underlying dynamical system). In other words, an online learning algorithm with small dynamic regret, if applied to our online learning setup, would produce a consistently optimal MPC algorithm with regard to the solution concept discussed above. However, we note that having small dynamic regret here does not directly imply good absolute performance on the control system, because the overall performance of the MPC algorithm is largely dependent on the choice of MPC objective . Small dynamic regret more precisely means whether the plan produced by an MPC algorithm is consistent with the given MPC objective.
3 A Family of MPC Algorithms Based on Dynamic Mirror Descent
The online learning perspective on MPC suggests that good MPC algorithms can be designed from online learning algorithms that achieve small dynamic regret. This is indeed the case. We will show that a range of existing MPC algorithms are in essence applications of a classical online learning algorithm called dynamic mirror descent (DMD) [14]. DMD is a generalization of mirror descent [4] to problems involving dynamic comparators (in this case, the in dynamic regret in (7)). In round , DMD applies the following update rule:
(8) 
where is the update direction (which can also be replaced by unbiased sampling if the gradient is an expectation), is called the predictive model,^{4}^{4}4In [14], is called a dynamical model, but it is not the same as the dynamics of our control system. We therefore rename it to avoid confusion. is the step size, and for , is the Bregman divergence generated by a strictly convex function on .
The first step of DMD in (8) is reminiscent of the proximal update in the usual mirror descent algorithm. It can be thought of as an optimization step where the Bregman divergence acts as a regularization to keep close to . Although a Bregman divergence is not necessarily a metric (since it may not be symmetric), it is still useful to view it as a distance between and . Indeed, familiar examples of the Bregman divergence include the squared Euclidean distance and KL divergence [3].
The second step of DMD in (8) uses the predictive model to anticipate the optimal decision for the next round. In the context of MPC, a natural choice for the predictive model is the shift operator in (5) defined previously in Section 2.1 (hence the same notation), because the perround losses in two consecutive rounds here concern problems with shifted time indices. Hall and Willett [14] show that the dynamic regret of DMD scales with how much the optimal decision sequence deviates from (i.e., , which is proportional to the unpredictable elements of the problem as we discussed earlier in Section 2.2.
Applying DMD in (8) to the online learning problem described in Section 2.2 leads to an MPC algorithm shown in Algorithm 1, which we call DMDMPC. More precisely, DMDMPC represents a family of MPC algorithms in which a specific instance is defined by a choice of
Thus, we can use DMDMPC as a generic strategy for synthesizing MPC algorithms. In the following, we use this recipe to recreate several existing MPC algorithms and demonstrate new MPC algorithms that naturally arise from this framework.
3.1 Loss Functions
We discuss several definitions of the perround loss , which all result from the formulation in (6) but with different . These loss functions are based on the statistic defined in (4) which measures the step accumulated cost of a given trajectory. For transparency of exposition, we will suppose henceforth that the control distribution is openloop^{5}^{5}5Note again that even while using openloop control distributions, the overall control law of MPC is statefeedback.; similar derivations follow naturally for closedloop control distributions. For convenience of practitioners, we also provide expressions of their gradients in terms of the likelihoodratio derivative^{6}^{6}6We assume the control distribution is sufficiently regular with respect to its parameter so that the likelihoodratio derivative rule holds. [12]. All these gradients shall have the form, for some function ,
(9) 
For compactness, we will write in place of . These gradients in practice are approximated by finite samples.
3.1.1 Expected Cost
The most commonly used MPC objective is the step expected accumulated cost function under model dynamics, because it directly estimates the expected longterm behavior when the dynamics model is accurate and is large enough. Its perround loss function is^{7}^{7}7In our experiments, we subtract the empirical average of the sampled costs from in (11
) to reduce the variance in estimating the gradient, at the cost of a small amount of bias.
(10)  
(11) 
3.1.2 Expected Utility
Instead of optimizing for average cost, we may care to optimize for some preference related to the trajectory cost , such as having the cost be below some threshold. This idea can be formulated as a utility that returns a normalized score related to the preference for a given trajectory cost . Specifically, suppose that is lower bounded by zero^{8}^{8}8If this is not the case, let , which we assume is finite. We can then replace with . and at some round define the utility (i.e., ) to be a function with the following properties:

,

is monotonically decreasing, and

.
These are sensible properties since we attain maximum utility when we have zero cost, the utility never increases with the cost, and the utility approaches zero as the cost increases without bound. We then define the expected utility under control distribution and dynamics model as and define the perround loss as the negative logarithm of this expected utility:
(12)  
(13) 
The gradient in (13) is particularly appealing when estimated with samples. Suppose we sample control sequences from and (for the sake of compactness) sample one state trajectory from for each corresponding control sequence, resulting in . Then the estimate of (13) is a convex combination of gradients:
We see that each weight is computed by considering the relative utility of its corresponding trajectory. A cost with high relative utility will push its corresponding weight closer to one, whereas a low relative utility will cause to be close to zero, effectively rejecting the corresponding sample.
We give two examples of utilities and their corresponding losses.
Probability of Low Cost
For example, we may care about the system being below some cost threshold as often as possible. To encode this preference, we can use the threshold utility , where is the indicator function and is a threshold parameter. Under this choice, the loss and its gradient become
(14)  
(15) 
As we can see, this loss function gives the probability of achieving cost below some threshold. As a result (
Fig. 2(a)), costs below are treated the same in terms of the utility. This can potentially make optimization easier since we are trying to make good trajectories as likely as possible instead of finding the best trajectories as in (10).However, if the threshold is set too low and the gradient is estimated with samples, the gradient estimate may have high variance due to the large number of rejected samples. Because of this, in practice, the threshold is set adaptively, e.g., as the largest cost of the top elite fraction of the sampled trajectories with smallest costs [6]. This allows the controller to make the best sampled trajectories more likely and therefore improve the controller.
Exponential Utility
We can also opt for a continuous surrogate of the indicator function, in this case the exponential utility , where is a scaling parameter. Unlike the indicator function, the exponential utility provides nonzero feedback for any given cost and allows us to discriminate between costs (i.e., if , then ), as shown in Fig. 2(b). Furthermore, acts as a continuous alternative to and dictates how quickly or slowly decays to zero, which in a soft way determines the cutoff point for rejecting given costs.
Under this choice, the loss and its gradient become
(16)  
(17) 
The loss function in (16) is also known as the riskseeking objective in optimal control [7]; this classical interpretation is based on a Taylor expansion of (16) showing
when is large, where is the variance of when is sampled from and is sampled from . Here we derive (16) from a different perspective that treats it as a continuous approximation of (14) for maximizing the probability of lowcost trajectories. The use of exponential transformation to approximate indicator functions is a common technique (like the Chernoff bound [9]
) in the machine learning literature.
3.2 Algorithms
We instantiate DMDMPC with different choices of loss function, control distribution, and Bregman divergence as concrete examples to showcase the flexibility of our framework. In particular, we are able to recover wellknown MPC algorithms as special cases of Algorithm 1.
Our discussions below are organized based on the class of Bregman divergences used in (8), and the following algorithms are derived assuming that the control distribution is a sequence of independent distributions. That is, we suppose is a probability density/mass function that factorizes as
(18) 
and for some basic control distribution parameterized by , where denotes the set of feasible for the basic control distribution. For control distributions in the form of (18), the shift operator in (5) would set by identifying , for , and initializes the final parameter as either or for some default parameter .
3.2.1 Quadratic Divergence
We start with perhaps the most common Bregman divergence: the quadratic divergence. That is, we suppose the Bregman divergence in (8) has a quadratic form for some positivedefinite matrix . Below we discuss different choices of and their corresponding update rules.
Projected Gradient Descent
This basic update rule is a special case when
is the identity matrix. Equivalently, the update can be written as
.Natural Gradient Descent
Quadratic Problems
While the above two update rules are quite general, we can further specialize the Bregman divergence to achieve faster learning when the perround loss function is quadratic. This happens, for instance, when the MPC problem in (3) is an LQR or LEQR problem^{9}^{9}9The dynamics model is linear, the step cost is quadratic, the perround loss is (10), and the basic control distribution is a Diracdelta distribution. [11]. That is, if
for some constant vector and positive definite matrix , we can set and , making given by the first step of (8) correspond to the optimal solution to (i.e., the solution of LQR/LEQR). The particular values of and for LQR and LEQR are derived in Appendix C.
3.2.2 KL Divergence and the Exponential Family
We show that for control distributions in the exponential family [23], the Bregman divergence in (8) can be set to the KL divergence, which is a natural way to measure distances between distributions. Toward this end, we review the basics of the exponential family. We say a distribution with natural parameter
belongs to the exponential family if its probability density/mass function satisfies(19) 
where is the sufficient statistics, is the carrier measure, and is the logpartition function. The distribution can also be described by its expectation parameter , and there is a duality relationship between the two parameterizations:
where is the Legendre transformation of and ; in other words, . This duality relationship results in the property below.
Fact 1.
[23] .
We can use creftype 1 to define the Bregman divergence in (8) in optimizing a control distribution that belongs to the exponential family:

if is an expectation parameter, we can set , or

if is a natural parameter, we can set .
We demonstrate some examples using this idea below.
Expectation Parameters and Categorical Distributions
We first discuss the case where is an expectation parameter and the first step in (8) is
(20) 
To illustrate, we consider an MPC problem with a discrete control space and use the categorical distribution as the basic control distribution in (18), i.e., we set , where is the probability of choosing each control among at the predicted time step and denotes the probability simplex in . This parameterization choice makes an expectation parameter of that corresponds to sufficient statistics given by indicator functions. Using the structure of (9), we find the update direction is
where and are the elements of and , respectively, has for each element except at index where it is , and denotes elementwise division. Update (20) then becomes the exponentiated gradient algorithm [16]:
(21) 
where is the element of is the normalizer for , and denotes elementwise multiplication. That is, instead of applying an additive gradient step to the parameters as in usual gradient descent, the update in (20) exponentiates the gradient and performs elementwise multiplication. This does a better job of accounting for the geometry of the problem, and makes projection a simple operation of normalizing a distribution.
Natural Parameters and Gaussian Distributions
Alternatively, we can set as a natural parameter and use
(22) 
as the first step in (8). Particularly, we show that, with (22), the structure of the likelihoodratio derivative in (9) can be leveraged to design an efficient update scheme.
The main idea follows from the observation that when the gradient is computed through (9) and is the natural parameter, we can write
(23) 
where we denote as the expectation parameter of and as the sufficient statistics of the control distribution. We combine the factorization in (23) with a property of the proximal update below (proved in Appendix A) to derive our algorithm.
Proposition 1.
Let be an update direction. Let be the image of under . If and , then .^{10}^{10}10A similar proposition can be found for (20).
We find that, under the assumption^{11}^{11}11If is not in , the update in (22) needs to perform a projection, the form of which is algorithm dependent. in creftype 1, the update rule in (22) becomes
(24) 
Equivalently, we can write (24) as
(25) 
In other words, when , the update to the expectation parameter in (8) is simply a convex combination of the sufficient statistics and the previous expectation parameter .
We provide a concrete example of an MPC algorithm that follows from (24). Let us consider a continuous control space and use the Gaussian distribution as the basic control distribution in (18), i.e., we set for some mean vector and covariance matrix . For , we can choose sufficient statistics , which results in the expectation parameter and the natural parameter , where . Let us set as the natural parameter. Then (22) is equivalent to the update rule for :
(26) 
Several existing stateoftheart algorithms are special cases of the update rule in (26).

Modelpredictive path integral (MPPI) [30]:
If we choose as the exponentiated cost, as in (16), and do not update the covariance, the update rule becomes
(27) which reduces to the MPPI update rule [30] for . This connection is also noted in [24].
Originally, Williams et al. [30] derived the MPPI update by estimating the mean of an intractable optimal control distribution, which minimizes the KL divergence between the Gaussian control distribution and the optimal control distribution. By contrast, our update rule in (27) results from optimizing an exponential utility. We argue that our derivation leading to (27) is more natural and direct: it starts from a continuous surrogate of the highprobability objective in (16) and follows from the application of a standard online learning algorithm, DMD. From this perspective, several design choices (e.g., ) in the original MPPI algorithm [30] are a heuristic and not necessarily optimal. See Section 5 for more details.
3.3 Extensions
In the previous sections, we discussed multiple instantiations of DMDMPC, showing the flexibility of our framework. But they are by no means exhaustive. For example, the control distributions can be fairly general, in addition to the categorical and Gaussian distributions that we discussed, and constraints on the problem (e.g., control limits) can be directly incorporated through proper choices of control distributions, such as the beta distribution. Moreover, different integration techniques, such as Gaussian quadrature
[5], can be adopted to replace the likelihoodratio derivative in (9) for computing the required gradient direction. We also note that the independence assumption on the control distribution in (18) is not necessary; timecorrelated control distributions and feedback policies are straightforward to consider in DMDMPC.4 Related Work
Recent work on MPC has studied samplingbased approaches, which are flexible in that they don’t not require differentiability of a cost function. One such algorithm which can be used with general cost functions and dynamics is MPPI, which was proposed by Williams et al. [30] as a generalization of the control affine case [29]. The algorithm is derived by considering an optimal control distribution defined by the control problem. This optimal distribution is intractable to sample from, so the algorithm instead tries to bring a tractable distribution (in this case, Gaussian with fixed covariance) as close as possible in the sense of KL divergence. This ends up being the same as finding the mean of the optimal control distribution. The mean is then approximated as a weighted sum of sampled control trajectories, where the weight is determined by the exponentiated costs. Although this algorithm works well in practice, it is not clear that matching the mean of the distribution should guarantee good performance, such as in the case of a multimodal optimal distribution. A closely related approach is the crossentropy method (CEM) [6]
, which also assumes a Gaussian sampling distribution but minimizes the KL divergence between the Gaussian distribution and a uniform distribution over low cost samples. CEM has found applicability in reinforcement learning
[19, 21, 26], path and motion planning [17, 18], and MPC [10, 31].These samplingbased control algorithms can be considered special cases of general derivativefree optimization algorithms, such as covariance matrix adaptation evolutionary strategies (CMAES) [15] and natural evolutionary strategies (NES) [28]. CMAES samples points from a multivariate Gaussian, evaluates their fitness, and adapts the mean and covariance of the sampling distribution accordingly. On the other hand, NES optimizes the parameters of the sampling distribution to maximize the expected fitness through steepest ascent, where the direction is provided by the natural gradient. Akimoto et al. [2] showed that CMAES can also be interpreted as taking a natural gradient step on the parameters of the sampling distribution. As we showed in Section 3.2, natural gradient descent is a special case of the mirror descent framework. A similar observation was made by Okada and Taniguchi [24], in which they derived MPPI through the framework of mirror descent. However, their derivation only considers KL divergence as the regularizer and restricts the sampling distribution to be a Gaussian. In contrast, we do not tie ourselves to a specific Bregman divergence or sampling distribution, but instead consider a family of algorithms by varying these choices.
5 Experiments
We use experiments to the validate the flexibility of DMDMPC. We show that this framework can handle both continuous (Gaussian control distributions) and discrete (categorial control distribution) variations of control problems, and that MPC algorithms like MPPI and CEM can be generalized using different step sizes and control distributions, resulting in improved performance. Additional experimental details are included in Appendix B.
5.1 Cartpole
We first consider the classic cartpole problem where we seek to swing a pole upright and keep it balanced only using actuation on the attached cart. We consider both the continuous and discrete control variants; for the discrete case, we can either push the cart left or right with some unit force or apply no force. For the continuous case, we choose the Gaussian distribution as the control distribution and keep the covariance fixed. For the discrete case, we choose the categorical distribution and use update (21). In either case, we have access to a biased stochastic model (has a different pole length compared to the real cart).
We consider the interaction between the choice of loss, step size, and number of samples used to estimate (9),^{12}^{12}12For our experiments, we vary the number of samples from and fix the number of samples from to ten. Furthermore, we use common random numbers when sampling from to reduce estimation variance. shown in Figs. 5 and 4. For this environment, we can achieve low cost using the expected cost with a proper step size ( for continuous and discrete problems) while being fairly robust to the number of samples. When using either of the utilities, the number of samples is more crucial in the continuous domain, with more samples allowing for larger step sizes. In the discrete domain (Fig. 3(b)), performance is largely unaffected by the number of samples when the step size is below , excluding the threshold utility with 1000 samples.
In Fig. 4(a), for a large range of utility parameters, we see that using step sizes above (the step size set in MPPI and CEM) give significant performance gains. In Fig. 4(b), there’s a more complicated interaction between the utility parameter and step size, with huge changes in cost when altering the utility parameter and keeping the step size fixed.
5.2 AutoRally Racing
5.2.1 Platform Description
We use the autonomous AutoRally platform [13] to run a highspeed dirttrack driving task. The robot (Fig. 6) is a 1:5 scale RC chassis capable of driving over (
) and has a desktopclass Intel Core i7 CPU and Nvidia GTX 1050 Ti GPU. The platform also has an IMU and GPS used to measure the acceleration and infer the position of the vehicle. In both simulated and realworld experiments, the dynamics model is a neural network which has been fitted to data collected from human demonstrations. We note that the dynamics model is deterministic, so we don’t need to estimate any expectations with respect to the dynamics.
5.2.2 Simulated Experiments
We first use the Gazebo simulator (Fig. 8) to perform a sweep of algorithm parameters, particularly the step size and number of samples, to evaluate how changing these parameters can effect the performance of DMDMPC. For all of the experiments, the control distribution is a Gaussian with fixed covariance, and we use update (27) (i.e., the loss is the exponential utility (16)) with . Lap times, average speeds, and maximum speeds are shown in Fig. 9.^{13}^{13}13The large error bar for 64 samples and step size of 0.8 is due to one particular lap where the car stalled at a turn for about 60 seconds. We see that although more samples do result in faster lap times, there are diminishing returns past 1920 samples per gradient. Indeed, with a proper step size, even as few as 192 samples can yield lap times within a couple seconds of 3840 samples and a step size of 1. We also observe that the curves converge as the step size decreases further, implying that only a certain number of samples are needed for a given step size. This is particularly important advantage of DMDMPC over methods like MPPI: by changing the step size, DMDMPC can perform much more effectively with fewer samples, making it a good choice for embedded systems which can’t produce many samples due to computational constraints.
We also qualitatively evaluate two particular extremes: few vs. many samples (64 vs. 3840) and small vs. large step size (0.5 vs. 1). Fig. 10 shows the speed of the car during the episode and Fig. 11 shows the predicted trajectory of DMDMPC of the mean control sequence at certain times. At small step sizes (Figs. 9(c) and 9(a)), the trajectories and speed profiles are rather similar. The corresponding rollout trajectories (Figs. 10(c) and 10(a)) are moderately long and imply reasonably low MPC cost. On the other hand, with few samples and a large step size (Fig. 9(b)), the car drives much more slowly and erratically, sometimes even stopping. The corresponding rollouts (Fig. 10(b)) have more variability, a consequence of the larger variance of the gradient. In the ideal scenario with many samples and a large step size, the car can achieve consistently high speed while driving smoothly (Fig. 9(d)). Indeed, the rollouts (Fig. 9(d)) are much longer while mostly staying near the center of the track.
We also experimented with instead optimizing the expected cost (10) and found performance was dramatically worse (Fig. 12), even when using 3840 samples per gradient. At best, the car would drive in the center of the track at speeds below (Fig. 11(c)), and at worst, the car would slowly drive along the track walls (Fig. 11(a)), or the controller would eventually produce controls that would prematurely end the experiment (Fig. 11(d)). The rollouts (Fig. 13) are likewise very conservative if not stationary. This poor performance is likely due to most samples in the estimate of (11) having very high cost (e.g., due to leaving the track) and contributing significantly to the gradient estimate. On the other hand, when estimating (17), as in the prior experiments, these high cost trajectories are assigned very low weights so that only low cost trajectories contribute to the gradient estimate.
5.2.3 RealWorld Experiments
In the realworld setting (Fig. 8), we ran two sets of experiments: one where we varied the desired speed parameter of the cost function (more details in Section B.2), the step size, and the sampling distribution; and one where we varied the number of samples and the step size. We note that these experiments were performed under different environment conditions, with the second set of experiments in particular done with poor track conditions that limited the max speed of the car. Therefore, the experimental results should not be compared across the two sets. For all experiments, we optimized only the mean of the control distribution and used the exponential utility (16) as the perround loss with .
For the first set of experiments, we fixed the number of samples to 1920 and for each speed target ( and ) used the following experimental conditions: Gaussian distribution with step size , Gaussian distribution with step size , and Laplace distribution with step size . We use update (27) for the first two conditions and a natural gradient update for the third condition. We note, then, that the first condition corresponds to the MPPI algorithm. Lap times, max speeds, and average speeds are shown in Table 1. In both cases, we find that the second and third controllers perform comparably with MPPI, with the performance gap actually decreasing with increased target speed. These results are also in line with the lap times achieved in [31]. Car speeds and rollouts are shown in Figs. 17, 16, 15 and 14 in Section D.1 that further demonstrate how similar the three controllers perform.
For the second set of experiments, we fixed the speed target to and control distribution to Gaussian using update (27) and used the following experimental conditions: each of 1920 and 192 samples and each of step sizes , , and .^{14}^{14}14Due to limited time for running the experiments, we did not test with 1920 samples and step size . The results are shown in Table 2. Overall, there is a mild improvement in lap time with step size of less than , and performance degrades little with the number of samples. Car speeds and rollouts are shown in Figs. 21, 20, 19 and 19 in Section D.2.
Target speed  Target speed  

Lap time ()  Avg. speed ()  Max speed ()  Lap time ()  Avg. speed ()  Max speed ()  
Gaussian,  
Gaussian,  
Laplace, 
1920 samples  192 samples  
Step size  Lap time ()  Avg. speed ()  Max speed ()  Lap time ()  Avg. speed ()  Max speed () 
not tested  not tested  not tested 
6 Conclusion
We presented a connection between model predictive control and online learning. From this connection, we proposed an algorithm based on dynamic mirror descent that can work for a wide variety of settings and cost functions. We also discussed the choice of loss function within this online learning framework and the sort of preference each loss function imposes. From this general algorithm and assortment of loss functions, we show several well known algorithms are special cases and presented a general update for members of the exponential family.
We empirically validated our algorithm on continuous and discrete simulated problems and on a realworld aggressive driving task. In the process, we also studied the parameter choices within the framework, finding, for example, that in our framework a smaller number of rollout samples can be compensated for by varying other parameters like the step size.
We hope that the online learning and stochastic optimization viewpoints of MPC presented in this paper opens up new possibilities for using tools from these domains, such as alternative efficient sampling techniques [5] and accelerated optimization methods [22, 24], to derive new MPC algorithms that perform well in practice.
Acknowledgements
This material is based upon work supported by NSF NRI award 1637758, NSF CAREER award 1750483, an NSF Graduate Research Fellowship under award No. 2015207631, and a National Defense Science & Engineering Graduate Fellowship.
References
 Abbeel et al. [2010] Pieter Abbeel, Adam Coates, and Andrew Y Ng. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 29(13):1608–1639, 2010.
 Akimoto et al. [2012] Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Theoretical foundation for CMAES from information geometry perspective. Algorithmica, 64(4):698–716, 2012.
 Banerjee et al. [2005] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with Bregman divergences. Journal of machine learning research, 6(Oct):1705–1749, 2005.
 Beck and Teboulle [2003] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 Bellman and Casti [1971] Richard Bellman and John Casti. Differential quadrature and longterm integration. Journal of Mathematical Analysis and Applications, 34(2):235–238, 1971.
 Botev et al. [2013] Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The crossentropy method for optimization. In Handbook of statistics, volume 31, pages 35–59. Elsevier, 2013.
 Broek et al. [2012] Bart van den Broek, Wim Wiegerinck, and Hilbert Kappen. Risk sensitive path integral control. arXiv preprint arXiv:1203.3523, 2012.
 Camacho and Alba [2013] Eduardo F Camacho and Carlos Bordons Alba. Model predictive control. Springer Science & Business Media, 2013.
 Chernoff et al. [1952] Herman Chernoff et al. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
 Chua et al. [2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. arXiv preprint arXiv:1805.12114, 2018.
 Duncan [2013] Tyrone E Duncan. Linearexponentialquadratic Gaussian control. IEEE Transactions on Automatic Control, 58(11):2910–2911, 2013.
 Glynn [1990] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
 Goldfain et al. [2018] Brian Goldfain, Paul Drews, Changxi You, Matthew Barulic, Orlin Velev, Panagiotis Tsiotras, and James M Rehg. AutoRally: An open platform for aggressive autonomous driving. arXiv preprint arXiv:1806.00678, 2018.
 Hall and Willett [2013] Eric C Hall and Rebecca M Willett. Dynamical models and tracking regret in online convex programming. arXiv preprint arXiv:1301.1254, 2013.
 Hansen et al. [2003] Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMAES). Evolutionary Computation, 11(1):1–18, 2003.
 Hazan [2016] Elad Hazan. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Helvik and Wittner [2002] Bjarne E. Helvik and Otto Wittner. Using the CrossEntropy Method to Guide/Govern Mobile Agent’s Path Finding in Networks. International Workshop on Mobile Agents for Telecommunication Applications, 2164, 2002.
 Kobilarov [2011] Marin Kobilarov. CrossEntropy Randomized Motion Planning. Robotics: Science and Systems, 2011.
 Mannor et al. [2003] Shie Mannor, Reuven Rubinstein, and Yohai Gat. The Cross Entropy method for Fast Policy Search. Proceedings of the Twentieth International Conference on Machine Learning, 2003.
 Mayne [2014] David Q Mayne. Model predictive control: Recent developments and future promise. Automatica, 50(12):2967–2986, 2014.
 Menache et al. [2005] Ishai Menache, Shie Mannor, and Nahum Shimkin. Basis Function Adaptation in Temporal Difference Reinforcement Learning. Annals of Operations Research, 134:215–238, 2005.
 Miyashita et al. [2018] Megumi Miyashita, Shiro Yano, and Toshiyuki Kondo. Mirror descent search and its acceleration. Robotics and Autonomous Systems, 106:107–116, 2018.
 Nielsen and Garcia [2009] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009.
 Okada and Taniguchi [2018] Masashi Okada and Tadahiro Taniguchi. Acceleration of Gradientbased Path Integral Method for Efficient Optimal and Inverse Optimal Control. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3013–3020. IEEE, 2018.
 Rattray et al. [1998] Magnus Rattray, David Saad, and Shunichi Amari. Natural gradient descent for online learning. Physical review letters, 81(24):5461, 1998.
 Szita and Lörincz [2006] István Szita and András Lörincz. Learning Tetris using the noisy crossentropy method. Neural computation, 18(12):2936–2941, 2006.
 Tassa et al. [2004] Yuval Tassa, Nicolas Mansard, and Emo Todorov. ControlLimited Differential Dynamic Programming. International Conference on Robotics and Automation, 2004.
 Wierstra et al. [2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.
 Williams et al. [2016] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving with model predictive path integral control. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 1433–1440. IEEE, 2016.
 Williams et al. [2017] Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information Theoretic MPC for ModelBased Reinforcement Learning. In International Conference on Robotics and Automation (ICRA), 2017.
 Williams et al. [2018] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. InformationTheoretic Model Predictive Control: Theory and Applications to Autonomous Driving. IEEE Transactions on Robotics, 34(6):1603–1622, 2018.
 Zhang et al. [2018] Lijun Zhang, Shiyin Lu, and ZhiHua Zhou. Adaptive Online Learning in Dynamic Environments. In Advances in Neural Information Processing Systems, pages 1330–1340, 2018.
Appendix A Proofs
Proof of creftype 1.
We prove the first statement; the second one follows directly from the duality relationship. The statement follows from the derivations below; we can write
where the last equality is due to the assumption that . Then applying on both sides and using the relationship that , we have . ∎
Appendix B Experimental Setup
b.1 Cartpole
The state is , where is the cart position, is the pole’s angle, and are the corresponding velocities, and the control is the force applied to the cart. We define the instantaneous cost and terminal cost of the MPC problem as
where is some threshold. For our experiments, we set radians.
In our experiments, the pole is massless except for some weight at the end of the pole. The mass of the cart and pole weight are and , respectively. The true length of the pole is , whereas the length used in the model is . Each time step is modeled using an Euler discretization of
seconds. Each episode of the problem lasts 500 time steps (i.e, 10 seconds) and has episode cost equal to the sum of encountered instantaneous costs. Both the true system and the model apply Gaussian additive noise to the commanded control with zero mean and a standard deviation of
newtons. For the continuous system, the commanded control is clamped to newtons. For the discrete system, the controller can either command newtons to the left, newtons to the right, or newtons.Both the discrete and continuous controller use a planning horizon of 50 time steps (i.e., 1 second). For the continuous controller, we keep the standard deviation of the Gaussian distribution fixed at newtons for each time step in the planning horizon. When applying a control on the real cartpole, we choose the mode of rather than sample from the distribution.
All reported results were gathered using ten episodes per parameter setting.
b.2 AutoRally
The state of the vehicle is
Comments
There are no comments yet.