An Online Learning Approach to Model Predictive Control

02/24/2019 ∙ by Nolan Wagener, et al. ∙ Georgia Institute of Technology 0

Model predictive control (MPC) is a powerful technique for solving dynamic control tasks. In this paper, we show that there exists a close connection between MPC and online learning, an abstract theoretical framework for analyzing online decision making in the optimization literature. This new perspective provides a foundation for leveraging powerful online learning algorithms to design MPC algorithms. Specifically, we propose a new algorithm based on dynamic mirror descent (DMD), an online learning algorithm that is designed for non-stationary setups. Our algorithm, Dynamic Mirror Decent Model Predictive Control (DMD-MPC), represents a general family of MPC algorithms that includes many existing techniques as special instances. DMD-MPC also provides a fresh perspective on previous heuristics used in MPC and suggests a principled way to design new MPC algorithms. In the experimental section of this paper, we demonstrate the flexibility of DMD-MPC, presenting a set of new MPC algorithms on a simple simulated cartpole and a simulated and real-world aggressive driving task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Equal contribution.

Model predictive control (MPC) [20] is an effective tool for control tasks involving dynamic environments, such as helicopter aerobatics [1] and aggressive driving [29]

. One reason for its success is the pragmatic principle it adopts in choosing controls: rather than wasting computational power to optimize a complicated controller for the full-scale problem, which may be difficult to accurately model, MPC instead optimizes a simple controller (e.g., an open-loop control sequence) over a shorter planning horizon that is just sufficient to make a sensible decision at the current moment. By alternating between optimizing the simple controller and applying its corresponding control on the real system, MPC results in a closed-loop policy that can handle modeling errors and dynamic changes in the environment.

Various MPC algorithms have been proposed, using tools ranging from constrained optimization techniques [8, 20, 27] to sampling-based techniques [29]. In this paper, we show that, while these algorithms were originally designed based on different assumptions and heuristics, if we view them through the lens of online learning [16], many of them actually follow the same general update rule.

Online learning is an abstract theoretical framework for analyzing online decision making. Formally, it concerns iterative interactions between a learner and an environment over rounds. At round , the learner makes a decision from some decision set

. The environment then chooses a loss function

based on the learner’s decision, and the learner suffers a cost . In addition to seeing the decision’s cost, the learner may be given additional information about the loss function (e.g., its gradient at ) to aid in choosing the next decision . The learner’s goal is to minimize the accumulated costs , e.g., by minimizing regret [16].

We find that the MPC process bears a strong similarity with online learning. At time (i.e., round ), an MPC algorithm optimizes a controller (i.e., the decision) over some cost function (i.e., the per-round loss). To do so, it observes the cost of the initial controller (i.e., ), improves the controller (to ), and executes a control based on the improved controller in the environment to get to the next state (which in turn defines the next per-round loss).

In view of this connection, we propose a generic framework, DMD-MPC (Dynamic Mirror Decent Model Predictive Control), for synthesizing MPC algorithms. DMD-MPC is based on a first-order online learning algorithm called dynamic mirror descent (DMD) [14], a generalization of mirror descent [4] for dynamic comparators. We show that several existing MPC algorithms [6, 30] are special cases of DMD-MPC, given specific choices of step sizes, loss functions, and regularization. Furthermore, using DMD-MPC, we demonstrate how new MPC algorithms can be derived systematically, including for systems with discrete controls, with only mild assumptions on the regularity of the cost function. This allows us to even work with cost functions that are discontinuous, such as indicator functions. Thus, DMD-MPC offers a spectrum from which practitioners can easily customize new algorithms for their applications.

In the experiments, we apply DMD-MPC to design a range of MPC algorithms and study their empirical performance. Our results indicate the extra design flexibility offered by DMD-MPC does make a difference in practice; by properly selecting additional hyperparameters, which are obscured in the previous approaches, we are able to improve the performance of existing algorithms. Finally, we apply DMD-MPC on a real-world AutoRally car platform 

[13] for autonomous driving tasks, matching state-of-the-art performance.

Notation: We summarize the convention used in the following sections. As our discussions will involve planning horizons, for clarity, we use lightface to denote variables that are meant for a single time step, and boldface to denote the variables congregated across the MPC planning horizon. For example, we use to denote the planned control at time and to denote an -step planned control sequence starting from time . Often we will use a subscript to extract elements from a congregated variable; in the above example, we use to the denote the  element in (note the subscript index starts from zero). By definition, this coincides with . Thus, another way of writing the congregated variables is as . All the variables in this paper are finite-dimensional.

2 An Online Learning Perspective on MPC

2.1 The MPC Problem Setup

Let be finite. We consider the problem of controlling a discrete-time stochastic dynamical system

(1)

for some stochastic transition map . At time , the system is in state . Upon the execution of control , the system randomly transitions to the next state , and an instantaneous cost is incurred. Our goal is to design a state-feedback control law (i.e., a rule of choosing based on ) such that the system exhibits good performance (e.g., accumulating low costs over time steps).

In this paper, we adopt the MPC approach to choosing : at state , we imagine controlling a stochastic dynamics model (which approximates our system ) for time steps into the future. Our planned controls come from a control distribution

that is parameterized by some vector

, where is the feasible parameter set. In each simulation (i.e., rollout), we sample111This can be sampled in either an open-loop or closed-loop fashion when recursively computing the state trajectory. a control sequence from the control distribution and recursively apply it to to generate a predicted state trajectory :

More compactly, we write the above as

(2)

in terms of some that is defined naturally according to the above recursion. Through these simulations, we desire to select a parameter that minimizes an MPC objective , which aims to predict the performance of the system if we were to apply the control distribution starting from .222 can be seen as a surrogate for the long-term performance of our controller. Typically, we set the planning horizon to be much smaller than the amount of time the system is being controlled. This makes the controller easier to optimize and mitigates modeling errors. In other words, we wish to find the that solves

(3)

Once is decided, we then sample333 This setup can also optimize deterministic policies. For instance, we can define to be a Gaussian policy with the mean being the deterministic policy. After optimizing , we return the mean instead of sampling from . This technique is known as “smoothing” [16], which is preferred when the MPC objective or the model transition is non-differentiable. from , extract the first control , and apply it on the real dynamical system in (1) (i.e., set ) to go to the next state . Because is determined based on , MPC is effectively state-feedback.

The motivation behind MPC is to use the MPC objective to reason about the controls required to achieve desirable long-term behaviors. Consider the statistic

(4)

where is a terminal cost function. A popular MPC objective is

, which estimates the expected

-step future costs. Later in Section 3.1, we will discuss several MPC objectives and their properties.

Although the idea of MPC sounds intuitively promising, the optimization can only be approximated in practice, because (3) is often a stochastic program (like the example above) and the control command needs to be computed at a high frequency. In consideration of this imperfection, it is common to heuristically bootstrap the previous approximate solution as the initialization to the current problem. Specifically, let be the approximate solution to the previous problem and denote the initial condition of in solving (3), and consider sampling and . We set

(5)

by defining a shift operator that outputs a new parameter in . This can be chosen to satisfy desired properties, one example being that when conditioned on and , the marginal distributions of are the same for both of and of . A simple example of this property is shown in Fig. 1. Note that also involves a new control that is not in , so the choice of is not unique but algorithm dependent; for example, we can set of to follow the same distribution as (cf. Section 3.2). Because the subproblems in (3) of two consecutive time steps share all control variables except for the first and the last ones, the “shifted” parameter to the current problem should be almost as good as the optimized parameter is to the previous problem. In other words, setting provides a warm start to (3) and amortizes the computational complexity of solving for .

Figure 1: A simple example of the shift operator . Here, the control distribution consists of a sequence of

independent Gaussian distributions. The shift operator moves the parameters of the Gaussians one time step forward and replaces the parameters at

with some default parameters.

2.2 The Online Learning Perspective

round

round

round

sample

sample

learner

environment
Figure 2: Diagram of the online learning perspective.

As discussed, the iterative update process of MPC resembles the setup of online learning [16]. Here we provide the details to convert an MPC setup into an online learning problem. Recall from the introduction that online learning mainly consists of three components: the decision set, the learner’s strategy for updating decisions, and the environment’s strategy for updating per-round losses. We show the counterparts in MPC that correspond to each component below. Note that in this section we will overload the notation to mean .

We use the concept of per-round loss in online learning as a mechanism to measure the decision uncertainty in MPC, and propose the following identification (shown in Fig. 2) for the MPC setup described in the previous section: we set the rounds in online learning to synchronize with the time steps in MPC, set the decision set as the space of feasible parameters of the control distribution , set the learner as the MPC algorithm which in round outputs the decision and side information , and set the per-round loss of the environment as

(6)

In other words, in round of this online learning setup, the learner plays a decision along with a side information (based on the optimized solution and the shift operator in (5)), the environment selects the per-round loss (by applying to the real dynamical system in (1) to transit the state to ), and finally the learner receives and incurs (which measures the sub-optimality of the future plan made by the MPC algorithm).

This online learning setup differs slightly from the standard setup in its separation of the decision and the side information ; while our setup can be converted into a standard one that treats as the sole decision played in round , we adopt this explicit separation in order to emphasize that the variable part of the incurred cost pertains to only . That is, the learner cannot go back and revert the previous control already applied on the system, but only uses to update the current and future controls .

The performance of the learner in online learning (which by our identification is the MPC algorithm) is measured in terms of the accumulated costs . For problems in non-stationary setups, a normalized way to describe the accumulated costs in the online learning literature is the concept of dynamic regret [14, 32], which is defined as

(7)

where . Dynamic regret quantifies how suboptimal the played decision is on the current loss function . In our proposed problem setup, the optimality concept associated with dynamic regret conveys a consistency criterion desirable for MPC: we would like to make a decision at state such that, after applying control and entering the new state , its shifted plan remains close to optimal with respect to the new loss function . If the dynamics model is accurate and the MPC algorithm is ideally solving (3), we can expect that bootstrapping the previous solution through (5) into would result in a small instantaneous gap which is solely due to unpredictable future information (such as the stochasticity in the underlying dynamical system). In other words, an online learning algorithm with small dynamic regret, if applied to our online learning setup, would produce a consistently optimal MPC algorithm with regard to the solution concept discussed above. However, we note that having small dynamic regret here does not directly imply good absolute performance on the control system, because the overall performance of the MPC algorithm is largely dependent on the choice of MPC objective . Small dynamic regret more precisely means whether the plan produced by an MPC algorithm is consistent with the given MPC objective.

3 A Family of MPC Algorithms Based on Dynamic Mirror Descent

The online learning perspective on MPC suggests that good MPC algorithms can be designed from online learning algorithms that achieve small dynamic regret. This is indeed the case. We will show that a range of existing MPC algorithms are in essence applications of a classical online learning algorithm called dynamic mirror descent (DMD) [14]. DMD is a generalization of mirror descent [4] to problems involving dynamic comparators (in this case, the in dynamic regret in (7)). In round , DMD applies the following update rule:

(8)

where is the update direction (which can also be replaced by unbiased sampling if the gradient is an expectation), is called the predictive model,444In [14], is called a dynamical model, but it is not the same as the dynamics of our control system. We therefore rename it to avoid confusion. is the step size, and for , is the Bregman divergence generated by a strictly convex function on .

The first step of DMD in (8) is reminiscent of the proximal update in the usual mirror descent algorithm. It can be thought of as an optimization step where the Bregman divergence acts as a regularization to keep close to . Although a Bregman divergence is not necessarily a metric (since it may not be symmetric), it is still useful to view it as a distance between and . Indeed, familiar examples of the Bregman divergence include the squared Euclidean distance and KL divergence [3].

The second step of DMD in (8) uses the predictive model to anticipate the optimal decision for the next round. In the context of MPC, a natural choice for the predictive model is the shift operator in (5) defined previously in Section 2.1 (hence the same notation), because the per-round losses in two consecutive rounds here concern problems with shifted time indices. Hall and Willett [14] show that the dynamic regret of DMD scales with how much the optimal decision sequence deviates from (i.e., , which is proportional to the unpredictable elements of the problem as we discussed earlier in Section 2.2.

for  do
      and set
      end for
Algorithm 1 Dynamic Mirror Descent MPC (DMD-MPC)

Applying DMD in (8) to the online learning problem described in Section 2.2 leads to an MPC algorithm shown in Algorithm 1, which we call DMD-MPC. More precisely, DMD-MPC represents a family of MPC algorithms in which a specific instance is defined by a choice of

  1. the MPC objective in (6),

  2. the form of the control distribution , and

  3. the Bregman divergence in (8).

Thus, we can use DMD-MPC as a generic strategy for synthesizing MPC algorithms. In the following, we use this recipe to recreate several existing MPC algorithms and demonstrate new MPC algorithms that naturally arise from this framework.

3.1 Loss Functions

We discuss several definitions of the per-round loss , which all result from the formulation in (6) but with different . These loss functions are based on the statistic defined in (4) which measures the -step accumulated cost of a given trajectory. For transparency of exposition, we will suppose henceforth that the control distribution is open-loop555Note again that even while using open-loop control distributions, the overall control law of MPC is state-feedback.; similar derivations follow naturally for closed-loop control distributions. For convenience of practitioners, we also provide expressions of their gradients in terms of the likelihood-ratio derivative666We assume the control distribution is sufficiently regular with respect to its parameter so that the likelihood-ratio derivative rule holds. [12]. All these gradients shall have the form, for some function ,

(9)

For compactness, we will write in place of . These gradients in practice are approximated by finite samples.

3.1.1 Expected Cost

The most commonly used MPC objective is the -step expected accumulated cost function under model dynamics, because it directly estimates the expected long-term behavior when the dynamics model is accurate and is large enough. Its per-round loss function is777In our experiments, we subtract the empirical average of the sampled costs from in (11

) to reduce the variance in estimating the gradient, at the cost of a small amount of bias.

(10)
(11)

3.1.2 Expected Utility

Instead of optimizing for average cost, we may care to optimize for some preference related to the trajectory cost , such as having the cost be below some threshold. This idea can be formulated as a utility that returns a normalized score related to the preference for a given trajectory cost . Specifically, suppose that is lower bounded by zero888If this is not the case, let , which we assume is finite. We can then replace with . and at some round define the utility (i.e., ) to be a function with the following properties:

  • ,

  • is monotonically decreasing, and

  • .

These are sensible properties since we attain maximum utility when we have zero cost, the utility never increases with the cost, and the utility approaches zero as the cost increases without bound. We then define the expected utility under control distribution and dynamics model as and define the per-round loss as the negative logarithm of this expected utility:

(12)
(13)

The gradient in (13) is particularly appealing when estimated with samples. Suppose we sample control sequences from and (for the sake of compactness) sample one state trajectory from for each corresponding control sequence, resulting in . Then the estimate of (13) is a convex combination of gradients:

We see that each weight is computed by considering the relative utility of its corresponding trajectory. A cost with high relative utility will push its corresponding weight closer to one, whereas a low relative utility will cause to be close to zero, effectively rejecting the corresponding sample.

We give two examples of utilities and their corresponding losses.

(a) Threshold utility
(b) Exponential utility
Figure 3: Visualization of different utilities.
Probability of Low Cost

For example, we may care about the system being below some cost threshold as often as possible. To encode this preference, we can use the threshold utility , where is the indicator function and is a threshold parameter. Under this choice, the loss and its gradient become

(14)
(15)

As we can see, this loss function gives the probability of achieving cost below some threshold. As a result (

Fig. 2(a)), costs below are treated the same in terms of the utility. This can potentially make optimization easier since we are trying to make good trajectories as likely as possible instead of finding the best trajectories as in (10).

However, if the threshold is set too low and the gradient is estimated with samples, the gradient estimate may have high variance due to the large number of rejected samples. Because of this, in practice, the threshold is set adaptively, e.g., as the largest cost of the top elite fraction of the sampled trajectories with smallest costs [6]. This allows the controller to make the best sampled trajectories more likely and therefore improve the controller.

Exponential Utility

We can also opt for a continuous surrogate of the indicator function, in this case the exponential utility , where is a scaling parameter. Unlike the indicator function, the exponential utility provides nonzero feedback for any given cost and allows us to discriminate between costs (i.e., if , then ), as shown in Fig. 2(b). Furthermore, acts as a continuous alternative to and dictates how quickly or slowly decays to zero, which in a soft way determines the cutoff point for rejecting given costs.

Under this choice, the loss and its gradient become

(16)
(17)

The loss function in (16) is also known as the risk-seeking objective in optimal control [7]; this classical interpretation is based on a Taylor expansion of (16) showing

when is large, where is the variance of when is sampled from and is sampled from . Here we derive (16) from a different perspective that treats it as a continuous approximation of (14) for maximizing the probability of low-cost trajectories. The use of exponential transformation to approximate indicator functions is a common technique (like the Chernoff bound [9]

) in the machine learning literature.

3.2 Algorithms

We instantiate DMD-MPC with different choices of loss function, control distribution, and Bregman divergence as concrete examples to showcase the flexibility of our framework. In particular, we are able to recover well-known MPC algorithms as special cases of Algorithm 1.

Our discussions below are organized based on the class of Bregman divergences used in (8), and the following algorithms are derived assuming that the control distribution is a sequence of independent distributions. That is, we suppose is a probability density/mass function that factorizes as

(18)

and for some basic control distribution parameterized by , where denotes the set of feasible for the basic control distribution. For control distributions in the form of (18), the shift operator in (5) would set by identifying , for , and initializes the final parameter as either or for some default parameter .

3.2.1 Quadratic Divergence

We start with perhaps the most common Bregman divergence: the quadratic divergence. That is, we suppose the Bregman divergence in (8) has a quadratic form for some positive-definite matrix . Below we discuss different choices of and their corresponding update rules.

Projected Gradient Descent

This basic update rule is a special case when

is the identity matrix. Equivalently, the update can be written as

.

Natural Gradient Descent

We can recover the natural gradient descent algorithm [2] by defining in round , where

is the Fisher information matrix. This update rule takes advantage of the natural Riemannian metric of distributions to normalize the effects of different parameterizations of the same distribution [25].

Quadratic Problems

While the above two update rules are quite general, we can further specialize the Bregman divergence to achieve faster learning when the per-round loss function is quadratic. This happens, for instance, when the MPC problem in (3) is an LQR or LEQR problem999The dynamics model is linear, the step cost is quadratic, the per-round loss is (10), and the basic control distribution is a Dirac-delta distribution. [11]. That is, if

for some constant vector and positive definite matrix , we can set and , making given by the first step of (8) correspond to the optimal solution to (i.e., the solution of LQR/LEQR). The particular values of and for LQR and LEQR are derived in Appendix C.

3.2.2 KL Divergence and the Exponential Family

We show that for control distributions in the exponential family [23], the Bregman divergence in (8) can be set to the KL divergence, which is a natural way to measure distances between distributions. Toward this end, we review the basics of the exponential family. We say a distribution with natural parameter

of random variable

belongs to the exponential family if its probability density/mass function satisfies

(19)

where is the sufficient statistics, is the carrier measure, and is the log-partition function. The distribution can also be described by its expectation parameter , and there is a duality relationship between the two parameterizations:

where is the Legendre transformation of and ; in other words, . This duality relationship results in the property below.

Fact 1.

[23]   .

We can use creftype 1 to define the Bregman divergence in (8) in optimizing a control distribution that belongs to the exponential family:

  • if is an expectation parameter, we can set , or

  • if is a natural parameter, we can set .

We demonstrate some examples using this idea below.

Expectation Parameters and Categorical Distributions

We first discuss the case where is an expectation parameter and the first step in (8) is

(20)

To illustrate, we consider an MPC problem with a discrete control space and use the categorical distribution as the basic control distribution in (18), i.e., we set , where is the probability of choosing each control among at the  predicted time step and denotes the probability simplex in . This parameterization choice makes an expectation parameter of that corresponds to sufficient statistics given by indicator functions. Using the structure of (9), we find the update direction is

where and are the  elements of and , respectively, has for each element except at index where it is , and denotes elementwise division. Update (20) then becomes the exponentiated gradient algorithm [16]:

(21)

where is the  element of is the normalizer for , and denotes elementwise multiplication. That is, instead of applying an additive gradient step to the parameters as in usual gradient descent, the update in (20) exponentiates the gradient and performs elementwise multiplication. This does a better job of accounting for the geometry of the problem, and makes projection a simple operation of normalizing a distribution.

Natural Parameters and Gaussian Distributions

Alternatively, we can set as a natural parameter and use

(22)

as the first step in (8). Particularly, we show that, with (22), the structure of the likelihood-ratio derivative in (9) can be leveraged to design an efficient update scheme.

The main idea follows from the observation that when the gradient is computed through (9) and is the natural parameter, we can write

(23)

where we denote as the expectation parameter of and as the sufficient statistics of the control distribution. We combine the factorization in (23) with a property of the proximal update below (proved in Appendix A) to derive our algorithm.

Proposition 1.

Let be an update direction. Let be the image of under . If and , then .101010A similar proposition can be found for (20).

We find that, under the assumption111111If is not in , the update in (22) needs to perform a projection, the form of which is algorithm dependent. in creftype 1, the update rule in (22) becomes

(24)

Equivalently, we can write (24) as

(25)

In other words, when , the update to the expectation parameter in (8) is simply a convex combination of the sufficient statistics and the previous expectation parameter .

We provide a concrete example of an MPC algorithm that follows from (24). Let us consider a continuous control space and use the Gaussian distribution as the basic control distribution in (18), i.e., we set for some mean vector and covariance matrix . For , we can choose sufficient statistics , which results in the expectation parameter and the natural parameter , where . Let us set as the natural parameter. Then (22) is equivalent to the update rule for :

(26)

Several existing state-of-the-art algorithms are special cases of the update rule in (26).

  • Cross-entropy method (CEM) [6]:
    In particular, if is set to (14) and , the update rule in (26) becomes

    which resembles the update rule of the cross-entropy method for Gaussian distributions [6]. The only difference is that the second-order moment matrix is updated instead of the covariance matrix .

  • Model-predictive path integral (MPPI) [30]:

    If we choose as the exponentiated cost, as in (16), and do not update the covariance, the update rule becomes

    (27)

    which reduces to the MPPI update rule [30] for . This connection is also noted in [24].

    Originally, Williams et al. [30] derived the MPPI update by estimating the mean of an intractable optimal control distribution, which minimizes the KL divergence between the Gaussian control distribution and the optimal control distribution. By contrast, our update rule in (27) results from optimizing an exponential utility. We argue that our derivation leading to (27) is more natural and direct: it starts from a continuous surrogate of the high-probability objective in (16) and follows from the application of a standard online learning algorithm, DMD. From this perspective, several design choices (e.g., ) in the original MPPI algorithm [30] are a heuristic and not necessarily optimal. See Section 5 for more details.

3.3 Extensions

In the previous sections, we discussed multiple instantiations of DMD-MPC, showing the flexibility of our framework. But they are by no means exhaustive. For example, the control distributions can be fairly general, in addition to the categorical and Gaussian distributions that we discussed, and constraints on the problem (e.g., control limits) can be directly incorporated through proper choices of control distributions, such as the beta distribution. Moreover, different integration techniques, such as Gaussian quadrature 

[5], can be adopted to replace the likelihood-ratio derivative in (9) for computing the required gradient direction. We also note that the independence assumption on the control distribution in (18) is not necessary; time-correlated control distributions and feedback policies are straightforward to consider in DMD-MPC.

4 Related Work

Recent work on MPC has studied sampling-based approaches, which are flexible in that they don’t not require differentiability of a cost function. One such algorithm which can be used with general cost functions and dynamics is MPPI, which was proposed by Williams et al. [30] as a generalization of the control affine case [29]. The algorithm is derived by considering an optimal control distribution defined by the control problem. This optimal distribution is intractable to sample from, so the algorithm instead tries to bring a tractable distribution (in this case, Gaussian with fixed covariance) as close as possible in the sense of KL divergence. This ends up being the same as finding the mean of the optimal control distribution. The mean is then approximated as a weighted sum of sampled control trajectories, where the weight is determined by the exponentiated costs. Although this algorithm works well in practice, it is not clear that matching the mean of the distribution should guarantee good performance, such as in the case of a multimodal optimal distribution. A closely related approach is the cross-entropy method (CEM) [6]

, which also assumes a Gaussian sampling distribution but minimizes the KL divergence between the Gaussian distribution and a uniform distribution over low cost samples. CEM has found applicability in reinforcement learning

[19, 21, 26], path and motion planning [17, 18], and MPC [10, 31].

These sampling-based control algorithms can be considered special cases of general derivative-free optimization algorithms, such as covariance matrix adaptation evolutionary strategies (CMA-ES) [15] and natural evolutionary strategies (NES) [28]. CMA-ES samples points from a multivariate Gaussian, evaluates their fitness, and adapts the mean and covariance of the sampling distribution accordingly. On the other hand, NES optimizes the parameters of the sampling distribution to maximize the expected fitness through steepest ascent, where the direction is provided by the natural gradient. Akimoto et al. [2] showed that CMA-ES can also be interpreted as taking a natural gradient step on the parameters of the sampling distribution. As we showed in Section 3.2, natural gradient descent is a special case of the mirror descent framework. A similar observation was made by Okada and Taniguchi [24], in which they derived MPPI through the framework of mirror descent. However, their derivation only considers KL divergence as the regularizer and restricts the sampling distribution to be a Gaussian. In contrast, we do not tie ourselves to a specific Bregman divergence or sampling distribution, but instead consider a family of algorithms by varying these choices.

5 Experiments

We use experiments to the validate the flexibility of DMD-MPC. We show that this framework can handle both continuous (Gaussian control distributions) and discrete (categorial control distribution) variations of control problems, and that MPC algorithms like MPPI and CEM can be generalized using different step sizes and control distributions, resulting in improved performance. Additional experimental details are included in Appendix B.

5.1 Cartpole

We first consider the classic cartpole problem where we seek to swing a pole upright and keep it balanced only using actuation on the attached cart. We consider both the continuous and discrete control variants; for the discrete case, we can either push the cart left or right with some unit force or apply no force. For the continuous case, we choose the Gaussian distribution as the control distribution and keep the covariance fixed. For the discrete case, we choose the categorical distribution and use update (21). In either case, we have access to a biased stochastic model (has a different pole length compared to the real cart).

We consider the interaction between the choice of loss, step size, and number of samples used to estimate (9),121212For our experiments, we vary the number of samples from and fix the number of samples from to ten. Furthermore, we use common random numbers when sampling from to reduce estimation variance. shown in Figs. 5 and 4. For this environment, we can achieve low cost using the expected cost with a proper step size ( for continuous and discrete problems) while being fairly robust to the number of samples. When using either of the utilities, the number of samples is more crucial in the continuous domain, with more samples allowing for larger step sizes. In the discrete domain (Fig. 3(b)), performance is largely unaffected by the number of samples when the step size is below , excluding the threshold utility with 1000 samples.

In Fig. 4(a), for a large range of utility parameters, we see that using step sizes above (the step size set in MPPI and CEM) give significant performance gains. In Fig. 4(b), there’s a more complicated interaction between the utility parameter and step size, with huge changes in cost when altering the utility parameter and keeping the step size fixed.

(a) Continuous controls
(b) Discrete controls
Figure 4: Step size and number of samples (same legends for (a) and (b), elite fraction , ; EC = expected cost (10), PLC = probability of low cost (14), EU = exponential utility (16)).
(a) Continuous controls
(b) Discrete controls
Figure 5: Loss parameter and step size (1000 samples).

5.2 AutoRally Racing

5.2.1 Platform Description

We use the autonomous AutoRally platform [13] to run a high-speed dirt-track driving task. The robot (Fig. 6) is a 1:5 scale RC chassis capable of driving over (

) and has a desktop-class Intel Core i7 CPU and Nvidia GTX 1050 Ti GPU. The platform also has an IMU and GPS used to measure the acceleration and infer the position of the vehicle. In both simulated and real-world experiments, the dynamics model is a neural network which has been fitted to data collected from human demonstrations. We note that the dynamics model is deterministic, so we don’t need to estimate any expectations with respect to the dynamics.

Figure 6: Rally car driving during experiment.
Figure 7: Simulated AutoRally task
Figure 8: Real-world AutoRally task

5.2.2 Simulated Experiments

We first use the Gazebo simulator (Fig. 8) to perform a sweep of algorithm parameters, particularly the step size and number of samples, to evaluate how changing these parameters can effect the performance of DMD-MPC. For all of the experiments, the control distribution is a Gaussian with fixed covariance, and we use update (27) (i.e., the loss is the exponential utility (16)) with . Lap times, average speeds, and maximum speeds are shown in Fig. 9.131313The large error bar for 64 samples and step size of 0.8 is due to one particular lap where the car stalled at a turn for about 60 seconds. We see that although more samples do result in faster lap times, there are diminishing returns past 1920 samples per gradient. Indeed, with a proper step size, even as few as 192 samples can yield lap times within a couple seconds of 3840 samples and a step size of 1. We also observe that the curves converge as the step size decreases further, implying that only a certain number of samples are needed for a given step size. This is particularly important advantage of DMD-MPC over methods like MPPI: by changing the step size, DMD-MPC can perform much more effectively with fewer samples, making it a good choice for embedded systems which can’t produce many samples due to computational constraints.

Figure 9: Simulated AutoRally performance with different step sizes and number of samples. Though many samples coupled with large steps can yield fast lap times and speeds, the performance gains are small after 1920 samples. With fewer samples, a lower step size helps recover some lost performance.

We also qualitatively evaluate two particular extremes: few vs. many samples (64 vs. 3840) and small vs. large step size (0.5 vs. 1). Fig. 10 shows the speed of the car during the episode and Fig. 11 shows the predicted trajectory of DMD-MPC of the mean control sequence at certain times. At small step sizes (Figs. 9(c) and 9(a)), the trajectories and speed profiles are rather similar. The corresponding rollout trajectories (Figs. 10(c) and 10(a)) are moderately long and imply reasonably low MPC cost. On the other hand, with few samples and a large step size (Fig. 9(b)), the car drives much more slowly and erratically, sometimes even stopping. The corresponding rollouts (Fig. 10(b)) have more variability, a consequence of the larger variance of the gradient. In the ideal scenario with many samples and a large step size, the car can achieve consistently high speed while driving smoothly (Fig. 9(d)). Indeed, the rollouts (Fig. 9(d)) are much longer while mostly staying near the center of the track.

(a) 64 samples,
(b) 64 samples,
(c) 3840 samples,
(d) 3840 samples,
Figure 10: Car speeds when optimizing (16). The speeds and trajectories are very similar at step size , irrespective of the number of samples. At step size , though, 64 samples result in capricious maneuvers and low speeds, whereas 3840 samples result in smooth driving at high speeds.
(a) 64 samples,
(b) 64 samples,
(c) 3840 samples,
(d) 3840 samples,
Figure 11: Rollouts of mean control sequence when optimizing (16). At step size , the rollouts are moderately long and don’t depend much on the number of samples. At step size , the rollouts are consistently longer when more samples are used, whereas the length of the predicted trajectories have more variability when less samples are used.
(a)
(b)
(c)
(d)
Figure 12: Car speeds when optimizing (10). All tested step sizes result in low speeds. At too low or too high or a step size, the car will drive along the wall or crash into it.
(a)
(b)
(c)
(d)
Figure 13: Rollouts of mean control sequence when optimizing (10). All rollouts are conservative with some of them remaining at the same initial location.

We also experimented with instead optimizing the expected cost (10) and found performance was dramatically worse (Fig. 12), even when using 3840 samples per gradient. At best, the car would drive in the center of the track at speeds below  (Fig. 11(c)), and at worst, the car would slowly drive along the track walls (Fig. 11(a)), or the controller would eventually produce controls that would prematurely end the experiment (Fig. 11(d)). The rollouts (Fig. 13) are likewise very conservative if not stationary. This poor performance is likely due to most samples in the estimate of (11) having very high cost (e.g., due to leaving the track) and contributing significantly to the gradient estimate. On the other hand, when estimating (17), as in the prior experiments, these high cost trajectories are assigned very low weights so that only low cost trajectories contribute to the gradient estimate.

5.2.3 Real-World Experiments

In the real-world setting (Fig. 8), we ran two sets of experiments: one where we varied the desired speed parameter of the cost function (more details in Section B.2), the step size, and the sampling distribution; and one where we varied the number of samples and the step size. We note that these experiments were performed under different environment conditions, with the second set of experiments in particular done with poor track conditions that limited the max speed of the car. Therefore, the experimental results should not be compared across the two sets. For all experiments, we optimized only the mean of the control distribution and used the exponential utility (16) as the per-round loss with .

For the first set of experiments, we fixed the number of samples to 1920 and for each speed target ( and ) used the following experimental conditions: Gaussian distribution with step size , Gaussian distribution with step size , and Laplace distribution with step size . We use update (27) for the first two conditions and a natural gradient update for the third condition. We note, then, that the first condition corresponds to the MPPI algorithm. Lap times, max speeds, and average speeds are shown in Table 1. In both cases, we find that the second and third controllers perform comparably with MPPI, with the performance gap actually decreasing with increased target speed. These results are also in line with the lap times achieved in [31]. Car speeds and rollouts are shown in Figs. 17, 16, 15 and 14 in Section D.1 that further demonstrate how similar the three controllers perform.

For the second set of experiments, we fixed the speed target to and control distribution to Gaussian using update (27) and used the following experimental conditions: each of 1920 and 192 samples and each of step sizes , , and .141414Due to limited time for running the experiments, we did not test with 1920 samples and step size . The results are shown in Table 2. Overall, there is a mild improvement in lap time with step size of less than , and performance degrades little with the number of samples. Car speeds and rollouts are shown in Figs. 21, 20, 19 and 19 in Section D.2.

Target speed Target speed
Lap time () Avg. speed () Max speed () Lap time () Avg. speed () Max speed ()
Gaussian,
Gaussian,
Laplace,
Table 1: First set of real-world experiments (1920 samples).
1920 samples 192 samples
Step size Lap time () Avg. speed () Max speed () Lap time () Avg. speed () Max speed ()
not tested not tested not tested
Table 2: Second set of real-world experiments (target speed ).

6 Conclusion

We presented a connection between model predictive control and online learning. From this connection, we proposed an algorithm based on dynamic mirror descent that can work for a wide variety of settings and cost functions. We also discussed the choice of loss function within this online learning framework and the sort of preference each loss function imposes. From this general algorithm and assortment of loss functions, we show several well known algorithms are special cases and presented a general update for members of the exponential family.

We empirically validated our algorithm on continuous and discrete simulated problems and on a real-world aggressive driving task. In the process, we also studied the parameter choices within the framework, finding, for example, that in our framework a smaller number of rollout samples can be compensated for by varying other parameters like the step size.

We hope that the online learning and stochastic optimization viewpoints of MPC presented in this paper opens up new possibilities for using tools from these domains, such as alternative efficient sampling techniques [5] and accelerated optimization methods [22, 24], to derive new MPC algorithms that perform well in practice.

Acknowledgements

This material is based upon work supported by NSF NRI award 1637758, NSF CAREER award 1750483, an NSF Graduate Research Fellowship under award No. 2015207631, and a National Defense Science & Engineering Graduate Fellowship.

References

Appendix A Proofs

Proof of creftype 1.

We prove the first statement; the second one follows directly from the duality relationship. The statement follows from the derivations below; we can write

where the last equality is due to the assumption that . Then applying on both sides and using the relationship that , we have . ∎

Appendix B Experimental Setup

b.1 Cartpole

The state is , where is the cart position, is the pole’s angle, and are the corresponding velocities, and the control is the force applied to the cart. We define the instantaneous cost and terminal cost of the MPC problem as

where is some threshold. For our experiments, we set radians.

In our experiments, the pole is massless except for some weight at the end of the pole. The mass of the cart and pole weight are and , respectively. The true length of the pole is , whereas the length used in the model is . Each time step is modeled using an Euler discretization of

seconds. Each episode of the problem lasts 500 time steps (i.e, 10 seconds) and has episode cost equal to the sum of encountered instantaneous costs. Both the true system and the model apply Gaussian additive noise to the commanded control with zero mean and a standard deviation of

newtons. For the continuous system, the commanded control is clamped to newtons. For the discrete system, the controller can either command newtons to the left, newtons to the right, or newtons.

Both the discrete and continuous controller use a planning horizon of 50 time steps (i.e., 1 second). For the continuous controller, we keep the standard deviation of the Gaussian distribution fixed at newtons for each time step in the planning horizon. When applying a control on the real cartpole, we choose the mode of rather than sample from the distribution.

All reported results were gathered using ten episodes per parameter setting.

b.2 AutoRally

The state of the vehicle is