Differentiable MPC for End-to-end Planning and Control

10/31/2018 ∙ by Brandon Amos, et al. ∙ 0

We present foundations for using Model Predictive Control (MPC) as a differentiable policy class for reinforcement learning in continuous state and action spaces. This provides one way of leveraging and combining the advantages of model-free and model-based approaches. Specifically, we differentiate through MPC by using the KKT conditions of the convex approximation at a fixed point of the controller. Using this strategy, we are able to learn the cost and dynamics of a controller via end-to-end learning. Our experiments focus on imitation learning in the pendulum and cartpole domains, where we learn the cost and dynamics terms of an MPC policy class. We show that our MPC policies are significantly more data-efficient than a generic neural network and that our method is superior to traditional system identification in a setting where the expert is unrealizable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free reinforcement learning has achieved state-of-the-art results in many challenging domains. However, these methods learn black-box control policies and typically suffer from poor sample complexity and generalization. Alternatively, model-based approaches seek to model the environment the agent is interacting in. Many model-based approaches utilize Model Predictive Control (MPC) to perform complex control tasks (González et al., 2011; Lenz et al., 2015; Liniger et al., 2014; Kamel et al., 2015; Erez et al., 2012; Alexis et al., 2011; Bouffard et al., 2012; Neunert et al., 2016). MPC leverages a predictive model of the controlled system and solves an optimization problem online in a receding horizon fashion to produce a sequence of control actions. Usually the first control action is applied to the system, after which the optimization problem is solved again for the next time step.

Formally, MPC requires that at each time step we solve the optimization problem:

(1)

where are the state and control at time , and are constraints on valid states and controls, is a (potentially time-varying) cost function, is a dynamics model, and is the initial state of the system. The optimization problem in (1) can be efficiently solved in many ways, for example with the finite-horizon iterative Linear Quadratic Regulator (iLQR) algorithm (Li and Todorov, 2004). Although these techniques are widely used in control domains, much work in deep reinforcement learning or imitation learning opts instead to use a much simpler policy class such as a linear function or neural network. The advantages of these policy classes is that they are differentiable and the loss can be directly optimized with respect to them while it is typically not possible to do full end-to-end learning with model-based approaches.

In this paper, we consider the task of learning MPC-based policies in an end-to-end fashion, illustrated in Figure 1. That is, we treat MPC as a generic policy class parameterized by some representations of the cost and dynamics model . By differentiating through the optimization problem, we can learn the costs and dynamics model to perform a desired task. This is in contrast to regressing on collected dynamics or trajectory rollout data and learning each component in isolation, and comes with the typical advantages of end-to-end learning (the ability to train directly based upon the task loss of interest, the ability to “specialize” parameter for a given task, etc).

Still, efficiently differentiating through a complex policy class like MPC is challenging. Previous work with similar aims has either simply unrolled and differentiated through a simple optimization procedure (Tamar et al., 2017) or has considered generic optimization solvers that do not scale to the size of MPC problems (Amos and Kolter, 2017). This paper makes the following two contributions to this space. First, we provide an efficient method for analytically differentiating through an iterative non-convex optimization procedure based upon a box-constrained iterative LQR solver (Tassa et al., 2014); in particular, we show that the analytical derivative can be computed using one additional backward pass of a modified iterative LQR solver. Second, we empirically show that in imitation learning scenarios we can recover the cost and dynamics from an MPC expert with a loss based only on the actions (and not states). In one notable experiment, we show that directly optimizing the imitation loss results in better performance than vanilla system identification.

Figure 1: Illustration of our contribution: A learnable MPC module that can be integrated into a larger end-to-end reinforcement learning pipeline. Our method allows the controller to be updated with gradient information directly from the task loss.

2 Background and Related Work

Pure model-free techniques for policy search have demonstrated promising results in many domains by learning reactive polices which directly map observations to actions (Mnih et al., 2013; Oh et al., 2016; Gu et al., 2016b; Lillicrap et al., 2015; Schulman et al., 2015, 2016; Gu et al., 2016a). Despite their success, model-free methods have many drawbacks and limitations, including a lack of interpretability, poor generalization, and a high sample complexity. Model-based methods are known to be more sample-efficient than their model-free counterparts. These methods generally rely on learning a dynamics model directly from interactions with the real system and then integrate the learned model into the control policy (Schneider, 1997; Abbeel et al., 2006; Deisenroth and Rasmussen, 2011; Heess et al., 2015; Boedecker et al., 2014). More recent approaches use a deep network to learn low-dimensional latent state representations and associated dynamics models in this learned representation. They then apply standard trajectory optimization methods on these learned embeddings (Lenz et al., 2015; Watter et al., 2015; Levine et al., 2016). However, these methods still require a manually specified and hand-tuned cost function, which can become even more difficult in a latent representation. Moreover, there is no guarantee that the learned dynamics model can accurately capture portions of the state space relevant for the task at hand.

To leverage the benefits of both approaches, there has been significant interest in combining the model-based and model-free paradigms. In particular, much attention has been dedicated to utilizing model-based priors to accelerate the model-free learning process. For instance, synthetic training data can be generated by model-based control algorithms to guide the policy search or prime a model-free policy (Sutton, 1990; Theodorou et al., 2010; Levine and Abbeel, 2014; Gu et al., 2016b; Venkatraman et al., 2016; Levine et al., 2016; Chebotar et al., 2017; Nagabandi et al., 2017; Sun et al., 2017). (Bansal et al., 2017) learns a controller and then distills it to a neural network policy which is then fine-tuned with model-free policy learning. However, this line of work usually keeps the model separate from the learned policy.

Alternatively, the policy can include an explicit planning module which leverages learned models of the system or environment, both of which are learned through model-free techniques. For example, the classic Dyna-Q algorithm (Sutton, 1990) simultaneously learns a model of the environment and uses it to plan. More recent work has explored incorporating such structure into deep networks and learning the policies in an end-to-end fashion. Tamar et al. (2016) uses a recurrent network to predict the value function by approximating the value iteration algorithm with convolutional layers. Karkus et al. (2017) connects a dynamics model to a planning algorithm and formulates the policy as a structured recurrent network. Silver et al. (2016) and Oh et al. (2017) perform multiple rollouts using an abstract dynamics model to predict the value function. A similar approach is taken by Weber et al. (2017) but directly predicts the next action and reward from rollouts of an explicit environment model. Farquhar et al. (2017) extends model-free approaches, such as DQN (Mnih et al., 2015) and A3C (Mnih et al., 2016), by planning with a tree-structured neural network to predict the cost-to-go. While these approaches have demonstrated impressive results in discrete state and action spaces, they are not applicable to continuous control problems.

To tackle continuous state and action spaces, Pascanu et al. (2017) propose a neural architecture which uses an abstract environmental model to plan and is trained directly from an external task loss. Pong et al. (2018) learn goal-conditioned value functions and use them to plan single or multiple steps of actions in an MPC fashion. Similarly, Pathak et al. (2018) train a goal-conditioned policy to perform rollouts in an abstract feature space but ground the policy with a loss term which corresponds to true dynamics data. The aforementioned approaches can be interpreted as a distilled optimal controller which does not separate components for the cost and dynamics. Taking this analogy further, another strategy is to differentiate through an optimal control algorithm itself. Okada et al. (2017) and Pereira et al. (2018) present a means to differentiate through path integral optimal control (Williams et al., 2016, 2017) and learn a planning policy end-to-end. Srinivas et al. (2018) shows how to embed differentiable planning (unrolled gradient descent over actions) within a goal-directed policy. In a similar vein, Tamar et al. (2017) differentiates through an iterative LQR (iLQR) solver (Li and Todorov, 2004; Xie et al., 2017; Tassa et al., 2014) to learn a cost-shaping term offline. This shaping term enables a shorter horizon controller to approximate the behavior of a solver with a longer horizon to save computation during runtime.

Contributions of our paper. All of these methods require differentiating through planning procedures by explicitly “unrolling” the optimization algorithm itself. While this is a reasonable strategy, it is both memory- and computationally-expensive and challenging when unrolling through many iterations because the time- and space-complexity of the backward pass grows linearly with the forward pass. In contrast, we address this issue by showing how to analytically differentiate through the fixed point of a nonlinear MPC solver. Specifically, we compute the derivatives of an iLQR solver with a single LQR step in the backwards pass. This makes the learning process more computationally tractable while still allowing us to plan in continuous state and action spaces. Unlike model-free approaches, explicit cost and dynamics components can be extracted and analyzed on their own. Moreover, in contrast to pure model-based approaches, the dynamics model and cost function can be learned entirely end-to-end.

3 Differentiable LQR

Discrete-time finite-horizon LQR is a well-studied control method that optimizes a convex quadratic objective function with respect to affine state-transition dynamics from an initial system state . Specifically, LQR finds the optimal nominal trajectory by solving the following optimization problem

(2)

From a policy learning perspective, this can be interpreted as a module with unknown parameters

, which can be integrated into a larger end-to-end learning system. The learning process involves taking derivatives of some loss function

, which are then used to update the parameters. Instead of directly computing each of the individual gradients, we present an efficient way of computing the derivatives of the loss function with respect to the parameters

(3)

By interpreting LQR from an optimization perspective (Boyd, 2008), we associate dual variables with the state constraints. The Lagrangian of the optimization problem is then given by

(4)

where the initial constraint is represented by setting and . Differentiating (4) with respect to yields

(5)

Input: Initial state
Parameters:
 
Forward Pass:

1: Solve (2)
2:Compute with (7)

Backward Pass:

1: Solve (9), reusing the factorizations from the forward pass
2:Compute with (7)
3:Compute the derivatives of with respect to , , , , and with (8)
Algorithm 1 Differentiable LQR Module (The LQR algorithm is defined in Appendix A)

Thus, the normal approach to solving LQR problems with dynamic Riccati recursion can be viewed as an efficient way of solving the following KKT system

(6)

Given an optimal nominal trajectory , (5) shows how to compute the optimal dual variables with the backward recursion

(7)

where , , and are the first block-rows of , , and , respectively. Now that we have the optimal trajectory and dual variables, we can compute the gradients of the loss with respect to the parameters. Since LQR is a constrained convex quadratic , the derivatives of the loss with respect to the LQR parameters can be obtained by implicitly differentiating the KKT conditions. Applying the approach from Section 3 of Amos and Kolter (2017), the derivatives are

(8)

where is the outer product operator, and and are obtained by solving the linear system

(9)

We observe that (9) is of the same form as the linear system in (6) for the LQR problem. Therefore, we can leverage this insight and solve (9) efficiently by solving another LQR problem that replaces with and with 0. Moreover, this approach enables us to re-use the factorization of from the forward pass instead of recomputing. Algorithm 1 summarizes the forward and backward passes for a differentiable LQR module.

4 Differentiable MPC

While LQR is a powerful tool, it does not cover realistic control problems with non-linear dynamics and cost. Furthermore, most control problems have natural bounds on the control space that can often be expressed as box constraints. These highly non-convex problems, which we will refer to as model predictive control (MPC), are well-studied in the control literature and can be expressed in the general form

(10)

where the non-convex cost function and non-convex dynamics function are (potentially) parameterized by some . We note that more generic constraints on the control and state space can be represented as penalties and barriers in the cost function. The standard way of solving the control problem (10) is by iteratively forming and optimizing a convex approximation

(11)

where we have defined the second-order Taylor approximation of the cost around as

(12)

with and . We also have a first-order Taylor approximation of the dynamics around as

(13)

with . In practice, a fixed point of (11) is often reached, especially when the dynamics are smooth. As such, differentiating the non-convex problem (10) can be done exactly by using the final convex approximation. Without the box constraints, the fixed point in (11) could be differentiated with LQR as we show in Section 3. In the next section, we will show how to extend this to the case where we have box constraints on the controls as well.

4.1 Differentiating Box-Constrained QPs

First, we consider how to differentiate a more generic box-constrained convex QP of the form

(14)

Given active inequality constraints at the solution in the form , this problem turns into an equality-constrained optimization problem with the solution given by the linear system

(15)

With some loss function that depends on , we can use the approach in Amos and Kolter (2017) to obtain the derivatives of with respect to , , , and as

(16)

where and are obtained by solving the linear system

(17)

The constraint is equivalent to the constraint if . Thus solving the system in (17) is equivalent to solving the optimization problem

(18)

4.2 Differentiating MPC with Box Constraints

Given: Initial state and initial control sequence
Parameters: of the objective and dynamics
 
Forward Pass:

1: Solve (10)
2:The solver should reach the fixed point in (11) to obtain approximations to the cost and dynamics
3:Compute with (7)

Backward Pass:

1: is with the rows corresponding to the tight control constraints zeroed
2: Solve (9), reusing the factorizations from the forward pass
3:Compute with (7)
4:Differentiate with respect to the approximations and with (8).
5:Differentiate these approximations with respect to

and use the chain rule to obtain

Algorithm 2 Differentiable MPC Module

At a fixed point, we can use (16) to compute the derivatives of the MPC problem, where and are found by solving the linear system in (9) with the additional constraint that if . Solving this system can be equivalently written as a zero-constrained LQR problem of the form

(19)

where is the iteration that (11) reaches a fixed point, and and are the corresponding approximations to the objective and dynamics defined earlier. Algorithm 2 summarizes the proposed differentiable MPC module. To solve the MPC problem in (10) and reach the fixed point in (11

), we use the box-DDP heuristic

(Tassa et al., 2014). For the zero-constrained LQR problem in (19) to compute the derivatives, we use an LQR solver that zeros the appropriate controls.

4.3 Drawbacks of Our Approach

Sometimes the controller does not run for long enough to reach a fixed point of (11), or a fixed point doesn’t exist, which often happens when using neural networks to approximate the dynamics. When this happens, (19) cannot be used to differentiate through the controller, because it assumes a fixed point. Differentiating through the final iLQR iterate that’s not a fixed point will usually give the wrong gradients. Treating the iLQR procedure as a compute graph and differentiating through the unrolled operations is a reasonable alternative in this scenario that obtains surrogate gradients to the control problem. However, as we empirically show in Section 5.1, the backwards pass of this method scales linearly with the number of iLQR iterations used in the forward. Instead, fixed-point differentiation is constant time and only requires a single iLQR solve.

5 Experimental Results

In this section, we present several results that highlight the performance and capabilities of differentiable MPC in comparison to neural network policies and vanilla system identification (SysId). We show 1) superior runtime performance compared to an unrolled solver, 2) the ability of our method to recover the cost and dynamics of a controller with imitation, and 3) the benefit of directly optimizing the task loss over vanilla SysId.

We have released our differentiable MPC solver as a standalone open source package that is available at https://github.com/locuslab/mpc.pytorch and our experimental code for this paper is also openly available at https://github.com/locuslab/differentiable-mpc

. Our experiments are implemented with PyTorch

(Paszke et al., 2017).

Figure 2: Runtime comparison of fixed point differentiation (FP) to unrolling the iLQR solver (Unroll), averaged over 10 trials. Figure 3:

Convergence of the LQR imitation learning experiments, showing the mean and standard deviation of four trials.

5.1 MPC Solver Performance

Figure 2 highlights the performance of our differentiable MPC solver. We compare to an alternative version where each box-constrained iLQR iteration is individually unrolled, and gradients are computed by differentiating through the entire unrolled chain. As illustrated in the figure, these unrolled operations incur a substantial extra cost. Our differentiable MPC solver 1) is slightly more computationally efficient even in the forward pass, as it does not need to create and maintain the backward pass variables; 2) is more memory efficient in the forward pass for this same reason (by a factor of the number of iLQR iterations); and 3) is significantly more efficient in the backwards pass, especially when a large number of iLQR iterations are needed. The backwards pass is essentially free, as it can reuse all the factorizations for the forward pass and does not require multiple iterations.

5.2 Imitation Learning: Linear-Dynamics Quadratic-Cost (LQR)

In this section, we show results to validate the MPC solver and gradient-based learning approach for an imitation learning problem. The expert and learner are LQR controllers that share all information except for the linear system dynamics . The controllers have the same quadratic cost (the identity), control bounds , horizon (5 timesteps), and 3-dimensional state and control spaces. Though the dynamics can also be recovered by fitting next-state transitions, we show that we can alternatively use imitation learning to recover the dynamics using only controls.

Given an initial state , we can obtain nominal actions from the controllers as , where . We randomly initialize the learner’s dynamics with and minimize the imitation loss

which we can uniquely do using only observed controls and no observations. We do learning by differentiating with respect to

(using mini-batches with 32 examples) and taking gradient steps with RMSprop

(Tieleman and Hinton, 2012). Figure 3 shows that the learner’s trajectories match the experts by minimizing . Furthermore, the learner also recovers the expert’s parameters by showing that it minimizes the model loss . We note that in general despite the LQR problem being convex, the optimization problem of some loss function w.r.t. the LQR’s parameters is a (potentially difficult) non-convex optimization problem.

5.3 Imitation Learning: Non-Convex Continuous Control

We next demonstrate the ability of our method to do imitation learning in the pendulum and cartpole benchmark domains. Despite being simple tasks, they are relatively challenging for a generic poicy to learn quickly in the imitation learning setting. In our experiments we use MPC experts and learners that produce a nominal action sequence where parameterizes the model that’s being optimized. The goal of these experiments is to optimize the imitation loss , again which we can uniquely do using only observed controls and no observations. We consider the following methods:

Baselines: nn is an LSTM that takes the state as input and predicts the nominal action sequence. In this setting we optimize the imitation loss directly. sysid assumes the cost of the controller is known and approximates the parameters of the dynamics by optimizing the next-state transitions.

Our Methods: mpc.dx assumes the cost of the controller is known and approximates the parameters of the dynamics by directly optimizing the imitation loss. mpc.cost assumes the dynamics of the controller is known and approximates the cost by directly optimizing the imitation loss. mpc.cost.dx approximates both the cost and parameters of the dynamics of the controller by directly optimizing the imitation loss.

In all settings that involve learning the dynamics (sysid, mpc.dx, and mpc.cost.dx) we use a parameterized version of the true dynamics. In the pendulum domain, the parameters are the mass, length, and gravity; and in the cartpole domain, the parameters are the cart’s mass, pole’s mass, gravity, and length. For cost learning in mpc.cost and mpc.cost.dx we parameterize the cost of the controller as the weighted distance to a goal state . We have found that simultaneously learning the weights and goal state is instable and in our experiments we alternate learning of and

independently every 10 epochs. We collected a dataset of trajectories from an expert controller and vary the number of trajectories our models are trained on. A single trial of our experiments takes 1-2 hours on a modern CPU. We optimize the

nn setting with Adam (Kingma and Ba, 2014) with a learning rate of and all other settings are optimized with RMSprop (Tieleman and Hinton, 2012) with a learning rate of and a decay term of .

Figure 4 shows that in nearly every case we are able to directly optimize the imitation loss with respect to the controller and we significantly outperform a general neural network policy trained on the same information. In many cases we are able to recover the true cost function and dynamics of the expert. More information about the training and validation losses are in Appendix B. The comparison between our approach mpc.dx and SysId is notable, as we are able to recover equivalent performance to SysId with our models using only the control information and without using state information.

Again, while we emphasize that these are simple tasks, there are stark differences between the approaches. Unlike the generic network-based imitation learning, the MPC policy can exploit its inherent structure. Specifically, because the network contains a well-defined notion of the dynamics and cost, it is able to learn with much lower sample complexity that a typical network. But unlike pure system identification (which would be reasonable only for the case where the physical parameters are unknown but all other costs are known), the differentiable MPC policy can naturally be adapted to objectives besides simple state prediction, such as incorporating the additional cost learning portion.

Figure 4: Learning results on the (simple) pendulum and cartpole environments. We select the best validation loss observed during the training run and report the best test loss.

5.4 Imitation Learning: SysId with a non-realizable expert

All of our previous experiments that involve SysId and learning the dynamics are in the unrealistic case when the expert’s dynamics are in the model class being learned. In this experiment we study a case where the expert’s dynamics are outside of the model class being learned. In this setting we will do imitation learning for the parameters of a dynamics function with vanilla SysId and by directly optimizing the imitation loss (sysid and the mpc.dx in the previous section, respectively).

SysId often fits observations from a noisy environment to a simpler model. In our setting, we collect optimal trajectories from an expert in the pendulum environment that has an additional damping term and also has another force acting on the point-mass at the end (which can be interpreted as a “wind” force). We do learning with dynamics models that do not have these additional terms and therefore we cannot recover the expert’s parameters. Figure 5 shows that even though vanilla SysId is slightly better at optimizing the next-state transitions, it finds an inferior model for imitation compared to our approach that directly optimizes the imitation loss.

We argue that the goal of doing SysId is rarely in isolation and always serves the purpose of performing a more sophisticated task such as imitation or policy learning. Typically SysId is merely a surrogate for optimizing the task and we claim that the task’s loss signal provides useful information to guide the dynamics learning. Our method provides one way of doing this by allowing the task’s loss function to be directly differentiated with respect to the dynamics function being learned.

Figure 5: Convergence results in the non-realizable Pendulum task.

6 Conclusion

This paper lays the foundations for differentiating and learning MPC-based controllers within reinforcement learning and imitation learning. Our approach, in contrast to the more traditional strategy of “unrolling” a policy, has the benefit that it is much less computationally and memory intensive, with a backward pass that is essentially free given the number of iterations required for a the iLQR optimizer to converge to a fixed point. We have demonstrated our approach in the context of imitation learning, and have highlighted the potential advantages that that approach brings over generic imitation learning and system identification.

We also emphasize that one of the primary contributions of this paper is to define and set up the framework for differentiating through MPC in general. Given the recent prominence of attempting to incorporate planning and control methods into the loop of deep network architectures, the techniques here offer a method for efficiently integrating MPC policies into such situations, allowing these architectures to make use of a very powerful function class that has proven extremely effective in practice. This has numerous additional applications, including tuning model parameters to task-specific goals, incorporating joint model-based and policy-based loss functions, and extensions into stochastic settings.

Acknowledgments

BA is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1252522.

References

Appendix A LQR and MPC Algorithms

The state space is -dimensional and the control space is -dimensional.
is the horizon length, the number of nominal timesteps to optimize for in the future.
is the initial state
and are the quadratic cost terms. Every must be PSD.
are the affine cost terms.  

Backward Recursion
for t = T to 1 do
     
     
     
     
     
     
end for
Forward Recursion
for t = 1 to T do
     
     
end for
return
Algorithm 3 Solves (2) as described in Levine [2017]

The state space is -dimensional and the control space is -dimensional.
is the horizon length, the number of nominal timesteps to optimize for in the future.
are respectively the control lower- and upper-bounds.
are respectively the initial state and nominal control sequence
is the non-convex and twice-differentiable cost function.
is the non-convex and once-differentiable dynamics function.  

for t = 1 to T-1 do
     
end for
for i = 1 to [converged] do
     for t = 1 to T do
          Form the second-order Taylor expansion of the cost as in (12)
         
         
         
          Form the first-order Taylor expansion of the dynamics as in (13)
         
         
     end for
     
end for
function ( )
      are the true cost and dynamics functions. is the current trajectory iterate.
      are the approximate cost and dynamics terms around the current trajectory.
     
      Backward Recursion: Over the linearized trajectory.
     
     for t = T to 1 do
         
         
         
         
          Can be solved with a Projected-Newton method as described in Tassa et al. [2014].
          Let respectively index the free and clamped dimensions of this optimization problem.
         
         
         
         
         
         
     end for
     
      Forward Recursion and Line Search: Over the true cost and dynamics.
     repeat
         
         for t = 1 to T do
              
              
         end for
         
     until 
     
     return
end function
Algorithm 4 Solves (10) as described in Tassa et al. [2014]

Appendix B Imitation learning experiment losses

Figure 6: Learning results on the (simple) pendulum and cartpole environments. We select the best validation loss observed during the training run and report the corresponding train and test loss. Every datapoint is averaged over four trials.