MPC-Net: A First Principles Guided Policy Search

09/11/2019 ∙ by Jan Carius, et al. ∙ ETH Zurich 0

We present an Imitation Learning approach for the control of dynamical systems with a known model. Our policy search method is guided by solutions from Model Predictive Control (MPC). Contrary to approaches that minimize a distance metric between the guiding demonstrations and the learned policy, our loss function corresponds to the minimization of the control Hamiltonian, which derives from the principle of optimality. Our algorithm, therefore, directly attempts to solve the HJB optimality equation with a parameterized class of control laws. The loss function's explicit encoding of physical constraints manifests in an improved constraint satisfaction metric of the learned controller. We train a mixture-of-expert neural network architecture for controlling a quadrupedal robot and show that this policy structure is well suited for such multimodal systems. The learned policy can successfully stabilize different gaits on the real walking robot from less than 10 min of demonstration data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The control of robotic systems with fast and unstable dynamics requires carefully designed feedback controllers. Hybrid, underactuated walking robots pose an especially challenging setting in this respect.

Recent successes in Reinforcement Learning (RL) demonstrate sophisticated walking robot control [1, 2, 3, 4, 5], yet a large number of policy rollouts need to be collected to reach the required performance level. It is, therefore, common practice to use physics simulators during training and subsequently attempt a sim-to-real transfer [1, 4].

Imitation Learning (IL[6] appears to be a promising method that could reduce the sampling needs of learning-based approaches by guiding them with expert demonstrations. When good demonstrations are available, sampling efficiency can be drastically improved over classical RL [7].

An appealing way to automatically generate such demonstrations for modeled dynamical systems is Optimal Control (OC) and its realtime counterpart MPC. They provide a formal framework for generating control commands that respect physical constraints and optimize a performance criterion. Knowledge of a system model and its gradients enable MPC to discover complex robot behaviors in a very sample-efficient way [8, 9, 10, 11, 12, 13]. Unfortunately, when deploying on a robot, the entire optimization problem has to be solved online because the resulting control policy is only valid around the current state. Moreover, the robustness against disturbances – both of intrinsic nature (e.g., modeling errors) as well as external effects – is critically dependent on the assumption that a new motion plan can be generated sufficiently fast. Even for moderately complex systems, the update frequency of MPC becomes a limiting factor when deploying on onboard computers.

Learning from OC solutions has proven a viable option for robot control that combines the advantages of both approaches [14, 15, 16, 17, 18, 19, 20, 21]. The benefit of using a solver as expert demonstrator over humans or animals is that there is no domain adaptation problem, and one can query demonstrations from arbitrary states. Additionally, one may request the solver to explicitly handle constraints instead of only presuming that demonstrations are constraint consistent.

Several methods take an inverse OC approach to IL: Multiple local approximations of the value function, computed by MPC runs, are aggregated into a single global approximation [22, 23, 24]. The learned value function and its induced optimal policy are in turn used to reduce the MPC horizon or speed up convergence. Alternatively, a Behavioral Cloning (BC) approach to IL attempts to directly learn a policy that reproduces the expert’s demonstrations without maintaining a value function explicitly. Accordingly, the original RL

problem is transformed into a supervised learning problem since the demonstrator’s actions can be interpreted as labels.

Our proposed algorithm belongs to the family of such actor-only approaches: We introduce MPC-Net, a policy search method that is guided by an OC solver to find a neural network control policy. The method can be seen as a policy iteration scheme that draws data from a perfect critic (i.e., the MPC). Our key innovation is a theoretically motivated loss function, which is based on the minimization of the control Hamiltonian. The structure of the control Hamiltonian captures the system dynamics and constraints of the control problem. We show that this learning objective has favorable properties in terms of convergence and constraint satisfaction, which is particularly important for hybrid systems.

Closely related to our algorithm are policy search methods with a teacher-learner setup [17, 18, 19]. These works employ an OC solver as a teacher from which a policy is learned. Contrary to our work, however, the teacher adapts to the student. This assimilation is achieved by adding a penalty term to the OC cost function so that demonstrations are created that remain close to the student’s policy. Additionally, the student’s objective is usually the optimization of a distance metric between student’s and teacher’s policy outputs. However, minimizing a distance may not correspond to performance, e.g., in constrained settings it is usually more important to satisfy constraints rather than mimicking the teacher accurately. In our approach, no such choice of a distance metric has to be made. Notably, our learner is never presented with the optimal control input. Additionally, since our demonstrator does not adapt to the current policy of the learner, all demonstration samples remain valid and can be re-used, thereby boosting sampling efficiency.

Imitating a demonstrator that is not adaptive to the learner induces the problem of distribution matching: Inevitable approximation errors between the learned and demonstrated policies make rollouts of the learned policy encounter a different distribution of states than the one from demonstration data. Ross et al. [25, 26] show that the resulting errors can compound quadratically in the time horizon. We use elements of their proposed solutions (probabilistic mixing and dataset augmentation) to ensure that the distributions match. Simply put, we bias the demonstrator’s query states towards the observations that our policy sees and thereby receive samples that match the learner’s distribution better.

While the idea of policy search through minimization of the control Hamiltonian applies to arbitrary parameterized policies such as neural networks, weighted motion primitives, or spline coefficients, we consider the very general class of mixture-of-expert neural networks policies [27] in this work. Our choice caters for the fact that OC is an inverse problem with potentially multiple solutions for the same observation. The expert data may, therefore, exhibit such multimodal behavior. We show that this choice of network structure has favorable properties in terms of convergence and constraint satisfaction and is particularly suitable for controlling legged robots since these systems inherently exhibit multi-modal dynamics.

Statement of Contributions

The contributions of this work are as follows:

  • Derivation of a policy search algorithm with a loss function that is derived from fundamental concepts from OC

  • Evidence that the explicit handling of constraints in our loss function achieves improved constraint satisfaction compared to minimization of a distance metric in terms of policy outputs

  • Demonstration of improved efficiency in terms of MPC calls by exploiting a local approximation of the OC value function

  • Evidence that a mixture-of-expert network architecture outperforms a general Multilayer Perceptron (MLP) for control of a walking robot

  • Validation of the trained control policies on robotic hardware. The learned controllers successfully stabilize two different gaits on a quadrupedal robot

Ii Method

The key steps of our method are listed in Alg. 1 and schematically shown in Fig. 1. Data is generated by running MPC

from a feasible, random initial state. Samples from the resulting optimal trajectories are stored in a replay buffer. At each policy update step, we construct a loss function by drawing a batch of the stored samples and perform a stochastic gradient descent step in the policy parameter space.


In this section, we first explain the MPC problem and the structure of its solution. Subsequently, we present the theoretical properties of the OC problem and how they motivate our loss function. Finally, we show how a neural network policy is trained from MPC demonstrations.

Fig. 1: Schematic of the MPC-Net policy learning approach
1:  Given: Replay Buffer , mpcSolver
2:  

Given hyperparameters:

maxIter, mpcDecimation, batchSize, learningRate, rolloutLength
3:  for iter in [1, maxIter] do
4:     if modulo(iter, mpcDecimation) then
5:        iter / maxIter
6:        sampleRandomStartingState
7:       for  in [0, rolloutLength] do
8:           mpcSolver(, )
9:           sampleInNeighborhood()
10:           valueFunctionDerivative()
11:           constraintLagrangian()
12:          Append sample to
13:           stepSystem(
14:       end for
15:     end if
16:      drawRandomSampleBatch(, batchSize)
17:      evaluatePolicyOnSamples()
18:      computeLoss()
19:      stepOptimizer()
20:  end for
Algorithm 1 MPC-Net Guided Policy Learning

Ii-a Model Predictive Control

We consider a continuous-time, finite horizon OC problem

(1)
subject to
(2)

where is the time horizon, a given initial state, the final cost and the intermediate cost function. The system flow map and constraints and may be time-dependent, for example to represent a hybrid walking robot.

In principle, our method works with any OC solver that can handle the constraints (2) and that provides an approximation of the optimal value function. In this work, we employ a Differential Dynamic Programming (DDP)-like algorithm called Sequential Linear-Quadratic (SLQ) control [28], which is the continuous-time equivalent to the Iterative Linear-Quadratic Regulator (iLQR[29]. This solver handles the inequality constraints through a barrier function  [30] and explicitly computes optimal Lagrange multipliers for satisfaction of the state-input equality constraint [28]. The Lagrangian of the OC problem (1) is therefore given by

(3)

The solution of the variational problem (1) consists of nominal state and input trajectories as well as time-dependent linear feedback gains that define the optimal control policy

(4)

As a byproduct of the solver, we also have access to the state derivative of the value function .

During our emulated real-time MPC loop, we let the solver compute the optimal policy, then store the values of at the first time step of the solution in our replay memory. Next, we update the current state using the system dynamics and continue until the rollout length is reached.

Ii-B Policy Loss Function

The MPC internally computes the optimal value function (cost-to-go), which is defined as

(5)

It is a known property of OC [31, p. 111] that the optimal input must satisfy

(6)
(7)

where is the input-dependent part of the control Hamiltonian, which directly arises from the HJB equation. A globally optimal policy would have to satisfy (6) at any time and state . Therefore, the perfect policy search method would involve a very rich family of parameterized policies and minimize the control Hamiltonian in the entire time-state space. Such minimization is impossible because the optimal value function and Lagrange multipliers are not known a priori.

To our benefit, however, MPC can compute the value function along trajectories in state space. For a sufficiently rich class of parameterized functions, one can expect to find some parameters that make the policy reproduce the optimal inputs sufficiently close. Our strategy to find these optimal parameters is, therefore, given by Eq. (6), where we insert and minimize the expectation over the time-state distribution that results from the MPC trajectories:

(8)

The quantity in the expectation can be seen as a per-sample loss for policy training. It is essential to realize that the control Hamiltonian allows us to find the optimal control via this unconstrained pointwise (i.e., per pair) minimization because the future cost and constraint Lagrangian have already been included. It is, therefore, not necessary to perform Monte-Carlo-style rollouts to find the optimal control.

The MPC loop presented in Sec. II-A serves as a data generation mechanism for the policy search module. In general terms, the MPC fills a replay buffer with data points that correspond to the states that it has encountered and those tuples are sampled from to compute the empirical expectation in (8). In our implementation, the samples for computing the policy gradient are drawn uniformly at random from the replay buffer, thereby breaking the temporal correlation of our samples [32].

Ii-C Sampling from an MPC Solution

A favorable property of DDP solvers is that they compute a second-order approximation of the optimal value function in the vicinity of the nominal state trajectory. In turn, the control Hamiltonian can also be calculated in a region around the MPC solution. Given feasible, random starting points, the areas where the value function is known corresponds to the subset of states that are visited by a (close-to) optimal policy.

This fact can be exploited to increase the extracted informational content from an MPC rollout. By sampling around the nominal state, our data automatically covers tubes in state space, which accelerates learning and makes the learned policy more robust. This procedure, denoted sampleInNeighborhood in Alg. 1

, amounts to drawing states from a Gaussian distribution according to

(9)

where the covariance matrix has diagonal entries corresponding to the typical disturbance that the respective state component may encounter. The sampling idea is conceptually similar to fitting the tangent space of the demonstrator policy instead of just the nominal control command [16].

Unfortunately, despite our efforts to extract samples from MPC that cover a large volume in state space, there is still a bias of the state distribution towards those states that are encountered by the optimal MPC policy. This distribution mismatch is a common problem in IL and stems from the fact that a learned controller produces inevitably different control inputs than the demonstrator (even when fully converged, unstable physical systems may amplify small differences), which will eventually drive the system into an area of the state space from which no data is available. To avoid this scenario, we use a behavioral policy to push the emulated MPC loop towards the states that will be seen by the learned policy. Taking inspiration from Dagger [26], the update rule for the next state (stepSystem method in Alg. 1) is given by

(10)
(11)

where the mixing parameter is initially zero and linearly increases with the number of iterations until it has reached one in the final iteration. Through this process, the learned policy is gradually given more responsibility to decide where the OC algorithm should be applied. It is important to note that the OC solver is not influenced by the learned policy and produces optimal solutions independent of the value of .

Ii-D Policy Structure and Training

Now that the loss function and a way to populate our experience buffer is defined, we turn the actual training procedure and computation of stochastic gradients of our policy.

In this work, we use a mixture-of-experts architecture [27] for the control policy, shown in Fig. 2. Allowing multiple policies to compete naturally handles the non-uniqueness of the OC

solution. For example, passing an obstacle around the left or right side may be an equally good choice that two different experts will try to imitate, but forcing a monolithic network to interpolate between these solutions can be catastrophic.

Fig. 2: Architecture of our mixture-of-experts network. The dimensions correspond to the instantiation for the ANYmal robot.

The final control output of the network is a convex combination of the outputs of different expert sub-policies

(12)

The mixing coefficients

are the output of a gating network whose final activation ensures that all coefficients are positive and sum up to one. While a softmax layer achieves this constraint, we find that a sigmoid activation with subsequent normalization performs better in selecting a consistent number of experts for a given task across multiple training runs. We believe the reason for this observation is that the softmax activation is too sharp in selecting one specific expert such that an unlucky initialization may lead to some experts never even being considered and therefore not receiving policy updates.

Both the expert sub-policies and the gating network share a common latent space representation. The overall policy (12

) remains a feed-forward neural network and can, therefore, be trained with standard deep learning optimization techniques: At each policy iteration step, we draw a batch of

tuples from the replay buffer and compute the empirical loss for this batch as

(13)

Note that we force each experts’ output to individually minimize the Hamiltonian to encourage specialization [27]. This procedure is slightly different from inserting (12) into (8), which would only encourage their combined output to be optimal. Training the optimal policy involves taking gradient steps in the parameter space. The policy gradient for the loss function (13) for a given sample is equal to

(14)

The control derivative of the Hamiltonian is computed as a byproduct of solving the MPC problem, whereas the gradients of and

can be calculated by backpropagation.

Iii Results

We assess the policy structure and loss function of the MPC-Net algorithm separately to highlight the performance of our method and justify individual design choices.

Iii-a Experimental Setup

The results presented in this document are produced with the quadrupedal robot ANYmal (Fig. 3), which is an example of a hybrid system with time-varying flow map and constraints. The constraints encode zero contact forces for a foot in swing phase and zero velocity when in stance phase.

Fig. 3: The quadrupedal robot ANYmal. The floating base and three joints per leg amount to 18 DOF. Our kinodynamic model of this robot has 24 states and 24 inputs.

Our kinodynamic model amounts to 24 states (base pose, base twist, joint angles) and 24 control inputs (joint velocities, foot contact forces). The control commands from our policy are fed to a whole-body tracking controller that computes the final actuator torque commands. Instead of providing the absolute time to the network, it is more expedient to encode the phase of the gait cycle of the legged robot. By abuse of notation, we, therefore, define four ‘time’ variables, one per leg, which is zero during stance phases and describes half a period of a sine wave during the swing motion.

We use a quadratic OC cost function (1) of the form

(15)
(16)

The reference states encourage the system to return to the origin with a trotting or static walk gait and then maintain a nominal configuration. Our quadratic cost structure, together with the fact that our constraints and dynamics are input-affine, makes the Hamiltonian a quadratic function in .

Since our loss function (13

) directly depends on the sampled data, it is not a suitable termination criterion for the training process and has a high variance. We monitor the performance of our policy by computing a rollout of the system dynamics with the learned policy from random initial points. A rollout lasts 3 s but is terminated early if the pitch or roll angle exceed 30 

or the height deviates more than 20 cm from the nominal value. This procedure can be seen as a test set for our learning approach. The resulting average rollout cost (1) and the survival time are good performance indicators for the policy.

All hyper-parameters of our algorithm are summarized in Tab. I. The network weights are randomly initialized before training and optimized with the Adam optimizer [33]. We take the data from MPC

as is without any pruning of failed rollouts or outlier states. For the following comparisons, we execute five training runs for each configuration and average the results.

maxIter 100’000 mpc decimation 500
rollout length 3 s Replay Buffer Size 100’000
time step 0.0025 s 8
learning rate 1e-3 batch size 32
TABLE I: Hyperparameters of MPC-Net

Iii-B Loss Function

The first experiment compares our proposed Hamiltonian (8) as a loss function with a simpler BC loss that encourages matching of the demonstrator’s control command

(17)

We use the control cost matrix here to normalize the different control dimensions. We see in Fig. 4 that the simpler loss (17) results in similar convergence to a stable control law, but the Hamiltonian loss consistently achieves a lower constraint violation value. When deployed in simulation, the policy trained on (17) tends to fall after a few footsteps as violations errors accumulate.

We conjecture that the structure of the Hamiltonian, which includes constraint violation penalties explicitly, encourages the learning algorithm to respect constraints more carefully than in the case of only observing constraint-consistent demonstrations. Note that our loss would inform the learner about constraint violations even if the demonstrations violated them.

Fig. 4: Comparison between minimization of the control Hamiltonian and a simpler loss penalizing differences in policy outputs.

Iii-C Sample Efficiency

Next, we show in Fig. 5 how sampling around the nominal MPC

trajectory influences the learning process for a quadruped walking motion. There is no noticeable effect in the loss function (i.e., the value of the Hamiltonian) throughout the process, which also suggests that this value is not a good indicator for the actual performance of the policy. Instead, a clear effect can be seen in the progression of the survival time. The plot suggests that the additionally sampled states provide valuable information for the training algorithm to learn faster and stabilize the system more consistently at the end of the training. More importantly even, we observe that the policy that is trained only on nominal samples is overly aggressive to small deviations in the system’s state. These strong gains lead to oscillatory behavior when deployed on the real system, where sensors and the state estimator inevitably introduce noise. Subsequently, only the policy that is trained with additional samples around the nominal

MPC solution is robust and smooth enough to stabilize the system under noisy state estimates. Evidence of this result is shown in the video111https://youtu.be/i4CLPc7wxzw.

Finally, experiments show that the policies with sampling become usable on the robot at approximately 75% of the maximum number of iterations, indicating that sampling also improves the effective amount of information extracted from MPC and thereby necessitating fewer MPC calls. Our algorithm, therefore, learns to stabilize a walking robot from an experience buffer that is equivalent to running the robot for eight minutes with an MPC controller. Notably, this time scale opens up the possibility of learning directly on a real system.

Fig. 5: Effect of collecting additional samples around the nominal MPC trajectory. The maximum duration of a policy rollout is 3 s. Five independent experiments are averaged for each setting.

Iii-D Mixture-of-Expert Architecture

In this experiment we compare the performance of our mixture-of-expert architecture to a classical MLP network of the form

(18)

with an equally-sized latent space than the one of the expert mixture.222 We also tested deeper and wider MLP architectures but could not observe improved performance. While both architectures achieve similar convergence to a stable controller, Fig. 6 shows that the expert mixture network reaches a significantly better constraint violation score.

We allow the expert mixture network to use 8 experts for training. Interestingly, the gating network decides to use fewer experts, and swiching between these sub-policies happens precisely at the times when the contact configuration of the system changes. For a trotting gait, only three experts are needed (blue expert for the first pair of diagonal legs, a mixture of red and black for the other pair, and red for the final stance phase) while a static walk selects four experts, one per swing leg.

This result shows that the policy learns to select an appropriate expert in different domains of the state space. Moreover, a specialized expert that focuses only on a specific contact configuration learns to obey the constraints better than a single policy for all phases of the gait.

Fig. 6: The top graph shows a comparison of constraint violation during training between the expert mixture network and a MLP of equivalent size. The bottom two graphs display the output of the expert gating network for two different gaits (one color per expert). Switching times correspond exactly to changes in the contact configuration and the pattern repeats periodically with the period of the gait.

Iii-E Robot Control

Finally, we test our algorithm on the physical ANYmal robot. We verify that both a trotting and a static walk gait can be learned from the MPC oracle using the same network structure and identical hyperparameters. Despite the seemingly more stable static walk, both gaits pose a comparable level of difficulty to the learning algorithm which manifests in similar convergence properties. The attached video shows the robot’s behavior under our learned policy.

We test the policy’s ability to return to the origin by starting the robot at a nonzero initial displacement and yaw rotation. In Figure 7, we plot the resulting state trajectories of - position as well as yaw angle, confirming that the network succeeds in the regularization task without overshoot.

Fig. 7: Time evolution of ANYmal’s base position and yaw angle under the trained policy. All quantities return to zero with minimal overshoot.

Iv Conclusion

In this work, we explored a variant of MPC-guided policy search to learn a feedback control law. Contrary to other imitation learning approaches, which try to mimic the control commands of a teacher, our formulation is based on minimizing the control Hamiltonian. The optimization corresponds to solving the OC problem with a restricted family of control laws. We show that our algorithm is capable of learning a feedback policy for two different gaits of a walking robot from less than 10 minutes of demonstration data.

By design, our method cannot outperform the MPC policy, because it optimizes the same cost function, and we cannot learn in areas where the OC algorithm does not find a solution. However, the improved speed in control evaluation may very well stabilize motions that were not possible before.

A related limitation is the lack of exploration, as our policy search method will fall into the same local minima that the OC optimizer found. Future research is necessary to investigate how policies could request new samples from the MPC to improve in areas where the optimal control is still uncertain.

References

  • [1] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” in Robotics: Science and Systems XIV, 2018.
  • [2] A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, and V. Vanhoucke, “Policies modulating trajectory generators,” in Conf. on Robot Learning (CoRL), 2018, pp. 916–926.
  • [3] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine, “Learning to walk via deep reinforcement learning,” CoRR, vol. abs/1812.11103, 2018.
  • [4] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019.
  • [5] Z. Xie, P. Clary, J. Dao, P. Morais, J. W. Hurst, and M. van de Panne, “Iterative reinforcement learning based design of dynamic locomotion skills for cassie,” CoRR, vol. abs/1903.09537, 2019.
  • [6] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,” Foundations and Trends in Robotics, vol. 7, no. 1-2, pp. 1–179, 2018.
  • [7] W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell, “Deeply aggrevated: Differentiable imitation learning for sequential prediction,” in

    Int. Conf. on Machine Learning ICML

    , 2017, pp. 3309–3318.
  • [8] H. Park, P. M. Wensing, and S. Kim, “Online planning for autonomous running jumps over obstacles in high-speed quadrupeds,” in Robotics: Science and Systems XI, 2015.
  • [9] M. Naveau, M. Kudruss, O. Stasse, C. Kirches, K. Mombaur, and P. Souères, “A reactive walking pattern generator based on nonlinear model predictive control,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 10–17, 2017.
  • [10] F. Farshidian, E. Jelavic, A. Satapathy, M. Giftthaler, and J. Buchli, “Real-time motion planning of legged robots: A model predictive control approach,” in IEEE-RAS Int. Conf. on Humanoid Robotics (Humanoids), Nov 2017, pp. 577–584.
  • [11] M. Neunert, M. Stäuble, M. Giftthaler, C. D. Bellicoso, J. Carius, C. Gehring, M. Hutter, and J. Buchli, “Whole-body nonlinear model predictive control through contacts for quadrupeds,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1458–1465, 2018.
  • [12] A. W. Winkler, C. D. Bellicoso, M. Hutter, and J. Buchli, “Gait and trajectory optimization for legged systems through phase-based end-effector parameterization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1560–1567, 2018.
  • [13] J. Carius, R. Ranftl, V. Koltun, and M. Hutter, “Trajectory optimization for legged robots with slipping motions,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3013–3020, 2019.
  • [14] N. D. Ratliff, D. M. Bradley, J. A. Bagnell, and J. E. Chestnutt, “Boosting structured prediction for imitation learning,” in Advances in Neural Information Processing Systems, 2006, pp. 1153–1160.
  • [15] P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” Int. J. Robotics Res., vol. 29, no. 13, pp. 1608–1639, 2010.
  • [16] I. Mordatch and E. Todorov, “Combining the benefits of function approximation and trajectory optimization,” in Robotics: Science and Systems X, 2014.
  • [17] S. Levine and V. Koltun, “Guided policy search,” in Int. Conf. on Machine Learning ICML, 2013, pp. 1–9.
  • [18] ——, “Learning complex neural network policies with trajectory optimization,” in Int. Conf. on Machine Learning ICML, 2014, pp. 829–837.
  • [19] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “PLATO: policy learning using adaptive trajectory optimization,” in IEEE Int. Conf. on Robotics and Automation ICRA, 2017, pp. 3342–3349.
  • [20] S. Choudhury, A. Kapoor, G. Ranade, S. Scherer, and D. Dey, “Adaptive information gathering via imitation learning,” in Robotics: Science and Systems XIII, 2017.
  • [21] Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani, “Data efficient reinforcement learning for legged robots,” CoRR, vol. abs/1907.03613, 2019.
  • [22] C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and value functions: A trajectory-based approach,” in Advances in Neural Information Processing Systems NIPS, 2002, pp. 1611–1618.
  • [23] M. Zhong, M. Johnson, Y. Tassa, T. Erez, and E. Todorov, “Value function approximation and model predictive control,” in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning ADPRL, 2013, pp. 100–107.
  • [24] N. Mansard, A. DelPrete, M. Geisert, S. Tonneau, and O. Stasse, “Using a memory of motion to efficiently warm-start a nonlinear predictive controller,” in IEEE Int. Conf. on Robotics and Automation ICRA, 2018, pp. 2986–2993.
  • [25] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in

    Int. Conf. on Artificial Intelligence and Statistics AISTATS

    , 2010, pp. 661–668.
  • [26] S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Int. Conf. on Artificial Intelligence and Statistics AISTATS, 2011, pp. 627–635.
  • [27] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991.
  • [28] F. Farshidian, M. Neunert, A. W. Winkler, G. Rey, and J. Buchli, “An efficient optimal planning and control framework for quadrupedal locomotion,” in IEEE Int. Conf. on Robotics and Automation ICRA, May 2017, pp. 93–100.
  • [29] W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems,” in Int. Conf. on Informatics in Control, Automation and Robotics ICINCO, 2004, pp. 222–229.
  • [30] R. Grandia, F. Farshidian, R. Ranftl, and M. Hutter, “Feedback MPC for torque-controlled legged robots,” CoRR, vol. abs/1905.06144, 2019.
  • [31] D. P. Bertsekas, Dynamic programming and optimal control, 3rd Edition.   Athena Scientific, 2005.
  • [32] L. J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine Learning, vol. 8, pp. 293–321, 1992.
  • [33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. on Learning Representations ICLR, 2015.