The control of robotic systems with fast and unstable dynamics requires carefully designed feedback controllers. Hybrid, underactuated walking robots pose an especially challenging setting in this respect.
Recent successes in Reinforcement Learning (RL) demonstrate sophisticated walking robot control [1, 2, 3, 4, 5], yet a large number of policy rollouts need to be collected to reach the required performance level. It is, therefore, common practice to use physics simulators during training and subsequently attempt a sim-to-real transfer [1, 4].
Imitation Learning (IL)  appears to be a promising method that could reduce the sampling needs of learning-based approaches by guiding them with expert demonstrations. When good demonstrations are available, sampling efficiency can be drastically improved over classical RL .
An appealing way to automatically generate such demonstrations for modeled dynamical systems is Optimal Control (OC) and its realtime counterpart MPC. They provide a formal framework for generating control commands that respect physical constraints and optimize a performance criterion. Knowledge of a system model and its gradients enable MPC to discover complex robot behaviors in a very sample-efficient way [8, 9, 10, 11, 12, 13]. Unfortunately, when deploying on a robot, the entire optimization problem has to be solved online because the resulting control policy is only valid around the current state. Moreover, the robustness against disturbances – both of intrinsic nature (e.g., modeling errors) as well as external effects – is critically dependent on the assumption that a new motion plan can be generated sufficiently fast. Even for moderately complex systems, the update frequency of MPC becomes a limiting factor when deploying on onboard computers.
Learning from OC solutions has proven a viable option for robot control that combines the advantages of both approaches [14, 15, 16, 17, 18, 19, 20, 21]. The benefit of using a solver as expert demonstrator over humans or animals is that there is no domain adaptation problem, and one can query demonstrations from arbitrary states. Additionally, one may request the solver to explicitly handle constraints instead of only presuming that demonstrations are constraint consistent.
Several methods take an inverse OC approach to IL: Multiple local approximations of the value function, computed by MPC runs, are aggregated into a single global approximation [22, 23, 24]. The learned value function and its induced optimal policy are in turn used to reduce the MPC horizon or speed up convergence. Alternatively, a Behavioral Cloning (BC) approach to IL attempts to directly learn a policy that reproduces the expert’s demonstrations without maintaining a value function explicitly. Accordingly, the original RL
problem is transformed into a supervised learning problem since the demonstrator’s actions can be interpreted as labels.
Our proposed algorithm belongs to the family of such actor-only approaches: We introduce MPC-Net, a policy search method that is guided by an OC solver to find a neural network control policy. The method can be seen as a policy iteration scheme that draws data from a perfect critic (i.e., the MPC). Our key innovation is a theoretically motivated loss function, which is based on the minimization of the control Hamiltonian. The structure of the control Hamiltonian captures the system dynamics and constraints of the control problem. We show that this learning objective has favorable properties in terms of convergence and constraint satisfaction, which is particularly important for hybrid systems.
Closely related to our algorithm are policy search methods with a teacher-learner setup [17, 18, 19]. These works employ an OC solver as a teacher from which a policy is learned. Contrary to our work, however, the teacher adapts to the student. This assimilation is achieved by adding a penalty term to the OC cost function so that demonstrations are created that remain close to the student’s policy. Additionally, the student’s objective is usually the optimization of a distance metric between student’s and teacher’s policy outputs. However, minimizing a distance may not correspond to performance, e.g., in constrained settings it is usually more important to satisfy constraints rather than mimicking the teacher accurately. In our approach, no such choice of a distance metric has to be made. Notably, our learner is never presented with the optimal control input. Additionally, since our demonstrator does not adapt to the current policy of the learner, all demonstration samples remain valid and can be re-used, thereby boosting sampling efficiency.
Imitating a demonstrator that is not adaptive to the learner induces the problem of distribution matching: Inevitable approximation errors between the learned and demonstrated policies make rollouts of the learned policy encounter a different distribution of states than the one from demonstration data. Ross et al. [25, 26] show that the resulting errors can compound quadratically in the time horizon. We use elements of their proposed solutions (probabilistic mixing and dataset augmentation) to ensure that the distributions match. Simply put, we bias the demonstrator’s query states towards the observations that our policy sees and thereby receive samples that match the learner’s distribution better.
While the idea of policy search through minimization of the control Hamiltonian applies to arbitrary parameterized policies such as neural networks, weighted motion primitives, or spline coefficients, we consider the very general class of mixture-of-expert neural networks policies  in this work. Our choice caters for the fact that OC is an inverse problem with potentially multiple solutions for the same observation. The expert data may, therefore, exhibit such multimodal behavior. We show that this choice of network structure has favorable properties in terms of convergence and constraint satisfaction and is particularly suitable for controlling legged robots since these systems inherently exhibit multi-modal dynamics.
Statement of Contributions
The contributions of this work are as follows:
Derivation of a policy search algorithm with a loss function that is derived from fundamental concepts from OC
Evidence that the explicit handling of constraints in our loss function achieves improved constraint satisfaction compared to minimization of a distance metric in terms of policy outputs
Demonstration of improved efficiency in terms of MPC calls by exploiting a local approximation of the OC value function
Evidence that a mixture-of-expert network architecture outperforms a general Multilayer Perceptron (MLP) for control of a walking robot
Validation of the trained control policies on robotic hardware. The learned controllers successfully stabilize two different gaits on a quadrupedal robot
from a feasible, random initial state. Samples from the resulting optimal trajectories are stored in a replay buffer. At each policy update step, we construct a loss function by drawing a batch of the stored samples and perform a stochastic gradient descent step in the policy parameter space.
In this section, we first explain the MPC problem and the structure of its solution. Subsequently, we present the theoretical properties of the OC problem and how they motivate our loss function. Finally, we show how a neural network policy is trained from MPC demonstrations.
Ii-a Model Predictive Control
We consider a continuous-time, finite horizon OC problem
where is the time horizon, a given initial state, the final cost and the intermediate cost function. The system flow map and constraints and may be time-dependent, for example to represent a hybrid walking robot.
In principle, our method works with any OC solver that can handle the constraints (2) and that provides an approximation of the optimal value function. In this work, we employ a Differential Dynamic Programming (DDP)-like algorithm called Sequential Linear-Quadratic (SLQ) control , which is the continuous-time equivalent to the Iterative Linear-Quadratic Regulator (iLQR) . This solver handles the inequality constraints through a barrier function  and explicitly computes optimal Lagrange multipliers for satisfaction of the state-input equality constraint . The Lagrangian of the OC problem (1) is therefore given by
The solution of the variational problem (1) consists of nominal state and input trajectories as well as time-dependent linear feedback gains that define the optimal control policy
As a byproduct of the solver, we also have access to the state derivative of the value function .
During our emulated real-time MPC loop, we let the solver compute the optimal policy, then store the values of at the first time step of the solution in our replay memory. Next, we update the current state using the system dynamics and continue until the rollout length is reached.
Ii-B Policy Loss Function
The MPC internally computes the optimal value function (cost-to-go), which is defined as
It is a known property of OC [31, p. 111] that the optimal input must satisfy
where is the input-dependent part of the control Hamiltonian, which directly arises from the HJB equation. A globally optimal policy would have to satisfy (6) at any time and state . Therefore, the perfect policy search method would involve a very rich family of parameterized policies and minimize the control Hamiltonian in the entire time-state space. Such minimization is impossible because the optimal value function and Lagrange multipliers are not known a priori.
To our benefit, however, MPC can compute the value function along trajectories in state space. For a sufficiently rich class of parameterized functions, one can expect to find some parameters that make the policy reproduce the optimal inputs sufficiently close. Our strategy to find these optimal parameters is, therefore, given by Eq. (6), where we insert and minimize the expectation over the time-state distribution that results from the MPC trajectories:
The quantity in the expectation can be seen as a per-sample loss for policy training. It is essential to realize that the control Hamiltonian allows us to find the optimal control via this unconstrained pointwise (i.e., per pair) minimization because the future cost and constraint Lagrangian have already been included. It is, therefore, not necessary to perform Monte-Carlo-style rollouts to find the optimal control.
The MPC loop presented in Sec. II-A serves as a data generation mechanism for the policy search module. In general terms, the MPC fills a replay buffer with data points that correspond to the states that it has encountered and those tuples are sampled from to compute the empirical expectation in (8). In our implementation, the samples for computing the policy gradient are drawn uniformly at random from the replay buffer, thereby breaking the temporal correlation of our samples .
Ii-C Sampling from an MPC Solution
A favorable property of DDP solvers is that they compute a second-order approximation of the optimal value function in the vicinity of the nominal state trajectory. In turn, the control Hamiltonian can also be calculated in a region around the MPC solution. Given feasible, random starting points, the areas where the value function is known corresponds to the subset of states that are visited by a (close-to) optimal policy.
This fact can be exploited to increase the extracted informational content from an MPC rollout. By sampling around the nominal state, our data automatically covers tubes in state space, which accelerates learning and makes the learned policy more robust. This procedure, denoted sampleInNeighborhood in Alg. 1
, amounts to drawing states from a Gaussian distribution according to
where the covariance matrix has diagonal entries corresponding to the typical disturbance that the respective state component may encounter. The sampling idea is conceptually similar to fitting the tangent space of the demonstrator policy instead of just the nominal control command .
Unfortunately, despite our efforts to extract samples from MPC that cover a large volume in state space, there is still a bias of the state distribution towards those states that are encountered by the optimal MPC policy. This distribution mismatch is a common problem in IL and stems from the fact that a learned controller produces inevitably different control inputs than the demonstrator (even when fully converged, unstable physical systems may amplify small differences), which will eventually drive the system into an area of the state space from which no data is available. To avoid this scenario, we use a behavioral policy to push the emulated MPC loop towards the states that will be seen by the learned policy. Taking inspiration from Dagger , the update rule for the next state (stepSystem method in Alg. 1) is given by
where the mixing parameter is initially zero and linearly increases with the number of iterations until it has reached one in the final iteration. Through this process, the learned policy is gradually given more responsibility to decide where the OC algorithm should be applied. It is important to note that the OC solver is not influenced by the learned policy and produces optimal solutions independent of the value of .
Ii-D Policy Structure and Training
Now that the loss function and a way to populate our experience buffer is defined, we turn the actual training procedure and computation of stochastic gradients of our policy.
solution. For example, passing an obstacle around the left or right side may be an equally good choice that two different experts will try to imitate, but forcing a monolithic network to interpolate between these solutions can be catastrophic.
The final control output of the network is a convex combination of the outputs of different expert sub-policies
The mixing coefficients
are the output of a gating network whose final activation ensures that all coefficients are positive and sum up to one. While a softmax layer achieves this constraint, we find that a sigmoid activation with subsequent normalization performs better in selecting a consistent number of experts for a given task across multiple training runs. We believe the reason for this observation is that the softmax activation is too sharp in selecting one specific expert such that an unlucky initialization may lead to some experts never even being considered and therefore not receiving policy updates.
Both the expert sub-policies and the gating network share a common latent space representation. The overall policy (12tuples from the replay buffer and compute the empirical loss for this batch as
Note that we force each experts’ output to individually minimize the Hamiltonian to encourage specialization . This procedure is slightly different from inserting (12) into (8), which would only encourage their combined output to be optimal. Training the optimal policy involves taking gradient steps in the parameter space. The policy gradient for the loss function (13) for a given sample is equal to
The control derivative of the Hamiltonian is computed as a byproduct of solving the MPC problem, whereas the gradients of and
can be calculated by backpropagation.
We assess the policy structure and loss function of the MPC-Net algorithm separately to highlight the performance of our method and justify individual design choices.
Iii-a Experimental Setup
The results presented in this document are produced with the quadrupedal robot ANYmal (Fig. 3), which is an example of a hybrid system with time-varying flow map and constraints. The constraints encode zero contact forces for a foot in swing phase and zero velocity when in stance phase.
Our kinodynamic model amounts to 24 states (base pose, base twist, joint angles) and 24 control inputs (joint velocities, foot contact forces). The control commands from our policy are fed to a whole-body tracking controller that computes the final actuator torque commands. Instead of providing the absolute time to the network, it is more expedient to encode the phase of the gait cycle of the legged robot. By abuse of notation, we, therefore, define four ‘time’ variables, one per leg, which is zero during stance phases and describes half a period of a sine wave during the swing motion.
We use a quadratic OC cost function (1) of the form
The reference states encourage the system to return to the origin with a trotting or static walk gait and then maintain a nominal configuration. Our quadratic cost structure, together with the fact that our constraints and dynamics are input-affine, makes the Hamiltonian a quadratic function in .
Since our loss function (13
) directly depends on the sampled data, it is not a suitable termination criterion for the training process and has a high variance. We monitor the performance of our policy by computing a rollout of the system dynamics with the learned policy from random initial points. A rollout lasts 3 s but is terminated early if the pitch or roll angle exceed 30or the height deviates more than 20 cm from the nominal value. This procedure can be seen as a test set for our learning approach. The resulting average rollout cost (1) and the survival time are good performance indicators for the policy.
as is without any pruning of failed rollouts or outlier states. For the following comparisons, we execute five training runs for each configuration and average the results.
|rollout length||3 s||Replay Buffer Size||100’000|
|time step||0.0025 s||8|
|learning rate||1e-3||batch size||32|
Iii-B Loss Function
The first experiment compares our proposed Hamiltonian (8) as a loss function with a simpler BC loss that encourages matching of the demonstrator’s control command
We use the control cost matrix here to normalize the different control dimensions. We see in Fig. 4 that the simpler loss (17) results in similar convergence to a stable control law, but the Hamiltonian loss consistently achieves a lower constraint violation value. When deployed in simulation, the policy trained on (17) tends to fall after a few footsteps as violations errors accumulate.
We conjecture that the structure of the Hamiltonian, which includes constraint violation penalties explicitly, encourages the learning algorithm to respect constraints more carefully than in the case of only observing constraint-consistent demonstrations. Note that our loss would inform the learner about constraint violations even if the demonstrations violated them.
Iii-C Sample Efficiency
Next, we show in Fig. 5 how sampling around the nominal MPC
trajectory influences the learning process for a quadruped walking motion. There is no noticeable effect in the loss function (i.e., the value of the Hamiltonian) throughout the process, which also suggests that this value is not a good indicator for the actual performance of the policy. Instead, a clear effect can be seen in the progression of the survival time. The plot suggests that the additionally sampled states provide valuable information for the training algorithm to learn faster and stabilize the system more consistently at the end of the training. More importantly even, we observe that the policy that is trained only on nominal samples is overly aggressive to small deviations in the system’s state. These strong gains lead to oscillatory behavior when deployed on the real system, where sensors and the state estimator inevitably introduce noise. Subsequently, only the policy that is trained with additional samples around the nominalMPC solution is robust and smooth enough to stabilize the system under noisy state estimates. Evidence of this result is shown in the video111https://youtu.be/i4CLPc7wxzw.
Finally, experiments show that the policies with sampling become usable on the robot at approximately 75% of the maximum number of iterations, indicating that sampling also improves the effective amount of information extracted from MPC and thereby necessitating fewer MPC calls. Our algorithm, therefore, learns to stabilize a walking robot from an experience buffer that is equivalent to running the robot for eight minutes with an MPC controller. Notably, this time scale opens up the possibility of learning directly on a real system.
Iii-D Mixture-of-Expert Architecture
In this experiment we compare the performance of our mixture-of-expert architecture to a classical MLP network of the form
with an equally-sized latent space than the one of the expert mixture.222 We also tested deeper and wider MLP architectures but could not observe improved performance. While both architectures achieve similar convergence to a stable controller, Fig. 6 shows that the expert mixture network reaches a significantly better constraint violation score.
We allow the expert mixture network to use 8 experts for training. Interestingly, the gating network decides to use fewer experts, and swiching between these sub-policies happens precisely at the times when the contact configuration of the system changes. For a trotting gait, only three experts are needed (blue expert for the first pair of diagonal legs, a mixture of red and black for the other pair, and red for the final stance phase) while a static walk selects four experts, one per swing leg.
This result shows that the policy learns to select an appropriate expert in different domains of the state space. Moreover, a specialized expert that focuses only on a specific contact configuration learns to obey the constraints better than a single policy for all phases of the gait.
Iii-E Robot Control
Finally, we test our algorithm on the physical ANYmal robot. We verify that both a trotting and a static walk gait can be learned from the MPC oracle using the same network structure and identical hyperparameters. Despite the seemingly more stable static walk, both gaits pose a comparable level of difficulty to the learning algorithm which manifests in similar convergence properties. The attached video shows the robot’s behavior under our learned policy.
We test the policy’s ability to return to the origin by starting the robot at a nonzero initial displacement and yaw rotation. In Figure 7, we plot the resulting state trajectories of - position as well as yaw angle, confirming that the network succeeds in the regularization task without overshoot.
In this work, we explored a variant of MPC-guided policy search to learn a feedback control law. Contrary to other imitation learning approaches, which try to mimic the control commands of a teacher, our formulation is based on minimizing the control Hamiltonian. The optimization corresponds to solving the OC problem with a restricted family of control laws. We show that our algorithm is capable of learning a feedback policy for two different gaits of a walking robot from less than 10 minutes of demonstration data.
By design, our method cannot outperform the MPC policy, because it optimizes the same cost function, and we cannot learn in areas where the OC algorithm does not find a solution. However, the improved speed in control evaluation may very well stabilize motions that were not possible before.
A related limitation is the lack of exploration, as our policy search method will fall into the same local minima that the OC optimizer found. Future research is necessary to investigate how policies could request new samples from the MPC to improve in areas where the optimal control is still uncertain.
-  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” in Robotics: Science and Systems XIV, 2018.
-  A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, and V. Vanhoucke, “Policies modulating trajectory generators,” in Conf. on Robot Learning (CoRL), 2018, pp. 916–926.
-  T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine, “Learning to walk via deep reinforcement learning,” CoRR, vol. abs/1812.11103, 2018.
-  J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019.
-  Z. Xie, P. Clary, J. Dao, P. Morais, J. W. Hurst, and M. van de Panne, “Iterative reinforcement learning based design of dynamic locomotion skills for cassie,” CoRR, vol. abs/1903.09537, 2019.
-  T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,” Foundations and Trends in Robotics, vol. 7, no. 1-2, pp. 1–179, 2018.
W. Sun, A. Venkatraman, G. J. Gordon, B. Boots, and J. A. Bagnell, “Deeply
aggrevated: Differentiable imitation learning for sequential prediction,” in
Int. Conf. on Machine Learning ICML, 2017, pp. 3309–3318.
-  H. Park, P. M. Wensing, and S. Kim, “Online planning for autonomous running jumps over obstacles in high-speed quadrupeds,” in Robotics: Science and Systems XI, 2015.
-  M. Naveau, M. Kudruss, O. Stasse, C. Kirches, K. Mombaur, and P. Souères, “A reactive walking pattern generator based on nonlinear model predictive control,” IEEE Robotics and Automation Letters, vol. 2, no. 1, pp. 10–17, 2017.
-  F. Farshidian, E. Jelavic, A. Satapathy, M. Giftthaler, and J. Buchli, “Real-time motion planning of legged robots: A model predictive control approach,” in IEEE-RAS Int. Conf. on Humanoid Robotics (Humanoids), Nov 2017, pp. 577–584.
-  M. Neunert, M. Stäuble, M. Giftthaler, C. D. Bellicoso, J. Carius, C. Gehring, M. Hutter, and J. Buchli, “Whole-body nonlinear model predictive control through contacts for quadrupeds,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1458–1465, 2018.
-  A. W. Winkler, C. D. Bellicoso, M. Hutter, and J. Buchli, “Gait and trajectory optimization for legged systems through phase-based end-effector parameterization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1560–1567, 2018.
-  J. Carius, R. Ranftl, V. Koltun, and M. Hutter, “Trajectory optimization for legged robots with slipping motions,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 3013–3020, 2019.
-  N. D. Ratliff, D. M. Bradley, J. A. Bagnell, and J. E. Chestnutt, “Boosting structured prediction for imitation learning,” in Advances in Neural Information Processing Systems, 2006, pp. 1153–1160.
-  P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous helicopter aerobatics through apprenticeship learning,” Int. J. Robotics Res., vol. 29, no. 13, pp. 1608–1639, 2010.
-  I. Mordatch and E. Todorov, “Combining the benefits of function approximation and trajectory optimization,” in Robotics: Science and Systems X, 2014.
-  S. Levine and V. Koltun, “Guided policy search,” in Int. Conf. on Machine Learning ICML, 2013, pp. 1–9.
-  ——, “Learning complex neural network policies with trajectory optimization,” in Int. Conf. on Machine Learning ICML, 2014, pp. 829–837.
-  G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “PLATO: policy learning using adaptive trajectory optimization,” in IEEE Int. Conf. on Robotics and Automation ICRA, 2017, pp. 3342–3349.
-  S. Choudhury, A. Kapoor, G. Ranade, S. Scherer, and D. Dey, “Adaptive information gathering via imitation learning,” in Robotics: Science and Systems XIII, 2017.
-  Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani, “Data efficient reinforcement learning for legged robots,” CoRR, vol. abs/1907.03613, 2019.
-  C. G. Atkeson and J. Morimoto, “Nonparametric representation of policies and value functions: A trajectory-based approach,” in Advances in Neural Information Processing Systems NIPS, 2002, pp. 1611–1618.
-  M. Zhong, M. Johnson, Y. Tassa, T. Erez, and E. Todorov, “Value function approximation and model predictive control,” in IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning ADPRL, 2013, pp. 100–107.
-  N. Mansard, A. DelPrete, M. Geisert, S. Tonneau, and O. Stasse, “Using a memory of motion to efficiently warm-start a nonlinear predictive controller,” in IEEE Int. Conf. on Robotics and Automation ICRA, 2018, pp. 2986–2993.
S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in
Int. Conf. on Artificial Intelligence and Statistics AISTATS, 2010, pp. 661–668.
-  S. Ross, G. J. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Int. Conf. on Artificial Intelligence and Statistics AISTATS, 2011, pp. 627–635.
-  R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, 1991.
-  F. Farshidian, M. Neunert, A. W. Winkler, G. Rey, and J. Buchli, “An efficient optimal planning and control framework for quadrupedal locomotion,” in IEEE Int. Conf. on Robotics and Automation ICRA, May 2017, pp. 93–100.
-  W. Li and E. Todorov, “Iterative linear quadratic regulator design for nonlinear biological movement systems,” in Int. Conf. on Informatics in Control, Automation and Robotics ICINCO, 2004, pp. 222–229.
-  R. Grandia, F. Farshidian, R. Ranftl, and M. Hutter, “Feedback MPC for torque-controlled legged robots,” CoRR, vol. abs/1905.06144, 2019.
-  D. P. Bertsekas, Dynamic programming and optimal control, 3rd Edition. Athena Scientific, 2005.
-  L. J. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine Learning, vol. 8, pp. 293–321, 1992.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. on Learning Representations ICLR, 2015.