Log In Sign Up

Nonsmooth optimal value and policy functions for mechanical systems subject to unilateral constraints

State-of-the-art approaches to optimal control of contact-rich robot dynamics use smooth approximations of value and policy functions and gradient-based algorithms for improving approximator parameters. Unfortunately, the dynamics of mechanical systems subject to unilateral constraints--i.e. robot locomotion and manipulation--are generally nonsmooth. We show that value and policy functions generally inherit regularity properties like (non)smoothness from the underlying system's dynamics, and demonstrate this effect in a simple mechanical system. We conclude with a discussion of implications for the use of gradient-based algorithms for optimal control of contact-rich robot dynamics.


Bundled Gradients through Contact via Randomized Smoothing

The empirical success of derivative-free methods in reinforcement learni...

ValueNetQP: Learned one-step optimal control for legged locomotion

Optimal control is a successful approach to generate motions for complex...

Fundamental Challenges in Deep Learning for Stiff Contact Dynamics

Frictional contact has been extensively studied as the core underlying b...

Thresholds of descending algorithms in inference problems

We review recent works on analyzing the dynamics of gradient-based algor...

Online Optimal Control with Affine Constraints

This paper considers online optimal control with affine constraints on t...

Sampling-based Polytopic Trees for Approximate Optimal Control of Piecewise Affine Systems

Piecewise affine (PWA) systems are widely used to model highly nonlinear...

Control of Painlevé Paradox in a Robotic System

The Painlevé paradox is a phenomenon that causes instability in mechanic...

I Introduction

In the optimization approach to robot control, a policy is sought that extremizes a given performance criterion; the performance achieved by this optimal policy is the optimal value

of the problem. Two widely-applied frameworks for solving such problems are reinforcement learning and trajectory optimization. Although many algorithms are available in either framework, scalable algorithms in both leverage local approximations—gradients of values and policies—to iteratively improve toward optimality. In robotics applications like collision–free motion planning, these gradients are guaranteed to exist and can be readily computed or approximated. We show in this paper that these gradients can fail to exist for contact-rich robot dynamics, precluding application of state–of–the–art algorithms for optimal control.

We begin in Section II by modeling contact–rich robot dynamics using mechanical systems subject to unilateral constraints, and describe how nonsmoothness—discontintinuity or piecewise–differentiability—manifests in trajectory outcomes and (hence) trajectory costs. Then in Section III we provide mathematical derivations that show nonsmoothness in trajectory outcomes and costs gives rise to nonsmoothness in optimal value and (hence) policy functions. Subsequently in Section IV we present numerical simulations that demonstrate discontinuous or merely piecewise–differentiable optimal value and policy functions in a mechanical system subject to unilateral constraints. Finally in Section V we discuss the prevelance of nonsmoothness and how the lack of classical differentiability prevents gradient–based algorithms from converging to optimality.

Ii Mechanical systems subject to unilateral constraints

(a) touchdown maneuver illustration
(b) liftoff maneuver illustration
(c) touchdown trajectory outcomes
(d) liftoff trajectory outcomes
(e) touchdown value
(f) liftoff value
Fig. 1: Piecewise–differentiable and discontinuous trajectory outcomes in saggital–plane biped. (a,b) Illustration of two maneuvers—touchdown and liftoff—performed under non–optimal policies that exert different forces depending on which feet are in contact with the ground. In the touchdown maneuver, feet are initially off the ground and trajectories terminate when the body height reaches nadir; in the liftoff maneuver, feet are initially on the ground and trajectories terminate when the body height reaches apex. (c,d) Trajectory outcomes (final body angle ) as a function of initial body angle . (e,f) Performance of trajectories as measured by the cost functions in (20), (21). Dashed colored vertical lines on (c–f) indicate corresponding colored outcomes on (a,b).
(a) touchdown contact modes
(b) liftoff contact modes
Fig. 2: Contact modes for touchdown and liftoff maneuvers. The saggital–plane biped illustrated in Figure 1(a,b) can be in one of four contact modes corresponding to which subset of the (two) limbs are in contact with the ground; each subset yields different dynamics in (1). (a,b) System contact mode at each time for a given initial body rotation ; the body torque input is zero () and the leg forces are different () in mode left () and right () than in aerial () or ground (). Dashed colored horizontal lines indicate corresponding colored trajectories in Figure 1. The increase in force during the transition to modes left and right in (b) changes the ground reaction force discontinuously, delaying liftoff and causing discontinuous trajectory outcomes in Figure 1(d).

In this section, we formalize a class of models for contact–rich dynamics in robot locomotion and manipulation as mechanical systems subject to unilateral constraints and formulate an optimal control problem for these systems.

Ii-a Dynamics

Consider the dynamics of a mechanical system with configuration coordinates subject to unilateral constraints specified by a differentiable function where are finite. We are primarily interested in systems with constraints, whence we regard the inequality as being enforced componentwise.

Given any , and letting denote the number of elements in the set , we let denote the function obtained by selecting the component functions of indexed by , and we regard the equality as being enforced componentwise.

It is well–known (cf.  [1, Sec. 3] or [2, Sec. 2.4, 2.5]) that, with denoting the contact mode, the system’s dynamics take the form


where specifies the mass matrix for the mechanical system in the coordinates, is termed the effort map [1] and specifies111We let denote the tangent bundle of the configuration space ; an element can be regarded as a pair of generalized configurations and velocities ; we write . the internal and applied forces, is an external input, denotes the Coriolis matrix determined by , denotes the (Jacobian) derivative of the constraint function with respect to the coordinates, denotes the reaction forces generated in contact mode to enforce , specifies the collision restitution law that instantaneously resets velocities to ensure compatibility with the constraint , and (resp. ) denotes the right– (resp. left–)handed limits of the velocity with respect to time.

Ii-B Regularity of dynamics

The seemingly benign equations in (1) can yield dynamics with a range of regularity properties. This issue has been thoroughly investigated elsewhere [3, 4, 1]; here we focus specifically on how design choices in a robot’s mechanical and control systems affect regularity of its dynamics.

In what follows, we will frequently refer to the concept of a control system’s flow, so we briefly review the concept before proceeding. Given a control system (e.g. (1) or (2)) with state space and input space , a flow is a function such that for all initial states and inputs , the function defined for all by is a trajectory for the control system. Intuitively, the flow “bundles” all trajectories into a single function. Mathematically, the flow is useful for studying how trajectories vary with respect to initial states and inputs. So long as trajectories exist and are unique for every and , the flow is a well–defined function.

It is common to assume that the functions in (1) are continuously–differentiable (); however, as illustrated by [1, Ex. 2], this assumption alone does not ensure existence or uniqueness of trajectories. This case contrasts starkly with that of classical control systems, where the equation


yields unique trajectories whose regularity matches the vector field’s: if

is continuously differentiable, then there exists a flow for (2) that is continuously differentiable to the same order.

To ensure trajectories for (1) exist uniquely, restrictions must be imposed; we refer the interested reader to [1, Thm. 10] for a specific result and [2] for a general discussion of this issue. Since we are chiefly concerned with how properties of the dynamics in (1) affect properties of optimal value and policy functions, we will assume in what follows that conditions have been imposed to ensure (1) has a flow for states, inputs, and time horizons of interest.

Assuming that a flow exists for (1) does not provide any regularity properties on the function ; these properties are determined by the design of a robot’s mechanical and control systems and their closed–loop interaction with the environment. For instance: when limbs are inertially coupled (e.g. by rigid struts and joints), so that one limb’s constraint activation instantaneously changes another’s velocity, is discontinuous at configurations where these two limbs activate constraints simultaneously [5, Table 3] [6]; when limbs are force coupled (e.g. by nonlinear damping), so that one limb’s constraint (de)activation instantaneously changes the force on another, can be piecewise–differentiable at configurations where these two limbs (de)activate constraints simultaneously [3, Fig. 1]. In both instances, mechanical design choices lead to nonsmooth dynamics; Figure 1 provides examples where control design choices lead to nonsmooth dynamics (piecewise–differentiable in Figure 1(a,c,e), discontinuous in Figure 1(b,d,f)). Other nonsmooth phenomena can arise, e.g. grazing222where a constraint function decreases to and then increases from zero without activating constraint and Zeno333where a constraint is activated an infinite number of times on a finite time horizon trajectories; in what follows we will focus on the case of simultaneous constraint (de)activations due to its prevalence in robot gaits and maneuvers (see Section V-A for a discussion of when this phenomena prevails).

Ii-C Regularity of optimal value and policy functions

A broad class of optimal control problems for the dynamics in (1) can be formulated in terms of final () and running () costs:


where denotes the unique trajectory obtained from initial state when input is applied; in terms of the flow, for all . To expose the dependence of the cost in (3) on the flow , we transcribe the problem in (3) to a simpler form using a standard state augmentation technique (cf.  [7, Ch. 4.1.2]):


As discussed in Section II-B, the continuity and differentiability properties of are partly determined by a robot’s design: it is possible for and hence to be discontinuous (), continuously–differentiable (), or piecewise–differentiable and not continuously–differentiable (), depending on the properties of the robot’s mechanical and control systems. In the next section, we study how continuity and differentiability properties of affect the corresponding properties of in (4).

Iii Continuity and differentiability of optimal value and policy functions

Consider minimization of the cost function with respect to an input :


so long as and are compact and is continuous, the function indicated in (5), termed the optimal value function, is well–defined. We let denote an optimal policy for (5), i.e.


or, equivalently,


In this section we study how regularity properties (continuity, differentiability) of the cost function () relate to regularity properties of optimal value () and policy () functions.

Iii-a Discontinuous cost functions

If the cost () is discontinuous with respect to its first argument, then the optimal policy () and value () are generally discontinuous as well. This observation is clear in the trivial case that the cost only depends on its first argument, but manifests more generally.

Iii-B Continuously–differentiable cost functions

This section contains straightforward calculations based on standard results in classical (smooth) Calculus and nonlinear programming; it is provided primarily as a rehearsal for the more general setting considered in the subsequent section.

If is continuously–differentiable, denoted or simply , then necessarily [7, Ch. 1.1.1]


If is two times continuously–differentiable (denoted ) and the second–order sufficient condition [7, Ch. 1.1.2] for strict local optimality for (5) is satisfied at ,


then the Implicit Function Theorem (IFT) [8, Thm. C.40] can be applied to (7) to choose as a function near . Note that IFT specifically required the invertibility tacit in (9):


If (8) and (9) are satisfied, then applying the Chain Rule [8, Prop. C.3] to (8) yields


and applying the Chain Rule to (7) yields


whence we obtain derivatives of the optimal value and policy functions in terms of derivatives of the cost function.

We conclude that if the cost function is two times continuously–differentiable () and first–order necessary (8) and second–order sufficient (9), (10) conditions for optimality and stability of solutions to (5) are satisfied at , then the optimal policy and value functions are continuously–differentiable at () and their derivatives at can be computed using (11), (12).

Proposition 1

If satisfies (8), (9), and (10) at , then there exist neighborhoods of and of and a function such that and, for all , is the unique minimizer for


the derivative of is given by (11), and the derivative of is given by (12).

Iii-C Piecewise–differentiable cost functions

If is piecewise–differentiable,444We use the notion of piecewise–differentiability from [9, Ch. 4.1]: a function is piecewise–differentiable if it is everywhere locally a continuous selection of a finite number of continuously–differentiable functions. denoted or simply , then necessarily


Here and below, denotes a continuous and piecewise–linear first–order approximation termed the Bouligand (or B–)derivative [9, Ch. 3] that exists by virtue of the cost being  [9, Lem. 4.1.3]; denotes the evaluation of at .

If is two times piecewise–differentiable (denoted ), and if a sufficient condition [10, Thm. 1] for strict local optimality for (5) is satisfied at ,


and if the piecewise–linear function


then a Implicit Function Theorem can be applied to choose near  [11, Cor. 3.4].555This Implicit Function Theorem requires be strongly B–differentiable; the costs considered here are not generally strongly B–differentiable, but they are generally –equivalent to strongly B–differentiable functions [12, Thm. 3.1], whence [11, Cor. 3.4] can be applied indirectly. Applying the Chain Rule [9, Thm. 3.1.1] to (14) yields (cf.  [11, § 3])


and applying the Chain Rule to (7) yields


whence we obtain B–derivatives of the optimal value and policy functions in terms of B–derivatives of the cost.

We conclude that if the cost function is two times piecewise–differentiable () and first–order necessary (14) and second–order sufficient (15), (16) conditions for optimality and stability of solutions to (5) are satisfied at , then the optimal policy and value functions are piecewise–differentiable at () and their B–derivatives at can be computed using (17), (18).

Proposition 2

If satisfies (14), (15), and (16) at , then there exist neighborhoods of and of and a function such that and, for all , is the unique minimizer for


the B–derivative of is given by (17), and the B–derivative of is given by (18).

Iii-D Conclusions regarding regularity of optimal value and policy functions

The results in Sections III-AIII-B, and III-C suggest that we should generally expect regularity of optimal value and policy functions to match that of the cost function: they should be discontinuous when the cost is discontinuous, or piecewise–differentiable when the cost is piecewise–differentiable. In Section IV we provide instances of the class of models described in Section II that exhibit these effects.

Iv Optimal value and policy functions for a mechanical system subject to unilateral constraints

We showed in the previous section that optimal value and policy functions for contact–rich robot dynamics inherit nonsmoothness from the underlying dynamics. To instantiate this result, we crafted the simplest mechanical system subject to unilateral constraints that exhibits the nonsmooth phenomena of interest (piecewise–differentiable or discontinuous trajectory outcomes), yielding the touchdown and liftoff maneuvers shown in Figure 1(a,b). For the touchdown maneuver, we seek the optimal (constant) force to exert in the left leg () when the left foot is in contact and the right foot is not; similarly, we seek the optimal choice of force in the right leg () when the right foot is in contact and the left foot is not: with as input penalty parameters,


For the liftoff maneuver, we seek the optimal (constant) torque () to apply to the body while both feet are in contact: with as an input penalty parameter,


We implemented numerical simulations of these models666using the modeling framework in [2] and simulation algorithm in [13] and applied a scalar minimization algorithm777SciPy v0.19.0 minimize_scalar to compute optimal policies as a function of initial body rotation.888We plan to release the software used to generate these results as an environment in OpenAI Gym [14].

As expected, the optimal value and policy functions computed for the touchdown and liftoff maneuvers are nonsmooth (Figure 3(c,d,e,f)). This result does not depend sensitively on the problem data; nonsmoothness is preserved after altering parameters of the model and/or cost function. We emphasize that the nonsmoothness in Figure 3 arises from the nonsmoothness in the underlying system dynamics (1); the functions in (20) and (21) are smooth.

(a) optimal touchdown trajectory outcomes
(b) optimal liftoff trajectory outcomes
(c) optimal touchdown policy
(d) optimal liftoff policy
(e) optimal touchdown value
(f) optimal liftoff value
Fig. 3: Optimal trajectories, values and policies for touchdown and liftoff maneuvers. Optimizing (20), (21) for the biped in Figure 1 yields trajectory outcomes (a,b), policies (c,d), and values (e,f) that are nonsmooth (piecewise–differentiable or discontinuous). Asymmetries in trajectory outcomes are due to unequal input penalty parameters () in (a) and unequal leg forces () in (b).

V Discussion

We conclude by discussing how often we expect to encounter the nonsmooth phenomena described above in models of robot behaviors (Section V-A) and what our results imply about the use of smooth tools in this nonsmooth setting (Section V-B).

V-a Prevalence of nonsmooth phenomena

In Section IV, we presented two simple optimal control problems where the dynamics of a mechanical system subject to unilateral constraints gave rise to a nonsmooth cost: one where the cost was piecewise–differentiable, and another where it was discontinuous. The reader may have noticed that the nonsmoothness occurred along trajectories that underwent simultaneous constraint (de)activation. This peculiarity was not accidental: the cost is generally continuously–differentiable along trajectories that (de)activate constraints at distinct instants in time.999This follows from [15, Eqn. 2.3] so long as the constraint (de)activations are admissible [3, Def. 3, Lem. 1].

If the constraint surfaces intersect transversely [8, Ch. 6], then the nonsmoothness presented in Section IV is confined to a subset of the state space with zero Lebesgue measure. In light of this observation, intuition may lead one to ignore these states in practice. However, we believe this intuition will lead the practitioner astray as the complexity of considered behaviors increases. Indeed, since the number of contact mode sequences increases factorially with the number of constraints and exponentially with the number of constraint (de)activations, then the region where the cost function is continuously–differentiable is “carved up” into a rapidly increasing number of disjoint “pieces” as behavioral complexity101010as measured by the number of constraints and/or constraint (de)activations increases.

Although we cannot at present comment in general on how these smooth pieces fit together, we note that some important behaviors will reside near a large number of pieces. For instance, periodic behaviors with (near–)simultaneous (de)activation of constraints as in [16] could yield up to pieces after periods. The combinatorics are similar for tasks that involve intermittently activating (a subset of) constraints times as in [17]. Since the dimension of the state space is independent of and , these pieces must be increasingly tightly packed as and/or increase.

V-B Justifying the use of gradient–based algorithms

Suppose a (possibly non–optimal) policy has an associated value . If this value admits a first–order approximation with respect to , then it is natural to improve the policy using steepest descent: with as a stepsize parameter,


The update in (22) is a direct policy gradient–based algorithm [18, 19], and can be interpreted as a natural [20] or trust region [21] algorithm depending on the norm chosen. In practice, the derivative

is not generally available and must be estimated, e.g. using function approximation 

[22, 23] or sampling [19, 24]. This practice is justified for smooth control systems whose value functions are smooth; it is not generally justified for the mechanical systems subject to unilateral constraints considered here since the value of (optimal or non–optimal) policies can be nonsmooth. To see how nonsmoothness can prevent a gradient–based algorithm from converging to an optimal policy, consider the result of applying one step of the policy gradient algorithm in (22) to the optimal policies in Figure 3(c,d) when is merely piecewise–differentiable. Since the policy is optimal, the first–order necessary condition for optimality (14) implies that the in (22) evaluates to zero, and therefore the optimal policy is a fixed point of the update in (22) when the true (Bouligand) derivative is available. However, an estimate of obtained via sampling or function approximation would be nonzero near , causing one step of the policy gradient algorithm in (22) to diverge from the optimal policy.

Recent work employs smooth approximations of the contact–rich robot dynamics in (1) to enable application of gradient–based learning [25, 26, 27] and optimization [28, 17, 29] algorithms. This approach leverages established scalable algorithms, but does not ensure that policies optimized for the smoothed dynamics are (near–)optimal when applied to the original system’s nonsmooth dynamics, since the dynamics of the smooth system being optimized differ from those of the original system. As an alternative approach, the framework we introduced in [4] provides design conditions that ensure trajectories of (1) depend continuously–differentiably on initial conditions. Thus in future work it may be possible to justify applying established algorithms for optimal control directly on some mechanical systems subject to unilateral constraints.


This material is based upon work supported by the U. S. Army Research Laboratory and the U. S. Army Research Office under contract/grant number W911NF-16-1-0158.


  • [1] P. Ballard, “The dynamics of discrete mechanical systems with perfect unilateral constraints,” Archive for Rational Mechanics and Analysis, vol. 154, no. 3, pp. 199–274, 2000.
  • [2] A. M. Johnson, S. A. Burden, and D. E. Koditschek, “A hybrid systems model for simple manipulation and self-manipulation systems,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1354–1392, 1 Sept. 2016.
  • [3] A. M. Pace and S. A. Burden, “Piecewise–differentiable trajectory outcomes in mechanical systems subject to unilateral constraints,” in Proceedings of Hybrid Systems: Computation and Control (HSCC), 2017.
  • [4] ——, “Decoupled limbs yield differentiable trajectory outcomes through intermittent contact in locomotion and manipulation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2017.
  • [5] C. D. Remy, K. Buffinton, and R. Siegwart, “Stability analysis of passive dynamic walking of quadrupeds,” The International Journal of Robotics Research, vol. 29, no. 9, pp. 1173–1185, 2010.
  • [6] Y. Hürmüzlü and D. B. Marghitu, “Rigid body collisions of planar kinematic chains with multiple contact points,” The International Journal of Robotics Research, vol. 13, no. 1, pp. 82–92, 1994.
  • [7] E. Polak, Optimization: Algorithms and Consistent Approximations.   Springer–Verlag, 1997.
  • [8] J. M. Lee, Introduction to Smooth Manifolds, 2nd ed., ser. Graduate texts in mathematics.   New York ; London: Springer, 2012.
  • [9] S. Scholtes, Introduction to Piecewise Differentiable Equations.   Springer–Verlag, 2012.
  • [10] R. W. Chaney, “Second-Order sufficient conditions in nonsmooth optimization,” Mathematics of Operations Research, vol. 13, no. 4, pp. 660–673, 1 Nov. 1988.
  • [11] S. M. Robinson, “An Implicit-Function theorem for a class of nonsmooth functions,” Mathematics of Operations Research, vol. 16, no. 2, pp. 292–309, 1991.
  • [12] L. Kuntz and S. Scholtes, “Structural analysis of nonsmooth mappings, inverse functions, and metric projections,” Journal of Mathematical Analysis and Applications, vol. 188, no. 2, pp. 346–386, 1994.
  • [13] S. A. Burden, H. Gonzalez, R. Vasudevan, R. Bajcsy, and S. S. Sastry, “Metrization and Simulation of Controlled Hybrid Systems,” IEEE Transactions on Automatic Control, vol. 60, no. 9, pp. 2307–2320, 2015.
  • [14] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI gym,” 5 June 2016.
  • [15] M. A. Aizerman and F. R. Gantmacher, “Determination of stability by linear approximation of a periodic solution of a system of differential equations with discontinuous Right–Hand sides,” The Quarterly Journal of Mechanics and Applied Mathematics, vol. 11, no. 4, pp. 385–398, 1958.
  • [16] R. M. Alexander, “The gaits of bipedal and quadrupedal animals,” The International Journal of Robotics Research, vol. 3, no. 2, pp. 49–59, 1984.
  • [17] I. Mordatch, E. Todorov, and Z. Popović, “Discovery of complex behaviors through contact-invariant optimization,” ACM Transactions on Graphics, vol. 31, no. 4, pp. 43:1–43:8, July 2012.
  • [18] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063, 2000.
  • [19] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,”

    The Journal of Artificial Intelligence Research

    , vol. 15, pp. 319–350, 2001.
  • [20] S. Kakade, “A natural policy gradient,” Advances in Neural Information Processing Systems, vol. 14, pp. 1531–1538, 2001.
  • [21] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, abs/1502. 05477, 2015.
  • [22] K. Doya, “Reinforcement learning in continuous time and space,” Neural Computation, vol. 12, no. 1, pp. 219–245, Jan. 2000.
  • [23] V. R. Konda and J. N. Tsitsiklis, “OnActor-Critic algorithms,” SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, Jan. 2003.
  • [24] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algorithms,” in ICML, Beijing, China, June 2014.
  • [25]

    S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in

    Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 1071–1079.
  • [26] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End–to–end training of deep visuomotor policies,”

    Journal of Machine Learning Research: JMLR

    , vol. 17, no. 1, pp. 1334–1373, 2016.
  • [27] V. Kumar, E. Todorov, and S. Levine, “Optimal control with learned local models: Application to dexterous manipulation,” in IEEE International Conference on Robotics and Automation (ICRA), May 2016, pp. 378–383.
  • [28] T. Erez and E. Todorov, “Trajectory optimization for domains with contacts using inverse dynamics,” in IEEE International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 4914–4919.
  • [29] I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, 2015, pp. 5307–5314.