In the optimization approach to robot control, a policy is sought that extremizes a given performance criterion; the performance achieved by this optimal policy is the optimal value
of the problem. Two widely-applied frameworks for solving such problems are reinforcement learning and trajectory optimization. Although many algorithms are available in either framework, scalable algorithms in both leverage local approximations—gradients of values and policies—to iteratively improve toward optimality. In robotics applications like collision–free motion planning, these gradients are guaranteed to exist and can be readily computed or approximated. We show in this paper that these gradients can fail to exist for contact-rich robot dynamics, precluding application of state–of–the–art algorithms for optimal control.
We begin in Section II by modeling contact–rich robot dynamics using mechanical systems subject to unilateral constraints, and describe how nonsmoothness—discontintinuity or piecewise–differentiability—manifests in trajectory outcomes and (hence) trajectory costs. Then in Section III we provide mathematical derivations that show nonsmoothness in trajectory outcomes and costs gives rise to nonsmoothness in optimal value and (hence) policy functions. Subsequently in Section IV we present numerical simulations that demonstrate discontinuous or merely piecewise–differentiable optimal value and policy functions in a mechanical system subject to unilateral constraints. Finally in Section V we discuss the prevelance of nonsmoothness and how the lack of classical differentiability prevents gradient–based algorithms from converging to optimality.
Ii Mechanical systems subject to unilateral constraints
In this section, we formalize a class of models for contact–rich dynamics in robot locomotion and manipulation as mechanical systems subject to unilateral constraints and formulate an optimal control problem for these systems.
Consider the dynamics of a mechanical system with configuration coordinates subject to unilateral constraints specified by a differentiable function where are finite. We are primarily interested in systems with constraints, whence we regard the inequality as being enforced componentwise.
Given any , and letting denote the number of elements in the set , we let denote the function obtained by selecting the component functions of indexed by , and we regard the equality as being enforced componentwise.
where specifies the mass matrix for the mechanical system in the coordinates, is termed the effort map  and specifies111We let denote the tangent bundle of the configuration space ; an element can be regarded as a pair of generalized configurations and velocities ; we write . the internal and applied forces, is an external input, denotes the Coriolis matrix determined by , denotes the (Jacobian) derivative of the constraint function with respect to the coordinates, denotes the reaction forces generated in contact mode to enforce , specifies the collision restitution law that instantaneously resets velocities to ensure compatibility with the constraint , and (resp. ) denotes the right– (resp. left–)handed limits of the velocity with respect to time.
Ii-B Regularity of dynamics
The seemingly benign equations in (1) can yield dynamics with a range of regularity properties. This issue has been thoroughly investigated elsewhere [3, 4, 1]; here we focus specifically on how design choices in a robot’s mechanical and control systems affect regularity of its dynamics.
In what follows, we will frequently refer to the concept of a control system’s flow, so we briefly review the concept before proceeding. Given a control system (e.g. (1) or (2)) with state space and input space , a flow is a function such that for all initial states and inputs , the function defined for all by is a trajectory for the control system. Intuitively, the flow “bundles” all trajectories into a single function. Mathematically, the flow is useful for studying how trajectories vary with respect to initial states and inputs. So long as trajectories exist and are unique for every and , the flow is a well–defined function.
It is common to assume that the functions in (1) are continuously–differentiable (); however, as illustrated by [1, Ex. 2], this assumption alone does not ensure existence or uniqueness of trajectories. This case contrasts starkly with that of classical control systems, where the equation
yields unique trajectories whose regularity matches the vector field’s: ifis continuously differentiable, then there exists a flow for (2) that is continuously differentiable to the same order.
To ensure trajectories for (1) exist uniquely, restrictions must be imposed; we refer the interested reader to [1, Thm. 10] for a specific result and  for a general discussion of this issue. Since we are chiefly concerned with how properties of the dynamics in (1) affect properties of optimal value and policy functions, we will assume in what follows that conditions have been imposed to ensure (1) has a flow for states, inputs, and time horizons of interest.
Assuming that a flow exists for (1) does not provide any regularity properties on the function ; these properties are determined by the design of a robot’s mechanical and control systems and their closed–loop interaction with the environment. For instance: when limbs are inertially coupled (e.g. by rigid struts and joints), so that one limb’s constraint activation instantaneously changes another’s velocity, is discontinuous at configurations where these two limbs activate constraints simultaneously [5, Table 3] ; when limbs are force coupled (e.g. by nonlinear damping), so that one limb’s constraint (de)activation instantaneously changes the force on another, can be piecewise–differentiable at configurations where these two limbs (de)activate constraints simultaneously [3, Fig. 1]. In both instances, mechanical design choices lead to nonsmooth dynamics; Figure 1 provides examples where control design choices lead to nonsmooth dynamics (piecewise–differentiable in Figure 1(a,c,e), discontinuous in Figure 1(b,d,f)). Other nonsmooth phenomena can arise, e.g. grazing222where a constraint function decreases to and then increases from zero without activating constraint and Zeno333where a constraint is activated an infinite number of times on a finite time horizon trajectories; in what follows we will focus on the case of simultaneous constraint (de)activations due to its prevalence in robot gaits and maneuvers (see Section V-A for a discussion of when this phenomena prevails).
Ii-C Regularity of optimal value and policy functions
A broad class of optimal control problems for the dynamics in (1) can be formulated in terms of final () and running () costs:
where denotes the unique trajectory obtained from initial state when input is applied; in terms of the flow, for all . To expose the dependence of the cost in (3) on the flow , we transcribe the problem in (3) to a simpler form using a standard state augmentation technique (cf. [7, Ch. 4.1.2]):
As discussed in Section II-B, the continuity and differentiability properties of are partly determined by a robot’s design: it is possible for and hence to be discontinuous (), continuously–differentiable (), or piecewise–differentiable and not continuously–differentiable (), depending on the properties of the robot’s mechanical and control systems. In the next section, we study how continuity and differentiability properties of affect the corresponding properties of in (4).
Iii Continuity and differentiability of optimal value and policy functions
Consider minimization of the cost function with respect to an input :
In this section we study how regularity properties (continuity, differentiability) of the cost function () relate to regularity properties of optimal value () and policy () functions.
Iii-a Discontinuous cost functions
If the cost () is discontinuous with respect to its first argument, then the optimal policy () and value () are generally discontinuous as well. This observation is clear in the trivial case that the cost only depends on its first argument, but manifests more generally.
Iii-B Continuously–differentiable cost functions
This section contains straightforward calculations based on standard results in classical (smooth) Calculus and nonlinear programming; it is provided primarily as a rehearsal for the more general setting considered in the subsequent section.
If is continuously–differentiable, denoted or simply , then necessarily [7, Ch. 1.1.1]
and applying the Chain Rule to (7) yields
whence we obtain derivatives of the optimal value and policy functions in terms of derivatives of the cost function.
We conclude that if the cost function is two times continuously–differentiable () and first–order necessary (8) and second–order sufficient (9), (10) conditions for optimality and stability of solutions to (5) are satisfied at , then the optimal policy and value functions are continuously–differentiable at () and their derivatives at can be computed using (11), (12).
Iii-C Piecewise–differentiable cost functions
If is piecewise–differentiable,444We use the notion of piecewise–differentiability from [9, Ch. 4.1]: a function is piecewise–differentiable if it is everywhere locally a continuous selection of a finite number of continuously–differentiable functions. denoted or simply , then necessarily
Here and below, denotes a continuous and piecewise–linear first–order approximation termed the Bouligand (or B–)derivative [9, Ch. 3] that exists by virtue of the cost being [9, Lem. 4.1.3]; denotes the evaluation of at .
and if the piecewise–linear function
then a Implicit Function Theorem can be applied to choose near [11, Cor. 3.4].555This Implicit Function Theorem requires be strongly B–differentiable; the costs considered here are not generally strongly B–differentiable, but they are generally –equivalent to strongly B–differentiable functions [12, Thm. 3.1], whence [11, Cor. 3.4] can be applied indirectly. Applying the Chain Rule [9, Thm. 3.1.1] to (14) yields (cf. [11, § 3])
and applying the Chain Rule to (7) yields
whence we obtain B–derivatives of the optimal value and policy functions in terms of B–derivatives of the cost.
We conclude that if the cost function is two times piecewise–differentiable () and first–order necessary (14) and second–order sufficient (15), (16) conditions for optimality and stability of solutions to (5) are satisfied at , then the optimal policy and value functions are piecewise–differentiable at () and their B–derivatives at can be computed using (17), (18).
Iii-D Conclusions regarding regularity of optimal value and policy functions
The results in Sections III-A, III-B, and III-C suggest that we should generally expect regularity of optimal value and policy functions to match that of the cost function: they should be discontinuous when the cost is discontinuous, or piecewise–differentiable when the cost is piecewise–differentiable. In Section IV we provide instances of the class of models described in Section II that exhibit these effects.
Iv Optimal value and policy functions for a mechanical system subject to unilateral constraints
We showed in the previous section that optimal value and policy functions for contact–rich robot dynamics inherit nonsmoothness from the underlying dynamics. To instantiate this result, we crafted the simplest mechanical system subject to unilateral constraints that exhibits the nonsmooth phenomena of interest (piecewise–differentiable or discontinuous trajectory outcomes), yielding the touchdown and liftoff maneuvers shown in Figure 1(a,b). For the touchdown maneuver, we seek the optimal (constant) force to exert in the left leg () when the left foot is in contact and the right foot is not; similarly, we seek the optimal choice of force in the right leg () when the right foot is in contact and the left foot is not: with as input penalty parameters,
For the liftoff maneuver, we seek the optimal (constant) torque () to apply to the body while both feet are in contact: with as an input penalty parameter,
We implemented numerical simulations of these models666using the modeling framework in  and simulation algorithm in  and applied a scalar minimization algorithm777SciPy v0.19.0 minimize_scalar to compute optimal policies as a function of initial body rotation.888We plan to release the software used to generate these results as an environment in OpenAI Gym .
As expected, the optimal value and policy functions computed for the touchdown and liftoff maneuvers are nonsmooth (Figure 3(c,d,e,f)). This result does not depend sensitively on the problem data; nonsmoothness is preserved after altering parameters of the model and/or cost function. We emphasize that the nonsmoothness in Figure 3 arises from the nonsmoothness in the underlying system dynamics (1); the functions in (20) and (21) are smooth.
We conclude by discussing how often we expect to encounter the nonsmooth phenomena described above in models of robot behaviors (Section V-A) and what our results imply about the use of smooth tools in this nonsmooth setting (Section V-B).
V-a Prevalence of nonsmooth phenomena
In Section IV, we presented two simple optimal control problems where the dynamics of a mechanical system subject to unilateral constraints gave rise to a nonsmooth cost: one where the cost was piecewise–differentiable, and another where it was discontinuous. The reader may have noticed that the nonsmoothness occurred along trajectories that underwent simultaneous constraint (de)activation. This peculiarity was not accidental: the cost is generally continuously–differentiable along trajectories that (de)activate constraints at distinct instants in time.999This follows from [15, Eqn. 2.3] so long as the constraint (de)activations are admissible [3, Def. 3, Lem. 1].
If the constraint surfaces intersect transversely [8, Ch. 6], then the nonsmoothness presented in Section IV is confined to a subset of the state space with zero Lebesgue measure. In light of this observation, intuition may lead one to ignore these states in practice. However, we believe this intuition will lead the practitioner astray as the complexity of considered behaviors increases. Indeed, since the number of contact mode sequences increases factorially with the number of constraints and exponentially with the number of constraint (de)activations, then the region where the cost function is continuously–differentiable is “carved up” into a rapidly increasing number of disjoint “pieces” as behavioral complexity101010as measured by the number of constraints and/or constraint (de)activations increases.
Although we cannot at present comment in general on how these smooth pieces fit together, we note that some important behaviors will reside near a large number of pieces. For instance, periodic behaviors with (near–)simultaneous (de)activation of constraints as in  could yield up to pieces after periods. The combinatorics are similar for tasks that involve intermittently activating (a subset of) constraints times as in . Since the dimension of the state space is independent of and , these pieces must be increasingly tightly packed as and/or increase.
V-B Justifying the use of gradient–based algorithms
Suppose a (possibly non–optimal) policy has an associated value . If this value admits a first–order approximation with respect to , then it is natural to improve the policy using steepest descent: with as a stepsize parameter,
The update in (22) is a direct policy gradient–based algorithm [18, 19], and can be interpreted as a natural  or trust region  algorithm depending on the norm chosen. In practice, the derivative
is not generally available and must be estimated, e.g. using function approximation[22, 23] or sampling [19, 24]. This practice is justified for smooth control systems whose value functions are smooth; it is not generally justified for the mechanical systems subject to unilateral constraints considered here since the value of (optimal or non–optimal) policies can be nonsmooth. To see how nonsmoothness can prevent a gradient–based algorithm from converging to an optimal policy, consider the result of applying one step of the policy gradient algorithm in (22) to the optimal policies in Figure 3(c,d) when is merely piecewise–differentiable. Since the policy is optimal, the first–order necessary condition for optimality (14) implies that the in (22) evaluates to zero, and therefore the optimal policy is a fixed point of the update in (22) when the true (Bouligand) derivative is available. However, an estimate of obtained via sampling or function approximation would be nonzero near , causing one step of the policy gradient algorithm in (22) to diverge from the optimal policy.
Recent work employs smooth approximations of the contact–rich robot dynamics in (1) to enable application of gradient–based learning [25, 26, 27] and optimization [28, 17, 29] algorithms. This approach leverages established scalable algorithms, but does not ensure that policies optimized for the smoothed dynamics are (near–)optimal when applied to the original system’s nonsmooth dynamics, since the dynamics of the smooth system being optimized differ from those of the original system. As an alternative approach, the framework we introduced in  provides design conditions that ensure trajectories of (1) depend continuously–differentiably on initial conditions. Thus in future work it may be possible to justify applying established algorithms for optimal control directly on some mechanical systems subject to unilateral constraints.
This material is based upon work supported by the U. S. Army Research Laboratory and the U. S. Army Research Office under contract/grant number W911NF-16-1-0158.
-  P. Ballard, “The dynamics of discrete mechanical systems with perfect unilateral constraints,” Archive for Rational Mechanics and Analysis, vol. 154, no. 3, pp. 199–274, 2000.
-  A. M. Johnson, S. A. Burden, and D. E. Koditschek, “A hybrid systems model for simple manipulation and self-manipulation systems,” The International Journal of Robotics Research, vol. 35, no. 11, pp. 1354–1392, 1 Sept. 2016.
-  A. M. Pace and S. A. Burden, “Piecewise–differentiable trajectory outcomes in mechanical systems subject to unilateral constraints,” in Proceedings of Hybrid Systems: Computation and Control (HSCC), 2017.
-  ——, “Decoupled limbs yield differentiable trajectory outcomes through intermittent contact in locomotion and manipulation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2017.
-  C. D. Remy, K. Buffinton, and R. Siegwart, “Stability analysis of passive dynamic walking of quadrupeds,” The International Journal of Robotics Research, vol. 29, no. 9, pp. 1173–1185, 2010.
-  Y. Hürmüzlü and D. B. Marghitu, “Rigid body collisions of planar kinematic chains with multiple contact points,” The International Journal of Robotics Research, vol. 13, no. 1, pp. 82–92, 1994.
-  E. Polak, Optimization: Algorithms and Consistent Approximations. Springer–Verlag, 1997.
-  J. M. Lee, Introduction to Smooth Manifolds, 2nd ed., ser. Graduate texts in mathematics. New York ; London: Springer, 2012.
-  S. Scholtes, Introduction to Piecewise Differentiable Equations. Springer–Verlag, 2012.
-  R. W. Chaney, “Second-Order sufficient conditions in nonsmooth optimization,” Mathematics of Operations Research, vol. 13, no. 4, pp. 660–673, 1 Nov. 1988.
-  S. M. Robinson, “An Implicit-Function theorem for a class of nonsmooth functions,” Mathematics of Operations Research, vol. 16, no. 2, pp. 292–309, 1991.
-  L. Kuntz and S. Scholtes, “Structural analysis of nonsmooth mappings, inverse functions, and metric projections,” Journal of Mathematical Analysis and Applications, vol. 188, no. 2, pp. 346–386, 1994.
-  S. A. Burden, H. Gonzalez, R. Vasudevan, R. Bajcsy, and S. S. Sastry, “Metrization and Simulation of Controlled Hybrid Systems,” IEEE Transactions on Automatic Control, vol. 60, no. 9, pp. 2307–2320, 2015.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “OpenAI gym,” 5 June 2016.
-  M. A. Aizerman and F. R. Gantmacher, “Determination of stability by linear approximation of a periodic solution of a system of differential equations with discontinuous Right–Hand sides,” The Quarterly Journal of Mechanics and Applied Mathematics, vol. 11, no. 4, pp. 385–398, 1958.
-  R. M. Alexander, “The gaits of bipedal and quadrupedal animals,” The International Journal of Robotics Research, vol. 3, no. 2, pp. 49–59, 1984.
-  I. Mordatch, E. Todorov, and Z. Popović, “Discovery of complex behaviors through contact-invariant optimization,” ACM Transactions on Graphics, vol. 31, no. 4, pp. 43:1–43:8, July 2012.
-  R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063, 2000.
J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,”
The Journal of Artificial Intelligence Research, vol. 15, pp. 319–350, 2001.
-  S. Kakade, “A natural policy gradient,” Advances in Neural Information Processing Systems, vol. 14, pp. 1531–1538, 2001.
-  J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust region policy optimization,” CoRR, abs/1502. 05477, 2015.
-  K. Doya, “Reinforcement learning in continuous time and space,” Neural Computation, vol. 12, no. 1, pp. 219–245, Jan. 2000.
-  V. R. Konda and J. N. Tsitsiklis, “OnActor-Critic algorithms,” SIAM Journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, Jan. 2003.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algorithms,” in ICML, Beijing, China, June 2014.
S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” inAdvances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 1071–1079.
S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End–to–end training of deep
Journal of Machine Learning Research: JMLR, vol. 17, no. 1, pp. 1334–1373, 2016.
-  V. Kumar, E. Todorov, and S. Levine, “Optimal control with learned local models: Application to dexterous manipulation,” in IEEE International Conference on Robotics and Automation (ICRA), May 2016, pp. 378–383.
-  T. Erez and E. Todorov, “Trajectory optimization for domains with contacts using inverse dynamics,” in IEEE International Conference on Intelligent Robots and Systems, Oct. 2012, pp. 4914–4919.
-  I. Mordatch, K. Lowrey, and E. Todorov, “Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, 2015, pp. 5307–5314.