I Introduction
Deployment of legged systems into realworld scenarios requires autonomous locomotion planning strategies. A challenge in legged locomotion is the necessity to reason about contact schedules during planning. We are interested in the efficient automation of contact schedule selection based on highlevel descriptors of a motion task.
The literature provides a wide spectrum of approaches to locomotion planning problems. The separation into simplified models for footstep planning together with robust wholebody control for tracking has found widespread application [1, 2]. In [3], footstep plans were optimized using a mixedinteger formulation incorporating a reachability heuristic. In [4], the stochastic Covariance Matrix Adaptation Evolution Strategy (CMAES) [5] optimized footholds on a prelabeled terrain costmap. Samplingbased methods employing Rapidlyexploring Random Trees (RRT) have solved complex motion planning problems while being guided by heuristics and requiring postprocessing [6, 7, 8]. The strict separation of footstep planning and wholebody tracking requires a conservative approach for the planning of the prior task to ensure feasibility of the latter task.
In dynamic locomotion tasks, inconsistent contact planning can entail performance degradation or even instability. The motion planning should thus reason about the robot dynamics and the contact states concurrently. A continuoustime differential dynamic programming (DDP) based method was provided in [9], while [10, 11] proposed direct collocationbased approaches that optimize over the contact schedule while planning dynamically consistent motions. These algorithms optimize over the contact duration and can implicitly alter the number of contact phases by setting step durations to zero. However, this seems unlikely from a numeric standpoint. Furthermore, including the contact schedule in the optimization variables increases both the computational burden and the risk to not converge to a feasible solution. Similar numerical concerns arise in contact invariant optimization [12, 13], where contact interaction is implicitly defined by complementary constraints. This raises the question of whether contact schedule selection can be decoupled from trajectory optimization, without sacrificing model consistency in favor of heuristics.
Learning contact schedule selection to guide the trajectory optimization presents itself as a viable option. Techniques combining optimal control with learning have found widespread interest in the robotics community. Optimization is efficient in finding good local solutions, while datadriven techniques offer the ability to model complex relations. In [14]
, controller adaptation to model inaccuracies was framed as a reinforcement learning problem. In
[15] and [16], robustness of neural network policies was demonstrated for locomotion tasks. In
[17], a classifier predicts contact mode sequences for a manipulation task under linearized dynamics to simplify the quadratic program solved online. The strongly nonlinear locomotion tasks considered here make the optimization prone to become locally trapped or to not converge at all. Efficient data acquisition should, therefore, avoid extensive sampling in regions likely to produce infeasible problems. This explorationexploitation tradeoff can be achieved through a combination of
Bayesian Optimization (BO) with Gaussian Process (GP) models, as discussed in [18, 19] and applied to contextual bandit problems in [20]. Previously, promising results with these approaches were demonstrated for applications such as automatic locomotion controller optimization in [21, 22].In this paper, we propose an efficient method to automate contact schedule selection in optimizationbased legged locomotion. The planning problem is cast as a bilevel optimization. At the lower level (LL), a direct collocation method solves the motion planning problem for the state and input trajectories under a fixed contact schedule. At the upper level (UL), BO is used to optimize a GP model which maps the contact scheduling policy to the trajectory optimization performance. The temporal component of gait planning is therefore decoupled from optimizing continuous state and input trajectories. Employing a learningbased method in the UL allows for sampleefficient offline training that avoids predicted infeasible actions, which can slow down computation significantly [23]. Unlike in [24], we furthermore avoid taking gradients with respect to the contact schedule and reduce the risk of getting trapped in local optima. We evaluate our approach on planning trajectories for the singlelegged hopping robot shown in Fig. 1.
Ii Methods
Optimizationbased locomotion planning attempts to compute optimal state and input trajectories, as well as the underlying contact sequence and durations. This proceeds in accordance with the system dynamics and the environment, as contact imposes constraints on the achievable motions. In general, operating on these continuous and discrete variables defines a mixed logic optimization problem [25], which is computationally expensive to solve. Here, we employ a bilevel optimization that separates gait planning (discrete) from trajectory planning (continuous) [24]. At the LL, we define a nonlinear program (NLP) to find optimal state and input trajectories under fixed gaits. At the UL, we leverage a gradientfree method capable of treating the NLP score as a black box function. The overall problem takes the form
(1)  
s.t.  
where are the UL decision variables, are the LL decision variables, denotes the cost, are the equality constraints, and are the inequality constraints, where .
LL optimization
At the LL we consider the gait to be given and set . We then determine the optimal state and input trajectories for the resulting locomotion task. To this end, we formulate an optimization problem using a direct collocation method [10]. The resulting NLP is written in its general form as
(2)  
s.t.  
where is the LL objective function. The LL
decision variables vector
stacks the individual node vectors consisting of generalized positions , generalized velocities , and input torques . The constraints include system dynamics, ground contact, friction, and joint limits. The NLP in (2) returns the optimized trajectory nodes together with the associated cost and constraint values and .Merit function
In order to quantify the quality of solutions provided by the LL optimization, we define a merit function . The merit function weighs trajectory cost and constraint violation based on
(3)  
where the first term penalizes the trajectory cost, while the second and the third terms penalize constraint violation of the equality and inequality constraints, respectively. In general, their ratios should ensure a separation between feasible and infeasible runs, while avoiding steep gradients in the merit function.
UL optimization
At the UL we consider the gait to be variable and select such that is minimal. Here, we set to convey that an optimal gait should lead to both minimal cost and constraint violation. The UL optimization is constrained by the LL solution such that . We therefore choose a gradientfree method from BO to solve it. To this end, we model the merit function as a GP according to
(4) 
where is a mean function and is a kernel function. The kernel function serves as a similarity measure and relates distance in pairing space to distance in merit space . We denote the set of observed samples by , where is the matrix of observed UL decision variables and the vector of corresponding merit scores, and a new query by . The GP
model then uses the observations to infer predictions for new queries, while providing estimates of the associated uncertainty. It does so via the predicted mean
and predicted standard deviation
, defined as(5)  
(6) 
where is the resulting kernel matrix and encodes the noiselevel in the observations.
Context and actions
Following the nomenclature in [20], we encode planning tasks using context . The context here includes the distance to the goal location together with the terrain heightmap. The available gaits are referred to as actions which encode a contact sequence and the associated contact durations. We define a contextaction pair as . These parameterize the UL optimization and subjected to we try to select such that is minimal.
Iii Bayesian Optimization Algorithm
The concepts introduced in the previous section are combined into a single BO algorithm. In the following, we introduce the GP regression formulation with the associated termination criterion and further specify both the kernel function and the merit function employed. The corresponding pseudocode is provided in Algorithm 1.
Gp regression
We employ GP regression to sequentially solve the UL optimization in (1). Specifically, we use a variation of the contextual Gaussian Process upper confidence bound algorithm (CGPUCB) presented in [20]. At iteration , context is sampled uniformly at random from context space and the posterior distribution is evaluated over the entire action space . Action is then selected using upper confidence bound (UCB) sampling according to the acquisition function
(7) 
where and denote the posterior mean and standard deviation of the previous iteration, and is an explorationexploitation tradeoff variable that encourages ongoing exploration. Based on the resulting , we solve the NLP and compute a merit score from the optimized trajectory. These are then added to the set of observed samples and the GP posterior belief is updated. This process continues until a termination criterion is satisfied.
Termination criterion
We assert convergence of the regression by monitoring a discounted version of the relative prediction error. Specifically, we use a firstorder lowpass filter on the squared relative residuals (fSRR). At iteration , the expectation of the fSRR is computed as
(8) 
where indicates the current iteration index, is the observed merit score, and is the corresponding difference between the observation and the prediction. The scalar defines how much emphasis is placed on the more recent predictions in computing the fSRR. Data acquisition is terminated once the fSRR falls below a predefined threshold provided as
(9) 
Kernel function
The GP model requires a kernel function as a similarity metric. We follow [20] and define one kernel over context space and another kernel over action space. Our overall kernel is then a multiplicative combination of the two, encoding that samples are only similar if both their context and their action are similar:
(10) 
with kernel hyperparameters
and . For the underlying kernel functions, we choose ARDMatern3/2 kernels, as these have limited smoothness and therefore have some capability of accommodating potential steps in the sample data. Such discontinuities can arise during transitions from one contact schedule to another, or when one contact schedule suddenly becomes infeasible.Merit refinement
Failed trajectories can lead to large constraint penalties in the merit function. Such outliers should be avoided, as GP models presume some smoothness characteristics of the functions they attempt to approximate. We therefore subject the merit function
in (3) to a sigmoid function to arrive at the new merit function
. The sigmoid function acts as a soft range limiter and ensures that the merit scores have an upper bound, while the transform is continuous and retains the merit score order.Iv Modelling
We evaluate the performance of the proposed algorithm on a simulated robot. In the following, the robot model and the environmental model are discussed.
Iva System
The agent considered is a singlelegged hopper, modelled in the sagittal plane (Fig. 1). It consists of a base (B), thigh, shank and actuated joints at the hip (H) and knee (K). The generalized coordinates and joint torques are
(11)  
(12) 
where is a horizontal position, is a vertical position, is a joint angle, and is the corresponding joint torque. The associated equations of motion (EoM) are written as
(13) 
where denotes the inertia matrix, groups Coriolis and centrifugal effects, is the gravitational force vector, is the selection matrix, refers to the ground reaction force, and is the corresponding Jacobian.
The system dynamics are represented as a hybrid system by projecting the rigid body equations into the null space of the contact constraints. During flight, no external force is present and equation (13) can be solved to obtain the generalized accelerations. During stance, friction limited ground contact occurs at the point foot. The associated contact constraint is derived to yield
(14) 
which is used to solve (13) simultaneously for the accelerations and contact force. The stance dynamics are
(15) 
The resulting contact forces are furthermore constrained in our trajectory optimization (TO) to lie in the friction cone defined by the local terrain normal.
IvB Environment
The environment is encoded via a heightmap representation. Terrain height is sampled along the horizontal with discretization on the interval . For a total of samples, the resulting heightmap is
(16) 
A  continuous representation of the terrain is recovered by leveraging shapepreserving piecewise cubic interpolation. Terrain height and gradient at intermediate locations are then computed from the interpolation polynomials.
V Experiments
Our main interest is in whether the GP model can learn reasonable contact schedule transitions for a given task. We are furthermore interested in whether the resulting predictor can perform competitively against a heuristic baseline. In the following, we introduce the general experimental setup and provide results for planning motions on both flat ground and uneven terrain.
Va Preliminaries
The following describes the general setup used for both the flat ground scenario and the uneven terrain scenario. First, the representation of context and action is provided. Then, the algorithm used for assessing model performance is outlined.
VA1 Contextaction space
The input space of the regression is spanned by the context space and the action space . A context consists of a goal distance and an dimensional vector of terrain features, denoting varying heightmap values. We consider goal distances of up to and terrain height magnitudes of up to , such that
(17) 
An action consists of an dimensional vector of alternating contact/flight phase durations. As we employ collocation nodes with constant time length, phase durations are encoded by the number of nodes. We consider up to two jumps, , and phase node numbers in the range of , as these encourage dynamic motions with lower jerk. We then have
(18) 
The UL optimization determines the desired gait by setting certain phases to zero. For example, a single stance phase is encoded by action , a single jump by , and a double jump by . Our nomenclature requires to be encoded by .
VA2 Performance metric
Performance of a trained GP model is assessed on a set of contexts sampled uniformly at random. For each context, the model predicts an optimal action and the merit score is determined by running the optimization. The procedure is outlined in Algorithm 2.
VB Scenario 1: flat ground
The scenario of hopping over flat ground is considered first. We refer to this flat terrain as FT with , i.e., the terrain is not part of the context for this experiment.
VB1 Learning contact transitions
We trained a GP model on flat terrain via Algorithm 1 and refer to it as GPFT. Fig. 2 provides the fSRR termination criterion over the number of iterations. It is apparent that the error metric is steadily decreasing, which means that the predicted merit scores become more accurate over time. Discontinuities in the fSRR indicate larger updates in the posterior belief, which can happen when exploring new parts of the contextaction space. Fig. 3 provides an overview of the learned contact sequence transitions over different goal distances, while Fig. 4 displays a selection of learned contact schedules. From Fig. 3 it is evident that the GP model is capable of learning natural contact sequence transitions that progress from stance over a single jump to a double jump. Referring to the contact schedules in Fig. 4, the model furthermore learned to vary contact durations within the contact sequences. The overall trend of adapting the phase durations or adding steps to the contact schedule as goal distance increases matches what human intuition would suggest.
VB2 Performance
We compare the performance of our learned model to a heuristic baseline. The baseline runs the trajectory optimization for a fixed set of contact schedules and selects the best action based on the resulting merit scores. The individual schedules were selected before model training based on intuition. Initial and final stance phases were chosen sufficiently long to initiate and terminate dynamic hopping. Intermediate stance phases are shorter to keep motions fluid. Flight phases are short to limit energy expenditure. The actionset available to the baseline is
(19) 
The performance of the GPFT model was evaluated by competing against the baseline according to Algorithm 2 over runs. The results are provided in Fig. 5
which illustrates the merit scores of different goal distances for the baseline and GPFT approaches. In general, the trained GPFT has a better performance than the baseline method. At the phase transitions, performance seems to converge with the baseline having a slight edge at the stance to single jump and single jump to double jump transition.
VC Scenario 2: uneven terrain
The scenario of hopping over uneven terrain is considered next. Fig. 6 shows some of the randomly generated terrains used for training of the GP. Here, we vary terrain nodes to create obstacles. Thus, in the following, we use features as terrain context to describe the heightmap.
VC1 Learning contact transitions
We trained a GP model on randomly generated rough terrain via Algorithm 1 and refer to it as GPRT. This proceeded in two stages. First, the model was trained on flat ground until convergence to an intermediate fSRR value. This initializes the model with unperturbed samples from the underlying system dynamics. Then, the model was trained on rough terrain until convergence to a final fSRR
value. At each iteration, a terrain was randomly sampled from a Gaussian distribution.
VC2 Performance
The performance of the GPRT model was evaluated by competing against the GPFT model according to Algorithm 2 over runs. The results are provided in Fig. 7. The merit scores were smoothed using a moving average filter for interpretability. The GPRT model incurs lower merit scores than the GPFT model on average. This can in part be attributed to the GPRT model selecting actions that fail less frequently than those selected by the GPFT model. Here, GPFT generated motions fail in about of the evaluated runs, while GPRT motions fail only in of them. Note that we restrict our analysis to a maximum of two jumps and randomly sampled context may generate truly infeasible objectives (e.g., Fig. 8 bottom with the maximum goal distance ).
In general, the GPFT model acts aggressively and selects very dynamic motions, or short contact schedules, at the risk of failing. The GPRT model acts more conservatively and adapts its actions to the observed terrain to avoid failing. For this reason, we observed the GPRT actions to perform slightly worse than the GPFT actions on terrains with limited roughness. However, GPRT motions demonstrate an increased level of robustness on terrains with more severe roughness. This behavior can be attributed to the underlying probabilistic model used for modeling the context (i.e., terrain heightmap). In the GPRT case, the optimized gait policy always assumes some level of uncertainty over the observed terrain. To avoid failure the GPRT policy chooses gait sequences which are less prone to uncertainties such as premature TO convergence or falling into a local minimum.
An example of this robust behavior is provided in Fig. 8. Here, the GPRT model selects a double jump schedule, while the GPFT model selects a single jump schedule. On the smaller obstacle (top), the GPRT performs slightly worse than the GPFT. On the higher obstacle (bottom), the GPRT performs significantly better than the GPFT. This is because the lower level NLP for the latter does not converge. Learning the terrain features is therefore advantageous in selecting contact schedules competitively.
Vi Conclusion
In this paper, we proposed a method that learns to select contact schedules based on highlevel task descriptors. The problem is cast as a bilevel optimization, where contact planning proceeds in the UL and constrained TO in the LL. The performance of a trajectory resulting from a specific contact schedule task is quantified using a merit function. The merit function is modeled as a GP and contact schedule selection is learned via a BO approach.
It was shown that the GP model is capable of learning contact schedule transitions that align with human intuition. The trained model is capable of outperforming a heuristic baseline in predicting task appropriate contact schedules. During training, the GP model learns about both the underlying system dynamics and the interaction with the specific terrain. It hereby does not necessarily find the global optimum of the strongly nonconvex problem, but learns to select contact schedules that the NLP solver performs well with.
It was demonstrated that a model trained on rough terrain outperforms a model trained on only flat terrain, highlighting that the method can incorporate terrain information into the decisions. However, we were impressed by how close the model trained on flat terrain comes to the rough terrain performance. It shows that our bilevel formulation provides a certain robustness. Even when the UL does not modify its decision to the terrain, the LL optimizes with full terrain information and manages to find reasonable solutions. The main difference is seen in scenarios with extreme terrain features, where the TO cannot converge with the flat terrain gait while the rough terrain gait succeeds in the task.
Future work will include the extension to longer gait sequences over more complex terrains. Moreover, we wish to use the method in a 3D setting as well. For both aspects, the scalability of the GPs to the larger input space has to be addressed, as exact GP regression has complexity. A promising strategy would be to employ a Deep Neural Network to imitate the GP model and thus to achieve better scalability at the expense of longer training sessions.
References
 [1] S. Kajita, F. Kanehiro, K. Kaneko, K. Yokoi, and H. Hirukawa, “The 3d linear inverted pendulum mode: a simple modeling for a biped walking pattern generation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), vol. 1, 2001, pp. 239–246 vol.1.
 [2] J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture point: A step toward humanoid push recovery,” in 6th IEEERAS International Conference on Humanoid Robots (Humanoids). IEEE, 2006, pp. 200–207.
 [3] R. Deits and R. Tedrake, “Footstep planning on uneven terrain with mixedinteger convex optimization,” in IEEERAS International Conference on Humanoid Robots (Humanoids). IEEE, 2014, pp. 279–286.
 [4] C. Mastalli, M. Focchi, I. Havoutis, A. Radulescu, S. Calinon, J. Buchli, D. G. Caldwell, and C. Semini, “Trajectory and foothold optimization using lowdimensional models for rough terrain locomotion,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 1096–1103.

[5]
N. Hansen, “The cma evolution strategy: a comparing review,” in
Towards a new evolutionary computation
. Springer, 2006, pp. 75–102.  [6] J. J. Kuffner, S. Kagami, K. Nishiwaki, M. Inaba, and H. Inoue, “Dynamicallystable motion planning for humanoid robots,” Autonomous Robots, vol. 12, no. 1, pp. 105–118, 2002.
 [7] C. Lau and K. Byl, “Smooth rrtconnect: An extension of rrtconnect for practical use in robots,” in IEEE International Conference on Technologies for Practical Robot Applications (TePRA). IEEE, 2015, pp. 1–7.
 [8] S. Bartoszyk, P. Kasprzak, and D. Belter, “Terrainaware motion planning for a walking robot,” in Robot Motion and Control (RoMoCo), 2017 11th International Workshop on. IEEE, 2017, pp. 29–34.
 [9] F. Farshidian, M. Neunert, A. W. Winkler, G. Rey, and J. Buchli, “An efficient optimal planning and control framework for quadrupedal locomotion,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 93–100.
 [10] D. Pardo, M. Neunert, A. Winkler, R. Grandia, and J. Buchli, “Hybrid direct collocation and control in the constraintconsistent subspace for dynamic legged robot locomotion,” in Robotics Science and Systems, 2017.
 [11] A. W. Winkler, C. D. Bellicoso, M. Hutter, and J. Buchli, “Gait and trajectory optimization for legged systems through phasebased endeffector parameterization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1560–1567, 2018.
 [12] I. Mordatch, E. Todorov, and Z. Popović, “Discovery of complex behaviors through contactinvariant optimization,” ACM Transactions on Graphics (TOG), vol. 31, no. 4, p. 43, 2012.
 [13] M. Posa, C. Cantu, and R. Tedrake, “A direct method for trajectory optimization of rigid bodies through contact,” The International Journal of Robotics Research, vol. 33, no. 1, pp. 69–81, 2014.
 [14] F. Farshidian, N. Neunert, and J. Buchli, “Learning of closedloop motion control,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2014, pp. 1441–1446.
 [15] J. Viereck, J. Kozolinsky, A. Herzog, and L. Righetti, “Learning a structured neural network policy for a hopping task,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4092–4099, 2018.

[16]
S. Levine and V. Koltun, “Guided policy search,” in
International Conference on Machine Learning
, 2013, pp. 1–9.  [17] F. R. Hogan, E. R. Grau, and A. Rodriguez, “Reactive planar manipulation with convex hybrid mpc,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 247–253.
 [18] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning, ser. Adaptative computation and machine learning series. University Press Group Limited, 2006.
 [19] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” in Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML. USA: Omnipress, 2010, pp. 1015–1022.
 [20] A. Krause and C. S. Ong, “Contextual gaussian process bandit optimization,” in Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2447–2455.
 [21] A. Rai, R. Antonova, S. Song, W. Martin, H. Geyer, and C. Atkeson, “Bayesian optimization using domain knowledge on the atrias biped,” in IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1771–1778.

[22]
R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian
optimization for learning gaits under uncertainty,”
Annals of Mathematics and Artificial Intelligence
, vol. 76, no. 12, pp. 5–23, 2016.  [23] M. Toussaint, K. R. Allen, K. A. Smith, and J. B. Tenenbaum, “Differentiable physics and stable modes for tooluse and manipulation planning,” in Proceedings of Robotics: Science and Systems. Elsevier, 2018.
 [24] F. Farshidian, M. Kamgarpour, D. Pardo, and J. Buchli, “Sequential linear quadratic optimal control for nonlinear switched systems,” in International Federation of Automatic Control (IFAC). Elsevier, 2017, pp. 1463–1469.
 [25] A. Bemporad and M. Morari, “Control of systems integrating logic, dynamics, and constraints,” Automatica, vol. 35, no. 3, pp. 407–427, 1999.
Comments
There are no comments yet.