Learning walk and trot from the same objective using different types of exploration

by   Zinan Liu, et al.

In quadruped gait learning, policy search methods that scale high dimensional continuous action spaces are commonly used. In most approaches, it is necessary to introduce prior knowledge on the gaits to limit the highly non-convex search space of the policies. In this work, we propose a new approach to encode the symmetry properties of the desired gaits, on the initial covariance of the Gaussian search distribution, allowing for strategic exploration. Using episode-based likelihood ratio policy gradient and relative entropy policy search, we learned the gaits walk and trot on a simulated quadruped. Comparing these gaits to random gaits learned by initialized diagonal covariance matrix, we show that the performance can be significantly enhanced.



page 5

page 6


Diverse Exploration via Conjugate Policies for Policy Gradient Methods

We address the challenge of effective exploration while maintaining good...

A Policy Gradient Method for Task-Agnostic Exploration

In a reward-free environment, what is a suitable intrinsic objective for...

Cautious Bayesian Optimization for Efficient and Scalable Policy Search

Sample efficiency is one of the key factors when applying policy search ...

Entropy Regularization with Discounted Future State Distribution in Policy Gradient Methods

The policy gradient theorem is defined based on an objective with respec...

ExPoSe: Combining State-Based Exploration with Gradient-Based Online Search

A tree-based online search algorithm iteratively simulates trajectories ...

Group Symmetry and non-Gaussian Covariance Estimation

We consider robust covariance estimation with group symmetry constraints...

Improving exploration in policy gradient search: Application to symbolic optimization

Many machine learning strategies designed to automate mathematical tasks...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In nature, animals have developed extensive gaits to adapt to the different terrestrial terrain and situations, such as a horse galloping for faster speed, or a lizard trotting for a stable locomotion. In recent years, quadrupedal gait learning has attracted some research interest in robotics. Quadruped gaits offer a wide range of different movement patterns. As the cyclic movements of all four legs are similar, the gaits can be categorized mainly by the timing and order of the footfall, which can be represented as phase gaps among the trajectories of each leg.

In the presented work we learn open loop control policies for various gaits, focusing on walk and trot. In walk the leg trajectories are separated by quarter-phase gaps, resulting in an equidistant footfall, whereas in trot diagonal pairs of legs move synchronously and are separated by half-phase gaps. Other gaits that can be learned using the described approach are bound and pace. We show how these symmetry properties can be encoded in the parameter space of the chosen policy representation, in order to enhance the initial exploration and reliably learn the chosen gaits.

Neither do we fully define the gait in the policy representation as in [7, 8, 12, 3], nor do we learn random gaits [6] which could lead to a highly non convex problem. We only initialize the exploration. This way we can learn different gaits as a set of skills sharing a common representation, thus allowing for choosing gaits suitable to the current terrain.

Traditional reinforcement learning methods, such as


, typically estimate the expected long-term reward at each state

and each time step , to evaluate the quality of executing an action in state . This is referred as the value function . Given a certain state and all possible actions, the action-value function is computed, an optimal action is then selected to optimize the action-value function [13]. However, this approach would be problematic in high dimensional state action space, since the whole space would need to be exhaustively covered. In addition, gait learning of a quadruped would also require continuous states and actions, meaning we would have to resort to value function approximation.

On the other hand, Policy Search methods offer the possibility to scale a high dimensional continuous action space to a reduced search space of possible policies. This can be done by parameterizing the search distribution , and directly operate and update in the parameter space , [4]. In this work, we learn a Gaussian search distribution that maximizes the expected reward, using Likelihood Ratio Policy Gradient [15] and Relative Entropy Policy Search [10]. The trajectories of the quadruped are computed by a linear combination of the von Mises basis functions and their corresponding weights , i.e., . The generated trajectories are fed to a PD controller to deterministically compute the motor torques.

2 Related Works

This section presents first the policy representation to be used in the experiments in conjunction with the subsequently presented episode-based policy search algorithms.

2.1 Policy Representation

The explored policies are central pattern generators using a linear representation , being the von Mises basis functions,


where is the width, are the equidistant centers of basis function, is the phase, which is linear in time with a temporal scaling factor describing the speed of executing the trajectory [7, 9]. The policy parameters to be learned then consist of the weights , and the temporal scaling factor .

In order to simplify the learning problem, we hand tuned the initial standing posture of the quadruped to prevent it from falling. The learning exploration of each rollout begins then with the standing initialization.

The trajectories defined by the learned policy, are then fed to a PD controller to deterministically compute the motor torques ,


where the desired velocities are approximated using forward differences of , and , are the observed state variables.

2.2 Episode-based Policy Search Algorithms

Episode-based policy search algorithms typically maximize the expected reward over a Gaussian search distribution , throughout the episode, w.r.t the distribution means  [4]

. Independent of the policy search algorithm, the parameter vector

, used for execution after a terminated learning process, is chosen as the final update of .

2.2.1 Likelihood Ratio Policy Gradient

Policy Gradient methods [11] use gradient ascent to maximize the expected return over the search distribution w.r.t. the parameters ,


In the following we estimate the gradient according to the Likelihood Ratio Policy Gradient method (LRPG) [15] with the batch mean reward as baseline, sampling from a Gaussian search distribution . A derivation can be found in 5.1.


2.2.2 Relative Entropy Policy Search

Similar to the Expectation Maximization (EM) algorithm

[5], Relative Entropy Policy Search (REPS) also updates the mean and covariance of the search distribution to maximize the expected reward

. In addition, REPS introduces the Kullback-Leibler divergence, as to upper bound the parameter update of the search distribution, thus forcing the parameter update to stay close to the sampled data. Considering the old distribution

, and the newly estimated distribution , the KL divergence is bounded by , where . This then resolves to a constraint optimization problem


The optimization can be solved by using the method of Lagrangian multipliers, resulting in a closed form solution of the new distribution


where is obtained by minimizing the dual function , approximated by samples,


The new distribution is then estimated by fitting a parametric distribution to the samples and the corresponding reward . The parametric distribution is estimated using weighted maximum likelihood, where the weightings

Given the Gaussian distribution of our set up,

, the weighted maximum likelihood solution for updating the distribution, namely and , is given by [5]



3 Evaluations

In this section, we evaluate the performance of the Likelihood Ratio Policy Gradient and the Relative Entropy Policy Search algorithm given different initializations for covariance of the Gaussian search distribution, which cause specific desired gaits to be learned.

3.1 Simulated Experimental Setup

We used the OpenAI Gym toolkit [2] with the MuJoCo simulator [14], and adapted the predefined ant environment to a quadruped

with eight parallel joints. The quadruped is a complete symmetric robot, with four hip joints and four knee joints, where each joint is driven by an actuator, rendering in total eight degrees of freedom. Four basis functions are used for each joint, allowing for quarter-phase gaps, while reducing the amount of weights to learn. In total, the parameters to be learned consist of 32 weights

, and the temporal scaling factor .

Throughout the simulation, a batch size of 50 is used, and each learning episode consists of 200 time frames, equivalent to 2 seconds given that the time step is 0.01s. The number of policy updates is fixed to 60. 10 trials are carried out for each configuration to evaluate the robustness of the algorithms.

3.2 Reward Function and Algorithm Comparison

We devise the reward function consisting of the forward translation, costs of the robot falling over , contact costs between the feet and the ground , and the control costs for the motors


where and denotes the total distance traveled in x and y direction in one episode, and is the duration of one episode. Since each learning process is initialized with a standing position, only one walking direction is rewarded, which is the forward direction.

To compare the learning behavior of REPS and LRPG, both are evaluated with hand tuned hyper parameters, and diagonal covariance initializations . A learning rate of is used for LRPG, and a KL bound of

for REPS. The learning curves of the expected reward and corresponding standard deviation are shown in Figure 

1, where REPS learns relatively fast in the initial phase before converging to a stable solution. This is due to the applied KL bound, forcing the update to stay close to the old distribution [10]. However, REPS also suffers from premature convergence, since the updates of the covariance reduce further exploration [1]. LRPG shows poor convergence properties, yet the exploration sometimes outperforms policies fond with REPS.

Figure 1: Typical learning curves of Likelihood Ratio Policy Gradient (LRPG) and Relative Entropy Policy Search (REPS) with diagonal covariance initialization, averaged over 50 rollouts per update.

3.3 Different Initializations

Aside from a diagonal initialization of the search distributions covariance matrix, a customized covariance matrix can be used to incorporate prior knowledge about the symmetry properties of a specific gait,



is the variance along each dimension,

is a symmetric matrix with elements , and is the degree of coupling. This way the exploration is limited in certain search directions of the parameter space , promoting the learning of movements with desired symmetry properties.

(a) Learning curves averaged over 10 trials of Likelihood Ratio Policy Gradient, each using specific covariance initializations at a batch-size of 50
(b) Learning curves averaged over 10 trials of Relative Entropy Policy Search, each using specific covariance initializations at a batch-size of 50
Figure 2: Snapshots of different gaits learned via full covariance initialization.

In Figure (a)a and (b)b three different initializations are applied: a diagonal and two non-diagonal matrices specific to walk and trot, where, the degree of coupling is Each learning curve is averaged over 10 trials to evaluate the robustness of the policy updates, and the colored regions indicate the standard deviation of the expected reward over the 10 trials. In Figure (b)b we can see that using the non-diagonal initializations, the slope of the learning curves as well as the expected reward of the learned policy increase compared to the diagonal initialization, indicating that the KL bounded updates of REPS are able to exploit the prior knowledge, even though it changes the covariance matrix in each update. In addition it shows that for this robot the trot gait is inherently faster than the walk gait, which is the usual case for most quadrupeds [16] In Figure (a)a however, LRPG tends to show decreased performance for non-diagonal initializations.

(a) Snapshots of the walking gait
(b) Snapshots of the trotting gait
Figure 3: Snapshots of different gaits learned via full covariance initialization.

Using LRPG as well as REPS, movements matching to the desired gaits are reliably learned given a degree of coupling for . In Figure 3, the snapshots of the learned walk and trot gaits are shown. The gaits are learned using REPS, initialized with the corresponding non-diagonal covariance matrices, specific to each gait. It can be seen that the learned gaits perform according to expected symmetry properties. In Figure (a)a, one cycle of the walk gait is demonstrated. The legs lift individually in the order front right, hind left, front left, and hind right. In Figure (b)b, one cycle of the trot gait is displayed. The front left and hind right leg lift up synchronously, as well as the font right and hind left leg. In between a standing posture can be observed.

3.4 Concrete Initializations

In this section we discuss how to create a template covariance matrix given the phase gaps specific to a desired gait. Note that this template only indicates the non-zero elements of the scaled covariance matrix defined in (11), excluding the parts corresponding to the temporal scaling which is always diagonal. We assume a maximal coupling of , reducing the exploration to a subspace of . In this case the coupling of the movement is identical to the coupling of the exploration, as no parameters outside of the subspace are sampled.

In Figure 4 the initializations of the covariance matrix is displayed as a heat map, divided into 8x8 blocks, corresponding to the coupling of the 8 weights describing one leg trajectory to another. Each of the blocks can further be divided into 4x4 sub-blocks corresponding to the joint-wise coupling. In our setup the knees are not coupled with the hips, therefore only the 4x4 sub-blocks on the diagonal of each 8x8 block are non-zero. Using 4 identically distributed cyclic basis functions per joint, any multiple of quarter-phase gaps, can be expressed as a permutation of weights. All 4x4 sub-blocks are identical to the permutation matrix that applies the desired multiple of quarter-phase gaps, which in our case are shifts of the weights. Since the phase gaps of the knee trajectories and hip trajectories are the same for each pair of legs, all 8x8 blocks consist of by two identical non-zero 4x4 sub-blocks. Thus, as only four different quarter-phase gaps can be defined, there can only be four different 8x8 blocks.

(a) Walk covariance initialization
(b) Trot covariance initialization
Figure 4: Initial covariance matrices displayed as heatmap.
(a) Walk covariance matrix after the 3rd update
(b) Walk covariance matrix after the 4th update
Figure 5: Evolution of the covariance during the learning process of the walk gait with Relative Entropy Policy Search.

In Figure (b)b, it can be observed that leg 1 and leg 2 are coupled to move synchronously with a half-phase gap to leg 3 and leg 4. Whereas in Figure (a)a, all four different 8x8 blocks can be observed, as well as the individual leg movements being delayed by quarter-phase gaps.

3.5 Change of Exploration in REPS

Since each iteration of REPS considerably changes the covariance of the search distribution, it diverges from its initialization within a few policy updates. As seen in Figure (a)a, up until the third update, the structure of the initialization for walk gait is preserved, however as shown in Figure (b)b, the structure is lost at the fourth update. Nevertheless, the custom exploration during the first few policy evaluations would still strongly impact the following learning process.

4 Conclusion

In this paper we presented a new approach to introduce prior knowledge in the context of policy search reinforcement learning, using the policy representation of von Mises basis functions. Namely we initialized the covariance of a Gaussian search distribution, and applied it to the problem of quadruped gait learning. We showed that, using REPS, significant improvements in performance can be gained by learning specific types of gaits, i.e., walk and trot compared to random gaits initialized by a diagonal covariance matrix. The different types of gaits can be learned sharing the same linear basis function representation. Further more, REPS has shown to have a more stable learning process, albeit converging prematurely, whereas LRPG shows a slower and unstable learning process.

5 Appendix

5.1 Derivation of the Likelihood Ratio Policy Gradient

The expectation is approximated by the sampling. The individual trajectories are sufficiently described by the policies parameters for an open-loop control. To reduce variance, the mean reward of each batch is used as baseline.

5.2 Estimation of covariance templates by sampling

The sub matrix corresponding to the weights in linear function representation is estimated by sampling. Three permutation matrices are devised: , , and , which define the coupling of the weight vectors assigned to each leg w.r.t. leg 1,

When sampling the weights of the first leg from a gaussian distribution with zero mean and identity covariance the covariance template can be calculated as

with . The limit can be calculated for sufficiently large by rounding.


  • [1] A. Abdolmaleki, R. Lioutikov, J. R. Peters, N. Lau, L. P. Reis, and G. Neumann. Model-based relative entropy stochastic search. In Advances in Neural Information Processing Systems, pages 3537–3545, 2015.
  • [2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.
  • [3] S. Chernova and M. Veloso. An evolutionary approach to gait learning for four-legged robots. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, volume 3, pages 2562–2567. IEEE, 2004.
  • [4] M. P. Deisenroth, G. Neumann, J. Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
  • [5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
  • [6] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver. Emergence of locomotion behaviours in rich environments, 2017.
  • [7] G. Kniewasser. Reinforcement learning with dynamic movement primitives-dmps-technical report. target, 6(8):10.
  • [8] N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 3, pages 2619–2624. IEEE, 2004.
  • [9] A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann. Probabilistic movement primitives. In Advances in Neural Information Processing Systems 26.
  • [10] J. Peters and K. Mülling. Relative entropy policy search. 2010.
  • [11] J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
  • [12] M. Saggar, T. D’Silva, N. Kohl, and P. Stone. Autonomous learning of stable quadruped locomotion. In Robot Soccer World Cup, pages 98–109. Springer, 2006.
  • [13] R. S. Sutton and A. G. Barto. Reinforcement Learning : An Introduction. MIT Press, 1998.
  • [14] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
  • [15] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [16] W. Xi, Y. Yesilevskiy, and C. D. Remy. Selecting gaits for economical locomotion of legged robots. The International Journal of Robotics Research, 35(9):1140–1154, 2016.