In nature, animals have developed extensive gaits to adapt to the different terrestrial terrain and situations, such as a horse galloping for faster speed, or a lizard trotting for a stable locomotion. In recent years, quadrupedal gait learning has attracted some research interest in robotics. Quadruped gaits offer a wide range of different movement patterns. As the cyclic movements of all four legs are similar, the gaits can be categorized mainly by the timing and order of the footfall, which can be represented as phase gaps among the trajectories of each leg.
In the presented work we learn open loop control policies for various gaits, focusing on walk and trot. In walk the leg trajectories are separated by quarter-phase gaps, resulting in an equidistant footfall, whereas in trot diagonal pairs of legs move synchronously and are separated by half-phase gaps. Other gaits that can be learned using the described approach are bound and pace. We show how these symmetry properties can be encoded in the parameter space of the chosen policy representation, in order to enhance the initial exploration and reliably learn the chosen gaits.
Neither do we fully define the gait in the policy representation as in [7, 8, 12, 3], nor do we learn random gaits  which could lead to a highly non convex problem. We only initialize the exploration. This way we can learn different gaits as a set of skills sharing a common representation, thus allowing for choosing gaits suitable to the current terrain.
Traditional reinforcement learning methods, such asTD-Learning
, typically estimate the expected long-term reward at each stateand each time step , to evaluate the quality of executing an action in state . This is referred as the value function . Given a certain state and all possible actions, the action-value function is computed, an optimal action is then selected to optimize the action-value function . However, this approach would be problematic in high dimensional state action space, since the whole space would need to be exhaustively covered. In addition, gait learning of a quadruped would also require continuous states and actions, meaning we would have to resort to value function approximation.
On the other hand, Policy Search methods offer the possibility to scale a high dimensional continuous action space to a reduced search space of possible policies. This can be done by parameterizing the search distribution , and directly operate and update in the parameter space , . In this work, we learn a Gaussian search distribution that maximizes the expected reward, using Likelihood Ratio Policy Gradient  and Relative Entropy Policy Search . The trajectories of the quadruped are computed by a linear combination of the von Mises basis functions and their corresponding weights , i.e., . The generated trajectories are fed to a PD controller to deterministically compute the motor torques.
2 Related Works
This section presents first the policy representation to be used in the experiments in conjunction with the subsequently presented episode-based policy search algorithms.
2.1 Policy Representation
The explored policies are central pattern generators using a linear representation , being the von Mises basis functions,
where is the width, are the equidistant centers of basis function, is the phase, which is linear in time with a temporal scaling factor describing the speed of executing the trajectory [7, 9]. The policy parameters to be learned then consist of the weights , and the temporal scaling factor .
In order to simplify the learning problem, we hand tuned the initial standing posture of the quadruped to prevent it from falling. The learning exploration of each rollout begins then with the standing initialization.
The trajectories defined by the learned policy, are then fed to a PD controller to deterministically compute the motor torques ,
where the desired velocities are approximated using forward differences of , and , are the observed state variables.
2.2 Episode-based Policy Search Algorithms
Episode-based policy search algorithms typically maximize the expected reward over a Gaussian search distribution , throughout the episode, w.r.t the distribution means 
. Independent of the policy search algorithm, the parameter vector, used for execution after a terminated learning process, is chosen as the final update of .
2.2.1 Likelihood Ratio Policy Gradient
Policy Gradient methods  use gradient ascent to maximize the expected return over the search distribution w.r.t. the parameters ,
In the following we estimate the gradient according to the Likelihood Ratio Policy Gradient method (LRPG)  with the batch mean reward as baseline, sampling from a Gaussian search distribution . A derivation can be found in 5.1.
2.2.2 Relative Entropy Policy Search
Similar to the Expectation Maximization (EM) algorithm, Relative Entropy Policy Search (REPS) also updates the mean and covariance of the search distribution to maximize the expected reward
. In addition, REPS introduces the Kullback-Leibler divergence, as to upper bound the parameter update of the search distribution, thus forcing the parameter update to stay close to the sampled data. Considering the old distribution, and the newly estimated distribution , the KL divergence is bounded by , where . This then resolves to a constraint optimization problem
The optimization can be solved by using the method of Lagrangian multipliers, resulting in a closed form solution of the new distribution
where is obtained by minimizing the dual function , approximated by samples,
The new distribution is then estimated by fitting a parametric distribution to the samples and the corresponding reward . The parametric distribution is estimated using weighted maximum likelihood, where the weightings
Given the Gaussian distribution of our set up,, the weighted maximum likelihood solution for updating the distribution, namely and , is given by 
In this section, we evaluate the performance of the Likelihood Ratio Policy Gradient and the Relative Entropy Policy Search algorithm given different initializations for covariance of the Gaussian search distribution, which cause specific desired gaits to be learned.
3.1 Simulated Experimental Setup
with eight parallel joints. The quadruped is a complete symmetric robot, with four hip joints and four knee joints, where each joint is driven by an actuator, rendering in total eight degrees of freedom. Four basis functions are used for each joint, allowing for quarter-phase gaps, while reducing the amount of weights to learn. In total, the parameters to be learned consist of 32 weights, and the temporal scaling factor .
Throughout the simulation, a batch size of 50 is used, and each learning episode consists of 200 time frames, equivalent to 2 seconds given that the time step is 0.01s. The number of policy updates is fixed to 60. 10 trials are carried out for each configuration to evaluate the robustness of the algorithms.
3.2 Reward Function and Algorithm Comparison
We devise the reward function consisting of the forward translation, costs of the robot falling over , contact costs between the feet and the ground , and the control costs for the motors
where and denotes the total distance traveled in x and y direction in one episode, and is the duration of one episode. Since each learning process is initialized with a standing position, only one walking direction is rewarded, which is the forward direction.
To compare the learning behavior of REPS and LRPG, both are evaluated with hand tuned hyper parameters, and diagonal covariance initializations . A learning rate of is used for LRPG, and a KL bound of
for REPS. The learning curves of the expected reward and corresponding standard deviation are shown in Figure1, where REPS learns relatively fast in the initial phase before converging to a stable solution. This is due to the applied KL bound, forcing the update to stay close to the old distribution . However, REPS also suffers from premature convergence, since the updates of the covariance reduce further exploration . LRPG shows poor convergence properties, yet the exploration sometimes outperforms policies fond with REPS.
3.3 Different Initializations
Aside from a diagonal initialization of the search distributions covariance matrix, a customized covariance matrix can be used to incorporate prior knowledge about the symmetry properties of a specific gait,
is the variance along each dimension,is a symmetric matrix with elements , and is the degree of coupling. This way the exploration is limited in certain search directions of the parameter space , promoting the learning of movements with desired symmetry properties.
In Figure (a)a and (b)b three different initializations are applied: a diagonal and two non-diagonal matrices specific to walk and trot, where, the degree of coupling is Each learning curve is averaged over 10 trials to evaluate the robustness of the policy updates, and the colored regions indicate the standard deviation of the expected reward over the 10 trials. In Figure (b)b we can see that using the non-diagonal initializations, the slope of the learning curves as well as the expected reward of the learned policy increase compared to the diagonal initialization, indicating that the KL bounded updates of REPS are able to exploit the prior knowledge, even though it changes the covariance matrix in each update. In addition it shows that for this robot the trot gait is inherently faster than the walk gait, which is the usual case for most quadrupeds  In Figure (a)a however, LRPG tends to show decreased performance for non-diagonal initializations.
Using LRPG as well as REPS, movements matching to the desired gaits are reliably learned given a degree of coupling for . In Figure 3, the snapshots of the learned walk and trot gaits are shown. The gaits are learned using REPS, initialized with the corresponding non-diagonal covariance matrices, specific to each gait. It can be seen that the learned gaits perform according to expected symmetry properties. In Figure (a)a, one cycle of the walk gait is demonstrated. The legs lift individually in the order front right, hind left, front left, and hind right. In Figure (b)b, one cycle of the trot gait is displayed. The front left and hind right leg lift up synchronously, as well as the font right and hind left leg. In between a standing posture can be observed.
3.4 Concrete Initializations
In this section we discuss how to create a template covariance matrix given the phase gaps specific to a desired gait. Note that this template only indicates the non-zero elements of the scaled covariance matrix defined in (11), excluding the parts corresponding to the temporal scaling which is always diagonal. We assume a maximal coupling of , reducing the exploration to a subspace of . In this case the coupling of the movement is identical to the coupling of the exploration, as no parameters outside of the subspace are sampled.
In Figure 4 the initializations of the covariance matrix is displayed as a heat map, divided into 8x8 blocks, corresponding to the coupling of the 8 weights describing one leg trajectory to another. Each of the blocks can further be divided into 4x4 sub-blocks corresponding to the joint-wise coupling. In our setup the knees are not coupled with the hips, therefore only the 4x4 sub-blocks on the diagonal of each 8x8 block are non-zero. Using 4 identically distributed cyclic basis functions per joint, any multiple of quarter-phase gaps, can be expressed as a permutation of weights. All 4x4 sub-blocks are identical to the permutation matrix that applies the desired multiple of quarter-phase gaps, which in our case are shifts of the weights. Since the phase gaps of the knee trajectories and hip trajectories are the same for each pair of legs, all 8x8 blocks consist of by two identical non-zero 4x4 sub-blocks. Thus, as only four different quarter-phase gaps can be defined, there can only be four different 8x8 blocks.
3.5 Change of Exploration in REPS
Since each iteration of REPS considerably changes the covariance of the search distribution, it diverges from its initialization within a few policy updates. As seen in Figure (a)a, up until the third update, the structure of the initialization for walk gait is preserved, however as shown in Figure (b)b, the structure is lost at the fourth update. Nevertheless, the custom exploration during the first few policy evaluations would still strongly impact the following learning process.
In this paper we presented a new approach to introduce prior knowledge in the context of policy search reinforcement learning, using the policy representation of von Mises basis functions. Namely we initialized the covariance of a Gaussian search distribution, and applied it to the problem of quadruped gait learning. We showed that, using REPS, significant improvements in performance can be gained by learning specific types of gaits, i.e., walk and trot compared to random gaits initialized by a diagonal covariance matrix. The different types of gaits can be learned sharing the same linear basis function representation. Further more, REPS has shown to have a more stable learning process, albeit converging prematurely, whereas LRPG shows a slower and unstable learning process.
5.1 Derivation of the Likelihood Ratio Policy Gradient
The expectation is approximated by the sampling. The individual trajectories are sufficiently described by the policies parameters for an open-loop control. To reduce variance, the mean reward of each batch is used as baseline.
5.2 Estimation of covariance templates by sampling
The sub matrix corresponding to the weights in linear function representation is estimated by sampling. Three permutation matrices are devised: , , and , which define the coupling of the weight vectors assigned to each leg w.r.t. leg 1,
When sampling the weights of the first leg from a gaussian distribution with zero mean and identity covariance the covariance template can be calculated as
with . The limit can be calculated for sufficiently large by rounding.
-  A. Abdolmaleki, R. Lioutikov, J. R. Peters, N. Lau, L. P. Reis, and G. Neumann. Model-based relative entropy stochastic search. In Advances in Neural Information Processing Systems, pages 3537–3545, 2015.
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. OpenAI Gym, 2016.
-  S. Chernova and M. Veloso. An evolutionary approach to gait learning for four-legged robots. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, volume 3, pages 2562–2567. IEEE, 2004.
-  M. P. Deisenroth, G. Neumann, J. Peters, et al. A survey on policy search for robotics. Foundations and Trends® in Robotics, 2(1–2):1–142, 2013.
-  A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
-  N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. M. A. Eslami, M. Riedmiller, and D. Silver. Emergence of locomotion behaviours in rich environments, 2017.
-  G. Kniewasser. Reinforcement learning with dynamic movement primitives-dmps-technical report. target, 6(8):10.
-  N. Kohl and P. Stone. Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, volume 3, pages 2619–2624. IEEE, 2004.
-  A. Paraschos, C. Daniel, J. R. Peters, and G. Neumann. Probabilistic movement primitives. In Advances in Neural Information Processing Systems 26.
-  J. Peters and K. Mülling. Relative entropy policy search. 2010.
-  J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
-  M. Saggar, T. D’Silva, N. Kohl, and P. Stone. Autonomous learning of stable quadruped locomotion. In Robot Soccer World Cup, pages 98–109. Springer, 2006.
-  R. S. Sutton and A. G. Barto. Reinforcement Learning : An Introduction. MIT Press, 1998.
-  E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
-  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
-  W. Xi, Y. Yesilevskiy, and C. D. Remy. Selecting gaits for economical locomotion of legged robots. The International Journal of Robotics Research, 35(9):1140–1154, 2016.