Realizing Learned Quadruped Locomotion Behaviors through Kinematic Motion Primitives

10/09/2018 ∙ by Abhik Singla, et al. ∙ indian institute of science 0

Humans and animals are believed to use a very minimal set of trajectories to perform a wide variety of tasks including walking. Our main objective in this paper is two fold 1) Obtain an effective tool to realize these basic motion patterns for quadrupedal walking, called the kinematic motion primitives (kMPs), via trajectories learned from deep reinforcement learning (D-RL) and 2) Realize a set of behaviors, namely trot, walk, gallop and bound from these kinematic motion primitives in our custom four legged robot, called the 'Stoch'. D-RL is a data driven approach, which has been shown to be very effective for realizing all kinds of robust locomotion behaviors, both in simulation and in experiment. On the other hand, kMPs are known to capture the underlying structure of walking and yield a set of derived behaviors. We first generate walking gaits from D-RL, which uses policy gradient based approaches. We then analyze the resulting walking by using principal component analysis. We observe that the kMPs extracted from PCA followed a similar pattern irrespective of the type of gaits generated. Leveraging on this underlying structure, we then realize walking in Stoch by a straightforward reconstruction of joint trajectories from kMPs. This type of methodology improves the transferability of these gaits to real hardware, lowers the computational overhead on-board, and also avoids multiple training iterations by generating a set of derived behaviors from a single learned gait.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Legged locomotion has been well studied for more than six decades, starting from the inception of the GE walking truck in [1]. A variety of approaches, starting from inverted pendulum models [2]

, zero moment points

[3], passivity based control [4, 5], capture points [6], hybrid zero dynamics [7]

to even machine learning based approaches

[8, 9] have been explored. A more recent trend has been the use of deep reinforcement learning (D-RL) methods [10, 11] to determine optimal policies for efficient and robust walking behaviors in both bipeds and quadrupeds. D-RL was successfully used to achieve walking in the quadrupedal robot Minitaur [10], where the control policy was trained via one of the well known policy gradient based approaches called the proximal policy optimization (PPO) [12].

While all the approaches described above were applicable to many kinds of legged robots, there was also an independent group of researchers focusing entirely on capturing symmetries in quadrupedal walking, and then utilize these symmetries to obtain a low dimensional representation for walking [13]. In essence, the focus here was more on trajectory reconstruction for each of the joints of the robot from a minimal number of basis functions, called the kinematic motion primitives (kMPs) [14]. It was shown in [15] that only four-five primitives are enough to represent the various locomotion patterns like walking, trotting, and canter. Comparisons were even made with kMPs derived from horse [13], showing a similarity of the kMPs irrespective of the gait used. This type of minimal representation of walking point to the fact that the joint angle trajectories do not contain “high frequency” modulations. Alternative representations of these motion primitives by using canonical walking functions [16], and order Bezier polynomials [7] also indicate this low dimensional behavior of walking. In fact, [16] showed that mean human walking data from the joints also had a high correlation () with time responses of linear spring-mass-damper systems. All these observations strongly indicate that a finite number of basic trajectories (or basis functions) are sufficient to realize periodic trajectories in the individual joints of the robot, which results in walking.

Fig. 1: Figure showing our custom built quadrupedal robot Stoch, and its simulation model shown in Pybullet simulator.
Fig. 2: Figure showing a graphical representation of our methodology for realizing quadrupedal walking in Stoch. We use the D-RL based controller for obtaining stable walking gaits in simulation, and then translate them into the quadruped Stoch, experimentally via kinematic motion primitives (kMPs).

The walking community is currently divided on deciding which approaches are best for achieving legged locomotion. Some use formal methods, which are model based and determine the underlying principles of walking, and then develop controllers that yield stable locomotion [4, 5, 7]. Others use model-free methods like RL (and also D-RL) to derive robust locomotion behaviors in both bipeds and quadrupeds [8, 9, 10, 17]. This methodology not only eliminates the difficult and often unpleasant task of identification and modeling of the robot, but also enables the computer to continuously adapt to changing environments. However, it is important to note that the controller learned is model-free, but the training is normally done in simulation which is often not realistic. This reality gap is also a well known problem faced by traditional model based optimization tools for walking [18, 19, 20].

One of the main goals of this paper is to approach this problem of reality gap by using the tools used in formal methods. Formal methods usually employ the optimization tools only for generating the reference gaits (trajectories), and then these trajectories are tracked at the joint level by model free methods (e.g. PD control) [18, 21, 22]. Therefore, we would like to view D-RL based training as a trajectory generator (optimization tool) more than the on-board compute platform for walking. This is similar to the method followed by [17], where RL was used to find the optimal set of parameters of the foot trajectory. We will obtain a gait from D-RL, extract kMPs from this gait, and then realize a set of derived behaviors, namely trot, walk, gallop and bound in Stoch experimentally. Trot and gallop were achieved in [10] by separate instances of training, while we were able to realize more than four behaviors by one instance of training. A pictorial representation of our approach is given in Fig. 2. This hybrid approach allows for a mechanistic way of generating gaits in complex legged locomotion—that combines the data driven D-RL based gait synthesis in simulation leading to extraction of kMPs which are then used for gait tracking in hardware with minimum computational overhead.

The paper is organized as follows: Section II introduces the Deep Reinforcement Learning (D-RL) framework for walking. We will also briefly discuss the simulator and training framework used. Section III has a detailed description on kinematic motion primitives (kMPs) and the associated kMP analysis of D-RL based gaits. We will also show that the kMPs thus obtained can also be used to generate other types of gaits like walk, gallop and bound. Finally Section IV describes the experimental results in our custom quadruped Stoch.

Ii Reinforcement Learning Framework for Quadrupedal Walking

In this section, we will study the deep reinforcement learning (D-RL) framework used for realizing quadrupedal walking in simulation. We will also discuss the associated challenges with the learning methods, and the transferability issue from simulation to real hardware.

Ii-a Definitions

We formulate the problem of locomotion as a Markov Decision Process (MDP). An MDP is represented by a tuple

. Here is the set of robot states referred to as state space, and is the set of feasible actions referred to as the action space.

is the transition probability function that models the evolution of states based on actions, and

is the reinforcement or the reward obtained for the given state and action. is called the discount factor defined in the range .

The goal in RL is to achieve a policy that maximizes the expected cumulative reward over time. A policy, denoted as

, is the probability distribution of the actions over the states. When dealing with huge state and action spaces, a parametrized policy is used, denoted as

, with the parameters . The optimal parameters and/or policy yields the maximum sum of the cumulative rewards. We use the following to obtain the optimal policy :


where is the marginalization of the state-action pair given the set of all possible trajectories following policy . The optimal parameters are evaluated iteratively by taking gradient ascent steps in the direction of increasing cumulative reward. More details on the algorithm, and the associated optimization technique are provided in Section II-C.

Ii-B State and Action Space

We will briefly describe the quadrupedal model of Stoch. There are four identical legs and eight actuators in the robot which are distributed equally among its joints. Each leg has a hip and a knee joint, and the actuators provide flexion-extension in each joint. Fig. 1 shows both the experimental and the simulated robot model. Details on the experimental model are in Section IV.

Ii-B1 State Space

The state for the training is represented by angles, velocities and torques of the active joints. Body orientation (in quaternions) is also included in the state space. The combined representation yields a 28D state space. Note that the motor torques are typically treated as states in RL framework.

Ii-B2 Action Space

Reference [23] demonstrated the effectiveness of choosing actuator joint angles as action space. The planar four bar linkage configuration of the legs, however, lead to self-colliding regions in the motor space. This creates a non-convex search space and degrades the learning efficiency. Hence, we chose to learn the legs’ end-point position in polar coordinates represented as where . The polar coordinates for the four legs collectively provide an dimensional action space. For each , we can obtain the joint angles (both hip and knee of leg ) via an inverse kinematics solver. We will denote the angles obtained from the inverse kinematics solver as . The polar coordinates can be restricted to a bounding box, thereby indirectly imposing angle limits on the joint angles.

Ii-C Network and Learning Algorithm

Since the states and actions are continuous, we seek for algorithms that enable us to construct optimal policies that yield action values in the continuous space. Proximal Policy Optimization (PPO) [12] have proved to be very successful for obtaining optimal continuous action values [24]

. It is an on-policy, model-free algorithm based on actor-critic learning framework. An actor (also called policy) decides an action given a state and a critic aims to improve the policy by evaluating the action value function. Both actor and critic networks are parameterized by neural networks with parameter

and respectively.

The learning algorithm samples episodes of maximum length using the current policy. The experience tuples consisting of , with , are stored and used for on-policy update of the parameters. Policy gradient method [25] gives the objective function to optimize the policy parameters given as:



is called the estimated Advantage function. It is given by

, where is the value function at any state. Furthermore, in order to account for the disparity between the behavioral policy , and the current updating policy , the advantage estimation is multiplied with the importance sampling ratio given by


Lastly, in order to avoid huge and sudden policy variations during training, PPO penalizes on the high KL-divergence and solves an unconstrained optimization. Hence the final objective for the PPO algorithm is given by


where is an adaptive penalty coefficient. Given a target KL-divergence , for next policy update is calculated as:

  • if

  • if

We used the open source implementation of PPO by Tensorflow Agents

[26] that creates the symbolic representation of the computation graph. The implementation is highly parallelized and performs full-batch gradient ascent updates, using Adam [27] optimizer, on the batch of data collected from multiple environment instances.

Ii-D Simulation Framework

We used Pybullet [28] simulator, built on top of Bullet3 physics engine, for a realistic model simulation. A three-dimensional computer-aided-design (CAD) model is developed using SolidWorks [29] to capture the kinematics and inertial properties of the robot. This model is transferred to Pybullet by using a Universal Robot Description Format [30] (URDF) exporter. In addition, actual mass of all the links, actuator force limits, joint limits and kinematic-loop constraints of the flexural joints are measured and manually updated in the URDF file for a more realistic simulation (refer Fig. 1).

Ii-E Reward Function

We designed a reward function that gives positive reinforcement with the increasing robot’s base speed and simultaneously penalizes high energy consumptions. The agent receives a scalar reward after each action step according to the reward function


Here is the difference between the current and the previous base position along the -axis. is the energy spent by the actuators for the current step, and and are weights corresponding to each term. is computed as


where are the motor torques, and are the motor velocities of the leg respectively.

Ii-F Gait Generation and Simulation Results

The actor and critic network in the learning algorithm consists of two fully connected layers with the first and second layers consisting of 200 and 100 nodes respectively. Activation units in the actor and critic networks are ReLU

ReLU , and ReLU ReLU linear respectively. Other hyper-parameters are mentioned in Table. I. The pybullet simulator is configured to spawn multiple agents for faster collection of samples—in our case 30 agents were used in parallel threads.

Quadrupedal animals typically have eight types of gaits [31, 32]. Work done in [11] used reference trajectories and [10] used open-loop signal to learn a specific gait. However in this work, we learned everything from scratch. To learn a specific gait type, we leveraged gait symmetries and put hard constraints on the foot position. For e.g. in trot, diagonally opposite legs are in sync. Hence, we provide same to these legs. This also results in reduction in the dimensions of action space from to .

The proposed approach yields walking gaits that are efficient, fast and also robust in simulation. The learning algorithm runs for a maximum of million steps and the observed training time is hours on a Intel Core i7 @3.5Ghz cores and 32 GB RAM machine.

Entity Value
Discount factor
Learning rate (actor, critic)

Training steps per epoch

KL target ()
End-point position limit ()
End-point position limit () cm, cm
TABLE I: Hyper-parameter values of the learning algorithm

Ii-G Challenges

There are two main challenges in realizing D-RL based gaits in quadrupeds:

Ii-G1 Training for model and environment updates

The training framework has a large set of hyper-parameters (a small subset of them are shown in Table I) that require tuning in order to get desirable gaits. These gaits are, indeed, robust and efficient, but any update in the model or change in constraints in the gait requires retuning of these hyper-parameters in order to obtain desirable gaits. Retuning is not intuitive and requires a series of training iterations. This is a common problem even in nonlinear model based walking optimizations [33], where small variations in constraints often lead to infeasible solutions.

Ii-G2 Transferability on real hardware

Transferability of these learned gaits in the real hardware faces numerous challenges. Some of them are mainly due to uncertainties due to unaccounted compliances, imperfect sensors, and actuator saturations. In addition, inertial and electrical properties of actuators vary from one joint to another. It can be observed that imposing very conservative constraints like tighter velocity and angle limits resulted in longer training times, which often converged to very unrealistic gaits.

In order to put these challenges in context, it is necessary to review the methodology followed in [10]. They proposed to develop accurate robot model, identify the actuator parameters and study the latencies of the system. This procedure, indeed, resulted in more realistic gaits, but required multiple instances of training and is highly sensitive for any update in hardware. In addition, DNNs are computationally intensive (NVIDIA Jetson TX2 processor was used). Therefore, as a deviation from this approach, we primarily want to focus on a methodology that is more robust to hardware changes, and also computationally less intensive. As we add more and more tasks and objectives for a diverse set of environments, it is important to focus on using a compressed set of walking parameters with minimal training. We will discuss this in detail in the next section.

Iii Realizing Behaviors through kMPs

The kinematic motion primitives (kMPs) are invariant waveforms, sufficient to explain a wide variety of complex coordinated motions, including discrete (e.g., reaching for a target with one hand) and periodic behaviors like walking and running [15]. This concept of motion primitives arose from biological observations [15, 34, 13]. In essence, it was observed that biological motions can be described to a good extent with the combination of a small number of lower dimensional behaviors [14, 35].

The process of determining kMPs are primarily conducted by Principal Component Analysis (PCA) [36, 34, 13, 37, 38]

. PCA is useful when there is extensive data from walking demonstrations, which can be processed to extract the kMPs. These kMPs can then be adapted/transformed to realize walking in robots. In essence, principal components yield vectors in the joint space with the highest variance for the given data. A linear transformation of the joint space to these principal components results in dimensionality reduction. A more detailed exploration of this analysis is done in

[36] and [39].

We will first describe the methods used for the data analysis, and then the associated principal components. kMPs for the different gaits, trot, pace and bound, will be extracted and compared. Towards the end of this section, we will also provide a statistical comparison between the different kMPs.

Iii-a Extraction of kMPs

Having obtained walking in simulation based on a trained policy, we record the joint angles and body orientation of the robot for steps. The data contains a set of joint trajectories for each trial. The recorded data is then processed as follows: First, we remove front and rear

% of the data set to avoid transient cases. Second, we divide the data into segments by peak detection. All these data segments are then shifted and scaled w.r.t. time in order to align the end points. By polynomial interpolation, we reconstruct trajectories in between the data points. With this reconstruction, we can directly obtain the mean data (average of the data at every time instant).

Given the mean data, let be the total number of points for each joint. Denote this data by , where is the number of actuated joints ( for Stoch). We get a matrix of joint angle values as Let be the zero mean data obtained from . We compute the covariance as


The principal components are precisely the eigenvectors,

, of . Hence, for joint angles, we have principal components. Given the principal components, we have the kMP defined as


where (infinity norm) is used to normalize the magnitude. We typically use fewer components ( for Stoch) for reconstructing the joint angle trajectories from the kMPs, since they account for over 99% of the cumulative percentage variance (see table in Fig. 3).

Having obtained the kMPs, we reconstruct the original joint trajectories as follows. Denote the vector of kMPs as , where is typically less than (for Stoch ). We have the final reconstruction of the joint angles as


where is the matrix of desired joint angle values, is coefficient matrix, also called as kMP synergy map, that maps the kMPs to the joint space, and is the zero mean offset matrix (derived from ), which is added back to the joint angles.

88.9% 95.8% 98.4% 99.9%
Fig. 3: Figure on the left shows end point trajectories of the one of the feet after reconstruction from different numbers of kMPs. The X-axis corresponds to the forward direction, and the Y-axis corresponds to the vertical direction respectively. Table on the right shows the percentage cumulative variance for adding each component at a time.

Iii-B Discussion & Comparison

Fig. 3 shows the result of reconstruction from the kMPs. A table showing the cumulative percent variance of the original trajectory with reconstruction from each kMP is also provided. It can be verified that only four kMPs were sufficient for the final reconstruction111It is important to note that this observation is true for any continuous trajectory that does not have “high-frequency” oscillations or even large slopes. This property was highly utilized to reconstruct walking trajectories by using ellipses in [17] and simple basis functions in [8].. Fig. 4 shows the comparison between the end foot trajectories obtained from the simulation, and from the reconstruction. It must be noted that kMPs yield trajectories that are as good as the time averaged trajectories of the original data (which is eight dimensional). In addition, with kMPs, there is existing literature for obtaining derived behaviors like walk, trot and gallop [15, 37], which will be explained more in the next section.

Fig. 4: Figures showing the end point trajectories of all the four feet of the quadruped. The plots in blue correspond to the end point trajectories from simulation data. The plots in red are the reconstruction from the kMPs.

It was shown in [13] that biological quadrupeds (like horses) inherently contain some basic patterns that were consistent across different gaits and models. Based on this result, we compared the trot data obtained from our simulation with the trot data obtained from the horse, and the results are in Fig. 5. We observed that there was a high correlation not only between the kMPs obtained from the learned and horse gait (Fig. 5-A), but also between the kMPs obtained from different learned gaits (Fig. 5-B). Table III provides more statistical details, where it is shown that cross-covariance w.r.t. the first two kMPs is more than . These observations point to the fact that even the kMPs derived from the RL based simulation followed very similar patterns, irrespective of the type of gaits and the model used.

Fig. 5: Here we make two types of comparisons. The top four figures (A) show the comparison between kMPs for the trot gait: hardware trot, simulation trot, and finally horse trot. The bottom four figures (B) show the comparison between kMPs obtained from the learned gaits in simulation: trot, pace and bound. The kMPs (of hardware, simulation or horse) seem to follow a pattern: VNMW. Statistical comparison of these patterns are provided in TABLE III

Iv Experimental Results

In this section we discuss the experimental results conducted on the quadrupedal robot Stoch. We first describe the hardware, and then describe the methodologies used to realize walking along with the results.

Iv-a Hardware Description of Stoch

Stoch is a quadrupedal robot (see Fig. 1) designed and developed in-house at IISc, Bangalore. Model and actuator specifications are summarized in Table II. The legs are assembled with lightweight carbon fiber (CF) tubes for the linkages, and D printed poly-lactic acid (PLA) parts for the joints. The servo-motors (Kondo-HV, Robokits Ultra Torque: RKI-) are mounted in a PLA housing. For measuring the body orientation, we used a axis IMU (Sparkfun MPU ). For measuring the joint angles, we used magnetic rotary (MR) encoders (Bourns EMSa). PWM signals to the servo motors are sent via Adafruit -Channel -bit: PCA. Main computation platform used is Raspberry Pi -B. The electric power to the robot is supplied from Li-Po batteries (V(3s)mAh for hip motors, and V(2s)mAh for knee motors).

total leg length 230 mm min./max. hip joint -45/ 45
leg segment length 120 mm min./max. knee joint -70/ 70
total mass 3.0 kg max hip speed 461/s
max hip torque 29 kg-cm max knee speed 461/s
max knee torque 32 kg-cm motor power(servo) 16 W
TABLE II: Hardware specifications of Stoch.
Cross-covariance Delay
1st 2nd 3rd 4th 1st 2nd 3rd 4th
RT-ET 0.97 0.74 0.93 0.79 0 -0.02 0.02 0.04
A. RT-HT 0.99 0.89 0.86 0.82 0 -0.02 -0.05 -0.04
ET-HT 0.99 0.77 0.88 0.87 0 0.15 -0.08 0.03
RT-RP 0.87 0.94 0.91 0.7 -0.03 -0.07 -0.08 0.16
B. RT-RB 0.81 0.84 0.72 0.88 0 -0.02 -0.05 -0.04
RP-RB 0.96 0.92 0.85 0.72 -0.01 0.0 -0.08 0.18
TABLE III: RT,RB,RP are trot, bound and pace gaits obtained from D-RL respectively. ET is the trot gait obtained from experiment, and HT is the trot gait obtained from the horse. A. (resp. B.) shows the comparison between RT,ET,HT (resp. RT,RB,RP).

Iv-B Synergy Matrices

The general procedure to obtain the walking gaits for Stoch is as follows. We first reconstruct the trot gait from the kMPs by using the synergy matrix . If is the trot data obtained from the simulation, then we obtain as


where is the pseudo-inverse of . This is computed offline. Synergy matrices for other types of gait are obtained by modifying . For example, to obtain the bound gait, we modify (by time shifting) in such a way that the front legs are in phase (resp. hind legs). We have also reshaped the end point trajectories in small amounts, and obtained the corresponding . Since we are manually modifying the data, a better approach would have been to use optimization, which will be explored in future.

Iv-C Results

Having obtained the kMPs, and the synergy matrices for trot, walk, gallop and bound gaits, we reconstruct the joint angle trajectories by using (9) in hardware. These joint angle trajectories are then used as command positions to each of the actuators. Video results are shown in

Due to space constraints, we only show the experimental data for the trot and its derived gaits. We have also provided kMPs extracted directly from the experimental trajectory, and the results are in Fig. 5-A and Table III. It can be verified that experimental kMPs also have a high correlation with the horse and RL kMPs. We have also provided Table IV to compare the speeds of the various gaits obtained.

Gait type Trot Walk Gallop Bound Modified Trot 1 Modified Trot 2
Speed 0.60 0.51 0.51 0.55 0.62 0.59
TABLE IV: Walking speed of Stoch obtained during different gait experiments. The last two columns are the result of reshaping of the end point trajectories of trot. All the gaits are shown in the video.

In order to put these observations in perspective, it is important to discuss the kMPs derived from horse data more. We cannot realize walking by using only horse kMPs. Based on (9), we will need to determine the right synergy matrix S that suits the given set of kMPs. It is also important to note that the kMPs derived from horses may not be the optimal choice for our quadruped (there are infinitely many kMPs). On the other hand, due to the fact that we already have (locally) optimal trajectories obtained from training, we focus on computing the synergy matrix from simulation kMPs. With these kMPs we were able to demonstrate not only trotting, but also other derived behaviors like walking, galloping and bounding.

V Conclusion

We demonstrated walk, trot, gallop and bound gaits in the quadruped robot Stoch experimentally. The key contribution was in the methodology followed in transferring of the walking gaits, obtained via D-RL based walking, from simulation to real hardware. Our approaches were mainly motivated by [13] and [15], which pointed to the fact that motion primitives have a very similar underlying structure that are independent of the type of gait and the model. Small variations in gaits and model do not require retraining via D-RL methods from scratch, and kMPs are sufficient for such scenarios. On the other hand, large variations in models and environments require a thorough search, thereby requiring learning based approaches. Since we mainly focused on data driven kMPs in this paper, future work will involve obtaining these motion primitives from continuous functions like polynomials, canonical walking functions or other basis functions that are equally suitable for generating walking gaits.


We acknowledge Ashish Joglekar, Balachandra Hegde, Ajay G and Abhimanyu for the PCB design and software development.


  • [1] “GE Walking Truck:,”
  • [2] M. H. Raibert, “Legged robots,” Communications of the ACM, vol. 29, no. 6, pp. 499–514, 1986.
  • [3] M. Vukobratović and B. Borovac, “Zero-moment point—thirty five years of its life,” International journal of humanoid robotics, vol. 1, no. 01, pp. 157–173, 2004.
  • [4] T. McGeer, “Passive dynamic walking,” The International Journal of Robotics Research, vol. 9, no. 2, pp. 62–82, 1990. [Online]. Available:
  • [5] S. H. Collins, M. Wisse, and A. Ruina, “A three-dimensional passive-dynamic walking robot with two legs and knees,” The International Journal of Robotics Research, vol. 20, no. 7, pp. 607–615, 2001.
  • [6] J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture point: A step toward humanoid push recovery,” in 2006 6th IEEE-RAS International Conference on Humanoid Robots, Dec 2006, pp. 200–207.
  • [7] E. R. Westervelt, J. W. Grizzle, and D. E. Koditschek, “Hybrid zero dynamics of planar biped walkers,” IEEE Transactions on Automatic Control, vol. 48, no. 1, pp. 42–56, Jan 2003.
  • [8] R. Tedrake, T. W. Zhang, and H. S. Seung, “Stochastic policy gradient reinforcement learning on a simple 3d biped,” in Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, vol. 3.   IEEE, pp. 2849–2854.
  • [9] S. Wang, J. Braaksma, R. Babuska, and D. Hobbelen, “Reinforcement learning control for biped robot walking on uneven surfaces,” in The 2006 IEEE International Joint Conference on Neural Network Proceedings, July 2006, pp. 4173–4178.
  • [10] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018. [Online]. Available:
  • [11] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback control for cassie with deep reinforcement learning,” CoRR, vol. abs/1803.05580, 2018. [Online]. Available:
  • [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available:
  • [13] F. L. Moro, A. Spröwitz, A. Tuleu, M. Vespignani, N. G. Tsagarakis, A. J. Ijspeert, and D. G. Caldwell, “Horse-like walking, trotting, and galloping derived from kinematic motion primitives (kmps) and their application to walk/trot transitions in a compliant quadruped robot,” Biological cybernetics, vol. 107, no. 3, pp. 309–320, 2013.
  • [14] T. D. Sanger, “Human arm movements described by a low-dimensional superposition of principal components,” Journal of Neuroscience, vol. 20, no. 3, pp. 1066–1072, 2000.
  • [15] F. L. Moro, N. G. Tsagarakis, and D. G. Caldwell, “On the kinematic motion primitives (kmps)-theory and application,” Frontiers in neurorobotics, vol. 6, p. 10, 2012.
  • [16] A. D. Ames, “Human-inspired control of bipedal walking robots,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1115–1130, May 2014.
  • [17] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, vol. 3.   IEEE, pp. 2619–2624.
  • [18] J. Reher, E. A. Cousineau, A. Hereid, C. M. Hubicki, and A. D. Ames, “Realizing dynamic and efficient bipedal locomotion on the humanoid robot durus,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), May 2016, pp. 1794–1801.
  • [19] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake, “Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot,” Autonomous Robots, vol. 40, no. 3, pp. 429–455, Mar 2016. [Online]. Available:
  • [20] A. Hereid and A. D. Ames, “Frost: Fast robot optimization and simulation toolkit,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   Vancouver, BC, Canada: IEEE/RSJ, Sep. 2017.
  • [21] K. Sreenath, H.-W. Park, I. Poulakakis, and J. W. Grizzle, “A compliant hybrid zero dynamics controller for stable, efficient and fast bipedal walking on mabel,” The International Journal of Robotics Research, vol. 30, no. 9, pp. 1170–1193, 2011. [Online]. Available:
  • [22] S. N. Yadukumar, M. Pasupuleti, and A. D. Ames, “From formal methods to algorithmic implementation of human inspired control on bipedal robots,” in Algorithmic Foundations of Robotics X, E. Frazzoli, T. Lozano-Perez, N. Roy, and D. Rus, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 511–526.
  • [23] X. B. Peng and M. van de Panne, “Learning locomotion skills using deeprl: Does the choice of action space matter?” CoRR, vol. abs/1611.01055, 2016. [Online]. Available:
  • [24] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018

    , 2018. [Online]. Available:
  • [25] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in In Advances in Neural Information Processing Systems 12.   MIT Press, 2000, pp. 1057–1063.
  • [26] D. Hafner, J. Davidson, and V. Vanhoucke, “Tensorflow agents: Efficient batched reinforcement learning in tensorflow,” arXiv preprint arXiv:1709.02878, 2017.
  • [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:
  • [28] “Pybullet:,”
  • [29] “Solidworks:,”
  • [30] “URDF:,”
  • [31] D. Owaki and A. Ishiguro, “A quadruped robot exhibiting spontaneous gait transitions from walking to trotting to galloping,” Scientific reports, vol. 7, no. 1, p. 277, 2017.
  • [32] Y. Fukuoka, Y. Habu, and T. Fukui, “A simple rule for quadrupedal gait generation determined by leg loading feedback: a modeling study,” Scientific Reports, vol. 5, p. 8169, Feb 2015, article. [Online]. Available:
  • [33] J. P. Reher, A. Hereid, S. Kolathaya, C. M. Hubicki, and A. D. Ames, “Algorithmic foundations of realizing multi-contact locomotion on the humanoid robot DURUS,” in The International Workshop on the Algorithmic Foundations of Robotics (WAFR), 2016. [Online]. Available:
  • [34] F. L. Moro, N. G. Tsagarakis, and D. G. Caldwell, “A human-like walking for the compliant humanoid coman based on com trajectory reconstruction from kinematic motion primitives,” in Humanoid Robots (Humanoids), 2011 11th IEEE-RAS International Conference on.   IEEE, 2011, pp. 364–370.
  • [35] G. J. Stephens, B. Johnson-Kerner, W. Bialek, and W. S. Ryu, “Dimensionality and dynamics in the behavior of c. elegans,” PLoS computational biology, vol. 4, no. 4, p. e1000028, 2008.
  • [36] B. Lim, S. Ra, and F. C. Park, “Movement primitives, principal component analysis, and the efficient generation of natural motions,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on.   IEEE, 2005, pp. 4630–4635.
  • [37] A. T. Sprowitz, A. Tuleu, A. J. Ijspeert et al., “Kinematic primitives for walking and trotting gaits of a quadruped robot with compliant legs,” Frontiers in computational neuroscience, vol. 8, p. 27, 2014.
  • [38] I. Jolliffe, “Principal component analysis,” in International encyclopedia of statistical science.   Springer, 2011, pp. 1094–1096.
  • [39] S. Kim and F. C. Park, “Fast robot motion generation using principal components: Framework and algorithms,” IEEE Trans. Industrial Electronics, vol. 55, no. 6, pp. 2506–2516, 2008.