, zero moment points, passivity based control [4, 5], capture points , hybrid zero dynamics 
to even machine learning based approaches[8, 9] have been explored. A more recent trend has been the use of deep reinforcement learning (D-RL) methods [10, 11] to determine optimal policies for efficient and robust walking behaviors in both bipeds and quadrupeds. D-RL was successfully used to achieve walking in the quadrupedal robot Minitaur , where the control policy was trained via one of the well known policy gradient based approaches called the proximal policy optimization (PPO) .
While all the approaches described above were applicable to many kinds of legged robots, there was also an independent group of researchers focusing entirely on capturing symmetries in quadrupedal walking, and then utilize these symmetries to obtain a low dimensional representation for walking . In essence, the focus here was more on trajectory reconstruction for each of the joints of the robot from a minimal number of basis functions, called the kinematic motion primitives (kMPs) . It was shown in  that only four-five primitives are enough to represent the various locomotion patterns like walking, trotting, and canter. Comparisons were even made with kMPs derived from horse , showing a similarity of the kMPs irrespective of the gait used. This type of minimal representation of walking point to the fact that the joint angle trajectories do not contain “high frequency” modulations. Alternative representations of these motion primitives by using canonical walking functions , and order Bezier polynomials  also indicate this low dimensional behavior of walking. In fact,  showed that mean human walking data from the joints also had a high correlation () with time responses of linear spring-mass-damper systems. All these observations strongly indicate that a finite number of basic trajectories (or basis functions) are sufficient to realize periodic trajectories in the individual joints of the robot, which results in walking.
The walking community is currently divided on deciding which approaches are best for achieving legged locomotion. Some use formal methods, which are model based and determine the underlying principles of walking, and then develop controllers that yield stable locomotion [4, 5, 7]. Others use model-free methods like RL (and also D-RL) to derive robust locomotion behaviors in both bipeds and quadrupeds [8, 9, 10, 17]. This methodology not only eliminates the difficult and often unpleasant task of identification and modeling of the robot, but also enables the computer to continuously adapt to changing environments. However, it is important to note that the controller learned is model-free, but the training is normally done in simulation which is often not realistic. This reality gap is also a well known problem faced by traditional model based optimization tools for walking [18, 19, 20].
One of the main goals of this paper is to approach this problem of reality gap by using the tools used in formal methods. Formal methods usually employ the optimization tools only for generating the reference gaits (trajectories), and then these trajectories are tracked at the joint level by model free methods (e.g. PD control) [18, 21, 22]. Therefore, we would like to view D-RL based training as a trajectory generator (optimization tool) more than the on-board compute platform for walking. This is similar to the method followed by , where RL was used to find the optimal set of parameters of the foot trajectory. We will obtain a gait from D-RL, extract kMPs from this gait, and then realize a set of derived behaviors, namely trot, walk, gallop and bound in Stoch experimentally. Trot and gallop were achieved in  by separate instances of training, while we were able to realize more than four behaviors by one instance of training. A pictorial representation of our approach is given in Fig. 2. This hybrid approach allows for a mechanistic way of generating gaits in complex legged locomotion—that combines the data driven D-RL based gait synthesis in simulation leading to extraction of kMPs which are then used for gait tracking in hardware with minimum computational overhead.
The paper is organized as follows: Section II introduces the Deep Reinforcement Learning (D-RL) framework for walking. We will also briefly discuss the simulator and training framework used. Section III has a detailed description on kinematic motion primitives (kMPs) and the associated kMP analysis of D-RL based gaits. We will also show that the kMPs thus obtained can also be used to generate other types of gaits like walk, gallop and bound. Finally Section IV describes the experimental results in our custom quadruped Stoch.
Ii Reinforcement Learning Framework for Quadrupedal Walking
In this section, we will study the deep reinforcement learning (D-RL) framework used for realizing quadrupedal walking in simulation. We will also discuss the associated challenges with the learning methods, and the transferability issue from simulation to real hardware.
We formulate the problem of locomotion as a Markov Decision Process (MDP). An MDP is represented by a tuple. Here is the set of robot states referred to as state space, and is the set of feasible actions referred to as the action space.
is the transition probability function that models the evolution of states based on actions, andis the reinforcement or the reward obtained for the given state and action. is called the discount factor defined in the range .
The goal in RL is to achieve a policy that maximizes the expected cumulative reward over time. A policy, denoted as
, is the probability distribution of the actions over the states. When dealing with huge state and action spaces, a parametrized policy is used, denoted as, with the parameters . The optimal parameters and/or policy yields the maximum sum of the cumulative rewards. We use the following to obtain the optimal policy :
where is the marginalization of the state-action pair given the set of all possible trajectories following policy . The optimal parameters are evaluated iteratively by taking gradient ascent steps in the direction of increasing cumulative reward. More details on the algorithm, and the associated optimization technique are provided in Section II-C.
Ii-B State and Action Space
We will briefly describe the quadrupedal model of Stoch. There are four identical legs and eight actuators in the robot which are distributed equally among its joints. Each leg has a hip and a knee joint, and the actuators provide flexion-extension in each joint. Fig. 1 shows both the experimental and the simulated robot model. Details on the experimental model are in Section IV.
Ii-B1 State Space
The state for the training is represented by angles, velocities and torques of the active joints. Body orientation (in quaternions) is also included in the state space. The combined representation yields a 28D state space. Note that the motor torques are typically treated as states in RL framework.
Ii-B2 Action Space
Reference  demonstrated the effectiveness of choosing actuator joint angles as action space. The planar four bar linkage configuration of the legs, however, lead to self-colliding regions in the motor space. This creates a non-convex search space and degrades the learning efficiency. Hence, we chose to learn the legs’ end-point position in polar coordinates represented as where . The polar coordinates for the four legs collectively provide an dimensional action space. For each , we can obtain the joint angles (both hip and knee of leg ) via an inverse kinematics solver. We will denote the angles obtained from the inverse kinematics solver as . The polar coordinates can be restricted to a bounding box, thereby indirectly imposing angle limits on the joint angles.
Ii-C Network and Learning Algorithm
Since the states and actions are continuous, we seek for algorithms that enable us to construct optimal policies that yield action values in the continuous space. Proximal Policy Optimization (PPO)  have proved to be very successful for obtaining optimal continuous action values 
. It is an on-policy, model-free algorithm based on actor-critic learning framework. An actor (also called policy) decides an action given a state and a critic aims to improve the policy by evaluating the action value function. Both actor and critic networks are parameterized by neural networks with parameterand respectively.
The learning algorithm samples episodes of maximum length using the current policy. The experience tuples consisting of , with , are stored and used for on-policy update of the parameters. Policy gradient method  gives the objective function to optimize the policy parameters given as:
is called the estimated Advantage function. It is given by, where is the value function at any state. Furthermore, in order to account for the disparity between the behavioral policy , and the current updating policy , the advantage estimation is multiplied with the importance sampling ratio given by
Lastly, in order to avoid huge and sudden policy variations during training, PPO penalizes on the high KL-divergence and solves an unconstrained optimization. Hence the final objective for the PPO algorithm is given by
where is an adaptive penalty coefficient. Given a target KL-divergence , for next policy update is calculated as:
We used the open source implementation of PPO by Tensorflow Agents that creates the symbolic representation of the computation graph. The implementation is highly parallelized and performs full-batch gradient ascent updates, using Adam  optimizer, on the batch of data collected from multiple environment instances.
Ii-D Simulation Framework
We used Pybullet  simulator, built on top of Bullet3 physics engine, for a realistic model simulation. A three-dimensional computer-aided-design (CAD) model is developed using SolidWorks  to capture the kinematics and inertial properties of the robot. This model is transferred to Pybullet by using a Universal Robot Description Format  (URDF) exporter. In addition, actual mass of all the links, actuator force limits, joint limits and kinematic-loop constraints of the flexural joints are measured and manually updated in the URDF file for a more realistic simulation (refer Fig. 1).
Ii-E Reward Function
We designed a reward function that gives positive reinforcement with the increasing robot’s base speed and simultaneously penalizes high energy consumptions. The agent receives a scalar reward after each action step according to the reward function
Here is the difference between the current and the previous base position along the -axis. is the energy spent by the actuators for the current step, and and are weights corresponding to each term. is computed as
where are the motor torques, and are the motor velocities of the leg respectively.
Ii-F Gait Generation and Simulation Results
The actor and critic network in the learning algorithm consists of two fully connected layers with the first and second layers consisting of 200 and 100 nodes respectively. Activation units in the actor and critic networks are ReLUReLU , and ReLU ReLU linear respectively. Other hyper-parameters are mentioned in Table. I. The pybullet simulator is configured to spawn multiple agents for faster collection of samples—in our case 30 agents were used in parallel threads.
Quadrupedal animals typically have eight types of gaits [31, 32]. Work done in  used reference trajectories and  used open-loop signal to learn a specific gait. However in this work, we learned everything from scratch. To learn a specific gait type, we leveraged gait symmetries and put hard constraints on the foot position. For e.g. in trot, diagonally opposite legs are in sync. Hence, we provide same to these legs. This also results in reduction in the dimensions of action space from to .
The proposed approach yields walking gaits that are efficient, fast and also robust in simulation. The learning algorithm runs for a maximum of million steps and the observed training time is hours on a Intel Core i7 @3.5Ghz cores and 32 GB RAM machine.
|Learning rate (actor, critic)|
Training steps per epoch
|KL target ()|
|End-point position limit ()|
|End-point position limit ()||cm, cm|
There are two main challenges in realizing D-RL based gaits in quadrupeds:
Ii-G1 Training for model and environment updates
The training framework has a large set of hyper-parameters (a small subset of them are shown in Table I) that require tuning in order to get desirable gaits. These gaits are, indeed, robust and efficient, but any update in the model or change in constraints in the gait requires retuning of these hyper-parameters in order to obtain desirable gaits. Retuning is not intuitive and requires a series of training iterations. This is a common problem even in nonlinear model based walking optimizations , where small variations in constraints often lead to infeasible solutions.
Ii-G2 Transferability on real hardware
Transferability of these learned gaits in the real hardware faces numerous challenges. Some of them are mainly due to uncertainties due to unaccounted compliances, imperfect sensors, and actuator saturations. In addition, inertial and electrical properties of actuators vary from one joint to another. It can be observed that imposing very conservative constraints like tighter velocity and angle limits resulted in longer training times, which often converged to very unrealistic gaits.
In order to put these challenges in context, it is necessary to review the methodology followed in . They proposed to develop accurate robot model, identify the actuator parameters and study the latencies of the system. This procedure, indeed, resulted in more realistic gaits, but required multiple instances of training and is highly sensitive for any update in hardware. In addition, DNNs are computationally intensive (NVIDIA Jetson TX2 processor was used). Therefore, as a deviation from this approach, we primarily want to focus on a methodology that is more robust to hardware changes, and also computationally less intensive. As we add more and more tasks and objectives for a diverse set of environments, it is important to focus on using a compressed set of walking parameters with minimal training. We will discuss this in detail in the next section.
Iii Realizing Behaviors through kMPs
The kinematic motion primitives (kMPs) are invariant waveforms, sufficient to explain a wide variety of complex coordinated motions, including discrete (e.g., reaching for a target with one hand) and periodic behaviors like walking and running . This concept of motion primitives arose from biological observations [15, 34, 13]. In essence, it was observed that biological motions can be described to a good extent with the combination of a small number of lower dimensional behaviors [14, 35].
. PCA is useful when there is extensive data from walking demonstrations, which can be processed to extract the kMPs. These kMPs can then be adapted/transformed to realize walking in robots. In essence, principal components yield vectors in the joint space with the highest variance for the given data. A linear transformation of the joint space to these principal components results in dimensionality reduction. A more detailed exploration of this analysis is done in and .
We will first describe the methods used for the data analysis, and then the associated principal components. kMPs for the different gaits, trot, pace and bound, will be extracted and compared. Towards the end of this section, we will also provide a statistical comparison between the different kMPs.
Iii-a Extraction of kMPs
Having obtained walking in simulation based on a trained policy, we record the joint angles and body orientation of the robot for steps. The data contains a set of joint trajectories for each trial. The recorded data is then processed as follows: First, we remove front and rear
% of the data set to avoid transient cases. Second, we divide the data into segments by peak detection. All these data segments are then shifted and scaled w.r.t. time in order to align the end points. By polynomial interpolation, we reconstruct trajectories in between the data points. With this reconstruction, we can directly obtain the mean data (average of the data at every time instant).
Given the mean data, let be the total number of points for each joint. Denote this data by , where is the number of actuated joints ( for Stoch). We get a matrix of joint angle values as Let be the zero mean data obtained from . We compute the covariance as
The principal components are precisely the eigenvectors,, of . Hence, for joint angles, we have principal components. Given the principal components, we have the kMP defined as
where (infinity norm) is used to normalize the magnitude. We typically use fewer components ( for Stoch) for reconstructing the joint angle trajectories from the kMPs, since they account for over 99% of the cumulative percentage variance (see table in Fig. 3).
Having obtained the kMPs, we reconstruct the original joint trajectories as follows. Denote the vector of kMPs as , where is typically less than (for Stoch ). We have the final reconstruction of the joint angles as
where is the matrix of desired joint angle values, is coefficient matrix, also called as kMP synergy map, that maps the kMPs to the joint space, and is the zero mean offset matrix (derived from ), which is added back to the joint angles.
Iii-B Discussion & Comparison
Fig. 3 shows the result of reconstruction from the kMPs. A table showing the cumulative percent variance of the original trajectory with reconstruction from each kMP is also provided. It can be verified that only four kMPs were sufficient for the final reconstruction111It is important to note that this observation is true for any continuous trajectory that does not have “high-frequency” oscillations or even large slopes. This property was highly utilized to reconstruct walking trajectories by using ellipses in  and simple basis functions in .. Fig. 4 shows the comparison between the end foot trajectories obtained from the simulation, and from the reconstruction. It must be noted that kMPs yield trajectories that are as good as the time averaged trajectories of the original data (which is eight dimensional). In addition, with kMPs, there is existing literature for obtaining derived behaviors like walk, trot and gallop [15, 37], which will be explained more in the next section.
It was shown in  that biological quadrupeds (like horses) inherently contain some basic patterns that were consistent across different gaits and models. Based on this result, we compared the trot data obtained from our simulation with the trot data obtained from the horse, and the results are in Fig. 5. We observed that there was a high correlation not only between the kMPs obtained from the learned and horse gait (Fig. 5-A), but also between the kMPs obtained from different learned gaits (Fig. 5-B). Table III provides more statistical details, where it is shown that cross-covariance w.r.t. the first two kMPs is more than . These observations point to the fact that even the kMPs derived from the RL based simulation followed very similar patterns, irrespective of the type of gaits and the model used.
Iv Experimental Results
In this section we discuss the experimental results conducted on the quadrupedal robot Stoch. We first describe the hardware, and then describe the methodologies used to realize walking along with the results.
Iv-a Hardware Description of Stoch
Stoch is a quadrupedal robot (see Fig. 1) designed and developed in-house at IISc, Bangalore. Model and actuator specifications are summarized in Table II. The legs are assembled with lightweight carbon fiber (CF) tubes for the linkages, and D printed poly-lactic acid (PLA) parts for the joints. The servo-motors (Kondo-HV, Robokits Ultra Torque: RKI-) are mounted in a PLA housing. For measuring the body orientation, we used a axis IMU (Sparkfun MPU ). For measuring the joint angles, we used magnetic rotary (MR) encoders (Bourns EMSa). PWM signals to the servo motors are sent via Adafruit -Channel -bit: PCA. Main computation platform used is Raspberry Pi -B. The electric power to the robot is supplied from Li-Po batteries (V(3s)mAh for hip motors, and V(2s)mAh for knee motors).
|total leg length||230 mm||min./max. hip joint||-45/ 45|
|leg segment length||120 mm||min./max. knee joint||-70/ 70|
|total mass||3.0 kg||max hip speed||461/s|
|max hip torque||29 kg-cm||max knee speed||461/s|
|max knee torque||32 kg-cm||motor power(servo)||16 W|
Iv-B Synergy Matrices
The general procedure to obtain the walking gaits for Stoch is as follows. We first reconstruct the trot gait from the kMPs by using the synergy matrix . If is the trot data obtained from the simulation, then we obtain as
where is the pseudo-inverse of . This is computed offline. Synergy matrices for other types of gait are obtained by modifying . For example, to obtain the bound gait, we modify (by time shifting) in such a way that the front legs are in phase (resp. hind legs). We have also reshaped the end point trajectories in small amounts, and obtained the corresponding . Since we are manually modifying the data, a better approach would have been to use optimization, which will be explored in future.
Having obtained the kMPs, and the synergy matrices for trot, walk, gallop and bound gaits, we reconstruct the joint angle trajectories by using (9) in hardware. These joint angle trajectories are then used as command positions to each of the actuators. Video results are shown in https://youtu.be/swcStUm0Nuk
Due to space constraints, we only show the experimental data for the trot and its derived gaits. We have also provided kMPs extracted directly from the experimental trajectory, and the results are in Fig. 5-A and Table III. It can be verified that experimental kMPs also have a high correlation with the horse and RL kMPs. We have also provided Table IV to compare the speeds of the various gaits obtained.
|Gait type||Trot||Walk||Gallop||Bound||Modified Trot 1||Modified Trot 2|
In order to put these observations in perspective, it is important to discuss the kMPs derived from horse data more. We cannot realize walking by using only horse kMPs. Based on (9), we will need to determine the right synergy matrix S that suits the given set of kMPs. It is also important to note that the kMPs derived from horses may not be the optimal choice for our quadruped (there are infinitely many kMPs). On the other hand, due to the fact that we already have (locally) optimal trajectories obtained from training, we focus on computing the synergy matrix from simulation kMPs. With these kMPs we were able to demonstrate not only trotting, but also other derived behaviors like walking, galloping and bounding.
We demonstrated walk, trot, gallop and bound gaits in the quadruped robot Stoch experimentally. The key contribution was in the methodology followed in transferring of the walking gaits, obtained via D-RL based walking, from simulation to real hardware. Our approaches were mainly motivated by  and , which pointed to the fact that motion primitives have a very similar underlying structure that are independent of the type of gait and the model. Small variations in gaits and model do not require retraining via D-RL methods from scratch, and kMPs are sufficient for such scenarios. On the other hand, large variations in models and environments require a thorough search, thereby requiring learning based approaches. Since we mainly focused on data driven kMPs in this paper, future work will involve obtaining these motion primitives from continuous functions like polynomials, canonical walking functions or other basis functions that are equally suitable for generating walking gaits.
We acknowledge Ashish Joglekar, Balachandra Hegde, Ajay G and Abhimanyu for the PCB design and software development.
-  “GE Walking Truck:,” https://youtu.be/ZMGCFLEYakM.
-  M. H. Raibert, “Legged robots,” Communications of the ACM, vol. 29, no. 6, pp. 499–514, 1986.
-  M. Vukobratović and B. Borovac, “Zero-moment point—thirty five years of its life,” International journal of humanoid robotics, vol. 1, no. 01, pp. 157–173, 2004.
-  T. McGeer, “Passive dynamic walking,” The International Journal of Robotics Research, vol. 9, no. 2, pp. 62–82, 1990. [Online]. Available: https://doi.org/10.1177/027836499000900206
-  S. H. Collins, M. Wisse, and A. Ruina, “A three-dimensional passive-dynamic walking robot with two legs and knees,” The International Journal of Robotics Research, vol. 20, no. 7, pp. 607–615, 2001.
-  J. Pratt, J. Carff, S. Drakunov, and A. Goswami, “Capture point: A step toward humanoid push recovery,” in 2006 6th IEEE-RAS International Conference on Humanoid Robots, Dec 2006, pp. 200–207.
-  E. R. Westervelt, J. W. Grizzle, and D. E. Koditschek, “Hybrid zero dynamics of planar biped walkers,” IEEE Transactions on Automatic Control, vol. 48, no. 1, pp. 42–56, Jan 2003.
-  R. Tedrake, T. W. Zhang, and H. S. Seung, “Stochastic policy gradient reinforcement learning on a simple 3d biped,” in Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, vol. 3. IEEE, pp. 2849–2854.
-  S. Wang, J. Braaksma, R. Babuska, and D. Hobbelen, “Reinforcement learning control for biped robot walking on uneven surfaces,” in The 2006 IEEE International Joint Conference on Neural Network Proceedings, July 2006, pp. 4173–4178.
-  J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” CoRR, vol. abs/1804.10332, 2018. [Online]. Available: http://arxiv.org/abs/1804.10332
-  Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback control for cassie with deep reinforcement learning,” CoRR, vol. abs/1803.05580, 2018. [Online]. Available: http://arxiv.org/abs/1803.05580
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
-  F. L. Moro, A. Spröwitz, A. Tuleu, M. Vespignani, N. G. Tsagarakis, A. J. Ijspeert, and D. G. Caldwell, “Horse-like walking, trotting, and galloping derived from kinematic motion primitives (kmps) and their application to walk/trot transitions in a compliant quadruped robot,” Biological cybernetics, vol. 107, no. 3, pp. 309–320, 2013.
-  T. D. Sanger, “Human arm movements described by a low-dimensional superposition of principal components,” Journal of Neuroscience, vol. 20, no. 3, pp. 1066–1072, 2000.
-  F. L. Moro, N. G. Tsagarakis, and D. G. Caldwell, “On the kinematic motion primitives (kmps)-theory and application,” Frontiers in neurorobotics, vol. 6, p. 10, 2012.
-  A. D. Ames, “Human-inspired control of bipedal walking robots,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1115–1130, May 2014.
-  N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, vol. 3. IEEE, pp. 2619–2624.
-  J. Reher, E. A. Cousineau, A. Hereid, C. M. Hubicki, and A. D. Ames, “Realizing dynamic and efficient bipedal locomotion on the humanoid robot durus,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), May 2016, pp. 1794–1801.
-  S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake, “Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot,” Autonomous Robots, vol. 40, no. 3, pp. 429–455, Mar 2016. [Online]. Available: https://doi.org/10.1007/s10514-015-9479-3
-  A. Hereid and A. D. Ames, “Frost: Fast robot optimization and simulation toolkit,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Vancouver, BC, Canada: IEEE/RSJ, Sep. 2017.
-  K. Sreenath, H.-W. Park, I. Poulakakis, and J. W. Grizzle, “A compliant hybrid zero dynamics controller for stable, efficient and fast bipedal walking on mabel,” The International Journal of Robotics Research, vol. 30, no. 9, pp. 1170–1193, 2011. [Online]. Available: https://doi.org/10.1177/0278364910379882
-  S. N. Yadukumar, M. Pasupuleti, and A. D. Ames, “From formal methods to algorithmic implementation of human inspired control on bipedal robots,” in Algorithmic Foundations of Robotics X, E. Frazzoli, T. Lozano-Perez, N. Roy, and D. Rus, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 511–526.
-  X. B. Peng and M. van de Panne, “Learning locomotion skills using deeprl: Does the choice of action space matter?” CoRR, vol. abs/1611.01055, 2016. [Online]. Available: http://arxiv.org/abs/1611.01055
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep
reinforcement learning that matters,” in
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, 2018. [Online]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16669
-  R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in In Advances in Neural Information Processing Systems 12. MIT Press, 2000, pp. 1057–1063.
-  D. Hafner, J. Davidson, and V. Vanhoucke, “Tensorflow agents: Efficient batched reinforcement learning in tensorflow,” arXiv preprint arXiv:1709.02878, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: http://arxiv.org/abs/1412.6980
-  “Pybullet:,” https://pybullet.org/wordpress/.
-  “Solidworks:,” https://www.solidworks.com.
-  “URDF:,” https://wiki.ros.org/urdf.
-  D. Owaki and A. Ishiguro, “A quadruped robot exhibiting spontaneous gait transitions from walking to trotting to galloping,” Scientific reports, vol. 7, no. 1, p. 277, 2017.
-  Y. Fukuoka, Y. Habu, and T. Fukui, “A simple rule for quadrupedal gait generation determined by leg loading feedback: a modeling study,” Scientific Reports, vol. 5, p. 8169, Feb 2015, article. [Online]. Available: http://dx.doi.org/10.1038/srep08169
-  J. P. Reher, A. Hereid, S. Kolathaya, C. M. Hubicki, and A. D. Ames, “Algorithmic foundations of realizing multi-contact locomotion on the humanoid robot DURUS,” in The International Workshop on the Algorithmic Foundations of Robotics (WAFR), 2016. [Online]. Available: http://ames.caltech.edu/WAFR_2016_multicontact.pdf
-  F. L. Moro, N. G. Tsagarakis, and D. G. Caldwell, “A human-like walking for the compliant humanoid coman based on com trajectory reconstruction from kinematic motion primitives,” in Humanoid Robots (Humanoids), 2011 11th IEEE-RAS International Conference on. IEEE, 2011, pp. 364–370.
-  G. J. Stephens, B. Johnson-Kerner, W. Bialek, and W. S. Ryu, “Dimensionality and dynamics in the behavior of c. elegans,” PLoS computational biology, vol. 4, no. 4, p. e1000028, 2008.
-  B. Lim, S. Ra, and F. C. Park, “Movement primitives, principal component analysis, and the efficient generation of natural motions,” in Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on. IEEE, 2005, pp. 4630–4635.
-  A. T. Sprowitz, A. Tuleu, A. J. Ijspeert et al., “Kinematic primitives for walking and trotting gaits of a quadruped robot with compliant legs,” Frontiers in computational neuroscience, vol. 8, p. 27, 2014.
-  I. Jolliffe, “Principal component analysis,” in International encyclopedia of statistical science. Springer, 2011, pp. 1094–1096.
-  S. Kim and F. C. Park, “Fast robot motion generation using principal components: Framework and algorithms,” IEEE Trans. Industrial Electronics, vol. 55, no. 6, pp. 2506–2516, 2008.