Learning Stable Manoeuvres in Quadruped Robots from Expert Demonstrations

07/28/2020 ∙ by Sashank Tirumala, et al. ∙ indian institute of science 2

With the research into development of quadruped robots picking up pace, learning based techniques are being explored for developing locomotion controllers for such robots. A key problem is to generate leg trajectories for continuously varying target linear and angular velocities, in a stable manner. In this paper, we propose a two pronged approach to address this problem. First, multiple simpler policies are trained to generate trajectories for a discrete set of target velocities and turning radius. These policies are then augmented using a higher level neural network for handling the transition between the learned trajectories. Specifically, we develop a neural network-based filter that takes in target velocity, radius and transforms them into new commands that enable smooth transitions to the new trajectory. This transformation is achieved by learning from expert demonstrations. An application of this is the transformation of a novice user's input into an expert user's input, thereby ensuring stable manoeuvres regardless of the user's experience. Training our proposed architecture requires much less expert demonstrations compared to standard neural network architectures. Finally, we demonstrate experimentally these results in the in-house quadruped Stoch 2.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The domain of quadrupedal research has reached industrial markets today with quite a few research labs/companies successfully commercializing their quadruped robots [7], [4], [6]. Similar to driving a car, controlling a quadruped robot has a steep learning curve that a novice user must struggle through. Assuming you are given an interface to control the velocity of the center of mass of a quadruped robot, rapid changes in said velocity will cause the robot to topple. This instability is not safe and could permanently damage the robot. However an expert user will be capable of performing the desired rapid velocity changes through experience that he gained through practice. Is it possible then to augment the novice user’s commands such that it represents the expert user’s manoeuvres?

Fig. 1: Figure showing the custom built quadruped robot, Stoch 2. Simulated version is shown on the left, and the actual hardware is shown on the right.

Quadruped Locomotion is a challenging research problem that has been studied for decades starting from the GE walking truck in 1960’s [5] . A slew of techniques have been used—inverted pendulum model based controllers [15]

, zero-moment point based controllers

[20], hybrid zero dynamics [19], model predictive controllers [10], deep reinforcement learning [8]—to name a few. These techniques provide an interface to control the center of mass velocity over a rough terrain [10], [8]. Relatively little work has been done on handling rapid changes in the desired center of mass velocity. We propose to tackle this as a behaviour cloning problem. In particular, we have access to expert input that is capable of performing rapid changes in a stable and safe manner. We also have access to a novice user’s input that is unsafe for the robot. We aim to train a neural network so that it takes a novice user’s input and transforms it to an expert user’s input. We validate our neural network by demonstrating rapid changes in linear ( to ) and angular velocity ( to ) in our in-house quadruped robot Stoch .

I-a Related Work

An omni-directional quadruped robot controller requires two parts: stable leg trajectories for different motions such as walking forward, turning etc., and techniques to transition between these trajectories when required. In the literature there has been more focus on the first problem of generating stable leg trajectories. [11] first used policy gradient algorithms to learn optimal end-foot trajectories for the Sony AIBO-Quadruped Robot. [3] used particle swarm optimisation to decide the parameters for a few different trajectories including turning on the Oncilla Quadruped Robot. Our current work borrows from the turning strategies described in the above paper. [17] first demonstrated techniques to learn a quadruped controller in simulation and deploy it on the actual robot. [12] modified the reinforcement learning algorithm to learn the parameterized leg trajectories quickly while [18] constrained the network architecture to speed up learning. In this paper we combine [12] with a sample efficient learning algorithm, augmented random search [14], to learn a hundred trajectories in simulation and deploy the learnt trajectories on our robot.

Transitioning between stable trajectories for quadrupeds was first studied by [2], in which analytical equations were used to stitch Hermite splines. [13] developed a similar analytical framework to transition between a number of different trajectories for slow static walking. These methods, despite being effective, were limiting the number and speed of allowable transitions. The MIT Cheetah [1] proposed a PD controller to track varying center of mass velocities combined with Raibert’s controller [15] to ensure stability of the robot. In general such controllers require expensive series elastic actuators or direct-drives with torque control. [9] trained a reinforcement learning agent capable of tracking a specific desired velocity profile while [8] trained a reinforcement learning controller capable of tracking any combination of linear and angular velocities. Our work is more adjacent to the above works, where we would like to replace unstable commands given by a novice user with stable commands given by an expert user. Our controller can sit on top of the above proposed controllers and does not have any specific actuator requirements. This controller will be based on a neural network based filter, which has the ability to generalize well. We will also show that the types of transitions executed are nonlinear, and linear filters do not yield the same result (see IV-B ahead).

I-B Contributions

In this paper we extend the work on trajectory generation by using the analytical equations in [3], [12], to constrain the action space of our learning agent, thereby learning trajectories in hours on an Intel Core i7 processor. Our learnt trajectories demonstrate omni-directional motion in our robot Stoch

. To transition between different trajectories, we exploit the knowledge that a human expert has in tele-operating the quadruped robot. We collect expert demonstrations and train a neural network to convert novice user input to expert user input. Our neural network has a novel architecture that consists of two non-trainable layers. It also consists of a classification network that takes in the user input and outputs the probability of choosing a particular filter at every time-step. This probability is used to calculate a weighted average of the filters, which is the final output of the network. By restricting our network to only output weighted averages of different filters at every time-step, we require much less training data to generalize, compared to fully connected dense neural networks and convolutional neural networks. This solves the bottleneck of collecting expert data that is prohibitive on real robot systems. Our final network is capable of generalizing with five expert demonstrations. We validate our network by testing it on our quadruped robot.

Ii Robot model and control

Fig. 2: A five-bar mechanism is used as the leg. This mechanism is actuated by the motors located at the main torso of the robot. The trajectory followed by the foot is parameterized by a rational Beziér curve with control points as shown.

In this section, we will discuss our custom built quadruped robot Stoch 2. Specifically, we will provide details about the hardware, the associated model, and the trajectory tracking control framework used for the various gaits of the robot.

Ii-a Kinematic Description

Stoch is a quadruped robot designed and developed in-house at the Indian Institute of Science (IISc), Bengaluru, India. In this robot, each leg comprises of a parallel five-bar linkage mechanism where two of the links are actuated and the end effector (i.e., the foot) is capable of moving safely (without encountering singular configurations) in a trapezoidal work-space as shown in Fig. 2. The calculation of the leg work-space and it’s inverse kinematic details can be found in [4]. In this paper, we focus on realizing trajectories of the feet in Cartesian coordinates on a fixed plane. The plane is chosen such that it passes through the center of the hip joint and is constrained by the three dimensional trapezoidal prism work-space of each foot.

Ii-B Control Framework

Our control framework takes in the turning radius and the heading velocity as input (mapped to joystick analog values), and then outputs the end-effector (end-foot) trajectory. This end-foot trajectory is sent to our inverse kinematics engine, that calculates the desired motor commands for each of the servo motors in our robot. Stable end-foot trajectories are found in simulation. We chose to parameterize the end-foot trajectory of each foot by a rational-Beziér curve. Rational Beziér Curves have attractive properties over other alternatives like cubic splines as they do not have self-intersecting curves, and are guaranteed to lie within the convex hull of the control points. This ensures that the trajectories are always in the work-space of the end-effector. The control points of the rational Beziér in 3D space are chosen analytically and lie on the boundaries of the robot leg-workspace as shown in Fig. 2. The weights of each of the control points are the parameters that we aim to find through our learning framework in simulation. We chose a order Beziér curve for the swing phase of our leg, and a order Beziér curve (straight line) for the stance phase motion of our leg. This choice was made to represent optimal half-boat shaped trajectories for mechanical walkers as described in [16]. Given control points, denoted by , , and the weights, denoted by , we have the Beziér curve given by

where , and . In practise, computing is computationally expensive, so we use a recursive implementation of the De-Casteljau Algorithm. The analytical equations that determine the control points of our Beziér curve is based upon a simplified model for our robot locomotion that we formulated through experimentation. Our experimentation showed that turning is more effective if abduction and step-size for each leg are dynamically varied. This result corroborates well with the analysis in [3]. In particular, each end-foot trajectory is restricted to a plane tilted by an angle about the axis in such a way that

where , the radius of curvature, is related to the desired linear velocity and angular velocity as . () stands for the width of the chassis of the robot and () stands the length of the chassis of the robot. stands for signum function, is the angle made by the leg with the vertical axis and is a constant that takes the following values for the four legs:

where and stand for the front-left, front-right, back-left and back-right legs respectively. Similarly, each leg must also have a step length equal to

where stands for the desired average speed of the robot for second. Note that can be both positive and negative (for left-right turn commands). Given the step-lengths and the plane angles of each leg, we can determine the control points of the Beziér curve that each leg follows. As shown in Fig. 2 above we place six control points such that they lie on the boundaries of our trapezoidal work-space, and lie on the plane that makes an angle about the z-axis. The first and last control points are chosen such that they are symmetric about the center of the work-space and the distance in-between them is equal to the step-length of the respective leg.

Ii-C Trajectory Generation Framework

The weights of the Beziér curve decide the overall shape while the control points limit the search space of all possible shapes. We formulate this trajectory generation as an optimization problem where we aim to find the weights such that a certain cost function is minimized. For a particular linear and angular velocity our weights are held constant. We chose to define our cost function as

where is the energy consumed per step, are the penalties related to rolling and pitching of the robot, and is the number of control time-steps for a single simulation. is given by

Here is angular velocity of motor, is torque of motor and is the total number of motors. These values are calculated by the simulation software during the training process. Here , where and are roll and pitch angles about the x-z axes of the center of mass respectively. To optimize for , many different algorithms can be used. We chose to use Augmented Random Search [14], a well-known training algorithm for linear policies.

Iii Trajectory Transition Framework

Having defined the model and the control methodology, we are now ready to discuss the trajectory transitioning framework of our robot. This part is divided into two sections where we describe the problem formulation, the neural network architecture and the training process.

Iii-a Problem Formulation

We are interested in aggressive manoeuvres where a naive approach will cause the robot to lose balance and fall. The user can input linear and angular velocity () through joystick commands. In our experimentation, it was easier to control the linear velocity and radius of curvature . We normalized the joystick values to the range . Positive indicates moving rightwards, negative indicates moving leftwards, while positive indicates moving forward and negative indicates moving backward.

To demonstrate our trajectory transition framework, we consider three complex manoeuvres that can potentially damage the robot. These manoeuvres are: - rapidly reaching maximum linear velocity and curvature radius, - abruptly coming to a halt from maximum linear velocity and curvature radius, - rapidly changing radius of curvature direction. Then for each of the above manoeuvres we collected novice and expert joystick data as shown in Fig. 3. We observed that there exists a complex relationship between the radius and velocity inputs that one dimensional filters cannot reproduce. In particular, manoeuvre requires to drop whenever sharply changes, which cannot be replicated with a simple filter. A simple filter would cause to gradually reduce from to - and not affect . In manoeuvre and both and change at different rates. In manoeuvre , changes instantly while v moves gradually, while in , reduces gradually while reduces exponentially. Without knowledge of the expert trajectories, choosing an appropriate filter for each manoeuvre is not straightforward. In an analytical approach, as the number of manoeuvres increase, more effort is required to design filters, while some manoeuvres like cannot be recreated with linear filters. A neural network bypasses these difficulties and can scale to as many expert demonstrations as required with no additional effort. Hence, our goal now is to train a neural network that is capable of taking the novice joystick-data, as shown in Fig. 3, and converting it to the expert joystick-data.

Fig. 3: Green dotted line stands for the radius of curvature while the red solid line is the heading velocity. Both values are normalized between and . Joystick data for novice is at the top, and expert data is at the middle for the three described manoeuvres. Neural network output after training is at the bottom.

Iii-B Neural Network Architecture and Training

Fig. 4:

Visualization of the neural network. It splits into two parts, the top layer being non trainable that acts like the low pass filters and the bottom layer being trainable. The bottom layer is a combination of a deep convolutional net (10 layers) followed by 2 fully connected layers for each column of the input. The final output of the bottom layer is multiplied by the final output of the top layer (filter bank) to give the output of the neural network. The product operator is a dot product between two vectors of size


The input to our neural network is the past joystick values from each analog stick and thus . These values are from a duration of seconds. The output of our neural network is the linear velocity and radius of curvature for the current time-step. The first two layers of our neural network consist of non trainable layers that we call “filter-banks”. The weights of these layers are such that they act as simple filters for the joystick inputs. In particular the output of our non-trainable layers is a set of values that is a combination of low pass filtered versions of the input and scaled versions of the input. The low pass filters are moving average filters of window size and scaling filters multiply the input by a constant which varies from .

In parallel, we have a 1D convolutional neural network (CNN) that takes as input and outputs a value . Our CNN is layers deep with convolutional layers and max pooling layers with Re-Lu activation. The output of CNN is sent to two sets of fully connected dense layers with softmax activation function that outputs values that correspond to the probability of choosing a particular filter from the filter bank. The output of these dense layers is used to compute the weighted average of the filter banks which leads to the final output of the network. This is similar to the attention mechanism of neural networks. The visualization of this network is shown in Fig. 4

and we have released the code of this network implemented in tensorflow (provided in

V). We collect a single demonstration of novice and expert input for each desired manoeuvre (three shown in Fig. 3), and train the neural network upon these demonstrations. The learnt network is capable of generalising unseen data of similar shape but of different amplitudes and frequency unlike a fully connected neural network. The hyper-parameters used in our training process are shown in Table 1. The training was stopped once validation loss fell below .

Learning Rate
Learning Rate 0.0002
L2-Regularization 0.000001
Dropouts % 50
Activation Function ReLu + Softmax(last layer)
Batch-size 64
Optimizer Adam
TABLE I: Hyperparameters of neural network

Iv Experimental Results

In this section we provide results to show: improvement of our trajectory generation framework to existing techniques, comparison with standard non-neural network based filters and the improvements of our proposed neural network architecture to standard neural network architectures.

Iv-a Trajectory Generation Framework Experiment

The main goal of our trajectory generation framework is to find stable trajectories for our robot quickly. Stability of the trajectory is measured using the pitch and roll angles of our robot’s body in simulation. A more stable trajectory will have lesser amplitude of oscillation. Compared with a default elliptic trajectory our learnt agent had approximately and lower amplitude of oscillation of roll and pitch angles respectively as seen in Fig. 5. This was expected as the pitch and roll penalties were explicitly added in our cost function during the training process. Generating a single trajectory takes a few minutes on an Intel i7 core PC, and generating the entire set of trajectories for all velocities and radius of curvature took about hours.

Fig. 5: Variation of roll (left) and pitch (right) angles versus time in the robot. Green is the learnt trajectory while red is the default elliptical trajectory.

Iv-B Comparison with Standard Filters

Standard convolutional 2D filters suffer from a number of problems that make them unsuitable to the current application. They are unable to copy complex transitions like transition 3 as shown in Fig. 6. Often the output of such a filter is a crude approximation of the actual expert output. Further, these filters have linear properties such as output superposition and scaling, which our expert output does not follow. To illustrate, consider the change of radius from to and to as shown in Fig. 6. Both of these transitions require the same dip in velocity as the transitions are symmetric (left turn right turn). However a linear filter trained on transition will cause a peak in velocity for the opposite direction, which actively destabilizes the system. Similarly at lower velocities, a linear filter will output a scaled version of the strategy used at higher velocities while in reality an expert user will often pursue a very different strategy. Thus nonlinear alternatives, specifically neural network based alternatives are pursued.

Fig. 6: Comparison between actual expert data and a standard 2D convolutional filter. As we can see a standard 2D filter struggles to copy the expert and exhibits properties of linearity that are undesirable

Iv-C Neural Network Architecture

A standard issue with techniques that imitate experts is that the collection of expert data tends to be time consuming and costly. Thus we designed a network that requires as little expert demonstrations as possible. Here we aim to measure the generalizability of our neural network compared to standard neural network architectures. Generalizability is a broad term and since we are doing only supervised learning we cannot expect our network to truly imitate an expert in completely unforeseen situations. Instead by generalizability we mean two major properties of our expert data: first, our data is a time invariant system. By this we mean that if a novice input is delayed by

seconds then our expert output should also be delayed by seconds. No other changes to the output is necessary. The second is our data is approximately scale invariant. This means that if we multiply the novice input by a constant factor , then our expert output is also approximately multiplied by a constant factor . This need not be strictly followed as experts tend to follow different strategies at different speeds, but broadly our network should be capable of handling scale invariance within a limit.

To measure the capabilities of our network, we measure the validation loss of our network compared to standard neural network architectures. We do so by first collecting a number of expert trajectories. Then we augment the data by temporally delaying it, and scaling it to about different trajectories. We compare our proposed architecture with two common architectures: a) A neural network with fully connected dense layers, and b) a neural network with fully connected convolutional layers. We use , , and demonstrations respectively as our training data-set and measure the loss over the entire training and validation data-set of demonstrations. The results are shown in the Fig. 7. As we can see from the above results, the validation loss of a fully connected or a CNN barely decreases with increase in size of the data-set. This shows that these architectures cannot generalize well. However our proposed neural network architecture has validation loss with demonstrations, making it more sample efficient than standard neural networks.

Fig. 7: Variation of validation-loss with training data. All networks reach very low training error however our network generalizes better to the validation data. Dataset A has demonstrations for training, while B, C have and demonstrations respectively. Note that losses for B, C with the proposed network are very small to be noticed.

V Conclusion and Future Works

We have presented a trajectory generation and transition framework that is easy to use and applicable for omni-directional motion of quadrupedal robots. Trajectory transitions are learnt from demonstrations of an expert user. Trajectory transitions are achieved by a neural network that uses a unique architecture, which is much more suited to mimicking filters than fully connected or convolutional neural networks. As an application, we show how a novice user’s joystick command is converted to safe commands. In future, we would like to generate trajectories for more complex terrains such as stair climbing and uphill slopes. Automating the generation of trajectories given a robot model to quickly generate controllers for different quadruped robots is also an exciting research direction. The video submission accompanying this paper is shown here: https://youtu.be/LRbHetp0dcg, and the code is released here: https://github.com/sashank-tirumala/stoch_robot_test2.


  • [1] G. Bledt, M. Powell, B. Katz, J. Carlo, P. Wensing, and S. Kim (2018-10) MIT cheetah 3: design and control of a robust, dynamic quadruped robot. pp. . External Links: Document Cited by: §I-A.
  • [2] J. Bruce, S. Lenser, and M. Veloso (2002-06) Fast parametric transitions for smooth quadrupedal motion. pp. . External Links: Document Cited by: §I-A.
  • [3] J. Degrave, M. Burm, T. Waegeman, F. Wyffels, and B. Schrauwen (2013-12) Comparing trotting and turning strategies on the quadrupedal oncilla robot. In 2013 IEEE International Conference on Robotics and Biomimetics (ROBIO), Vol. , pp. 228–233. External Links: Document, ISSN null Cited by: §I-A, §I-B, §II-B.
  • [4] D. Dholakiya, S. Bhattacharya, A. Gunalan, A. Singla, S. Bhatnagar, B. Amrutur, A. Ghosal, and S. Kolathaya (2019-04) Design, development and experimental realization of a quadrupedal research platform: stoch. In 2019 5th International Conference on Control, Automation and Robotics (ICCAR), Vol. , pp. 229–234. External Links: Document, ISSN 2251-2446 Cited by: §I, §II-A.
  • [5] Cited by: §I.
  • [6] F. Grimminger, A. Meduri, M. Khadiv, J. Viereck, M. Wüthrich, M. Naveau, V. Berenz, S. Heim, F. Widmaier, J. Fiene, A. Badri-Spröwitz, and L. Righetti (2019) An open torque-controlled modular robot architecture for legged locomotion research. External Links: 1910.00093 Cited by: §I.
  • [7] M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso, V. Tsounis, J. Hwangbo, K. Bodie, P. Fankhauser, M. Bloesch, R. Diethelm, S. Bachmann, A. Melzer, and M. Hoepflinger (2016) ANYmal - a highly mobile and dynamic quadrupedal robot. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , pp. 38–44. Cited by: §I.
  • [8] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). External Links: Document, Link, https://robotics.sciencemag.org/content/4/26/eaau5872.full.pdf Cited by: §I-A, §I.
  • [9] A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, and V. Vanhoucke (2018) Policies modulating trajectory generators. In Conference on Robot Learning, pp. 916–926. Cited by: §I-A.
  • [10] D. Kim, J. D. Carlo, B. Katz, G. Bledt, and S. Kim (2019) Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control. ArXiv abs/1909.06586. Cited by: §I.
  • [11] N. Kohl and P. Stone Policy gradient reinforcement learning for fast quadrupedal locomotion. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, Vol. 3, pp. 2619–2624. Cited by: §I-A.
  • [12] S. Kolathaya, A. Joglekar, S. Shetty, D. Dholakiya, Abhimanyu, A. Sagi, S. Bhattacharya, A. Singla, S. Bhatnagar, A. Ghosal, and B. Amrutur (2019-10) Trajectory based deep policy search for quadrupedal walking. In 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Vol. , pp. 1–6. External Links: Document, ISSN 1944-9445 Cited by: §I-A, §I-B.
  • [13] S. Ma, T. Tomiyama, and H. Wada (2005-05) Omnidirectional static walking of a quadruped robot. Robotics, IEEE Transactions on 21, pp. 152 – 161. External Links: Document Cited by: §I-A.
  • [14] H. Mania, A. Guy, and B. Recht (2018) Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1800–1809. Cited by: §I-A, §II-C.
  • [15] M. H. Raibert (1986) Legged robots. Communications of the ACM 29 (6), pp. 499–514. Cited by: §I-A, §I.
  • [16] J. E. Shigley (1960) The mechanics of walking vehicles. Technical report ARMY TANK-AUTOMOTIVE CENTER WARREN MI. Cited by: §II-B.
  • [17] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. CoRR abs/1804.10332. External Links: Link, 1804.10332 Cited by: §I-A.
  • [18] S. Tirumala, A. Sagi, K. Paigwar, A. Joglekar, S. Bhatnagar, A. Ghosal, B. Amrutur, and S. Kolathaya (2019) Gait library synthesis for quadruped robots via augmented random search. External Links: 1912.12907 Cited by: §I-A.
  • [19] E. R. Westervelt, J. W. Grizzle, and D. E. Koditschek (2003-01) Hybrid zero dynamics of planar biped walkers. IEEE Transactions on Automatic Control 48 (1), pp. 42–56. External Links: Document, ISSN 0018-9286 Cited by: §I.
  • [20] A. W. Winkler, F. Farshidian, D. Pardo, M. Neunert, and J. Buchli (2017) Fast Trajectory Optimization for Legged Robots Using Vertex-Based ZMP Constraints. IEEE Robotics and Automation Letters 2 (4), pp. 2201–2208. External Links: Document, Link Cited by: §I.