Locomotion for legged robots is a challenging control problem that requires high-speed control of actuators as well as precise coordination between multiple legs based on various types of sensor data. In addition to basic locomotion, different terrains, tasks or environmental conditions might require specific primitive behaviors.
Recent research shows promising results on learning based systems for locomotion tasks in simulation and real hardware [hwangbo2019learning, iscen2018policies, yu2018policy]. Various techniques can be used to discover policies for such tasks. In this work, we focus on Reinforcement Learning (RL) to obtain robust policies.
Robot locomotion is an excellent match for hierarchical control architectures. Indeed, the separation of low-level control of the legs and high-level decision making based on the environment and task at hand provides multiple advantages such as reuse of the learned low-level skills across tasks, and interpretability of the high-level decisions.
Given a complex task, manually defining a suitable hierarchy is typically a tedious task that requires engineering of the state and action spaces as well as reward functions for each primitive. To overcome this, we introduce a hierarchical framework to automatically decompose complex locomotion tasks. A high-level policy issues commands to a low-level policy and decides for how long to execute the low-level policy at a time. The low-level policy acts according to commands from the high-level policy and on-board sensors. Our approach allows separation of the state variables that are used for low-level control, from state variables only required for higher-level control. Our architecture naturally allows the high-level to operate at a slower timescale than the low-level.
We test our framework on a path following task for a dynamic quadruped robot. The task requires walking into different directions to complete the track while keeping balance. Using our architecture, we train both levels of the hierarchical policy end-to-end. We show that steering behavior automatically emerges in the latent command space between the high-level and low-level policies, which allows reuse of the learned low-level behaviors. We show transfer of the low-level policy to a different track to achieve fast adaptation to a new task. Lastly, we deploy our policies to hardware to validate the learned behaviors on a real robot.
Ii Related Work
Hierarchical Reinforcement Learning (HRL) methods focus on decomposing complex tasks into simpler sub-tasks. Not only does this help simplify a single difficult problem, it can also help in adapting the solution faster to a new problem if sub-tasks are general enough. The framework based on pre-defined options [sutton1999between], or temporally extended actions, is one of the first popular methods in this direction. More recently, considerable research attention is given to the problem of automatically discovering options through experience.
In methods like HRL with hindsight [levy2018hierarchical] and data-efficient HRL [nachum2018data], hierarchy is introduced using universal value functions (value functions that are parameterized by ’goal’). Actions of a higher-level policy, running at a fixed slower timescale, act as goals for a lower-level. A goal is explicitly defined as a point in observation space and the low-level is rewarded for reaching that point. This allows both levels to be trained through their respective reward signals. However, this goal specification is not suitable in all situations. If the observation space is high dimensional, then the high-level task of selecting a goal becomes very difficult. Also, determining when the goal is achieved requires task-specific domain knowledge.
Latent space policies for HRL [haarnoja2018latent] use a different approach to parameterize the low-level. The high-level outputs a set of latent variables as goal for the lower level that are learned through maximum entropy reinforcement learning. Both levels are then trained to maximize the main task reward. This, however, prevents the low-level from being reused for any other task.
Along similar lines, Osa et. al. [osa2019hierarchical] recently proposed a method based on information maximization to learn latent variables of a hierarchical policy.
In their paper on meta learning shared hierarchies [frans2017meta], Kevin et al. propose a HRL framework that is learned on multiple related tasks. The low-level skills are reused across tasks while the meta-controller is task-specific. Instead of parameterizing a single low-level policy, the meta-controller selects a different low level policy from a set for each sub-task. In order for general low-level policies to emerge, the framework needs to be trained on a number of related tasks.
In our method, we use a latent goal representation to remove the need to hand design low-level rewards or deciding on the number of low-level policies. We also use different state representations for both levels to ensure that reusable low-level skills are learned even when trained on a single task. Moreover, in our method, the high-level policy runs at a variable timescale, easing processing requirements for higher-level state information.
The task of robot navigation lends itself to a hierarchical solution with path-planning at the high-level and point-to-point locomotion at the low-level. In this context, many methods [bischoff2013hierarchical, heess2016learning, faust2018prm] have been tried to solve these two tasks separately. Nicolas et al. [heess2016learning], propose a hierarchical framework for locomotion based on modulated locomotor controllers. A low-level spinal network learns primitive locomotion by training on simple tasks. A high-level cortical network, drives behavior by modulating the inputs to the pre-trained spinal network. HRL with pre-trained primitives is also applied to the task of robot locomotion on rough terrains [peng2017deeploco, peng2016terrain]. In the DeepLoco [peng2017deeploco] paper, low-level controllers achieve robust walking gaits that satisfy a stepping-target. High-level controllers then invoke desired step targets for the low-level controller.
We apply our hierarchical learning method to the robot locomotion task of following a path in 2D. Our method does not need specification of timescales for the two levels nor a low-level reward signal. Our end-to-end hierarchical learning framework automatically discovers steering behaviors at the low-level which can transfer to a real quadruped robot.
Iii-a Hierarchical Policy Structure and Execution
Our hierarchical policy is structured as shown in Fig. 2. The high-level policy (HL) receives higher-level observations from the environment and issues commands in a latent space to a low-level policy. The high-level also decides the duration for which the low-level is executed before the next high-level evaluation. The low-level (LL) receives observations from on-board sensors (low-level) and the current latent command from the high-level. It outputs actions to execute on the hardware. At the end of the duration set by the high-level, the high-level is invoked again and the process repeats (Fig. 3
). Both high-level and low-level policies in this architecture are neural networks. Algorithm1 shows how an episode is executed using a hierarchical policy in which the high-level and low-level have weights and respectively.
Iii-B Learning Parameters of a Hierarchical Policy
To jointly learn the parameters of the high-level and low-level neural networks, we optimize a standard reinforcement learning objective. Consider a state space and action space
. A sequential decision making or control problem can be modeled as a Markov Decision Process (MDP). An MDP is defined by a transition functionand a reward function, . A policy , parameterized by a weight vector , maps states to actions . For a hierarchical policy, is the collection of parameters from all levels () and the subset of state variables observable by the high-level and low-level are denoted as and respectively. The policy interacts with the MDP for an episode of timesteps at a time. The reinforcement learning objective is to maximize the expected total reward at the end of episode:
We use a simple derivative-free optimization algorithm called Augmented Random Search (ARS) [mania2018simple] to maximize . The algorithm proceeds by choosing a number of directions uniformly at random on a sphere in policy parameter space, then evaluates the policy along these directions and finally updates the parameters along the top performing directions.
Iii-C Transferring Low-Level Policies
An interesting aspect of our hierarchical method is that after learning a policy on one task, the low-level policy can be transferred to a new task from a similar domain. This allows sharing of primitive skills across related problems and is faster than learning from scratch on each task. The low-level policy can be transferred by keeping fixed after learning on the original task and re-initializing . Then, during training only is updated by ARS.
Iv-a Task Details
We apply our method to a path-following task for a quadruped robot. For this, we use the Minitaur quadruped robot from Ghost Robotics111ghostrobotics.io. The Minitaur robot has degrees of freedom ( per leg). The swing and extension of each the legs is controlled using a PD position controller provided with the robot. We train our policies in simulation using pyBullet [pybulletcoumans, tan2018sim].
For the locomotion task, we tackle the problem of following a curved path in 2D while staying within the allowed region. The robot is rewarded for moving towards the end of the path. The task requires the robot to steer left and right at different angles. The optimal trajectory for the center of mass for the robot is not defined and depends on the robot’s anatomy and learned low-level behaviors. Steering poses additional challenges because the legs of the robot can only move in the sagittal plane. The reward function is given by:
where is the Euclidean distance, is the position of the robot, and is the final position of the path. We terminate an episode as soon as the robot moves out of the path.
To learn locomotion, we use the recent Policies Modulating Trajectory Generators (PMTG) architecture, which has shown success at learning forward locomotion on quadruped robots [iscen2018policies]. The PMTG architecture takes advantage of the cyclic characteristic of locomotion and of leg movement primitives by using trajectory generators. Trajectory generators serve as parameterized functions that provide circular leg positions. The policy is responsible to modulate the generator and adjust leg trajectories with a residual as needed. A more detailed explanation of the architecture can be found in the paper [iscen2018policies]. Our hierarchical policy is responsible for controlling the PMTG architecture which issues motor position commands.
Iv-B Hierarchical Architecture
As demonstrated in previous work [iscen2018policies], a well-trained linear neural network policy in combination with the PMTG can produce locomotion. Therefore we use linear neural networks for the high-level and the low-level policies. However, we clip the latent command space to , which allows us to more easily study the latent space. The number of dimensions of the latent command is a hyper-parameter. Note that while the policy networks are linear, PMTG introduces recurrency and non-linearities [iscen2018policies].
We separate the state information into two. We only feed the robot’s position and the robot’s orientation (yaw direction) into the high-level policy (4-dimensional). The high-level policy outputs the latent command and a duration .
The low-level policy network observes the 8-dimensional PMTG state (we use 4 trajectory generators, one per leg), 4-dimensional IMU sensor data (roll, pitch, roll rate, pitch rate), and the latent command from the high-level policy. The output of the low-level network are motor positions and PMTG parameters.
We update the low-level’s output every . The high-level is executed every low-level steps (where was calculated during the previous high-level cycle). In practice is rescaled to from the clipped value. Since the low-level timestep is , the time between high-level evaluations is between and
. This highly simplifies the process of estimating the position and direction of the robot.
Iv-C Transfer of Low-Level Policies to New Tasks
We show that our architecture can adapt to different paths shown in Figure 4. We first train the architecture for path on the left side of Figure 4. The low-level policy only has access to proprioceptive sensor data and this forces it to learn generic steering primitives that can be reused across different paths. We test this property of our hierarchical architecture by reusing the trained low-level policies from path when training on path .
For comparison, we train flat policies on these tasks. The input to the flat policies is the same as the high-level’s observations concatenated with the low-level’s in the hierarchical setup (except, trivially, for the latent commands) and the output is the same as the low-level actions. The flat policy also uses the same PMTG architecture for a fair comparison.
Secondly, we implement an expert hierarchical policy for additional comparison. We pre-train the low-level policy for this baseline using a carefully designed and tuned reward function to follow a target steering angle. The high-level policy computes the running duration for the pre-trained low-level policy and also outputs a steering angle (a scalar in the range (far left) to (far right), instead of the latent command ). The input for the expert policy’s high-level and low-level is exactly the same as in the HRL case.
As in the HRL case, the baseline policies are trained by directly optimizing using Augmented Random Search (ARS) [mania2018simple]. We perform evaluation across different search directions in parallel. We train each method with a set of hyper-parameters (number of directions to search in ARS, number of top direction for updating parameters and number of latent command dimensions in case of our hierarchical method). Finally, we pick the best hyper-parameter for each and compare the average performance of random training runs with those hyper-parameter settings.
In Fig. 5 we show learning curves for policies, a flat policy, hierarchical policy with expert-designed, pre-trained low-level, and a hierarchical policy with latent command space (our method). The policies are trained on different paths. All three methods succeed in solving the task of following the first path (Fig. 4(a)). For the second path, our method is able to solve the task significantly faster than other policies (Fig. 4(b)). On the second path, the flat policy has to learn the parameters from scratch. The expert policy’s high-level learns to use the same low-level policy used in the first path. This low-level policy was pre-trained (see Appendix). Therefore, the expert policy needs extra training time to learn both levels separately. On the other hand, both levels of our latent command based hierarchical policy are trained from scratch on the first path. The best performing policy uses a dimensional latent space. We can see that this policy can still reuse the same low-level and D latent commands to adapt quickly to a new task.
Fig. 4 shows how the robot trained with a hierarchical policy behaves in simulation. It successfully follows the path using steering behaviors. Complete trajectories can be seen in Fig. 3(b). Markers along the trajectory show points at which the high-level becomes active and computes the next latent command and duration. The low-level policy was only trained on the first path and is reused for the second path.
Learning curves of a flat policy, a hierarchical policy with latent commands and an expert hierarchical policy. We plot the average of 5 statistical runs with shaded area representing the standard error.
To simplify the analysis, we study a dimensional latent command space learned by our method in Fig. 7. We evaluated the low-level for different points in the latent space. In Fig. 6(a) we show the movement direction of the robot when giving different points in latent space as commands to the low-level and executing the low-level for a fixed number of steps (). The length of the arrow is proportional to the distance covered. Corresponding color-coded robot trajectories are shown in Fig. 6(b). We can observe that for the path following task, robot steering behaviors of varying velocities emerge automatically as low-level behaviors. The high-level uses these steering behaviors to navigate different parts of the path as show in Fig. 6(b). Moreover, the high-level also decides a variable duration for each latent command (see Fig. 6(b)). We can observe that for straighter parts of the path, the high-level selects a longer duration to go forward, while for curved parts, it switches latent commands more frequently.
Iv-E Hardware Validation
Finally, we validate our results by transferring an HRL policy to a real robot and recording the resulting trajectories. We use a motion capture system (PhaseSpace Impulse X2E) to estimate the robot’s current position and heading, which is then fed into the high-level policy. Since our architecture allows execution of the different levels at different frequencies, it is sufficient to transmit motion capture data to the high-level policy at a much lower rate compared to low-level sensor data such as IMU readings.
). To overcome this limitation, we recorded shorter robot trajectories starting at the origin. We then virtually moved the robot down the path by adding an offset to the motion capture’s position estimate and recorded another set of trajectories. Note the significant variance for the real trajectories at the start of the path due to slippage of the legs during dynamic turning gaits.
We presented a hierarchical control approach particularly suited for legged robots. By separating the architecture into two parts, a high-level and a low-level policy network, and jointly training them, we obtained a number of advantages over previous algorithms.
First, the architecture is agnostic to the task: we do not need to manually pick or pretrain the behaviors (primitives) of the low-level policy. As a consequence we also remove the need to design individual reward functions for each behavior. In fact, our algorithm outperforms a similar setup in which the low-level behaviors are predefined.
Secondly, our method can be used to bootstrap when training on a new task by transferring the trained low-level policy.
Finally, the high-level and low-level policies operate at different timescales and can use different state representations. This is of particular practical importance, since motor commands should be able to be calculated in mere milliseconds by a low-level policy for safety and stability reasons. High-level signals such as rewards or position estimates are often updated at much lower frequencies and might have to be transmitted via a wireless connection. Our approach provides a natural way to decouple these timescales.
The task at hand allowed us to study the results in detail in both simulation and hardware to validate our approach and implementation. We show that given the path following task, the steering behaviors automatically emerge in a latent space, and the robot can easily adapt to a new path with low-level transfer. We also deployed these policies to hardware to validate the learned hierarchical policy.
In future work, we plan to apply this algorithm on tasks requiring a high level of agility in more complex environments. As an example, if the robot has to jump over an obstacle or climb stairs, manually defining a set of low-level behaviors will become even more cumbersome. We believe that the latent command space will allow us to tackle these challenges through automatic discovery of the complex primitives required to solve the task. In addition, we are planning to incorporate more complex sensors such as camera images, which naturally operate at different timescales and require significant computational power. In this case our approach would allow for distributed processing, without compromising performance.
As part of the baselines, a low-level expert steering policy is trained separately. This policy is controlled by a scalar input from the high-level , which determines the target direction. We train the policy using the ARS algorithm by rewarding the magnitude of the average steering angle over the past timesteps. The reward is capped by the input . Then another component (weighted by ) is added to the reward for moving forward, which is capped by a fixed value, :
We would like to thank Jie Tan, Tingnan Zhang, Erwin Coumans, Sehoon Ha (Robotics at Google), Honglak Lee, Ofir Nachum (Google Brain), and Arun Ahuja (DeepMind) for insightful discussions.