Reinforcement Learning (RL) can help robots generalize to unseen scenarios, and achieve novel tasks. In locomotion, there has been recent success in using RL to learn to walk in simulation [peng2017deeploco, frans2017meta, chua2018deep], but examples of RL on locomotion hardware are rare. This is due to multiple reasons, such as sample inefficiency of RL methods, lack of robust locomotion platforms, challenging dynamics, and high-dimensional robots. However, locomotion skills are important for autonomous agents to accomplish tasks outside of their workspace, such as clean a room, pick a far-away object. For navigating uneven terrains, stairs, etc. legged platforms become important.
In this work, we address two of the main challenges facing learning for legged locomotion research - sample efficiency and generalization. Typical examples of RL on locomotion platforms involve learning conservative policies in simulation and deploying them on hardware [tan2018sim, li2019using, hwangbo2019learning]. However, the learned policy might not be efficient on hardware, and might frequently fail [li2019using]. This motivates learning directly on hardware. [haarnoja2018learning] were successful in learning to walk on a Minitaur robot, but training on a higher degree of freedom robot can be very expensive, and most locomotion platforms cannot withstand such extended use. Moreover, [haarnoja2018learning] do not generalize to targets other than walking forward.
In fact, most works on RL for locomotion try to learn to walk forward, but, any realistic task for locomotion would involve reaching a particular goal in space, in the shortest amount of time or with minimum energy. There is a surprising lack of learning literature that address the problem of reaching arbitrary goals, while there are multiple optimal control papers that address this [feng2016online, mason2018mpc]. This is because generalizing to unseen, arbitrary goals often requires a dynamics model. However, many optimal control algorithms are also sensitive to modeling inaccuracies in dynamics and their performance can suffer with poor models [feng2016online].
Model-free learning methods like [andrychowicz2017hindsight] can generalize to goals in the space explored during learning, but do not generalize well to arbitrary goals. Model-based RL holds the promise of generalizing to new goals, but it is largely validated in simulation [Yang2018Learning]. [bechtle2019curious] point out that learning dynamics models in the model-based RL loop is challenging and might need specialized exploration. As a result, there is little to no evidence of learning to reach arbitrary goals in locomotion literature on hardware.
In this work, we improve sample-efficiency of RL on locomotion by using a cyclic parametrization of walking policies, similar to [crespi2008controlling, owaki2017quadruped, Yang2018Learning]. We learn the parameters of these policies using a model-free RL algorithm, Soft Actor Critic [haarnoja2018soft], from scratch on a 18 degree of freedom hexapod robot. This cyclic structure is capable of achieving many different locomotion behaviors and gaits without expert intervention, as demonstrated in our experiments.
Further, we improve generalization to multiple goals by proposing an efficient hierarchical structure. We divide the problem of goal-oriented locomotion into two sub-problems: first we learn temporally extended action primitives that can achieve simple goals such as turning and walking straight, using model-free RL. Next, we build ‘coarse’ dynamics models of these primitives and use them for planning using model predictive control. Coarse dynamics models are fit over transitions over one cycle of primitive actions. This allows us to build dynamics models with very small amount of hardware data, and plan efficiently in primitive space. An overview of our algorithm is shown in Figure 2.
Our main contribution is a hierarchical framework that combines model-free learning with model-based planning to improves generalization of locomotion skills to new goals. Our approach is easy to train, and robust to hardware noise. We demonstrate our results on a Daisy hexapod (Figure 1) over multiple targets up to away, starting with training on very short episodes. To the best of our knowledge, this is the first demonstration of such a hybrid model-free learning with model-based planning framework on a locomotion robot hardware. Our results show that such a combination of the two approaches can greatly improve the sample-efficiency and generalization abilities of RL methods for locomotion.
Ii Background and Related Work
Here, we present a brief overview of model-based and model-free optimization methods from literature and previous works that are closely related to our work.
Ii-a Model-based and model-free optimization
We consider a Markov Decision Process with actions, state and dynamics governed by transition function .Starting from an initial state , and sampling an action at state according to policy , the agent gets a reward and transition to the next state , generating a trajectory .
In planning and control, the objective is to maximize the cumulative reward , where, denotes the trajectory distribution generated by policy . This can be done in a model-free manner [sutton2018reinforcement] or using information from the dynamics in a model-based way.
Ii-A1 Model-free reinforcement learning
Model-free RL optimizes a policy by directly maximizing the long-term reward, without reasoning about the dynamics. Model-free methods are typically less sample-efficient that model-based but achieve better asymptotic performance. Our model-free learning algorithm of choice is Soft Actor Critic (SAC) [haarnoja2018soft], which is a maximum entropy RL algorithm that maximizes both the long-term reward and the entropy of the policy. For a finite horizon MDP, the SAC objective function is: where, is the temperature that trades off between reward and entropy.
SAC is an off-policy algorithm that allows using past data to update current policy, improving sample-efficiency. It has been demonstrated to work on locomotion hardware from scratch in [haarnoja2018learning], and hence we decided to use it.
Ii-A2 Model Predictive Control (MPC)
An alternative to model-free RL is to utilize the dynamics (if known) to maximize the long term reward of a trajectory. One such popular approach is MPC, also known as receding horizon control. MPC solves for an action sequence that maximizes the long-term reward subject to the dynamics at each instant[mayne2000constrained]. The first action is applied on the system, and the process repeats.
MPC has been widely used for control of dynamical systems, [di2018dynamic, park2015online, herdt2010online]. [mason2018mpc, koenemann2015whole] use MPC for controlling a humanoid robot’s center of mass dynamics. However, these works assume a known dynamics model and are sensitive to dynamics modeling errors. As a result, they are hard to generalize to new tasks or robots.
Ii-A3 Hierarchical RL (HRL) with primitives
Using a hierarchical structure that decomposes complex task control into easier sub-tasks control can speed up learning [sutton1999between]. Previous work studied learning the different levels of the hierarchy together [daniel2016probabilistic, bacon2016optioncritic, stulp2011hierarchical]. An alternative is to divide the task into learning primitives, followed by planning in primitive space, while fine-tuning primitives [stulp2011hierarchical, daniel2013learning, 2017-TOG-deepLoco, frans2017meta]. However, most HRL literature is model-free and hence sample inefficient. For example, [peng2017deeploco] needs over a million samples to learn high-level control.
We combine model-based planning and model-free learning, by using model-free RL for learning action primitives, and sequencing them using model-based planning. By incorporating dynamics models in HRL, we can improve sample-efficiency as well as generalization.
Ii-B Learning for locomotion
Using Deep RL in locomotion system has been wildly studied in simulation. [peng2017deeploco, MCPPeng19] used hierarchical RL to achieve challenging locomotion tasks in simulation such as moving a soccer ball and carrying an object to a goal. [heess2017emergence] used deep RL to train locomotion systems in different training environments, and found new emergent gaits. [2016-TOG-deepRL] show that robust locomotion can even be achieved from high-dimensional inputs, such as images. However, since these methods take millions of samples, they are not usable on hardware without modifications.
[Yang2018Learning, antonova2019bayesian, andre2015adapting, cully2015robots] use a cyclic controller structure similar to ours, and use model-free policy search approaches to learn locomotion skills. These methods effectively transfer information between different tasks, or from simulation and hardware. However, they are limited to a relatively low-dimensional parametric controller, and can be hard to generalize to new robots. On the other hand, [tan2018sim, li2019using]
used Deep RL to train unstructured neural network policies in simulation and transfer them to hardware successfully. However, policies that perform well in simulation do not perform well on hardware due to differences between simulation and the real world, such as contact models.
Instead of training in simulation and transferring policies to the real-world, [haarnoja2018learning, yang2019data] directly trained policies in the real world on a Minitaur robot. [haarnoja2018learning] used SAC with automatic entropy adjustment to train a Minitaur robot to walk forward with 0.1 million samples. [yang2019data] used model-based RL with trajectory generators to train Minitaur robot to walk in 45,000 samples. Minitaur has 8 motors that control its longitudinal motion, and no control for lateral movements. In comparison, our hexapod (Daisy) has omni-directional movements and 18 motors. This makes the problem of controlling Daisy especially challenging, and would require significantly longer training. Moreover, previous work only learns to walk forward, and needs additional training to achieve new goals. Our approach can learn to control Daisy and achieve arbitrary goals, using only 35,000 samples on hardware. Such reduction in hardware samples is important as most locomotion robots get damaged from wear and tear when operated for long. For example, in the course of our experiments, we had to replace two motors.
Iii Learning Generalizable Locomotion Skills
We now describe our proposed approach in detail. Figure 2 shows an overview of the hierarchical control structure proposed in this work. In a nutshell, our approach builds a library of primitives that encode low-level controllers for micro-actions turn left, turn right, move forward, stand still. These primitives are learned via model-free reinforcement learning. On a higher level, our approach depends on a model that predicts the dynamics of applying a cycle of the primitive. A model-predictive planner utilizes this model to optimize for the next optimal action sequence to achieve a goal. In the following we start by introducing notation and our experimental platform, we then propose two representations for the action primitives and how to learn them and finally describe the high-level planner.
Iii-a Daisy - Hexapod
Our test platform in this paper is the Daisy Hexapod (Figure 1). Daisy is a six-legged robot with three motors on each leg - base, shoulder, and elbow. Practically, the robot is omni-directional, and the center of mass can follow any given trajectory, but the mass of the motors limits the leg velocity. A Vive tracking system is used to measure robot’s position in the global frame.
The robot has 18 motors that we control by sending desired motor velocities as actions . The state used in the high-level planner is the center of mass position and orientation. The low-level policies output 18 desired joint angles , which are then converted into desired motor velocities in a feedback loop: . Here are the current joint angles; is a feedforward velocity. and are hand-tuned feedback gains.
Iii-B Action Primitive Representations
We take inspiration from biological gaits in locomotion and use two cyclic parametrizations for our action primitives . Our primitives take as input a phase variable and predicts the next desired joint configuration as an action, . At the beginning of every cycle, the phase variable is initialized to , and then grows linearly to , with the length of the cycle designed by the expert. This also allows us to change the speed of our primitive, for example when training we use a slower primitive for the safety of our robot, but when testing, we increase the frequency for better performance.
This idea of periodic gaits was also used in [fukuoka2013analysis, kimura2006biologically, owaki2017quadruped], but these works designed the primitives manually. Instead, here we consider parametric policies, and learn the parameters using a modified SAC, described in Section III-C. We consider 2 types of parametrizations for our primitives:
Iii-B1 Neural Network Policy
Iii-B2 Sinusoidal Policy
A structured parametric controller, which consists of sine waves in each joint: . Each motor has an independent phase , offset and amplitude for the primitive, leading to a total 54 dimensional controller . The parameters of this controller are also learned using modified SAC.
Iii-C Soft Actor-Critic with KL Constraint
While maximum entropy in SAC makes the learning on hardware robust and sample-efficient, sometimes it leads to aggressive policy updates that might harm our robot. Hence, we add an additional practical constraint. We introduce a KL divergence constraint from the previous policy, similar to Trust region policy optimization (TRPO) [schulman2015trust]. Now the objective function for updating policy is expressed as: . This cost encourages entropy-based exploration, while keeping the updated policy close to the last policy, leading to more conservative policy updates that still explore.
Iii-D High-Level Control: Model-based Control
We use a model-based high-level planner that plans the best primitive sequence for our horizon using MPC. The dynamics used in this planning are learned over the whole primitive cycle, rather than the instantaneous dynamics, i.e, is the next state after executing the primitive for time steps, starting from . This leads to a ‘coarse’ transition model learned over extended action sequences, rather than per time step transitions. Moreover, the planning is in a much reduced space of primitive actions instead of the whole action space.
Starting from the current center of mass position and orientation , our high level planner does an exhaustive search over the possible sequences of actions to find the globally optimal sequence for our horizon . Moreover, to further simplify the dynamics, we learn a delta dynamics model , which reasons about the change in the state after the execution of the primitive. This makes the dynamics learning much more efficient, and generalize to unseen states.
Iv Experimental Results
In the following we present evaluations on the Daisy hexapod both in simulation and hardware. Our experiments are structured as follows:
Learning of primitives: We train two primitive actions on Daisy : walk forward, and turn. During training, the total steps in a cycle is 100, and we sample 10 cycles for each iteration, hence 1000 samples per iteration.
High-level control: For experiments with the high-level control, we use MPC for planning in the space of trained primitives. We set targets far away from the training region of the primitives, and reach them using the hierarchy.
Iv-a Simulation Experiments
We simulate the Daisy robot in PyBullet [pybullet]. We start by describing our experimental setup for learning the low-level primitives in simulation.
Iv-A1 Learning Primitives
We decompose locomotion behaviors into four elementary motions: move forward, turn left, turn right, and stand-still. Since turning right can be achieved by mirroring the control of turning left, we do not need to train a new policy; for standing still the desired joint state is the current joint state.
We train the move forward and turn right primitives in sequence, starting with move forward. The parameters of the move forward policy are initialized randomly, and the training data is used to initialize training of turn right policy. In simulation both primitives are trained for 50 iterations using the algorithm described in Section III-C.
For training the move forward policy, we used the reward function
where the first term gives reward for moving forward and penalty for lateral and backward movements, the second term tries to minimize deviation in orientation, and the third penalizes for high joint velocities.
After training the move forward policy, we switch to training the turn right policy. We reuse the data collected in this the phase to initialize the parameters of the turn right policy. Since SAC is an off-policy method, we can just re-evaluate the reward of each transition on the turn right reward function and restart training. The reward function to train the turn policy was
This reward function penalizes the movement of the center of mass in any direction. For each primitive cycle, we assign a desired orientation for the robot. Lastly, we penalize high joint velocity for the safety of our robots. Intuitively, this reward functions encodes that the optimal turning behavior is to turn on the spot at a constant speed. The parameters for reward functions for training are shown in Table I.
Our simulation training results for the neural network and sinusoidal controller are shown in Figure 2(a), 2(b). In simulation, for the forward task, the neural network learns faster than the sinusoidal controller, and the reward is also higher than the sinusoidal controller. This is because the ground contact models in the simulation are very inaccurate, and with the neural network controller, the optimization quickly learns to exploit it by sliding. Since turning is a more controlled task with a target orientation, it is harder to exploit the simulation and both controllers learn at a comparable rate, with the sinusoidal controller having a more stable learning curve.
Iv-A2 High-level control
Once the primitive actions are trained, we can move to the high-level control. We start by training a dynamics model for each primitive by simply building a look-up table for . The look-up table is trained by sampling 50 cycles of randomly selected primitives and averaging the resultant displacement, as described in Algorithm 1.
Once the look up table has been created, we utilize the model within MPC to optimize the sequence of actions that minimizes the cost over a horizon of . We apply the first action from this optimized sequence on the robot, and replan. The reward for the high-level control is episodic, the final distance between the robot and goal at the end of the horizon .
In simulation, we compare the high-level control against PETS [chua2018deep], a state-of-the-art model-based RL baseline. We compare against two versions of PETS:
PETS : We train the full dynamics model of the robot while trying to achieve a goal, in the standard PETS loop. Then we do MPC with cross-entropy method (CEM) using the trained dynamics to achieve other goals, far away from the goal for which the dynamics was trained.
PETS with SAC data : We train the full dynamics model on data that was used for training the forward and turning controllers. This dynamics includes turning and walking data, but for a very small part of the robot’s space. The goals are set quite far away from the training set, and MPC+CEM is used to optimize the action.
We note that the dynamics trained for PETS comparison are on the full state of the robot (18 joint angles), and the action is an optimized sequence of 18 desired joint velocities. As compared to our hierarchical framework, this is a much higher dimensional optimization problem.
In simulation, we test two experimental settings:
Different goals: The goals are at , , , , starting from (Figure 2(c)). Both the neural network and sinusoidal controllers can achieve all targets using our approach. Baselines PETS and PETS trained on SAC data fail to achieve goals other than the one they were trained on.
Waypoint goals: The robot has to achieve targets in a square at sequentially, starting from (Figure 2(d)). Both the neural network and sinusoidal controllers can achieve all targets using our hierarchical control structure. This setting is similar to waypoint goals, where the robot sequentially moves between targets.
In both these experiments (Figure 2(c), 2(d)), the hierarchical control performs well, and the robot is able to reach the targets, despite slipping in simulation. In comparison, while PETS is able to reach the goal that the dynamics were learned on efficiently, it does not generalize to other goals in the environment. Since PETS with SAC data is only trained on very short episodes, it is also unable to achieve far away goals. Hence, the hierarchy helps improve generalization to new goals, when trained with the same amount of data as PETS, a model-based RL approach.
Iv-B Hardware Experiments
Simulation experiments allowed us to test the validity of our approach, but did not have an accurate contact model with the ground. The neural network controllers trained in simulation performed very poorly on hardware because of this mismatch, making it essential to train directly on hardware.
Iv-B1 Learning Primitives
For hardware experiments, we used the same formulation of reward as in simulation but with slightly different weights in rewards, as summarized in Table I. The parameters of the move forward policy are initialized randomly, and the training data is used to initialize training of turn right policy. We trained forward and turning policies on hardware, and their learning curves are shown in Figure 4. We used 20000 samples to train the forward controller which took approximately an hour and 15000 samples to train the turning controller which took about 45 minutes. Although in simulation the neural network trains faster than the sinusoidal controller, we were not successful in training a neural network policy from scratch on hardware, possibly due to noise in reward generation. Since the sinusoidal controller is restricted in space of controllers, in our experience, it was more robust to noise in reward signals, as compared to the neural network controller. The trained sinusoidal forward controller can walk straight and the turning controller can turn left with a small turning radius.
Iv-B2 High-Level Controller
On hardware, we add an orientation term to the high level reward, because the position sensing tends to drift over time, and the robot fails to reach the global goal without orientation guidance.
Here, the first term is the distance term, same as in simulation, and the second term measures the deviation of the center of mass orientation from the goal angle.
For our hardware experiments with high-level control, we start by building the dynamics models of the each primitive in the primitive library . Each dynamics is trained for 50 samples on hardware, leading to a total of 200 hardware samples.
For testing our algorithm on the Daisy robot, we designed a similar experimental setup as simulation, where Daisy was commanded to go to goals up to 12m away from its start point. While our method can generalize to arbitrarily far away locations, currently our hardware setup is limited by the sensing of Vive tracking system for global position of Daisy; our goals are limited to be in the region covered by the base stations. Despite this, sometimes the robot loses tracking during the experiments, and the high-level action is hard-coded to stand still until the tracking is recovered. We test two experimental settings on hardware:
Different goals: The robot has to move to goals starting from (Figure 3(c)). The sinusoidal controller can reliably achieve all targets despite slipping on the ground and noise.
Waypoint goals: The robot is sequentially asked to move to a series of goals. Similar to the simulation setup, the robot has to reach corners in a square sequentially, starting from (Figure 3(d)). Our approach easily generalizes to this setting. In the future, these hand-designed goals can be replaced by a separate controller that specifies waypoints for the robot to move to.
Our hardware experiments show that our proposed hierarchical controller can achieve far away goals using very small amount of training data on hardware. It generalizes to different scenarios, as well as different experimental settings like different flooring, sensing noise, etc. Though we could not train the neural network policy successfully on hardware, we achieved reliable success with the sinusoidal policy at reaching far away goals. However, the performance of the action primitives can be improved on hardware, for example the forward primitive moves at about 0.15m/s. An online updating scheme that fine-tunes the primitives and their dynamics models for a new setting can improve performance on new floors. Moreover, discovering new primitives, such as to go upstairs is a challenging problem. We assume that the size of the primitive library is predefined, but in the future, we would like to explore methods similar to [frans2017meta] to discover new primitives online, while maintaining the generalizability and sample-efficiency of our current approach.
In this work, we proposed a hierarchical control structure for controlling locomotion robots. We decomposed the problem of learning locomotion skills into two sub-problems – learning low-level locomotion skills, followed by sequencing them in a model-based way. Our experiments on the Daisy robot show that such a decomposition can lead to very sample-efficient learning of generalizable locomotion skills. Using our approach, Daisy can reach goals up to 12m away from its start location, and follow waypoints defined by a user. In the future, these waypoints can be generated by a separate controller that takes the environment state as input, for example with an image.
Our work is a step towards building generalizable locomotion skills that can reach arbitrary goals in unknown environments. However, there are many avenues for improvement over our current performance. The low-level primitives, when trained and tested on different environments can have very different performance. For example, they might slip on a slippery floor, or walk too conservatively. While the high-level control helps achieve targets despite these disturbances, performance can improved by updating the primitives locally for different environments. Additionally, there might be a need to discover new primitives for new settings. For example, if a leg breaks, or in the presence of stairs, the current library of primitives might not be enough to achieve a goal. In such cases, one could try to incrementally learn new primitives for achieving new targets, and store them in the library for future reusability. We leave this to future work.