Learning to Walk in the Real World with Minimal Human Effort

Reliable and stable locomotion has been one of the most fundamental challenges for legged robots. Deep reinforcement learning (deep RL) has emerged as a promising method for developing such control policies autonomously. In this paper, we develop a system for learning legged locomotion policies with deep RL in the real world with minimal human effort. The key difficulties for on-robot learning systems are automatic data collection and safety. We overcome these two challenges by developing a multi-task learning procedure, an automatic reset controller, and a safety-constrained RL framework. We tested our system on the task of learning to walk on three different terrains: flat ground, a soft mattress, and a doormat with crevices. Our system can automatically and efficiently learn locomotion skills on a Minitaur robot with little human intervention.


page 1

page 2

page 4

page 6

page 7


Learning to Walk via Deep Reinforcement Learning

Deep reinforcement learning suggests the promise of fully automated lear...

Towards General and Autonomous Learning of Core Skills: A Case Study in Locomotion

Modern Reinforcement Learning (RL) algorithms promise to solve difficult...

Learning Locomotion Skills in Evolvable Robots

The challenge of robotic reproduction – making of new robots by recombin...

Goal-conditioned Batch Reinforcement Learning for Rotation Invariant Locomotion

We propose a novel approach to learn goal-conditioned policies for locom...

A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning

Deep reinforcement learning is a promising approach to learning policies...

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning

In this work, we present and study a training set-up that achieves fast ...

Learning Semantics-Aware Locomotion Skills from Human Demonstration

The semantics of the environment, such as the terrain type and property,...

I Introduction

Reliable and stable locomotion has been one of the most fundamental challenges in the field of robotics. Traditional hand-engineered controllers often require expertise and manual effort to design. While this can be effective for a small range of environments, it is hard to scale to the large variety of situations that the robot may encounter in the real world. In contrast, deep reinforcement learning (deep RL) can learn control policies automatically, without any prior knowledge about the robot or the environment. In principle, each time the robot walks on a new terrain, the same learning process can be applied to acquire an optimal controller for that environment.

However, despite the recent successes of deep reinforcement learning, these algorithms are often exclusively evaluated in simulation. Building fast and accurate simulations to model the robot and the rich environments that the robot may be operating in is extremely difficult. For this reason, we aim to develop a deep RL system that can learn to walk autonomously in the real world. There are many challenges to design such a system. In addition to finding a stable and efficient deep RL algorithm, we need to address the challenges associated with the safety and the automation of the learning process. During training, the robot may fall and damage itself, or leave the training area, which will require labor-intensive human intervention. Because of this, prior work that studied learning locomotion in the real world has focused on statically stable robots [26, 39] or relied on tedious manual resets between roll-outs [28, 16].


Fig. 1: Our autonomous on-robot learning system allows us to learn locomotion policies with minimal human intervention on different terrains, such as flat ground (Top), a soft mattress (Middle), and a doormat with crevices (Bottom).

Minimizing human interventions is the key to a scalable reinforcement learning system. In this paper, we focus on solving two bottlenecks in this problem: automation and safety. During training, the robot needs to automatically and safely retry the locomotion task hundreds or thousands of times. This requires the robot staying within the workspace bounds, minimizing the number of dangerous falls, and automating the resets between episodes. We accomplish all these via a multi-task learning procedure, a safety-constrained learner, and several carefully designed hardware/software components. By simultaneously learning to walk in different directions, the robot stays within the workspace. By automatically adjusting the balance between reward and safety, the robot falls dramatically less. By building hardware infrastructure and designing stand-up controllers, the robot automatically resets its states and enables continuous data collection.

Our main contribution is an autonomous real-world reinforcement learning system for robotic locomotion, which allows a quadrupedal robot to learn multiple locomotion skills on a variety of surfaces, with minimal human intervention. We test our system to learn locomotion skills on flat ground, a soft mattress and a doormat with crevices (Figure 1). Our system can learn to walk on these terrains in just a few hours, with minimal human effort, and acquire distinct and specialized gaits for each one. In contrast to the prior work [28], in which approximately a hundred manual resets are required in the simple case of walking on the flat ground, our system requires zero manual resets in this case. We also show that our system can train four policies simultaneously (walking forward, backward and turning left and right), which form a complete skill-set for navigation and can be composed into an interactive directional walking controller at test time.

Ii Related Work

Control of legged robots is typically decomposed into a modular design with multiple layers, such as state estimation, foot-step planning, trajectory optimization, and model-predictive control. For instance, researchers have demonstrated agile locomotion with quadrupedal robots using a state-machine 

[11], impulse scaling [42], and convex model predictive control [34]. The ANYmal robot [30] plans footsteps based on the inverted pendulum model [44], which is further modulated by a vision component [51]. Similarly, bipedal robots can be controlled by the fast online trajectory optimization [7] or whole-body control [35]. This approach has been used for many locomotion tasks, from stable walking to highly dynamic running, but often requires considerable prior knowledge of the target robotic platform and task. Instead, we aim to develop an end-to-end on-robot training system that automatically learns locomotion skills from real-world experience, which requires no prior knowledge about the dynamics of the robot.

Recently, deep reinforcement learning has drawn attention as a general framework for acquiring control policies. It has been successful for finding effective policies for various robotic applications, including autonomous driving [5, 41], navigation [15], and manipulation [33, 36, 2]. Deep RL also has been used to learn locomotion control policies [29, 52, 43, 9, 56], mostly in simulated environments. Despite its effectiveness, one of the biggest challenges is to transfer the trained policies to the real world, which often incurs significant performance degradation due to the discrepancy between the simulated and real environments. Although researchers have developed principled approaches for mitigating the sim-to-real issue, including system identification [14, 31], domain randomization [45, 50, 47], and meta-learning [24, 53, 55], it remains an open problem.

Researchers have investigated applying RL to real robotic systems directly, which is intrinsically free from the sim-to-real gap [48, 40]. The approach of learning on real robots has achieved state-of-the-art performance on manipulation and grasping tasks, by collecting a large amount of interaction data on real robots [37, 33, 57]. However, applying the same method to underactuated legged robots is challenging. One major challenge is the need to reset the robot to the proper initial states after each episode of data collection, for hundreds or even thousands of roll-outs. Researchers tackle this issue by developing external resetting devices for lightweight robots, such as a simple 1-DoF system [26] or an articulated robotic arm [39]. Otherwise, the learning process requires a large number of manual resets between roll-outs [28, 16, 54], which limits the scalability of the learning system.

Another challenge is to guarantee the safety of the robot during the entire learning process. Safety in RL can be formulated as a constrained Markov Decision Process (cMDP) 

[4], which is often solved by the Lagrangian relaxation procedure [10]. Achiam et al. [1] discussed a theoretical foundation of cMDP problems, which guarantees to improve both rewards and safety constraints. Many researchers have proposed extensions of the existing deep RL algorithms to address safety, such as learning an additional safety layer that projects raw actions to a safe feasibility set [3, 25, 20, 17]

, learning the boundary of the safe states with a classifier 

[38, 49, 23], expanding the identified safe region progressively [8], or training a reset policy alongside with a task policy [22].

In this paper, we focus on developing an autonomous and safe learning system for legged robots, which can learn locomotion policies with minimal human intervention. Closest to our work is the prior paper by Haarnoja et al. [28], which also uses soft actor-critic (SAC) to train walking policies in the real world. In contrast to this work, our focus is on eliminating the need for human intervention, while the prior method requires a person to intervene hundreds of times during training. We further demonstrate that we can learn locomotion on challenging terrains, and simultaneously learn multiple policies, which can be composed into an interactive directional walking controller at test time.

Iii Overview

[width=0.24]images/fail-out.jpg [width=0.24]images/fail-fall.jpg
Fig. 2: An autonomous learning system for real legged robots requires to address two challenges: (Left) a robot leaving the workspace and (Right) a robot falling to the ground.


Fig. 3: Overview of our learning system. We solve three main automation challenges, leaving the workspace, falling, and resetting, by multi-task learning, a safety-constrained SAC algorithm, and an automatic reset controller.

Our goal is to develop an automated system for learning locomotion gaits in the real world with minimal human intervention (Figure 3). Aside from incorporating a stable and efficient deep RL algorithm, our system needs to address the following challenges. First, the robot must remain within the training area (workspace), which is difficult if the system only learns a single policy that walks in one direction. We utilize a multi-task learning framework, which simultaneously learns multiple locomotion tasks for walking in different directions, such as walking forward, backward, turning left or right. This multi-task learner selects the task to learn according to the relative position of the robot in the workspace. For example, if the robot is about the leave the workspace, the selected task-to-learn would be walking backward. Using this simple state machine scheduler, the robot can remain within the workspace during the entire training process. Second, the system needs to minimize the number of falls because falling can result in substantially longer training times due to the overhead of experiment reset after each fall. Even worse, the robot can get damaged after repeated falls. We augment the Soft Actor-Critic (SAC) formulation with a safety constraint that limits the roll and the pitch of the robot’s torso. Solving this cMDP greatly reduces the number of falls during training. Third, in cases where falling is inevitable, the robot needs to stand up and reset its pose. We designed a stand-up controller, which allows the robot to stand up in a wide variety of fallen configurations. Our system, combining all these components, effectively reduces the number of human interventions to zero in most of the training runs.

Iv Preliminary: Reinforcement Learning

We formulate the task of learning to walk in the setting of reinforcement learning [46]. The problem is represented as a Markov Decision Process (MDP), which is defined by the state space , action space , stochastic transition function , reward function , and the distribution of initial states . By executing a policy , we can generate a trajectory of state and actions . We denote the trajectory distribution induced by by . Our goal is to find the optimal policy that maximizes the sum of expected returns:


V Automated Learning in the Real World

V-a Multi-Task Learning


Fig. 4: Task scheduling scheme in multi-task learning. To drive the robot toward the center of the workspace, we divide the space into two or four robot-centric regions and select the task where the center of the workspace resides in the corresponding subdivision.
Task Weights
Walking forward
Walking backward
Turning left
Turning right
TABLE I: Tasks and their weights.

An important cause of human intervention is the need to move the robot back to the initial position after each episode. Otherwise, the robot would quickly leave the limit-sized workspace within a few rollouts. Carrying a heavy-legged robot back-and-forth hundreds of times is labor-intensive. We develop a multi-task learning method with a simple state-machine scheduler that generates an interleaved schedule of multi-directional locomotion tasks, in which the robot will learn to walk towards the center of the workspace automatically.

In our formulation, a task is defined by the desired direction of walking with respect to its initial position and orientation at the beginning of each roll-out. More specifically, the task reward

is parameterized by a three dimensional task vector



where is the rotation matrix of the base at the beginning of the episode, and are the position and yaw angle of the base in the horizontal plane at time , and measures smoothness of actions, which is the desired motor acceleration in our case. This task vector defines the desired direction of walking. For example, walking forward is and turning left is . Note that the tasks are locally defined and invariant to the selection of the starting position and orientation of each episode.

At the beginning of the episode, the scheduler determines the next task to learn from the set of predefined tasks based on the relative position of the center of the workspace in the robot’s coordinate. Refer to Table I for the complete set of tasks. In effect, our scheduler selects the task in which the desired walking direction is pointing towards the center. This is done by dividing the workspace in the robot’s coordinate with the fixed angles and selecting the task where the center is located in its corresponding subdivision (Figure 4). Assuming we have two tasks: forward and backward walking, the scheduler will select the forward task if the workspace center is in front of the robot, and select the backward task in the other case. Note that a simple round-robin scheduler will not work because the tasks may not be learned at the same rate.

Our multi-task learning method is based on two assumptions: First, we assume that even a partially trained policy still can move the robot in the desired direction even by a small amount most of the time. In practice, we find this to be true, since the initial policy is not likely to move the robot far away from the center, and as the policy improves, the robot quickly begins to move in the desired direction, even if it does so slowly and unreliably at first. Usually after to roll-outs, the robot starts to move at least  cm, by pushing its base in the desired direction. The second assumption is that, for each task in the set , there is a counter-task that moves in the opposite direction. For example, walking forward versus backward or turning left versus right. Therefore, if one policy drives the robot to the boundary of the workspace, its counter-policy can bring it back. In our experience, both assumptions hold for most scenarios, unless the robot accidentally gets stuck at the corners of the workspace.

We train a policy for each task with a separate instance of learning, without sharing actors, critics, or replay buffer. We made this design decision because we did not achieve clear performance gain when we experimented with weights and data sharing, probably because the definition of tasks is discrete and the experience from one task may not be helpful for other tasks.

Although multi-task learning reduces the number of out-of-bound failures by scheduling proper task-to-learn between episodes, the robot can still occasionally leave the workspace if it travels a long distance in one episode. We prevent this failure by triggering early termination (ET) when the robot is near and continues moving towards the boundary. In contrast to falling, this early termination requires special treatments for return calculation. Since the robot does not fall and can continue executing the task if it is not near the boundary, we take future rewards into consideration when we compute the target values of Q functions.

V-B RL with Safety Constraints

Repeated falls may not only damage the robot, but also significantly slow down training, since the robot must stand up after it falls. To mitigate this issue, we formulate a constrained MDP to find the optimal policy that maximizes the sum of rewards while satisfying the given safety constraints :


In our implementation, we design the safety constraints to prevent forward or backward falls that can easily damage the servo motors:


where and are the pitch and roll angles of the robot’s torso, and and are the maximum allowable tilt, where they are set to and for all experiments.

We can rewrite the constrained optimization by introducing a Lagrangian multiplier :


We optimize this objective using the dual gradient decent method [12], which alternates between the optimization of the policy and the Lagrangian multiplier . First, we train both Q functions for the regular reward and the safety term , which are parameterized by and respectively, using the formulation similar to Haarnoja et al. [28]. Then we can obtain the following actor loss:


where is the replay buffer. Finally, we can learn the Lagrangian multiplier by minimizing the loss :

[height=0.22]images/minitaur.jpg [height=0.22]images/hardware_cable2.jpg
Fig. 5: Illustration of Hardware. (Left) A Minitaur robot with a safety box. (Right) A 1-DoF cable management system.

V-C Additional System Designs for Autonomous Training

We developed additional hardware and software features to facilitate a safe and autonomous real-world training environment.

First, while safety constraints significantly reduce the number of falls, experiencing some falls is inevitable because most RL algorithms rely on failure experience to learn effective control policies. To eliminate the manual resets after the robot falls, we develop an automated stand-up controller that can recover the robot from a wide range of failure configurations. Our stand-up controller is manually-engineered based on a simple state machine, which pushes the leg on the fallen side or adjusts the leg angles to roll back to the initial orientation. One challenge of designing such a stand-up controller is that the robot may not have enough space and torque to move its legs underneath its body in certain falling configurations due to the weak direct-drive motors. For this reason, we attach a box that is made of cardboard underneath the robot (Figure 5, Left). When the robot falls, this small box gives the robot additional space for moving its legs, which prevents the legs from getting stuck and the servos from overheating. Please refer to the supplemental video for the stand-up motion.

Second, we find that the robot often gets tangled by the tethering cables when it walks and turns. We develop a cable management system so that all the tethering cables, including power, communication, and motion capture, are hung above the robot (Figure 5, Right). We wire the cables through a  m rod that is mounted at the height of  m and adjust the cable’s length to maintain the proper slackness. Since our workspace has an elongated shape (m by m), we connect one end of the rod to a hinge joint at the midpoint of the long side of the workspace, which allows the other end of the rod to follow the robot passively.

Third, to reduce the wear-and-tear of the motors due to the jerky random exploration of the RL algorithm, we post-process the action commands with a first-order low-pass Butterworth filter [13] with a cutoff frequency of  Hz.

Vi Experiments

We design experiments to validate that the proposed system can learn locomotion controllers in the real world with minimal human intervention. Particularly, we would like to answer the following questions.

  1. Can our system learn locomotion policies in the real world with minimal human intervention?

  2. Can our system learn multiple policies simultaneously?

  3. Can our constrained MDP formulation reduce the number of failures?

Vi-a Experiment Details

We test our framework on Minitaur, a quadruped robot [21] (Figure 5, Left), which is approximately  kg with m body length. The robot has eight direct-drive servo motors to move its legs in the sagittal plane. Each leg is constructed as a four-bar linkage and is not symmetric: the end-effector (one of the bars) is longer and points towards the forward direction. The robot is equipped with motor encoders that read joint positions and an IMU sensor that measures the torso’s orientation and angular velocities in the roll and pitch axes. In our MDP formulation, the state consists of motor angles, IMU readings, and previous actions, in the last six time steps. The robot is directly controlled from a non-realtime Linux workstation (Xeon E5-1650 V4 CPU, 3.5GHz) at about

 Hz. At each time step, we send the action, the target motor angles, to the robot with relatively low PD gains, 0.5 and 0.005. While collecting the data from the real-world, we train all the neural networks by taking two gradient steps per control step. We use Equation 

2 as the reward function, which is parameterized by task weights .

We solve the safety-constrained MDP (Equation 3 and 4) using the off-policy learning algorithm, Soft Actor-Critic (SAC) [27], which has an additional inequality constraint for the policy entropy. Therefore, we optimize two Lagrangian multipliers, for the entropy constraint and

for the safety constraint, by applying dual gradient descent to both variables. We represent the policy and the value functions with fully connected feed-forward neural networks with two hidden-layers,

neurons per layer, and ReLU activation functions. The network weights are randomly initialized and learned with Adam optimizer with the learning rate of .

Vi-B Learning on a flat terrain

[width=0.24]images/flat_reward.png [width=0.24]images/multi_reward.png
Fig. 6: Learning curves on the flat terrain with two tasks (Left) and four tasks (Right). In both cases, our framework trains successful multi-directional walking policies (reward ) with minimal human intervention.
[width=0.195]images/motion2/flat_forward_15.jpg [width=0.195]images/motion2/flat_forward_25.jpg [width=0.195]images/motion2/flat_forward_34.jpg [width=0.195]images/motion2/flat_forward_47.jpg [width=0.195]images/motion2/flat_forward_61.jpg
[width=0.195]images/motion2/flat_backward_04.jpg [width=0.195]images/motion2/flat_backward_14.jpg [width=0.195]images/motion2/flat_backward_27.jpg [width=0.195]images/motion2/flat_backward_39.jpg [width=0.195]images/motion2/flat_backward_49.jpg
Fig. 7: Learned policies on the flat terrain. (Top) The forward policy moves the legs on the same side synchronously, which resembles a pacing gait. (Bottom) The backward gait is similar to a bounding gait by having left-right symmetric motions.


Fig. 8: Plot of the starting location (points) and desired walking direction (arrows) of each episode. The color represents the index of the episode and the marker represents the task (circle: forward, square: backward). The plot indicates that our multi-task learning method finds proper tasks to attract the robot to the center of the workspace, which results in zero human intervention in this learning.

First, we test our system in an area of  m flat ground using the two-task configuration (walking forward or backward). Our framework successfully trains both policies from scratch on the flat surface, with zero human intervention. In addition, our system requires far less data than that of Haarnoja et al. [28]. While the prior work [28] needs 2 hours or 160k steps, to learn a single policy, the total time of training two policies using our system is about 1.5 hours, which is approximately 60k steps or 135 episodes. Figure 6 (left) shows the learning curve for both two tasks. This greatly-improved efficiency is because our system updates the policy for every single step of the real-world execution, which turns out to be much more sample efficient than the episodic, asynchronous update scheme of Haarnoja et al. [28].

We observe that the robot’s asymmetric leg structure leads to different gaits when walking forward and backward. Forward walking requires higher foot clearance because otherwise, the front-facing feet can easily stab the ground, causing large frictions and leading to falling. In contrast, the robot can drag the feet when it walks backward, even if they are sliding on the ground. As a consequence, the learned forward locomotion resembles a regular pacing gait with high ground clearance and the backward locomotion can be roughly classified as a high-frequency bounding gait. Please refer to Figure 7 and the supplemental video for more detailed motions.

Although a small number of manual resets were still required occasionally when the robot was stuck at the corner, the majority of the training runs were finished with zero human interventions. This is significantly less than the work of Haarnoja et al. [28] that requires hundreds of manual resets. In the training session shown in Figure 6 (left), our framework automatically recovered from falls and flips, by invoking the stand-up controller for times. Without the multi-task learner, the robot would leave the workspace times. Figure 8 visualizes the starting position of all the episodes during the entire learning, where desired walking directions are marked as arrows. The figure indicates that our multi-task learning method steers the robot to the center successfully, although it may deviate slightly due to immature policy and stochastic environments.


Fig. 9: We train locomotion policies in four different directions, walking forward, walking backward, turning left, and turning right, which allow us to interactively control the robot with a game controller.

Furthermore, we test our system in the four-task configuration, which includes walking forward, walking backward, turning left, and turning right. These four tasks form a complete set of high-level navigation commands. The learning curve is plotted on the right of Figure 6, which uses the total k samples for all four tasks. While forward and backward policies are similar to the previous experiment, we obtain effective turning policies that can finish a complete turn in-place within ten seconds. Note that turning in-place with the planar leg structure of Minitaur would be difficult to manually-design, but our system can discover it automatically. We integrated these learned policies with a remote controller, which enables us to remotely steer and move the robot in real-time.

Vi-C Learning on challenging surfaces

[width=0.195]images/motion2/mattress_forward_02.jpg [width=0.195]images/motion2/mattress_forward_23.jpg [width=0.195]images/motion2/mattress_forward_47.jpg [width=0.195]images/motion2/mattress_forward_77.jpg [width=0.195]images/motion2/mattress_forward_103.jpg
[width=0.195]images/motion2/mattress_backward_04.jpg [width=0.195]images/motion2/mattress_backward_11.jpg [width=0.195]images/motion2/mattress_backward_24.jpg [width=0.195]images/motion2/mattress_backward_34.jpg [width=0.195]images/motion2/mattress_backward_41.jpg
[width=0.195]images/motion2/doormat_forward_01.jpg [width=0.195]images/motion2/doormat_forward_34.jpg [width=0.195]images/motion2/doormat_forward_55.jpg [width=0.195]images/motion2/doormat_forward_105.jpg [width=0.195]images/motion2/doormat_forward_127.jpg
[width=0.195]images/motion2/doormat_backward_04.jpg [width=0.195]images/motion2/doormat_backward_19.jpg [width=0.195]images/motion2/doormat_backward_34.jpg [width=0.195]images/motion2/doormat_backward_47.jpg [width=0.195]images/motion2/doormat_backward_64.jpg
Fig. 10: Learned locomotion gaits on the challenging terrains. (Row 1) The learned forward gait on the mattress tends to lift the front and back limbs to secure larger ground clearance. (Row 2) The learned backward gait on the mattress takes large steps. (Row 3) The learned forward policy walks on the doormat while pulling the leg from a crevice. (Row 4) The learned backward gait on the doormat resembles an energetic pacing gait.
[width=0.24]images/mattress_reward.png [width=0.24]images/doormat_reward.png
Fig. 11: Learning curves on the soft mattress (Left) and the doormat with crevices (Right). On both challenging surfaces, our framework trains successful policies (reward ).

We also deploy our autonomous learning system to more challenging surfaces: a soft mattress and a doormat with crevices. A soft mattress is made of gel memory foam,  cm thick, with a size of  m by  m. A doormat is made of rubber,  cm thick, with a size of  m by  m. We combine eight doormats to obtain the  m by  m workspace. Both surfaces are “challenging” because they are extremely difficult to model. Additionally, a policy trained on flat ground cannot walk on either surface, although the maximum-entropy RL is known for learning robust policies.

Our framework successfully learns to walk forward and backward on both challenging surfaces (Figure 11). Training on these surfaces requires more samples than the flat surface. Walking forward and backward takes total k steps for the mattress and k steps for the doormat. In both cases, learning to walk backward was slightly easier than the forward locomotion task. On the soft mattress, learned policies find gaits with larger ground clearance than the flat ground by tilting the torso more. Particularly, the learned forward policy is not homogeneous over time and alternates among pacing, galloping, and pronking gaits, although each segment does not perfectly coincide with the definition of any particular gait (Figure 10, Top). Locomotion policies on a doormat are more energetic than flat terrain policies in general (Figure 10, Bottom). We observe that the forward policy often shakes the leg when it is stuck within a crevice, which makes a key difference from the policy trained on the flat surface.

Although our framework greatly reduces the number of failures, it still requires a few manual resets ( to ) when training locomotion on the challenging surfaces. The increased number of manual resets compared to the flat ground is largely due to the reduced size of the workspace. In addition, the automated stand-up controller sometimes fails when the foot is stuck firmly in a crevice.

Vi-D Analysis


Fig. 12: The number of out-of-bound failures with a multi-task learning method (ours) and a standard learning method (baseline). For all sizes of workspace, multi-task learning dramatically reduces the number of manual resets.
Fig. 13: Comparison of the fixed weight and adaptive Lagrangian multipliers (ours) for a safety constraint. Our method (blue) reduces the number of falls significantly compared to the baseline without any safety constraint (orange). Its performance is close to the best case of the baseline if the weight is carefully tuned (weight = 1.0, purple). However, it is often infeasible to tune hyper-parameters for training on real robots.

In this section, we evaluate two main components of our framework, multi-task learning and the safety-constrained SAC, using a large number of training runs in PyBullet simulation [19]. Simulation allows us to easily collect a large quantity of data and compare the statistics of the performance without damaging the robot. We want to emphasize that these experiments in simulation are only for analysis, and we do not use any simulation data for the experiments in the previous sections. For all statistics in this section, we run experiments with five different random seeds.

First, we compare the number of out-of-bound failures for our multi-task learning method and the baseline, a single-task learning. Because the number of failures is related to the size of the workspace, we select three sizes (m x m, m x m, and m x m) that are similar to the real-world settings of the mattress, the doormat, and flat ground, respectively. Figure 12 shows that our method learns policies with a much smaller number of failures, which is only  % to  % of the baseline. The trend was the same for all three sizes, although smaller workspace requires more manual resets, as expected. And the number of out-of-bound errors for multi-task learning methods in a large workspace ( m x  m) is also well matched with our empirical results in the real-world experiment.

Figure 13 shows a comparison of different ways to enforce safety, using a weighted penalty term in the reward or a constraint (our method), in which the penalty weight is learned automatically. We measure both the reward (Figure 13, Top) and the number of falls (Figure 13, Bottom) during the fixed duration of learning (k samples) to verify whether our methods can reduce the number of failures. The results indicate that our approach (blue) trains a successful policy with approximately falls, which is significantly lower than learning without a safety term (orange). However, the number of falls is larger than the optimal setting of the fixed weight of (purple), which falls times on average. However in practice, the safety (minimizing failure) and the optimality (maximizing reward) are often contradictory, as observed in the case of an excessive weight of (brown). Finding a good trade-off may require extensive hyper-parameter tuning, which is infeasible when training on real robots. Our method can significantly reduce the number of falls without any need of hype-parameter tuning.

Vii Conclusion and Future Work

We presented an autonomous system for learning legged locomotion policies in the real world with minimal human intervention. We focus on resolving key challenges in automation and safety, which is the bottleneck of robotic learning and is complementary to improving the existing deep RL algorithms. First, we develop a multi-task learning system that prevents the robot from leaving the training area by scheduling multiple tasks with various target directions. Second, we solve a safety-constrained MDP by automatically adjusting the Lagrangian multiplier, which minimizes the number of falls in the real system without additional hyper-parameter tuning. Third, we build hardware infrastructure and a stand-up controller that enable continuous data collection. In our experiments, we show that developing an autonomous learning system is enough to tackle challenging locomotion problems. It reduced the number of manual resets by more than an order of magnitude compared to the state-of-the-art on-robot training system of Haarnoja et al. [28]. Furthermore, the system allows us to train successful gaits on the challenging surfaces, such as a soft mattress and a doormat with crevices, where the acquisition of an accurate simulation model is expensive. And we show that we can obtain a complete set of locomotion policies (walking forward, walking backward, turning left, and turning right) from a single learning session.

One requirement of our system is a robust stand-up controller that works for a variety of situations, which is designed manually in the current version. Although developing an effective stand-up controller was not difficult due to the simple morphology of the Minitaur robot, it would be ideal if we can automatically learn it using RL. In the near future, we intend to train a recovery policy from the real-world experience using the proposed framework itself by learning locomotion and recovery policies simultaneously [22].

Our system treats learning of multiple tasks as fully separated deep RL instances without sharing any experience. We intentionally made this design decision because the experience of one discrete task is not informative to other tasks. However, it will be possible to improve the sample efficiency by having a shared replay buffer when we want to learn locomotion tasks that are parameterized by target linear and angular velocities. This sharing can be potentially approached by Hindsight Experience Replay [6] or model-based RL [18, 32, 54].


  • [1] J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017) Constrained policy optimization. In

    International Conference on Machine Learning (ICML)

    Cited by: §II.
  • [2] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §II.
  • [3] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu (2018) Safe reinforcement learning via shielding. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §II.
  • [4] E. Altman (1999) Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §II.
  • [5] A. Amini, G. Rosman, S. Karaman, and D. Rus (2019) Variational end-to-end navigation and localization. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8958–8964. Cited by: §II.
  • [6] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba (2017) Hindsight experience replay. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5048–5058. Cited by: §VII.
  • [7] T. Apgar, P. Clary, K. Green, A. Fern, and J. W. Hurst (2018) Fast online trajectory optimization for the bipedal robot cassie.. In Robotics: Science and Systems, Cited by: §II.
  • [8] F. Berkenkamp, M. Turchetta, A. Schoellig, and A. Krause (2017) Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pp. 908–918. Cited by: §II.
  • [9] G. Berseth, C. Xie, P. Cernek, and M. Van de Panne (2018) Progressive reinforcement learning with distillation for multi-skilled motion control. International Conference on Learning Representations (ICLR). Cited by: §II.
  • [10] D. P. Bertsekas (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §II.
  • [11] G. Bledt, M. J. Powell, B. Katz, J. D. Carlo, P. M. Wensing, and S. Kim (2018) MIT cheetah 3: design and control of a robust, dynamic quadruped robot. In International Conference on Intelligent Robots and Systems (IROS), Cited by: §II.
  • [12] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §V-B.
  • [13] S. Butterworth et al. (1930) On the theory of filter amplifiers. Wireless Engineer 7 (6), pp. 536–541. Cited by: §V-C.
  • [14] K. Chen, S. Ha, and K. Yamane (2018) Learning hardware dynamics model from experiments for locomotion optimization. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3807–3814. Cited by: §II.
  • [15] H. L. Chiang, A. Faust, M. Fiser, and A. Francis (2018) Learning navigation behaviors end to end. CoRR, abs/1809.10124. Cited by: §II.
  • [16] S. Choi and J. Kim (2019) Trajectory-based probabilistic policy gradient for learning locomotion behaviors. In 2019 International Conference on Robotics and Automation (ICRA), pp. 1–7. Cited by: §I, §II.
  • [17] Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. Duenez-Guzman (2019) Lyapunov-based safe policy optimization for continuous control. arXiv preprint arXiv:1901.10031. Cited by: §II.
  • [18] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §VII.
  • [19] E. Coumans (2013) Bullet physics library. Cited by: §VI-D.
  • [20] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa (2018) Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757. Cited by: §II.
  • [21] A. De (2017) Modular hopping and running via parallel composition. Cited by: §VI-A.
  • [22] B. Eysenbach, S. Gu, J. Ibarz, and S. Levine (2018) Leave no trace: learning to reset for safe and autonomous reinforcement learning. In International Conference on Learning Representations (ICLR), Cited by: §II, §VII.
  • [23] J. Fan and W. Li (2019) Safety-guided deep reinforcement learning via online gaussian process estimation. arXiv preprint arXiv:1903.02526. Cited by: §II.
  • [24] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §II.
  • [25] J. F. Fisac, A. K. Akametalu, M. N. Zeilinger, S. Kaynama, J. Gillula, and C. J. Tomlin (2018) A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control 64 (7), pp. 2737–2752. Cited by: §II.
  • [26] S. Ha, J. Kim, and K. Yamane (2018) Automated deep reinforcement learning environment for hardware of a modular legged robot. In International Conference on Ubiquitous Robots (UR), Cited by: §I, §II.
  • [27] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning (ICML), Cited by: §VI-A.
  • [28] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine (2018) Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103. Cited by: §I, §I, §II, §II, §V-B, §VI-B, §VI-B, §VII.
  • [29] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, M. Riedmiller, et al. (2017) Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286. Cited by: §II.
  • [30] M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso, V. Tsounis, J. Hwangbo, K. Bodie, P. Fankhauser, M. Bloesch, et al. (2016) Anymal-a highly mobile and dynamic quadrupedal robot. In International Conference on Intelligent Robots and Systems (IROS), pp. 38–44. Cited by: §II.
  • [31] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019) Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26). External Links: Document Cited by: §II.
  • [32] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. In Advances in Neural Information Processing Systems, pp. 12498–12509. Cited by: §VII.
  • [33] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §II, §II.
  • [34] B. Katz, J. Di Carlo, and S. Kim (2019) Mini cheetah: a platform for pushing the limits of dynamic quadruped control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6295–6301. Cited by: §II.
  • [35] D. Kim, Y. Zhao, G. Thomas, B. R. Fernandez, and L. Sentis (2016) Stabilizing series-elastic point-foot bipeds using whole-body operational space control. IEEE Transactions on Robotics 32 (6), pp. 1362–1379. Cited by: §II.
  • [36] M. A. Lee, Y. Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg (2019)

    Making sense of vision and touch: self-supervised learning of multimodal representations for contact-rich tasks

    In 2019 International Conference on Robotics and Automation (ICRA), pp. 8943–8950. Cited by: §II.
  • [37] S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies. Journal of Machine Learning Research 17 (39), pp. 1–40. Cited by: §II.
  • [38] Z. C. Lipton, K. Azizzadenesheli, A. Kumar, L. Li, J. Gao, and L. Deng (2016) Combating reinforcement learning’s sisyphean curse with intrinsic fear. arXiv preprint arXiv:1611.01211. Cited by: §II.
  • [39] K. S. Luck, J. Campbell, M. A. Jansen, D. M. Aukes, and H. B. Amor (2017) From the lab to the desert: fast prototyping and learning of robot locomotion. arXiv preprint arXiv:1706.01977. Cited by: §I, §II.
  • [40] J. Morimoto, C. G. Atkeson, G. Endo, and G. Cheng (2007) Improving humanoid locomotive performance with learnt approximated dynamics via gaussian processes for regression. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4234–4240. Cited by: §II.
  • [41] X. Pan, Y. You, Z. Wang, and C. Lu (2017) Virtual to real reinforcement learning for autonomous driving. arXiv preprint arXiv:1704.03952. Cited by: §II.
  • [42] H. Park, P. M. Wensing, and S. Kim (2017) High-speed bounding with the mit cheetah 2: control design and experiments. The International Journal of Robotics Research 36 (2), pp. 167–192. Cited by: §II.
  • [43] X. B. Peng, G. Berseth, and M. Van de Panne (2016) Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (TOG) 35 (4), pp. 81. Cited by: §II.
  • [44] M. H. Raibert (1986) Legged robots that balance. MIT press. Cited by: §II.
  • [45] F. Sadeghi and S. Levine (2017) (CAD)RL: real single-image flight without a single real image. In Robotics: Science and Systems (RSS), Cited by: §II.
  • [46] R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §IV.
  • [47] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. In Robotics: Science and Systems (RSS), Cited by: §II.
  • [48] R. Tedrake, T. W. Zhang, and H. S. Seung (2005) Learning to walk in 20 minutes. In Proceedings of the Fourteenth Yale Workshop on Adaptive and Learning Systems, Vol. 95585, pp. 1939–1412. Cited by: §II.
  • [49] P. Tigas, A. Filos, R. McAllister, N. Rhinehart, S. Levine, and Y. Gal (2019) Robust imitative planning: planning from demonstrations under uncertainty. arXiv preprint arXiv:1907.01475. Cited by: §II.
  • [50] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: §II.
  • [51] V. Tsounis, M. Alge, J. Lee, F. Farshidian, and M. Hutter (2019) DeepGait: planning and control of quadrupedal gaits using deep reinforcement learning. arXiv preprint arXiv:1909.08399. Cited by: §II.
  • [52] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne (2018) Feedback control for cassie with deep reinforcement learning. arXiv preprint arXiv:1803.05580. Cited by: §II.
  • [53] Y. Yang, K. Caluwaerts, A. Iscen, J. Tan, and C. Finn (2019) Norml: no-reward meta learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 323–331. Cited by: §II.
  • [54] Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani (2019) Data efficient reinforcement learning for legged robots. arXiv preprint arXiv:1907.03613. Cited by: §II, §VII.
  • [55] W. Yu, J. Tan, Y. Bai, E. Coumans, and S. Ha (2019) Learning fast adaptation with meta strategy optimization. arXiv preprint arXiv:1909.12995. Cited by: §II.
  • [56] W. Yu, G. Turk, and C. K. Liu (2018) Learning symmetric and low-energy locomotion. ACM Transactions on Graphics (TOG) 37 (4), pp. 1–12. Cited by: §II.
  • [57] A. Zeng, S. Song, J. Lee, A. Rodriguez, and T. Funkhouser (2019) Tossingbot: learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239. Cited by: §II.