TF2 Implementation of the Soft Actor-Critic Algorithm
Deep reinforcement learning suggests the promise of fully automated learning of robotic control policies that directly map sensory inputs to low-level actions. However, applying deep reinforcement learning methods on real-world robots is exceptionally difficult, due both to the sample complexity and, just as importantly, the sensitivity of such methods to hyperparameters. While hyperparameter tuning can be performed in parallel in simulated domains, it is usually impractical to tune hyperparameters directly on real-world robotic platforms, especially legged platforms like quadrupedal robots that can be damaged through extensive trial-and-error learning. In this paper, we develop a stable variant of the soft actor-critic deep reinforcement learning algorithm that requires minimal hyperparameter tuning, while also requiring only a modest number of trials to learn multilayer neural network policies. This algorithm is based on the framework of maximum entropy reinforcement learning, and automatically trades off exploration against exploitation by dynamically and automatically tuning a temperature parameter that determines the stochasticity of the policy. We show that this method achieves state-of-the-art performance on four standard benchmark environments. We then demonstrate that it can be used to learn quadrupedal locomotion gaits on a real-world Minitaur robot, learning to walk from scratch directly in the real world in two hours of training.READ FULL TEXT VIEW PDF
TF2 Implementation of the Soft Actor-Critic Algorithm
My solution to the NeurIPS challenge Learn to Move: Walk Around
Designing locomotion controllers for legged robots is a long-standing research challenge. Current state-of-the-art methods typically employ a pipelined approach, consisting of components such as state estimation, contact scheduling, trajectory optimization, foot placement planning, model-predictive control, and operational space control[13, 23, 5, 1]. Designing these components requires expertise and often an accurate dynamics model of the robot that can be difficult to acquire. In contrast, end-to-end deep reinforcement learning (deep RL) does not assume any prior knowledge of the gait or the robot’s dynamics, and can in principle be applied to robotic systems without explicit system identification or manual engineering. If successfully applied, deep RL can automate the controller design, completely removing the need for system identification, and resulting in gaits that are directly optimized for a particular robot and environment. However, applying deep RL to learning gaits in the real world is challenging, since current algorithms often require a large number of samples—on the order of tens of thousands of trials . Moreover, such algorithms are often highly sensitive to hyperparameter settings and require considerable tuning , further increasing the overall sample complexity. For this reason, many prior methods have studied learning of locomotion gaits in simulation [20, 50, 34, 4], requiring accurate system identification and modeling.
In this paper, we aim to address these challenges by developing a deep RL algorithm that is both sample efficient and robust to the choice of hyperparameters, thus allowing us to learn locomotion gaits directly in the real world, without prior modeling. In particular, we extend the framework of maximum entropy RL. Methods of this type, such as soft actor-critic  and soft Q-learning , can achieve state-of-the-art sample efficiency  and have been successfully deployed in real-world manipulation tasks [16, 31], where they exhibit a high degree of robustness due to entropy maximization . However, maximum entropy RL algorithms are sensitive to the choice of the temperature parameter, which determines the trade-off between exploration (maximizing the entropy) and exploitation (maximizing the reward). In practice, this temperature is considered as a hyperparameter that must be tuned manually for each task.
We propose an extension to the soft actor-critic algorithm  that removes the need for manually tuning of the temperature parameter. Our method employs gradient-based optimization of the temperature towards the targeted expected entropy over the visited states. In contrast to standard RL, our method controls only the expected entropy over the states, while the per-state entropy can still vary—a desirable property that allows the policy to automatically reduce entropy for states where acting deterministically is preferred, while still acting stochastically in other states. Consequently, our approach virtually eliminates the need for per-task hyperparameter tuning, making it practical for us to apply this algorithm to learn quadrupedal locomotion gaits directly on a real-world robotic system.
The principal contribution of our paper is an end-to-end RL framework for legged locomotion on physical robots, which includes a data efficient learning algorithm based on maximum entropy RL and an asynchronous learning system. We demonstrate the framework by training a Minitaur robot  (Figure 1) to walk. While we train the robot on flat terrain, the learned policy can generalize to unseen terrains and is moderately robust to perturbations. The training requires about 400 rollouts, equating to about two hours of real-world time. In addition to the robot experiments, we evaluate our algorithm on simulated benchmark tasks and show that it can achieve state-of-the-art performance and, unlike prior works based on maximum entropy RL, can use exactly the same hyperparameters for all tasks.
Current state-of-the-art locomotion controllers typically adopt a pipelined control scheme. For example, the MIT Cheetah  uses a state machine over contact conditions, generates simple reference trajectories, performs model predictive control  to plan for desired contact forces, and then uses Jacobian transpose control to realize them. The ANYmal robot  plans footholds based on the inverted pendulum model , applies CMA-ES  to optimize a parameterized controller [12, 13], and solves a hierarchical operational space control problem  to produce joint torques, contact forces, and body motion. While these methods can provide effective gaits, they require considerable prior knowledge of the locomotion task and, more importantly, of the robot’s dynamics. In contrast, our method aims to control the robot without prior knowledge of either the gait or the dynamics. We do not assume access to any trajectory design, foothold planner, or a dynamics model of the robot, since all learning is done entirely through real-world interaction. The only requirement is knowledge of the dimension and bounds of the state and action space, which in our implementation correspond to joint angles, IMU readings, and desired motor positions. While in practice, access to additional prior knowledge could be used to accelerate learning (see, e.g., ), end-to-end methods that make minimal prior assumptions are broadly applicable, and developing such techniques will make acquisition of gaits for diverse robots in diverse conditions scalable.
Deep RL has been used extensively to learn locomotion policies in simulation [20, 50, 34, 4] and even transfer them to real-world robots [44, 24], but this inevitably incurs a loss of performance due to discrepancies in the simulation, and requires accurate system identification. Using such algorithms directly in the real world has proven challenging. Real-world applications typically make use of simple and inherently stable robots  or low-dimensional gait parameterizations [29, 8, 36], or both . In contrast, we show that we can acquire locomotion skills directly in the real world using neural-net policies.
Our algorithm is based on maximum entropy RL, which maximizes the weighted sum of the the expected return and the policy’s expected entropy. This framework has been used in many contexts, from inverse RL  to optimal control [48, 49, 38]. One advantage of maximum entropy RL is that it produces relatively robust policies, since injection of structured noise during training causes the policy to explore the state space more broadly and improves the robustness of the policy 
. However, the weight on the entropy term (“temperature”) is typically chosen heuristically[33, 15, 32, 40, 17]. In our observation, this parameter is very sensitive and manual tuning can make real-world application of the maximum entropy framework difficult. Instead, we propose to constrain the expected entropy of the policy and adjust the temperature automatically to satisfy the constraint. Our formulation is an instance of constrained MDP, which has been studied recently in [6, 3, 47]. These works consider constraints that depend on the policy only via the sampling distribution, whereas in our case the constraint depends the policy explicitly. Our approach is also closely related to KL-divergence constraints that limit the policy change between iterations [35, 39, 2] but is applied directly to the current policy’s entropy. We find that this simple modification drastically reduces the effort of parameter tuning on both simulated benchmarks and our robotic locomotion task.
In this section, we will first describe our asynchronous robotic RL system, which we will use to evaluate real-world RL for robotic locomotion. The system, shown in Figure 2, consists of three components: a data collection job that collects robot experience, a motion capture job that computes the reward signal based on robot’s position measured by a motion capture system, and a training job that updates the neural networks. These subsystems run asynchronously on different machines. When the learning system starts, the subsystems are synchronized to a common clock and use timestamps to sync the future data streams.
The data collection job runs on an on-board computer and executes the latest policy produced by the training job. For each control step , it collects observations , performs neural network policy inference, and executes an action . The entire observed trajectory, or rollout, is recorded into tuples and sent to the training job. The motion capture system measures the position of the robot and provides the reward signal
. It periodically pulls data from the robot and the motion capture system, evaluates the reward function, and appends it to a replay buffer. The training subsystem runs on a workstation. At each iteration of training, the training job randomly samples a batch of data from this buffer and uses stochastic gradient descent to update the value network, the policy network, and the temperature parameter, as we will discuss inSection V. Once training is started, minimal human intervention is needed, except for the need to reset the robot if it falls or runs out of free space.
The asynchronous design allows us to pause or restart any subsystem without affecting the other subsystems. In practice, we found this particularly useful because we often encounter hardware and communication errors, in which case we can safely restart any of the subsystems without impacting the entire learning process. In addition, our system can be easily scaled to multiple robots by simply increasing the number of data collection jobs. In the following sections, we describe our proposed reinforcement learning method in detail.
Reinforcement learning aims to learn a policy that maximizes the expected sum of rewards 
. We consider Markov decision processes where the state spaceand action space are continuous. An agent starts at an initial state , samples an action from a policy , receives a bounded reward , and transitions to a new state according to the dynamics . This generates a trajectory of states and actions . We denote the trajectory distribution induced by by , and we overload the notation and use and to denote the corresponding state-action and state marginals, respectively.
Maximum entropy RL optimizes both the expected return and the entropy of the policy. For finite-horizon MDPs, the corresponding objective can be expressed as
which incentivizes the policy to explore more widely improving its robustness against perturbations . The temperature parameter determines the relative importance of the entropy term against the reward, and thus controls the stochasticity of the optimal policy. The maximum entropy objective differs from the standard maximum expected reward objective used in conventional reinforcement learning, though the conventional objective can be recovered in the limit as . In the finite horizon case, the policy is time dependent, and we write and to denote the set of all policies or temperatures . We can extend the objective to infinite horizon problems by introducing a discount factor to ensure that the sum of expected rewards and entropies is finite , in which case we overload the notation and denote a stationary policy and temperature as and .
One of the central challenges with the objective in (1) is that the trade-off between maximizing the return, or exploitation, versus the entropy, or exploration, is directly affected by the scale of the reward function111Reward scale is the reciprocal of temperature. We will use these two terms interchangeably throughout this paper.. Unlike in conventional RL, where the optimal policy is independent of scaling of the reward function, in maximum entropy RL the scaling factor has to be tuned per environment, and a sub-optimal scale can drastically degrade the performance .
Learning robotic tasks in the real world requires an algorithm that is sample efficient, robust, and insensitive to the choice of the hyperparameters. Maximum entropy RL is both sample efficient and robust, making it a good candidate for real-world robot learning . However, one of the major challenges of maximum entropy RL is its sensitivity to the temperature parameter, which typically needs to be tuned for each task separately. In this section, we propose an algorithm that enables automated temperature adjustment at training time, substantially reducing the effort of hyperparameter tuning and making deep RL a viable solution for real-world robotic problems.
The magnitude of the reward differs not only across tasks, but it also depends on the policy, which improves over time during training. Since the optimal entropy depends on this magnitude, choosing the ideal temperature is particularly difficult: the entropy can vary unpredictably both across tasks and during training as the policy becomes better. Instead of requiring the user to set the temperature manually, we can automate this process by formulating a modified RL objective, where the entropy is treated as a constraint. Simply forcing the entropy to a fixed value is a poor solution, since the policy should be free to explore more in regions where the optimal action is uncertain, but remain more deterministic in states with a clear distinction between good and bad actions. Therefore, we constrain the expected entropy of the policy, while the entropy at different states can still vary. We show that the Lagrangian relaxation of this problem leads to the maximum entropy objective with respect to the policy, where the dual variable takes the role of the temperature.
In particular, our aim is to find a stochastic policy with maximal expected return that satisfies a minimum expected entropy constraint. Formally, we want to solve the constrained optimization problem
where is the desired minimum expected entropy. Note that, for fully observed MDPs, the policy that optimizes the expected return is deterministic, so we expect this constraint to usually be tight and do not need to impose an upper bound on the entropy.
We optimize this objective using the dual gradient method. Note that for a fixed dual variable, the Lagrangian is exactly equal to the maximum entropy objective in (1) minus an additive constant () per time step, and can thus be optimized with any off-the-shelf maximum entropy RL algorithm. Specifically, we resort to approximate dynamic programming, which turn out to correspond to the soft actor-critic algorithm . We first define the soft Q-function and use it to bootstrap the algorithm. The optimal soft Q-function is defined as
and denotes the optimal policy for time . We have omitted the dependency of the soft Q-function on the dual variable of the future time steps for brevity. We also abuse the notation slightly, and write to denote , which are equal only if is a set of all policies (and not for example the set of Gaussian policies). We initialize the iteration by setting . Assuming we have evaluated for some , we can substitute it to the Lagrangian. We can now solve for the optimal policy at time for all by noting that the optimal policy at time is independent of the policy at the previous time steps:
The partition function does not depend on , so we can ignore it for optimizing . This is exactly the soft policy improvement step introduced by , with an additional temperature parameter . In contrast to , which shows that this update leads to an improvement in the infinite horizon case, we derived it starting from the finite horizon objective. By traversing backwards in time, we can optimize the Lagrangian with respect to the policy.
After solving for the policy for a fixed dual variable, we improve the dual in order to satisfy the entropy constraint. We can optimize the temperature by moving it in the direction of the negative gradient of (3):
where is the learning rate222We also need to make sure remains non-negative. In practice, we thus parameterize and optimize instead.. The equations (4), (6), and (7) constitute the core of our algorithm. However, solving these equations exactly is not practical for continuous state and actions, and in practice, we cannot compute the expectations, but instead have access to unbiased samples. Therefore, for a practical algorithm, we need to resort to function approximators and stochastic gradient descent as well as other standard tricks to stabilize training, as discussed in the next section.
In practice, we parameterize a Gaussian policy with parameters , and learn them using stochastic gradient descent for the discounted, infinite horizon problem. We additionally use two parameterized Q-functions, with parameters and , as suggested in . We learn the Q-function parameters as a regression problem by minimizing the following loss :
using minibatches from a replay buffer . The value function is implicitly defined through the Q-functions and the policy as . We learn a Gaussian policy by minimizing
To learn , we need to minimize the dual objective, which can be done by approximating dual gradient descent. Instead of optimizing with respect to the primal variables to convergence, we use a truncated version that performs incomplete optimization and alternates between taking a single gradient step on each objective. While convergence to the global optimum is not guaranteed, we found this approach to work well in practice. Thus, we compute gradients for with the following objective:
The proposed algorithm alternates between a data collection phase and an optimization phase. In the optimization phase, the algorithm optimizes all objectives in (8) – (10) jointly. We also incorporate delayed target Q-function networks as is standard in prior work. Algorithm 1 summarizes the full algorithm, where denotes stochastic gradients.
Before evaluating on real-world locomotion, we conduct a comprehensive evaluation in simulation to validate our algorithm. Our goal is to answer following four questions:
Does our method achieve the state-of-the-art data efficiency?
How sensitive is our method to the hyperparameter?
Is our method effectively regulating the entropy and dynamically adjusting the temperature during learning?
Can the learned policy generalize to unseen situations?
We first evaluate our algorithm on four standard benchmark environments for continuous locomotion tasks in OpenAI Gym benchmark suite . We compare our method to soft actor-critic (SAC)  with a fixed temperature parameter that is tuned for each environment. We also compare to deep deterministic policy gradient (DDPG) , proximal policy optimization (PPO) , and twin delayed deep deterministic policy gradient algorithm (TD3) . All of the algorithms use the same network architecture: all of the function approximators (policy and Q-functions for SAC) are parameterized with a two-layer neural network with 256 hidden units on each layer, and we use ADAM  with the same learning rate of 0.0003 to train all the networks and temperature parameter . For standard SAC, we tune the reward scale per environment using grid search. Poorly chosen reward scales can degrade performance drastically (see (a)). For our method, we simply set the target entropy to be -1 per action dimension (i.e., HalfCheetah has target entropy -6, while Humanoid uses -17).
Figure 3 shows a comparison of the algorithms. The solid line denotes the average performance over five random seeds, and the shaded region corresponds to the best and worst performing seeds. The results indicate that our method (blue) achieves practically identical or better performance compared to standard SAC (orange), which is tuned per environment for all environments. Overall, our method performs better or comparably to the other baselines, standard SAC, DDPG, TD3, and PPO.
We compare the sensitivity to the hyperparameter between our method (target entropy) and the standard SAC (reward scale). Both maximum entropy RL algorithms  and standard RL algorithms  can be very sensitive to the scale of the reward function. In the case of maximum entropy RL, this scale directly affects the trade-off between reward maximization and entropy maximization . We first validate the sensitivity of standard SAC by running experiments on the HalfCheetah, Walker, Ant, and the simulated Minitaur robot (See Section VI-B for more details). (a) shows the returns for a range of reward scale values that are normalized to the maximum reward of the given task. All benchmark environments achieve good performance for about the same range of values, between 1 to 10. On the other hand, the simulated Minitaur requires roughly two orders of magnitude larger reward scale to work properly. This result indicates that, while standard benchmarks offer high variability in terms of task dimensionality, they are homogeneous in terms of other characteristics, and testing only on the benchmarks might not generalize well to seemingly similar tasks designed for different purposes. This suggests that the good performance of our method, with the same hyperparameters, on both the benchmark tasks and the Minitaur task accurately reflects its generality and robustness. (b) compares the sensitivity of our method to the target entropy on the same tasks. In this case, the range of good target entropy values is essentially the same for all environments, making hyperparameter tuning substantially less laborious. It is also worth noting that this large range indicates that our algorithm is relatively insensitive to the choice of this hyperparameter.
Next, we compared how the entropy and temperature evolve during training. (a)
compares the entropy (estimated as an expected negative log probability over a minibatch) on HalfCheetah for SAC with fixed temperature (orange) and our method (blue), which uses a target entropy of -13. The figure clearly indicates that our algorithm is able to match the target entropy in a relatively small number of steps. On the other hand, regular SAC has a fixed temperature parameter and thus the entropy slowly decreases as the Q-function increases.(b) compares the temperature parameter of the two methods. Our method (blue) actively adjusts the temperature, particularly in the beginning of training when the Q-values are small and the entropy term dominates in the objective. The temperature is quickly pulled down so as to make the entropy to match the target. For other simulated environments, we observed similar entropy and temperature curves throughout the learning.
Next, we evaluate our method on a simulated Minitaur locomotion task (Figure 6). Simulation allows us to quantify perturbation robustness, measure states that are not accessible on the robot, and more importantly, gather more data to evaluate our algorithm. To prevent bias of our conclusion, we have also conducted a careful system identification, following Tan et al. , such that our simulated robot moderately represents the real system. However, we emphasize that we do not transfer any simulated policy to the real world —all real-world experiments use only real-world training, without access to any simulator.
Figure 6b compares the learning curve of our method to the state-of-the-art deep reinforcement learning algorithms. Our method is the most data efficient. Note that in order to obtain the result of SAC (fixed temperature) in the plot, we had to sweep though a set of candidate temperatures and choose the best one. This mandatory hyperparameter tuning is equivalent to collecting an order of magnitude more samples, which is not shown in Figure 6b. While the number of steps is a common measure of data efficiency in the learning community, the number of episodes can be another important indicator for robotics because the number of episodes determines the number of experiment reset, which typically is time-consuming and require human intervention. Figure 6c indicates that our method takes fewer numbers of episodes for training a good policy. In the experiments, our algorithm effectively escapes a local minimum of “diving forward,” which is a common cause of falling and early episode termination, by maintaining the policy entropy at higher values (Figure 6d).
The final learned policy in simulation qualitatively resembles the gait learned directly on the robots. We tested its robustness by applying lateral perturbations to its base for seconds with various magnitudes. Even though no perturbation is injected during training, the simulated robot can withstand up to N lateral pushes and subsequently recover to normal walking. This is significantly larger than the maximum N of the best PPO-trained policy that is picked out of learning trials. We suspect that this robustness emerges automatically from the SAC method due to entropy maximization at training time.
In this section, we describe the real-world learning experiments on the Minitaur robot. We aim to answer the following questions:
Can our method efficiently train a policy on hardware without hyperparameter tuning?
Can the learned policy generalize to unseen situations?
Quadrupedal locomotion presents substantial challenges for real-world reinforcement learning. The robot is underactuated, and must therefore delicately balance contact forces on the legs to make forward progress. A suboptimal policy can cause it to lose balance and fall, which will quickly damage the hardware, making sample-efficient learning essential. In this section, we test our learning method and system on a quadrupedal robot in the real world settings. We use the Minitaur robot, a small-scale quadruped with eight direct-drive actuators . Each leg is controlled by two actuators that allow it to move in the sagittal plane. The Minitaur is equipped with motor encoders that measure the motor angles and an IMU that measures the orientation and angular velocity of Minitaur’s base.
In our MDP formulation, the observation includes eight motor angles, roll and pitch angles, and their angular velocities. We choose to exclude the yaw measurement because it drifts quickly. The action space includes the swing angle and the extension of each leg, which are then mapped to desired motor positions and tracked with a PD controller . For safety, we choose low PD gains and to ensure compliant motion. We find that the latencies in the hardware and the partial observation make the system non-Markovian, which significantly degrades the learning performance. We therefore augment an observation space to include a history of the last five observations and actions which results in a dimensional observation space. The reward is defined as:
The function encourages longer walking distance (), which is measured using the motion capture system, and penalizes large joint accelerations (), computed via finite differences using the last three actions. We also find it necessary to penalize a large roll angle of the base () and the joint angles when the front legs () are folded under the robot, which are the common failure cases. The weights are set to respectively and the maximum angle threshold is set to radians.
Our method successfully learns to walk from 160k control steps, or approximately 400 rollouts. Each rollout has the maximum length of 500 steps (equivalent to 10 seconds) and can terminate early if the robot falls. The whole training process takes about two hours. Figure 7 shows the learning curve, The performance is slightly less than the simulation, potentially due to fewer collected samples. Please refer to the supplemental video to see the learning process, the final policy, and more evaluations on different terrains.
The trained robot is able to walk forward at a speed of 0.32m/s (0.8 body length per second). The learned gait swings the front legs once per cycle, while pushing against the ground multiple times with the rear legs (Figure 8a and b). Note that the learned gait is periodic and synchronized, though no explicit trajectory generator, symmetry constraint, or periodicity constraint is encoded into the system. Comparing to the default controller (trotting gait) provided by the manufacturer that walks at a similar speed, the learned gait has similar frequencies (2Hz) and swing amplitudes (0.7 Rad), but has substantially different joint angle trajectories and foot placement. The learned gait has a much wider stance and a lower standing height. We evaluated the robustness of the trained policy against external perturbations by pushing the base of the robot backward (Figure 8c) for approximately one second, or side for around half second. Although the policy has never been trained with such perturbations, it successfully recovered and returned to a periodic gait for all repeated tests.
In the real world, the utility of a locomotion policy hinges critically on its ability to generalize to different terrains and obstacles. Although we trained our policy only on flat terrain (Figure 9, first row), we tested it on varied terrains and obstacles (other rows). Because the SAC method learns robust policies due to entropy maximization at training time, the policy can readily generalize to these perturbations without any additional tweaking. The robot is able to walk up and down a slope (second row), ram through an obstacle made of wooden blocks (third row), and step down stairs (fourth row) without difficulty, despite not being trained in these settings. We repeated these tests for times, and the robot succeeds on all cases.
We presented a complete end-to-end learning system for locomotion with legged robots. The core algorithm, which is based on a dual formulation of an entropy-constrained reinforcement learning objective, can automatically adjust the temperature hyperparameter during training, resulting in a sample-efficient and stable algorithm with respect to hyperparameter settings. It enables end-to-end learning of quadrupedal locomotion controllers from scratch on a real-world robot. In our experiments, a walking gait emerged automatically in two hours of training without the need of prior knowledge about the locomotion tasks or the robot’s dynamic model. A further discussion of this method and results on other platforms can be found in an extended technical report . Compared to sim-to-real approaches  that require careful system identification, learning directly on hardware can be more practical for systems where acquiring an accurate model is hard and expensive, such as for walking on diverse terrains or manipulation of deformable objects.
To the best of our knowledge, our experiment is the first example of an application of deep reinforcement learning to quadrupedal locomotion directly on a robot in the real world without any pretraining in the simulation, and it is the first step towards a new paradigm of fully autonomous real-world training of robot policies. Two of the most critical remaining challenges of our current system are the heavy dependency on manual resets between episodes and the lack of a safety layer that would enable learning on bigger robots, such as ANYmal  or HyQ. In the future work, we plan to address these issues by developing a framework for learning policies that are safety aware and can be trained to automatically reset themselves [10, 24].
The authors gratefully thank Ken Caluwaerts, Tingnan Zhang, Julian Ibarz, Vincent Vanhoucke, and the anonymous reviewers for valuable discussion and suggestions.
International Conference on Machine Learning (ICML), 2017.
Towards a new evolutionary computation, pages 75–102. Springer, 2006.