1 Introduction
Robust and agile locomotion of legged robots based on classical control stacks typically requires accurate dynamics models, human expertise, and tedious manual tuning [1, 2, 3, 4]. As a potential alternative, modelfree reinforcement learning (RL) algorithms optimize the target policy directly and do not assume prior knowledge of environmental dynamics. Recently, they have enabled automation of the design process for locomotion controllers [2, 5, 6, 7]. Yet, all too often, progress with modelfree methods is only demonstrated in simulated environments [8, 9], due to the amount of data required to learn meaningful gaits. Attempting to take these methods to physical legged robots presents major challenges: Namely, how to mitigate the laborious and timeconsuming data collection process [10], and how to minimize hardware wear and tear during exploration? Additionally, what the robot learns is often a taskspecific policy. As a result, adapting to new tasks typically involves finetuning based on new rounds of robot experiments [11].
In this paper, we propose a modelbased learning framework that significantly improves sample efficiency and task generalization compared to modelfree methods. The key idea is to learn a dynamics model from data and consequently plan for action sequences according to the learned model. While modelbased learning is commonly considered a more sampleefficient alternative to modelfree methods, its successful application to legged locomotion has been limited [12]. Our main challenges are threefold. First, the learned model needs to be sufficiently accurate for longhorizon planning, as an inaccurate model can dramatically degrade the performance of the final controller. This is particularly evident for locomotion because of frequent and abrupt contact events. The predicted and real trajectories can quickly diverge after a contact event, even if the singlestep model error is small. The second challenge is realtime action planning at a high control frequency. To maintain balance, locomotion controllers often run at a frequency of hundreds or even thousands of times per second. Therefore, even a short latency in action planning can significantly affect the performance of controller. Finally, safe data collection for model learning is nontrivial. To ensure sufficient exploration, RL algorithms typically drive the actuators using random noise. However, such random actuation patterns can impose a lot of stress on the actuators and cause mechanical failures, especially during the initial stages of training.
Our proposed algorithm addresses the above challenges. During model learning, we use multistep loss to prevent accumulation of errors in longhorizon prediction. To achieve realtime planning, we parallelize a samplingbased planning algorithm on a GPU. Additionally, we plan actions based on a predicted future state using the learned dynamics model to compensate for planning latency. We develop safe exploration strategies using a trajectory generator [13], which ensures that the planned action trajectories are smooth and do not damage the actuators. Combining these three improvements with modelbased learning, stable locomotion gaits can be learned efficiently on a real robot.
The main contribution of our paper is a highly efficient learning framework for legged locomotion. With our framework, a Minitaur robot can successfully learn to walk from scratch after 36 rollouts, which corresponds to 4.5 minutes of data (45,000 control steps) or approximately 10 minutes of robot experimentation time (accounting for the overhead of experiment reset and data collection times). To the best of our knowledge, this is at least an order of magnitude more sample efficient than the stateoftheart onrobot learning method using the same hardware platform [10]. More importantly, we show that the learned model can generalize to new tasks without additional data collection or fine tuning.
2 Modellearning and ModelPredictive Control Loop
We formulate the locomotion problem as a Markov Decision Process (MDP) defined by a state space
, an action space , a state transition distribution , an initial state distribution and a reward function . We apply modelbased RL to solve this MDP, which learns a deterministic dynamics model to approximateby fitting to collected trajectories. The learned model estimates the next state given the current stateaction pair, and is used by an action planner to optimize the cumulative reward. To account for model inaccuracies, we use a model predictive control (MPC) framework that periodically replans based on the latest robot observation.
Using learned models for action planning raises extra challenges for model accuracy. Although a learned dynamics model can generally remain accurate around trajectories in the training data, its performance for unseen stateactions is not guaranteed. As a result, the planner might exploit such imperfections in the model and optimize for actions that are actually suboptimal on the robot. To compensate for this distribution mismatch between training and testing data, we keep track of all collected trajectories in a replay buffer and periodically retrain the model using all trajectories [14]. The updated model is then used to collect more trajectories from the robot, which are added to the replay buffer for future model training (Fig. 1). By interleaving model training and data collection, we improve the model’s accuracy on parts of the state space where the planner is more likely to visit, which in turn increases the quality of the plan.
3 Modelbased Learning for Locomotion
3.1 Accurate Dynamics Modeling with Multistep Loss
We model the difference between consecutive states as a function , where
is a feedforward neural network with parameters
. Given a set of state transitions , a standard way to train the model is by directly minimizing the prediction error:(1) 
Note that although Eq. 1 ensures the model’s accuracy for one time step, it does not prevent the accumulation of errors over longer planning horizons. In previous works, ensembles of models have been exploited to reduce uncertainty and improve the model’s longterm accuracy [15, 16]. However, predicting using ensemble of models requires extra computation that increases planning time. Instead, we introduce a multistep loss function to train the dynamics model, similar to [17, 18]. By propagating the model for steps into the future, and summing over the prediction error at each timestep (Fig. 2), we define:
(2) 
where is the system state estimated by the dynamics model and now contains sequences of state transitions (). When , Eq. 2 reduces to the singlestep loss. As increases, the loss focuses the accuracy of the model over multiple planning steps, making it suitable for longhorizon planning. We empirically validate the effect of multistep loss in Section 5.4.1.
3.2 Efficient Planning of Smooth Actions
We use a model predictive control (MPC) framework to plan for optimal actions. Instead of optimizing for the entire episode offline, MPC replans periodically using the most recent robot state, so that the controller is less sensitive to model inaccuracies. Since replanning happens simultaneously with robot execution, the speed of the planning algorithm is critical to the performance of MPC.
With handcrafted models, a number of efficient planning algorithms have been tested for robot locomotion [19, 20, 21]. However, they either assume a linear dynamics model, or compute model gradients for linear approximations, which is costly to evaluate for neural networks. Instead, we use the Cross Entropy Method (CEM) to plan for optimal actions [22]. CEM is an efficient, derivativefree optimization method that is easily parallelizable and less prone to local minima. It has demonstrated good performance in optimizing neural network functions [15, 23]
and can handle nonsmooth reward functions. To compute an action plan, CEM samples a population of action sequences at each iteration, fits a normal distribution to the best samples, and samples the next population from the updated distribution in the next iteration.
Sampling each action independently in an step action sequence is unlikely to generate high quality plans. While good plans often consist of actions that are smooth and periodic, timeindependent samples are more likely to produce jerky motions, making it difficult for CEM to select smooth actions. Instead, we apply a filter to smooth out the noises added to the mean action. Given a filter coefficient , we first generate timecorrelated samples from i.i.d. normally distributed samples :
(3)  
(4) 
Given the mean and standard deviation of the action sequences
, the sampled actions are computed as . Note that each sampled action still follows the desired normal distribution . However, the exploration noise in consecutive samples is now timecorrelated, favoring smooth actions that are less likely to damage the actuators.3.3 Online Replanning in Presence of Latency
In classic MPC, replanning happens at every timestep, and only the first action of a planned action sequence is executed. As a result, planning speed directly affects control frequency, and thus largely limits the capabilities of the controller. To minimize the effect of planning time on action execution, we parallelize action planning and execution. As the planning thread optimizes for actions in the background, the execution thread applies planned actions at the desired control frequency.
We carefully handle the synchronization between the planning and execution threads to minimize the effect of planning latency. Assume planning takes timesteps to finish. Naively, the planner reads the current state and plans optimal actions . However, by the time planning has completed, timesteps will have passed, and the actions are no longer optimal for the new system state (Fig. 2(a)). To mitigate this, we propagate the initial state forward by timesteps using previously planned actions and the learned dynamics model, and feed the estimated state, to the planner (Fig. 2(b)). The planned actions, , are then executed since time . This technique, which we call “asynchronous control”, provides the planner with a more accurate estimation of the starting state, and significantly increases the plan quality in the presence of system latency.
4 Safe Exploration with Trajectory Generators
Whereas formulating the action space using desired motor angles is easier to learn [24], controlling the motors in position control mode can result in abrupt changes in desired motor angles, which may cause large torque output that could potentially damage the robot and its surroundings. Instead, we use trajectory generators (TGs) to encourage smooth trajectories. Similar to [13], TGs output periodic trajectories in the extension space of each leg (Fig. 3(a)), and can be modulated by the planner for more complex behaviors.
We use four independent trajectory generators to control all four legs of the robot. Each TG maintains an internal phase and controls the leg extension following a periodic function:
(5) 
Here are parameters for the TG. As the phase evolves, the TG alternates between the stance mode () and lift mode (). We choose a different amplitude for each mode of the TG, and rescale the original phase to so that the resulting leg extension is a continuous function. Note that the TGs do not control the leg swing angles. As a result, our planner starts with an openloop TG that generates an inplace stepping gait (Fig. 3(b)).
We augment the state and action space of the environment so that the planner can interact with TGs (Fig. 3(c)). Our new action space is 12 dimensional. The first 8 dimensions correspond to the swing and extension residual of each leg, which is added to the TG outputs before the command is sent to the robot. The residuals allow the planner to complement the TG outputs for more complex behaviors. The remaining 4 dimensions specify the phase scales for each TG at time , so that the phase of each TG can be propagated independently . This gives the controller additional freedom to synchronize arbitrary pairs of legs and coordinate for varied gait patterns. Finally, we augment the state space with the phase of each TG to make the state of TGs fully observable.
5 Experiments
5.1 Experimental Setup
We use the Minitaur robot from Ghost Robotics [25] as the hardware platform for our experiments. We run our controller with a timestep of 6ms. Similar to [5], the controller outputs desired swing and extension of each leg, which is converted to desired motor positions and tracked by a Proportional Derivative (PD) controller.
We include base linear velocity, IMU readings (roll, pitch, and yaw), and motor positions in the state space of the robot, where the readings come from motion capture (PhaseSpace Inc. Impulse X2E) and onboard sensors. The state space is 18dimensional (TG state and sensors). Similar to [10], we concatenate a history of the past four observations as the input to our dynamics model to account for hardware latency and partial observability of the system. The dynamics are modeled as a feedforward neural network with one hidden layer of 256 units and tanh activation. We choose as the number of steps to propagate the model and compute the loss.
For MPC, we run CEM for 5 iterations with 400 samples per iteration and a planning horizon of 75 control steps (450 ms). We implement our algorithm in JAX [26] for compiled execution and run the algorithm on a Nvidia GTX 1080 GPU. With software and hardware acceleration, our CEM implementation executes in less than 60ms. We replan every 72ms to handle model inaccuracies.
In all experiments where we collect data to train the model, the robot’s task is to walk forward following a desired speed profile over an episode of 7.5 seconds. The desired speed starts at zero and increases linearly to a top speed of 0.66 m/s within the first 3 seconds, and remains constant for the rest of the episode. The reward we use is , where and are the current and desired walking speed, and are the roll, pitch, and yaw of the base. The second term encourages walking forward, and the last term stabilizes the base during walking.
5.2 Learning on Hardware
Our method successfully learns a dynamics model based on data from a real robot and optimizes a forward walking gait in only 36 episodes (45,000 control steps), which corresponds to approximately 10 minutes of robot time, including rollouts, data collection, and experiment resets (Fig. 4(a) top). We update the dynamics model every 3 episodes. The robot tracks a desired speed of 0.66 m/s (Fig. 4(b)), or 1.6 body lengths per second, which is twice the fastest speed achieved by [10]. The entire learning process, including data collection and offline model training, takes less than one hour to complete. Please refer to the attached video for the learning process.
It is important to interleave data collection and model training, and update the dynamics model using new data (Fig. 4(b)). Initially, when trained only on random trajectories, the model cannot predict the robot dynamics accurately, and MPC only achieves a slow forward velocity. As more data is collected, the model becomes more accurate in the part of the state space which the planner is likely to utilize, leading to better planning performance.
Periodic and distinctive gait patterns develop as the training proceeds (Fig. 4(c)). With TGs providing the underlying trajectory, MPC swings the legs forward in the lift phase and backward in the stance phase, leading to a periodic forwardwalking behavior. Note that TGs affect leg extensions only, and the leg swing angles are controlled exclusively by MPC. Additionally, the ability for MPC to control the phase of each TG allows individual legs to be coordinated. In the learned gait, MPC swings the four legs in succession, resulting in a walking pattern.
We also test the robustness of MPC on an unseen terrain. We place a slope in the robot’s path (Fig. 4(a) bottom). Although the robot has not trained on the slope, it still can maintain a periodic gait using MPC (Fig. 4(d)). The robot’s pitch angle shows slight perturbations while walking uphill and downhill, but the robot remains upright most of the time.
5.3 Generalization to Unseen Tasks
We test the ability of our learned dynamics model to generalize to unseen tasks. We take the dynamics model learned in Section 5.2, which is trained for walking forward, and perform MPC on new tasks with unseen reward functions. For example, to make the robot turn left, we change the reward function to for a desired turning rate , where is approximated by finite difference.
Even though we only train our dynamics model on the task of walking forward, the model is sufficiently accurate to allow MPC to plan for new tasks, including walking backwards and turning (Fig. 5(a), 5(b)). By learning the dynamics instead of the policy, our algorithm achieves zeroshot generalization to related tasks.
5.4 Ablation Study
To evaluate the importance of the key components of our algorithm, including multistep loss, asynchronous control and trajectory generators, we perform an ablation study in a highly accurate, opensource simulation of Minitaur [5]
. Simulations help us collect a larger amount of data and reduce the variance of analysis due to algorithmic and environmental stochasticity.
5.4.1 Dynamics Modeling with Multistep Loss
We find that the number of steps ( in eq. 2) to compute the model loss
is an important hyperparameter that affects the model accuracy. Without multistep loss, the model cannot accurately track the robot dynamics over a long horizon and the MPC controller does not achieve a high reward (Fig.
6(a)). We further validate this by training models using multiple values of and testing their performances on an unseen trajectory (Fig. 6(b)). With more timesteps propagated in computing the model loss, the trained model tracks the ground truth trajectory increasingly better. Note that the plotted state is the velocity of the robot, for which the planner directly optimizes. Inaccurate estimation of the robot velocity is likely to result in suboptimal planning. We choose as a tradeoff between model accuracy and training time.5.4.2 Asynchronous CEM Controller
Planning with asynchronous control is important for fast locomotion (Fig. 6(c)). Without asynchronous control, the MPC controller could only track the desired speed up to approximately 0.4m/s. As the robot moves faster, the robot states can change rapidly even within a few timesteps. Therefore, it is important to perform planning with respect to an accurate state. This is also illustrated in Fig. 6(a), where the system struggles to achieve a good final reward without asynchronous control.


We identify additional important hyperparameters for CEM in Table. 1: CEM requires at least 5 iterations for optimal performance. While smoothing out sampled actions can significantly improve the plan quality, excessive smoothing can make the legs overly compliant for dynamic behaviors. It is important to plan over a sufficiently long horizon to optimize for longterm return. On the other hand, planning for too long makes the planner susceptible to imperfections of the model.
5.4.3 Role of Trajectory Generators
The TGs play an important role in regulating planned actions and ensuring periodicity of leg motion. In Fig. 6(d), we compare the final behavior of MPC using models trained with and without TG. While both rollouts achieve similar total reward, planning with TG smooths the motor actions and makes the leg behavior periodic. The learning process is also less stable without TG (Fig. 6(a)). We attempted to learn a model without TG on the real robot. The motors overheated quickly and the jerky motions damaged the motor mounts, forcing us to stop the experiment early.
5.4.4 Comparison with Modelfree Algorithms
We compare the sample efficiency of our modelbased learning method with modelfree ones (Fig. 8). We obtain the implementations of modelfree algorithms from TFAgents [27]. As a stateoftheart onpolicy algorithm, Proximal Policy Optimization (PPO) [28] achieves a similar reward but requires nearly 1000 times more samples, making it difficult to run on the real robot. While the offpolicy method, Soft Actor Critic (SAC) [29], significantly improved sample efficiency and has been demonstrated to learn walking on a Minitaur robot [10], it still requires an order of magnitude more samples compared to our method, with a less stable learning curve.
6 Perspectives and Future Work
The combination of accurate longhorizon dynamics learning with multistep loss functions, careful handling of realtime requirements by compensating for planning latency, and embedding periodicity priors into MPC walking policies, yields an approach that requires only 4.5 minutes of realworld data collection to induce robust and fast gaits on a quadruped robot. Such learning efficiency is more than an order of magnitude superior to modelfree methods. The learnt dynamics model can then be reused to induce new locomotion behaviors.
Yet, many questions remain to be answered in future work: How can rigidbody dynamics be best combined with function approximators for even greater sample efficiency, how should predictive controllers be made aware of model misspecification, and how should predictive uncertainty be best captured and exploited for improved exploration and realtime online adaptation to enable more agile and complex behaviors? Interfacing vision, contact sensing and other perceptual modules with an endtoend model learning and realtime planning stack is also critical for greater autonomy.
References
 Kuindersma et al. [2016] S. Kuindersma, R. Deits, M. Fallon, A. Valenzuela, H. Dai, F. Permenter, T. Koolen, P. Marion, and R. Tedrake. Optimizationbased locomotion planning, estimation, and control design for the atlas humanoid robot. Autonomous Robots, 40(3):429–455, 2016.
 Hutter et al. [2017] M. Hutter, C. Gehring, A. Lauber, F. Gunther, C. D. Bellicoso, V. Tsounis, P. Fankhauser, R. Diethelm, S. Bachmann, M. Blösch, et al. Anymaltoward legged robots for harsh environments. Advanced Robotics, 31(17):918–931, 2017.
 Bledt et al. [2018] G. Bledt, M. J. Powell, B. Katz, J. Di Carlo, P. M. Wensing, and S. Kim. MIT cheetah 3: Design and control of a robust, dynamic quadruped robot. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2245–2252, Oct 2018.
 Kim et al. [2017] S. Kim, P. M. Wensing, et al. Design of dynamic legged robots. Foundations and Trends® in Robotics, 5(2):117–190, 2017.
 Tan et al. [2018] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke. Simtoreal: Learning agile locomotion for quadruped robots. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018.
 Xie et al. [2018] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne. Feedback control for cassie with deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1241–1246. IEEE, 2018.
 Hwangbo et al. [2019] J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
 Heess et al. [2017] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, M. Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Peng et al. [2017] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG), 36(4):41, 2017.
 Haarnoja et al. [2019] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine. Learning to walk via deep reinforcement learning. In Proceedings of Robotics: Science and Systems, FreiburgimBreisgau, Germany, June 2019.

Finn et al. [2017]
C. Finn, P. Abbeel, and S. Levine.
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1126–1135. JMLR. org, 2017.  Nagabandi et al. [2018] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 Iscen et al. [2018] A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, and V. Vanhoucke. Policies modulating trajectory generators. In Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 916–926. PMLR, 29–31 Oct 2018.

Ross et al. [2011]
S. Ross, G. Gordon, and D. Bagnell.
A reduction of imitation learning and structured prediction to noregret online learning.
InProceedings of the fourteenth international conference on artificial intelligence and statistics
, pages 627–635, 2011.  Chua et al. [2018] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4754–4765, 2018.
 Clavera et al. [2018] I. Clavera, J. Rothfuss, J. Schulman, Y. Fujita, T. Asfour, and P. Abbeel. Modelbased reinforcement learning via metapolicy optimization. In Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 617–629. PMLR, 29–31 Oct 2018.
 Talvitie [2017] E. Talvitie. Selfcorrecting models for modelbased reinforcement learning. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Talvitie [2014] E. Talvitie. Model regularization for stable sample rollouts. In UAI, pages 780–789, 2014.
 Neunert et al. [2018] M. Neunert, M. Stäuble, M. Giftthaler, C. D. Bellicoso, J. Carius, C. Gehring, M. Hutter, and J. Buchli. Wholebody nonlinear model predictive control through contacts for quadrupeds. IEEE Robotics and Automation Letters, 3(3):1458–1465, 2018.
 Di Carlo et al. [2018] J. Di Carlo, P. M. Wensing, B. Katz, G. Bledt, and S. Kim. Dynamic locomotion in the mit cheetah 3 through convex modelpredictive control. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–9. IEEE, 2018.
 Apgar et al. [2018] T. Apgar, P. Clary, K. Green, A. Fern, and J. W. Hurst. Fast online trajectory optimization for the bipedal robot cassie. In Robotics: Science and Systems, 2018.
 Rubinstein and Kroese [2004] R. Y. Rubinstein and D. P. Kroese. The crossentropy method: A unified approach to monte carlo simulation, randomized optimization and machine learning. Information Science & Statistics, Springer Verlag, NY, 2004.
 Kalashnikov et al. [2018] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine. Scalable deep reinforcement learning for visionbased robotic manipulation. In Proceedings of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning Research, pages 651–673. PMLR, 29–31 Oct 2018.
 Peng and van de Panne [2017] X. B. Peng and M. van de Panne. Learning locomotion skills using deeprl: Does the choice of action space matter? In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, page 12. ACM, 2017.
 Kenneally et al. [2016] G. Kenneally, A. De, and D. E. Koditschek. Design principles for a family of directdrive legged robots. IEEE Robotics and Automation Letters, 1(2):900–907, 2016.
 Frostig et al. [2018] R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, and S. WandermanMilne. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.

Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro,
Ethan Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Neal Wu, Chris Harris,
Vincent Vanhoucke, Eugene Brevdo [2018]
Sergio Guadarrama, Anoop Korattikara, Oscar Ramirez, Pablo Castro, Ethan
Holly, Sam Fishman, Ke Wang, Ekaterina Gonina, Neal Wu, Chris Harris, Vincent
Vanhoucke, Eugene Brevdo.
TFAgents: A library for reinforcement learning in tensorflow.
https://github.com/tensorflow/agents, 2018. URL https://github.com/tensorflow/agents. [Online; accessed 25June2019].  Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.