Humanoid robots are advantageous for mobility in tight spaces. However, fast bipedal locomotion requires precision control of the contact transition process. There are many successful studies addressing agile legged locomotion. Model-free approaches, such as the Policy Gradient (PG) method used in Deep Reinforcement Learning (DRL), rely on data and function approximation via neural networks. Model-based approaches employ differential dynamics of robots to synthesize locomotion controllers. Our work leverages the advantage of data driven methods and model-base approaches in a safe and efficient manner.
Recent work has shown the possibility of robust and agile locomotion control through model-free learning. In , “locomotors” were trained for various environments and were able to achieve robust behaviors. In , a policy is trained on a joint space trajectory generated by motion capture data from humans . The work in  learns local models of the robot for locomotion, while the work in  penalizes asymmetric motions to achieve energy efficient motions. However, model-free learning approaches are limited due to data inefficiency, unsafe policy exploration and jerky motions.
On the other hand, model-based approaches decouple the problem into two sub-problems: 1) reduce the complexity of full-body dynamics via simplified models such as the Inverted Pendulum [6, 7, 8, 9] or the Centroidal Model [10, 11, 12], and then 2) compute a feedback joint torque command that makes the robot track the behavior of the simplified model. In our recent studies [13, 14], we achieved unsupported passive ankle dynamic locomotion via two computational elements: 1) a high-level footstep planner, dubbed Time-to-Velocity-Reversal (TVR) planner, based on the Linear Inverted Pendulum Model (LIPM) and 2) a low-level Whole Body Controller (WBC) that tracks the desired trajectories. However, because of using the LIPM, WBC has significant footstep tracking errors given trajectories given by the TVR planner.
In this paper, we devise a Markov Decision Process (MDP) for locomotion and employ CBF for safe learning. In contrast to model-free approaches, whose MDP is characterized by sensor data and joint torque at every control loop, our formulation augments the walking pattern generator with a model-free approach. More precisely, the moment the walking pattern is computed, we define actions related to footstep locations and used them for learning. Our objective is to find an optimal policy for the desired foot locations. We continuously update the foot location policy using the PG method and DRL. The policy is designed based on three components: the TVR planner, the parametric neural network stochastic policy, and the safety controller. Here, the TVR planner provides a good initial offset for the parametric neural network policy, which helps efficient learning. The parametric neural network takes arbitrary actions, explores the state space of the robot, and optimizes the parameters so that the long term reward is maximized. The safety controller corrects the policy so that it prevents the robot from being steered to unsafe state space regions. To design safe actions, we learn the discrepancies between the LIPM and the simulated robot using a Gaussian Process (GP).
The proposed MDP formulation and the learning framework have the following advantages: 1) The learned policy compensates for inaccurate tracking errors. For example, the policy compensates for the effects of limb dynamics and angular momentum. 2) It provides data efficiency and safe exploration during the learning process. The policies for both forward walking and turning converge after iterations approximately; 3) Since the LIPM approximates biped robots and since WBC is a task-oriented feedback controller, the proposed algorithm is scalable to many types of biped robots.
The remainder of this paper is organized as follow: Section II describes a model-based approach for biped locomotion and DRL with safety guarantees. Section III proposes an MDP formulation and Section IV shows how we compose and update the policy effectively and safely. Section V
evaluates the proposed framework in simulation for forward walking on a 10 Degree-of-Freedom (DoF) biped, dubbed DRACO, and includes a turning behavior of the 23 DoF humanoid robot, ATLAS. Finally, SectionVI concludes the paper.
denotes the real numbers, and and are the sets of non-negative and non-positive real numbers. is used for natural numbers. Given and , where , the set of natural numbers in the interval is denoted by . The sets of
-dimensional real vectors andreal matrices are denoted by and , respectively. Given and , represents their concatenation. The dimensional matrix whose elements are all one is denoted by , and the identity matrix is represented as . General Euclidean norm is denoted as . Inner product in the vector space is denoted by . represents the probabilistic expectation operator.
Ii-B A Model-based approach to Locomotion
In this subsection, we summarize how locomotion behaviors are represented and achieved by WPG and WBC. Locomotion behaviors are manifested as stabilizing leg contact changes (coordinated by a state machine) triggered by either pre-defined temporal specifications or foot contact detection sensors. Here, we define a Locomotion State and a state machine with simple structures to represent locomotion behaviors.
(Locomotion State) A locomotion state is defined as a tuple, .
represents a semantic expression of locomotion behaviors: .
Subscripts , , and describe locomotion states for double support, lifting the right/left leg, and landing the right/left leg.
is a time duration for .
(State Machine) We define a state machine as a sequence of Locomotion States:
Each Locomotion State is terminated and switched to the next state.
Locomotion state could be terminated before when contact is detected between the swing foot and the ground.
Based on the , we further define an Apex Moment and a Switching Moment.
(Apex Moment and Switching Moment) Given the , an Apex Moment is defined as an instance when the state is switched to the state and labeled as . A Switching Moment is defined as an instance in the middle of and labeled as .
Let us consider the LIPM for our simplified model. The state of the LIPM is defined as the position and velocity of the Center of Mass (CoM) of the robot on a constant height surface and is denoted by , where represents a manifold embedded in with the LIPM dynamics. The stance of the LIPM is defined as the location of the pivot and denoted by . The input of the LIPM is defined as the desired location of the next stance and denoted by . The nomenclatures are used with a subscript to represent properties in th step, e.g., , and . When the LIPM is regulated by , we further use subscripts and to denote the properties of the robot at the Apex Moment and the Switching Moment in th step. For example,
denote the state of the LIPM at the Apex Moment and the Switching Moment in th step. Since the stance and input of the LIPM are invariant in the step, and are inter-changeable with and . Beyond the simplified model, properties of the actual robot could be represented with the subscript. For instance, and represent the orientation and angular velocity of the base link of the robot with respect to the world frame at the Apex Moment in th step, respectively. Fig. 1 illustrates and the abstraction of the locomotion behavior with the LIPM.
Given and the nomenclatures, the goal of WPG is to generate and the CoM trajectory based on and at the Apex Moment in th step. From the walking pattern, WBC provides the computation of sensor-based feedback control loops and torque command for the robot to track the desired location of the next stance and the CoM trajectory. Note that the WPG designs the pattern at the Apex Moment in each step, while the WBC computes the feedback torque command in a control loop.
Ii-C TVR Planner
As a WPG, the TVR planner decides the desired location of the next stance based on the LIPM. The differential equation of the LIPM is represented as follows:
where is the gravitational constant and is the constant height of the CoM of the point mass.
This subsection considers th stepping motion and shows how the TVR planner designs the desired location of the next stance. Given an initial condition and a stance position , the solution of Eq. (1) yields a state transition map , with expression
, , and , and , respectively.
Since the TVR planner decides the desired location of the next stance at the Apex Moment (i.e. ), we set the initial condition as . With pre-specified time duration , we compute the state at the Switching Moment as
From , the TVR planner computes , such that the sagittal velocity (lateral velocity ) of the CoM is driven to zero at ( times, respectively) after the LIPM switches to the new stance. The constraints are expressed as
where and . From Eq. (4), is computed with an additional bias term and as
and denotes a desired position for the CoM of the robot. Note that Eq. (5) is a simple proportional-derivative controller and and are the gain parameters to keep the CoM converging to the desired position. A more detailed derivation of the LIPM is described in .
Ii-D Reinforcement Learning with Safety Guarantees
Consider an infinite-horizon discounted MDP with control-affine, deterministic dynamics defined by the tuple , where is a set of states, is a set of actions, is the deterministic dynamics which is an affine in control, is the reward function, is the distribution of the initial state, and is the discount factor. The control affine dynamics are written as
where , denotes a state and input, , are the nominal under-actuated and actuated dynamics, and is the unknown system dynamics. Moreover, let denote a stochastic control policy parameterized with a vector that maps states to distributions over actions, and denote the policy’s expected discounted reward with expression
where is a trajectory drawn from the policy (e.g. ).
To achieve safe exploration in the learning process under the uncertain dynamics, 
employed a Gaussian Process (GP) to approximate the unknown part of the dynamics from the dataset by learning a mean estimateand an uncertainty
where is a design parameter for confidence (e.g. for confidence). Then, the control input is computed so that the following state stays within a given invariant safe set by computing
Iii MDP Formulation
In this section, we define MDP components for data efficient and safe learning. Our MDP formulation augments the TVR planner with a model-free approach. We define a set of states and a set of actions associated with the Apex Moment in each step:
where can be set as when considering the infinite steps of the locomotion.
Recall from the nomenclatures in Section II-B that and denote the state, stance, and the input of the LIPM at the Apex Moment in th step. Note that and are inter-changeable with and . Moreover, and represent the orientation and the angular velocity of the base link at the same moment.
Based on Eq. (2), we define the transition function in the MDP as
in Eq. (10) represents the unknown part of the dynamics fitted via Eq. (8)111We use a squared exponential kernel for GP prior to implementation.. The uncertainty is attributed to discrepancies between the simplified model and the simulated robot. Note that the dynamics of the lower part of the states, , cannot be expressed in closed form. Therefore, we optimize our policy in a model-free sense, but utilize CoM dynamics to provide safe exploration and data efficiency in the learning process.
To improve the locomotion behavior, we define the folllowing reward function
Given and the Euler ZYX representation of , is an alive bonus, is penalizing the roll and pitch deviation to maintain the body upright, is a penalty for diverging from the desired CoM positions and the heading of the robot, is for steering the robot with a desired velocity, and penalizes excessive control input.
Iv policy Search
Our goal is to learn an optimal policy for desired foot locations. We use PPO to optimize the policy iteratively. PPO defines an advantage function , where is the state-action value function that evaluates the return of taking action at state and following the policy thereafter. By maximizing a modified objective function
where is the importance re-sampling term that allows us to use the dataset under the old policy to estimate for the current policy . is a short notation for . The and operator ensures that the policy does not change too much from the old policy .
Iv-a Safe Set Approximation
In this subsection, we compute a safe set and a CBF to design a safe policy. The work in  introduced an instantaneous capture point which enables the LIPM to come to a stop if it were to instantaneously place and maintain its stance there. Here, we consider one-step capture regions for the LIPM at the Apex Moment for the th step:
where , , and . is the maximum step length that the LIPM can reach. Both and are achieved from the kinematics of the robot. is a pre-defined temporal parameter that represents time to land the swing foot. We conservatively approximate the ellipsoid of Eq. (12) with a polytope and define the safe set as
The safe set in Eq. (13) denotes the set of the LIPM state and stance pairs that could be stabilized without falling by taking one-step. In other words, if the LIPM state and stance is inside the safe set, there is always a location for the next stance, that stabilizes the LIPM. The projection onto the and plane of the actual one-step capture regions and its approximation is represented in Fig. 2(b).
Iv-B Safety Guaranteed Policy Design
For data efficient and safe learning, we design our control input at time step with three components:
where is computed by the TVR planner and is drawn from a stochastic Gaussian policy, , where and denote the mean vector and the covariance matrix parameterized by 222In the implementation, we choose fully connected two hidden layers with activation function..
Given and , ensures the following LIPM state and stance () steered by the final control input () stays inside the safe set . In our problem, Eq. (9) is modified as
where is a slack variable in the safety constraint, and is a large constant to penalize safety violation. Here,
The first segment of the inequality represents constraint for the safety and the last two are for the input constraint. The design of the safety guaranteed policy design is illustrated in Fig. 2(a). Based on the MDP formulation and the policy design, the overall algorithm for the efficient and safe learning for locomotion behaviors is summarized in Alg. 1.
Iv-C Further Details
It is worth taking a look at each of the components in the final control input described by Eq. (14). provides a “feedforward exploration” in the state space, where the parameterized Gaussian policy explores around the TVR planner policy and optimizes the long term reward. projects onto the safe set of policies and furnishes “safety compensation”.
Particularly, in the “feedforward exploration” provides a model based initial guess on the offset and resolves two major issues caused by the safety projection: 1) inactive exploration and 2) the credit assignment problem. For example, let us consider two cases with different “feedforward explorations” as illustrated in Fig. 3, whose final control policies are: (a) and (b) .
In the case of (a) (and (b), respectively), the cyan area represents “feedforward exploration”, which is the Gaussian distribution(and , respectively) and the green dots are its samples. The pink arrow represents the “safety compensation” (and ), respectively). The black striped regions are the distribution of the final policy distributions and the yellow dots are their sample.
In (a), there is no intersection between the set of safe actions and the possible “feedforward exploration” since in most cases, we initialize the Gaussian policy with zero mean vector. Then, all explorations are projected onto the safe action set. The projection does not preserve the volume in the action space and it hinders active explorations in the learning. However, (b) leverages the TVR as a near optimal policy retains the volume in action space to explore over.
When it comes to computing a gradient of the long term reward, the projected actions make it difficult to evaluate the resulting trajectories and assign the credits in the space. In other words, in (a), three compensated samples (yellow dots) do not roll out different trajectories, which prevents gradient descent and results in local optimum.
V Simulation Results
Our proposed MDP formulation and the policy design could be applied to any kind of humanoid to achieve versatile locomotion behavior. In this section, we evaluate our framework via forward walking with 10-DoF DRACO biped  and turning with 23-DoF Boston Dynamic’s ATLAS humanoid in DART simulator . Parameters such as the robot’s kinematics, time duration in , gains, the number of nodes in two hidden layers, reward scaling constant and the behavior steering factors are summarized in Table I.
V-a Draco Forward Walking
DRACO is 10-DoF biped designed for agile locomotion behavior that has light weighted small feet without ankle roll actuation. Due to the absence of the ankle roll actuation and the size of the feet, we design the WBC to control the position of the feet, the roll and pitch of the torso, and the height of the CoM of the robot. We move a target frame that represents the desired position and heading of the robot with velocity of to achieve a forward walking behavior.
Fig. 4 summarizes the results of the forward walking simulation. In (a) and (b), the forward walking behavior is regulated by , the WBC, and the learned footstep decision making policy. (c) illustrates the sagittal directional LIPM states relative to stances and shows that the explorations all stay inside the safe set. (d) illustrates the data efficiency of our proposed MDP formulation in policy learning compared to the other conventional MDP formulations in model-free approaches. For the comparison, we have trained the policy to achieve forward walking with similar velocity using the same PG methods but different MDP formulation. The learning curve for the proposed MDP formulation is converged with iterations, while the other one requires more than updates.
In (e), we show the 2-norm of Zero-Moment-Points (ZMP) in the dataset in the learning process and argue that the policy learning on the desired location of the next stance is enhancing the locomotion capability. The ZMP has been a significant indicator for dynamic balancing and widely used concepts in the control of walking robots . For example, when the ZMP moves outside of the supporting polygon, the robot loses its balance. In the box plot, the inter-quartile range decreases as the learning process precedes. It indicates that less torque on the ankle actuation is used for balancing, which results in less shaky locomotion. To evaluate the learned GP model, we perform 4-folds cross validation. The mean of the coefficient of determination is .
V-B ATLAS Turning
In the second simulation, we adapt the proposed MDP formulation and accomplish a different type of locomotion behavior, which is turning. Here, we use the full humanoid robot, ATLAS. To achieve turning behavior in the higher DoF robot, WBC is designed to stabilize the position and orientation of the feet, pelvis, and torso. All the joints are commanded to maintain nominal positions at the lowest hierarchy.
We incrementally rotate a target frame with angular velocity . The policy learns to correct the desired location of the next stance for turning behavior which cannot be represented with the LIPM. Our algorithm is scalable regardless of the complexity of the robot and the learning curve is converged at a similar number of iterations to the first simulation. Fig. 4(f) and (e) show the results of ATLAS turning behavior.
Vi Concluding Remarks
In this letter, we have described an MDP formulation for data efficient and safe learning for locomotion. Our formulation is built upon our previous work [13, 14] that makes footstep decision using the LIPM and stabilizes the robot with WBC. Based on footstep decisions, we define states and actions in our MDP process while WBC stabilizes the robot to step over the desired locations. At the same time, we learn the transition function of the MDP process using GP based on the LIPM, such that we compensate for behaviors outside of the LIPM. We design our policy, in combination with the TVR policy, parametric stochastic policy and safety guaranties, via CBF. We evaluate our framework’s efficiency and safe exploration during the learning process through simulations of DRACO walking forward and ATLAS turning.
In the future, we plan to implement this framework into real bipedal hardware, in particular our liquid cooled viscoelastic biped, DRACO. We have seen many behaviors that the LIPM could not capture and cumbersome tuning procedures being needed in the past. We expect the policy learning technique presented here will automatically find the gap between model and reality and adjust the policy accordingly.
The authors would like to thank the members of the Human Centered Robotics Laboratory at The University of Texas at Austin for their great help and support. This work was supported by the Office of Naval Research, ONR Grant #N000141512507 and the National Science Foundation, NSF Grant #1724360.
-  N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, M. Riedmiller et al., “Emergence of locomotion behaviours in rich environments,” arXiv preprint arXiv:1707.02286, 2017.
-  X. B. Peng, G. Berseth, K. Yin, and M. van de Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Transactions on Graphics (Proc. SIGGRAPH 2017), vol. 36, no. 4, 2017.
N. Ratliff, J. A. Bagnell, and S. S. Srinivasa, “Imitation learning for locomotion and manipulation,” in2007 7th IEEE-RAS International Conference on Humanoid Robots, Nov 2007, pp. 392–397.
-  M. Deisenroth and C. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011. Omnipress, 2011, pp. 465–472.
-  W. Yu, G. Turk, and C. K. Liu, “Learning symmetric and low-energy locomotion,” ACM Trans. Graph., vol. 37, no. 4, pp. 144:1–144:12, Jul. 2018. [Online]. Available: http://doi.acm.org/10.1145/3197517.3201397
-  Kuindersma et al., “Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot,” Autonomous Robots, vol. 40, no. 3, pp. 429–455, Mar 2016.
-  S. Rezazadeh, C. Hubicki, M. Jones, A. Peekema, J. Van Why, A. Abate, and J. Hurst, “Spring-Mass Walking With ATRIAS in 3D: Robust Gait Control Spanning Zero to 4.3 KPH on a Heavily Underactuated Bipedal Robot,” in ASME 2015 Dynamic Systems and Control Conference. Columbus: ASME, Oct. 2015, p. V001T04A003.
-  S. Caron, A. Kheddar, and O. Tempier, “Stair climbing stabilization of the HRP-4 humanoid robot using whole-body admittance control,” in IEEE International Conference on Robotics and Automation, May 2019. [Online]. Available: https://hal.archives-ouvertes.fr/hal-01875387
-  S. Kajita, F. Kanehiro, K. Kaneko, K. Fujiwara, K. Harada, K. Yokoi, and H. Hirukawa, “Biped walking pattern generation by using preview control of zero-moment point,” in 2003 IEEE International Conference on Robotics and Automation, vol. 2, Sep. 2003, pp. 1620–1626 vol.2.
-  J. Carpentier and N. Mansard, “Multicontact locomotion of legged robots,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1441–1460, Dec 2018.
-  A. Herzog, S. Schaal, and L. Righetti, “Structured contact force optimization for kino-dynamic motion generation,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 2703–2710.
-  D. E. Orin, A. Goswami, and S.-H. Lee, “Centroidal dynamics of a humanoid robot,” Autonomous Robots, vol. 35, no. 2, pp. 161–176, Oct 2013. [Online]. Available: https://doi.org/10.1007/s10514-013-9341-4
-  D. Kim, S. J. Jorgensen, J. Lee, J. Ahn, J. Luo, and L. Sentis, “Dynamic locomotion for passive-ankle biped robots and humanoids using whole-body locomotion control,” arXiv preprint arXiv:1901.08100, 2019.
-  J. Ahn, D. Kim, S. Bang, and L. Sentis, “Control of A High Performance Bipedal Robot using Liquid Cooled Viscoelastic Actuators,” in preparation, 2019.
-  J. Ahn, O. Campbell, D. Kim, and L. Sentis, “Fast kinodynamic bipedal locomotion planning with moving obstacles,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2018, pp. 177–184.
-  R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” arXiv preprint arXiv:1903.08792, 2019.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
-  T. Koolen, T. de Boer, J. Rebula, A. Goswami, and J. Pratt, “Capturability-based analysis and control of legged locomotion, part 1: Theory and application to three simple gait models,” The International Journal of Robotics Research, vol. 31, no. 9, pp. 1094–1113, 2012.
-  J. Lee, M. X. Grey, S. Ha, T. Kunz, S. Jain, Y. Ye, S. S. Srinivasa, M. Stilman, and C. K. Liu, “Dart: Dynamic animation and robotics toolkit,” The Journal of Open Source Software, vol. 3, no. 22, p. 500, 2018.
-  M. VUKOBRATOVIĆ and B. BOROVAC, “Zero-moment point — thirty five years of its life,” International Journal of Humanoid Robotics, vol. 01, no. 01, pp. 157–173, 2004.