I Introduction
Humanoid robots are advantageous for mobility in tight spaces. However, fast bipedal locomotion requires precision control of the contact transition process. There are many successful studies addressing agile legged locomotion. Modelfree approaches, such as the Policy Gradient (PG) method used in Deep Reinforcement Learning (DRL), rely on data and function approximation via neural networks. Modelbased approaches employ differential dynamics of robots to synthesize locomotion controllers. Our work leverages the advantage of data driven methods and modelbase approaches in a safe and efficient manner.
Recent work has shown the possibility of robust and agile locomotion control through modelfree learning. In [1], “locomotors” were trained for various environments and were able to achieve robust behaviors. In [2], a policy is trained on a joint space trajectory generated by motion capture data from humans [3]. The work in [4] learns local models of the robot for locomotion, while the work in [5] penalizes asymmetric motions to achieve energy efficient motions. However, modelfree learning approaches are limited due to data inefficiency, unsafe policy exploration and jerky motions.
On the other hand, modelbased approaches decouple the problem into two subproblems: 1) reduce the complexity of fullbody dynamics via simplified models such as the Inverted Pendulum [6, 7, 8, 9] or the Centroidal Model [10, 11, 12], and then 2) compute a feedback joint torque command that makes the robot track the behavior of the simplified model. In our recent studies [13, 14], we achieved unsupported passive ankle dynamic locomotion via two computational elements: 1) a highlevel footstep planner, dubbed TimetoVelocityReversal (TVR) planner, based on the Linear Inverted Pendulum Model (LIPM) and 2) a lowlevel Whole Body Controller (WBC) that tracks the desired trajectories. However, because of using the LIPM, WBC has significant footstep tracking errors given trajectories given by the TVR planner.
In this paper, we devise a Markov Decision Process (MDP) for locomotion and employ CBF for safe learning. In contrast to modelfree approaches, whose MDP is characterized by sensor data and joint torque at every control loop, our formulation augments the walking pattern generator with a modelfree approach. More precisely, the moment the walking pattern is computed, we define actions related to footstep locations and used them for learning. Our objective is to find an optimal policy for the desired foot locations. We continuously update the foot location policy using the PG method and DRL. The policy is designed based on three components: the TVR planner, the parametric neural network stochastic policy, and the safety controller. Here, the TVR planner provides a good initial offset for the parametric neural network policy, which helps efficient learning. The parametric neural network takes arbitrary actions, explores the state space of the robot, and optimizes the parameters so that the long term reward is maximized. The safety controller corrects the policy so that it prevents the robot from being steered to unsafe state space regions. To design safe actions, we learn the discrepancies between the LIPM and the simulated robot using a Gaussian Process (GP).
The proposed MDP formulation and the learning framework have the following advantages: 1) The learned policy compensates for inaccurate tracking errors. For example, the policy compensates for the effects of limb dynamics and angular momentum. 2) It provides data efficiency and safe exploration during the learning process. The policies for both forward walking and turning converge after iterations approximately; 3) Since the LIPM approximates biped robots and since WBC is a taskoriented feedback controller, the proposed algorithm is scalable to many types of biped robots.
The remainder of this paper is organized as follow: Section II describes a modelbased approach for biped locomotion and DRL with safety guarantees. Section III proposes an MDP formulation and Section IV shows how we compose and update the policy effectively and safely. Section V
evaluates the proposed framework in simulation for forward walking on a 10 DegreeofFreedom (DoF) biped, dubbed DRACO, and includes a turning behavior of the 23 DoF humanoid robot, ATLAS. Finally, Section
VI concludes the paper.Ii Preliminaries
Iia Notation
denotes the real numbers, and and are the sets of nonnegative and nonpositive real numbers. is used for natural numbers. Given and , where , the set of natural numbers in the interval is denoted by . The sets of
dimensional real vectors and
real matrices are denoted by and , respectively. Given and , represents their concatenation. The dimensional matrix whose elements are all one is denoted by , and the identity matrix is represented as . General Euclidean norm is denoted as . Inner product in the vector space is denoted by . represents the probabilistic expectation operator.IiB A Modelbased approach to Locomotion
In this subsection, we summarize how locomotion behaviors are represented and achieved by WPG and WBC. Locomotion behaviors are manifested as stabilizing leg contact changes (coordinated by a state machine) triggered by either predefined temporal specifications or foot contact detection sensors. Here, we define a Locomotion State and a state machine with simple structures to represent locomotion behaviors.
Definition 1.
(Locomotion State) A locomotion state is defined as a tuple, .

represents a semantic expression of locomotion behaviors: .

Subscripts , , and describe locomotion states for double support, lifting the right/left leg, and landing the right/left leg.

is a time duration for .
Definition 2.
(State Machine) We define a state machine as a sequence of Locomotion States:

Each Locomotion State is terminated and switched to the next state.

Locomotion state could be terminated before when contact is detected between the swing foot and the ground.
Based on the , we further define an Apex Moment and a Switching Moment.
Definition 3.
(Apex Moment and Switching Moment) Given the , an Apex Moment is defined as an instance when the state is switched to the state and labeled as . A Switching Moment is defined as an instance in the middle of and labeled as .
Let us consider the LIPM for our simplified model. The state of the LIPM is defined as the position and velocity of the Center of Mass (CoM) of the robot on a constant height surface and is denoted by , where represents a manifold embedded in with the LIPM dynamics. The stance of the LIPM is defined as the location of the pivot and denoted by . The input of the LIPM is defined as the desired location of the next stance and denoted by . The nomenclatures are used with a subscript to represent properties in th step, e.g., , and . When the LIPM is regulated by , we further use subscripts and to denote the properties of the robot at the Apex Moment and the Switching Moment in th step. For example,
denote the state of the LIPM at the Apex Moment and the Switching Moment in th step. Since the stance and input of the LIPM are invariant in the step, and are interchangeable with and . Beyond the simplified model, properties of the actual robot could be represented with the subscript. For instance, and represent the orientation and angular velocity of the base link of the robot with respect to the world frame at the Apex Moment in th step, respectively. Fig. 1 illustrates and the abstraction of the locomotion behavior with the LIPM.
Given and the nomenclatures, the goal of WPG is to generate and the CoM trajectory based on and at the Apex Moment in th step. From the walking pattern, WBC provides the computation of sensorbased feedback control loops and torque command for the robot to track the desired location of the next stance and the CoM trajectory. Note that the WPG designs the pattern at the Apex Moment in each step, while the WBC computes the feedback torque command in a control loop.
IiC TVR Planner
As a WPG, the TVR planner decides the desired location of the next stance based on the LIPM. The differential equation of the LIPM is represented as follows:
(1) 
where is the gravitational constant and is the constant height of the CoM of the point mass.
This subsection considers th stepping motion and shows how the TVR planner designs the desired location of the next stance. Given an initial condition and a stance position , the solution of Eq. (1) yields a state transition map , with expression
(2) 
where
, , and , and , respectively.
Since the TVR planner decides the desired location of the next stance at the Apex Moment (i.e. ), we set the initial condition as . With prespecified time duration , we compute the state at the Switching Moment as
(3) 
From , the TVR planner computes , such that the sagittal velocity (lateral velocity ) of the CoM is driven to zero at ( times, respectively) after the LIPM switches to the new stance. The constraints are expressed as
(4) 
where and . From Eq. (4), is computed with an additional bias term and as
(5) 
where
and denotes a desired position for the CoM of the robot. Note that Eq. (5) is a simple proportionalderivative controller and and are the gain parameters to keep the CoM converging to the desired position. A more detailed derivation of the LIPM is described in [15].
IiD Reinforcement Learning with Safety Guarantees
Consider an infinitehorizon discounted MDP with controlaffine, deterministic dynamics defined by the tuple , where is a set of states, is a set of actions, is the deterministic dynamics which is an affine in control, is the reward function, is the distribution of the initial state, and is the discount factor. The control affine dynamics are written as
(6) 
where , denotes a state and input, , are the nominal underactuated and actuated dynamics, and is the unknown system dynamics. Moreover, let denote a stochastic control policy parameterized with a vector that maps states to distributions over actions, and denote the policy’s expected discounted reward with expression
(7) 
where is a trajectory drawn from the policy (e.g. ).
To achieve safe exploration in the learning process under the uncertain dynamics, [16]
employed a Gaussian Process (GP) to approximate the unknown part of the dynamics from the dataset by learning a mean estimate
and an uncertaintyin tandem with a policy update process with probability confidence intervals on the estimation,
(8) 
where is a design parameter for confidence (e.g. for confidence). Then, the control input is computed so that the following state stays within a given invariant safe set by computing
(9) 
where .
Iii MDP Formulation
In this section, we define MDP components for data efficient and safe learning. Our MDP formulation augments the TVR planner with a modelfree approach. We define a set of states and a set of actions associated with the Apex Moment in each step:
where can be set as when considering the infinite steps of the locomotion.
Recall from the nomenclatures in Section IIB that and denote the state, stance, and the input of the LIPM at the Apex Moment in th step. Note that and are interchangeable with and . Moreover, and represent the orientation and the angular velocity of the base link at the same moment.
Based on Eq. (2), we define the transition function in the MDP as
(10) 
where
in Eq. (10) represents the unknown part of the dynamics fitted via Eq. (8)^{1}^{1}1We use a squared exponential kernel for GP prior to implementation.. The uncertainty is attributed to discrepancies between the simplified model and the simulated robot. Note that the dynamics of the lower part of the states, , cannot be expressed in closed form. Therefore, we optimize our policy in a modelfree sense, but utilize CoM dynamics to provide safe exploration and data efficiency in the learning process.
To improve the locomotion behavior, we define the folllowing reward function
(11) 
Given and the Euler ZYX representation of , is an alive bonus, is penalizing the roll and pitch deviation to maintain the body upright, is a penalty for diverging from the desired CoM positions and the heading of the robot, is for steering the robot with a desired velocity, and penalizes excessive control input.
Iv policy Search
Our goal is to learn an optimal policy for desired foot locations. We use PPO to optimize the policy iteratively. PPO defines an advantage function , where is the stateaction value function that evaluates the return of taking action at state and following the policy thereafter. By maximizing a modified objective function
where is the importance resampling term that allows us to use the dataset under the old policy to estimate for the current policy . is a short notation for . The and operator ensures that the policy does not change too much from the old policy .
Iva Safe Set Approximation
In this subsection, we compute a safe set and a CBF to design a safe policy. The work in [19] introduced an instantaneous capture point which enables the LIPM to come to a stop if it were to instantaneously place and maintain its stance there. Here, we consider onestep capture regions for the LIPM at the Apex Moment for the th step:
(12) 
where , , and . is the maximum step length that the LIPM can reach. Both and are achieved from the kinematics of the robot. is a predefined temporal parameter that represents time to land the swing foot. We conservatively approximate the ellipsoid of Eq. (12) with a polytope and define the safe set as
(13) 
where
The safe set in Eq. (13) denotes the set of the LIPM state and stance pairs that could be stabilized without falling by taking onestep. In other words, if the LIPM state and stance is inside the safe set, there is always a location for the next stance, that stabilizes the LIPM. The projection onto the and plane of the actual onestep capture regions and its approximation is represented in Fig. 2(b).
IvB Safety Guaranteed Policy Design
For data efficient and safe learning, we design our control input at time step with three components:
(14) 
where is computed by the TVR planner and is drawn from a stochastic Gaussian policy, , where and denote the mean vector and the covariance matrix parameterized by ^{2}^{2}2In the implementation, we choose fully connected two hidden layers with activation function..
Given and , ensures the following LIPM state and stance () steered by the final control input () stays inside the safe set . In our problem, Eq. (9) is modified as
(15) 
Substituting Eq. (8), Eq. (10) and Eq. (13) into Eq. (15), the optimization problem is summarized into the following Quadratic Programming (QP) and efficiently solved for the safety compensation as
(16) 
where is a slack variable in the safety constraint, and is a large constant to penalize safety violation. Here,
and
The first segment of the inequality represents constraint for the safety and the last two are for the input constraint. The design of the safety guaranteed policy design is illustrated in Fig. 2(a). Based on the MDP formulation and the policy design, the overall algorithm for the efficient and safe learning for locomotion behaviors is summarized in Alg. 1.
IvC Further Details
It is worth taking a look at each of the components in the final control input described by Eq. (14). provides a “feedforward exploration” in the state space, where the parameterized Gaussian policy explores around the TVR planner policy and optimizes the long term reward. projects onto the safe set of policies and furnishes “safety compensation”.
Particularly, in the “feedforward exploration” provides a model based initial guess on the offset and resolves two major issues caused by the safety projection: 1) inactive exploration and 2) the credit assignment problem. For example, let us consider two cases with different “feedforward explorations” as illustrated in Fig. 3, whose final control policies are: (a) and (b) .
In the case of (a) (and (b), respectively), the cyan area represents “feedforward exploration”, which is the Gaussian distribution
(and , respectively) and the green dots are its samples. The pink arrow represents the “safety compensation” (and ), respectively). The black striped regions are the distribution of the final policy distributions and the yellow dots are their sample.In (a), there is no intersection between the set of safe actions and the possible “feedforward exploration” since in most cases, we initialize the Gaussian policy with zero mean vector. Then, all explorations are projected onto the safe action set. The projection does not preserve the volume in the action space and it hinders active explorations in the learning. However, (b) leverages the TVR as a near optimal policy retains the volume in action space to explore over.
When it comes to computing a gradient of the long term reward, the projected actions make it difficult to evaluate the resulting trajectories and assign the credits in the space. In other words, in (a), three compensated samples (yellow dots) do not roll out different trajectories, which prevents gradient descent and results in local optimum.
V Simulation Results
Our proposed MDP formulation and the policy design could be applied to any kind of humanoid to achieve versatile locomotion behavior. In this section, we evaluate our framework via forward walking with 10DoF DRACO biped [14] and turning with 23DoF Boston Dynamic’s ATLAS humanoid in DART simulator [20]. Parameters such as the robot’s kinematics, time duration in , gains, the number of nodes in two hidden layers, reward scaling constant and the behavior steering factors are summarized in Table I.
Va Draco Forward Walking
DRACO is 10DoF biped designed for agile locomotion behavior that has light weighted small feet without ankle roll actuation. Due to the absence of the ankle roll actuation and the size of the feet, we design the WBC to control the position of the feet, the roll and pitch of the torso, and the height of the CoM of the robot. We move a target frame that represents the desired position and heading of the robot with velocity of to achieve a forward walking behavior.
Fig. 4 summarizes the results of the forward walking simulation. In (a) and (b), the forward walking behavior is regulated by , the WBC, and the learned footstep decision making policy. (c) illustrates the sagittal directional LIPM states relative to stances and shows that the explorations all stay inside the safe set. (d) illustrates the data efficiency of our proposed MDP formulation in policy learning compared to the other conventional MDP formulations in modelfree approaches. For the comparison, we have trained the policy to achieve forward walking with similar velocity using the same PG methods but different MDP formulation. The learning curve for the proposed MDP formulation is converged with iterations, while the other one requires more than updates.
In (e), we show the 2norm of ZeroMomentPoints (ZMP) in the dataset in the learning process and argue that the policy learning on the desired location of the next stance is enhancing the locomotion capability. The ZMP has been a significant indicator for dynamic balancing and widely used concepts in the control of walking robots [21]. For example, when the ZMP moves outside of the supporting polygon, the robot loses its balance. In the box plot, the interquartile range decreases as the learning process precedes. It indicates that less torque on the ankle actuation is used for balancing, which results in less shaky locomotion. To evaluate the learned GP model, we perform 4folds cross validation. The mean of the coefficient of determination is .
VB ATLAS Turning
In the second simulation, we adapt the proposed MDP formulation and accomplish a different type of locomotion behavior, which is turning. Here, we use the full humanoid robot, ATLAS. To achieve turning behavior in the higher DoF robot, WBC is designed to stabilize the position and orientation of the feet, pelvis, and torso. All the joints are commanded to maintain nominal positions at the lowest hierarchy.
We incrementally rotate a target frame with angular velocity . The policy learns to correct the desired location of the next stance for turning behavior which cannot be represented with the LIPM. Our algorithm is scalable regardless of the complexity of the robot and the learning curve is converged at a similar number of iterations to the first simulation. Fig. 4(f) and (e) show the results of ATLAS turning behavior.
Vi Concluding Remarks
In this letter, we have described an MDP formulation for data efficient and safe learning for locomotion. Our formulation is built upon our previous work [13, 14] that makes footstep decision using the LIPM and stabilizes the robot with WBC. Based on footstep decisions, we define states and actions in our MDP process while WBC stabilizes the robot to step over the desired locations. At the same time, we learn the transition function of the MDP process using GP based on the LIPM, such that we compensate for behaviors outside of the LIPM. We design our policy, in combination with the TVR policy, parametric stochastic policy and safety guaranties, via CBF. We evaluate our framework’s efficiency and safe exploration during the learning process through simulations of DRACO walking forward and ATLAS turning.
In the future, we plan to implement this framework into real bipedal hardware, in particular our liquid cooled viscoelastic biped, DRACO. We have seen many behaviors that the LIPM could not capture and cumbersome tuning procedures being needed in the past. We expect the policy learning technique presented here will automatically find the gap between model and reality and adjust the policy accordingly.
LIPM  Reward  Behavior  
Layer  





Acknowledgment
The authors would like to thank the members of the Human Centered Robotics Laboratory at The University of Texas at Austin for their great help and support. This work was supported by the Office of Naval Research, ONR Grant #N000141512507 and the National Science Foundation, NSF Grant #1724360.
References
 [1] N. Heess, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S. Eslami, M. Riedmiller et al., “Emergence of locomotion behaviours in rich environments,” arXiv preprint arXiv:1707.02286, 2017.
 [2] X. B. Peng, G. Berseth, K. Yin, and M. van de Panne, “Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning,” ACM Transactions on Graphics (Proc. SIGGRAPH 2017), vol. 36, no. 4, 2017.

[3]
N. Ratliff, J. A. Bagnell, and S. S. Srinivasa, “Imitation learning for locomotion and manipulation,” in
2007 7th IEEERAS International Conference on Humanoid Robots, Nov 2007, pp. 392–397.  [4] M. Deisenroth and C. Rasmussen, “Pilco: A modelbased and dataefficient approach to policy search,” in Proceedings of the 28th International Conference on Machine Learning, ICML 2011. Omnipress, 2011, pp. 465–472.
 [5] W. Yu, G. Turk, and C. K. Liu, “Learning symmetric and lowenergy locomotion,” ACM Trans. Graph., vol. 37, no. 4, pp. 144:1–144:12, Jul. 2018. [Online]. Available: http://doi.acm.org/10.1145/3197517.3201397
 [6] Kuindersma et al., “Optimizationbased locomotion planning, estimation, and control design for the atlas humanoid robot,” Autonomous Robots, vol. 40, no. 3, pp. 429–455, Mar 2016.
 [7] S. Rezazadeh, C. Hubicki, M. Jones, A. Peekema, J. Van Why, A. Abate, and J. Hurst, “SpringMass Walking With ATRIAS in 3D: Robust Gait Control Spanning Zero to 4.3 KPH on a Heavily Underactuated Bipedal Robot,” in ASME 2015 Dynamic Systems and Control Conference. Columbus: ASME, Oct. 2015, p. V001T04A003.
 [8] S. Caron, A. Kheddar, and O. Tempier, “Stair climbing stabilization of the HRP4 humanoid robot using wholebody admittance control,” in IEEE International Conference on Robotics and Automation, May 2019. [Online]. Available: https://hal.archivesouvertes.fr/hal01875387
 [9] S. Kajita, F. Kanehiro, K. Kaneko, K. Fujiwara, K. Harada, K. Yokoi, and H. Hirukawa, “Biped walking pattern generation by using preview control of zeromoment point,” in 2003 IEEE International Conference on Robotics and Automation, vol. 2, Sep. 2003, pp. 1620–1626 vol.2.
 [10] J. Carpentier and N. Mansard, “Multicontact locomotion of legged robots,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1441–1460, Dec 2018.
 [11] A. Herzog, S. Schaal, and L. Righetti, “Structured contact force optimization for kinodynamic motion generation,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2016, pp. 2703–2710.
 [12] D. E. Orin, A. Goswami, and S.H. Lee, “Centroidal dynamics of a humanoid robot,” Autonomous Robots, vol. 35, no. 2, pp. 161–176, Oct 2013. [Online]. Available: https://doi.org/10.1007/s1051401393414
 [13] D. Kim, S. J. Jorgensen, J. Lee, J. Ahn, J. Luo, and L. Sentis, “Dynamic locomotion for passiveankle biped robots and humanoids using wholebody locomotion control,” arXiv preprint arXiv:1901.08100, 2019.
 [14] J. Ahn, D. Kim, S. Bang, and L. Sentis, “Control of A High Performance Bipedal Robot using Liquid Cooled Viscoelastic Actuators,” in preparation, 2019.
 [15] J. Ahn, O. Campbell, D. Kim, and L. Sentis, “Fast kinodynamic bipedal locomotion planning with moving obstacles,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct 2018, pp. 177–184.
 [16] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks,” arXiv preprint arXiv:1903.08792, 2019.
 [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [18] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [19] T. Koolen, T. de Boer, J. Rebula, A. Goswami, and J. Pratt, “Capturabilitybased analysis and control of legged locomotion, part 1: Theory and application to three simple gait models,” The International Journal of Robotics Research, vol. 31, no. 9, pp. 1094–1113, 2012.
 [20] J. Lee, M. X. Grey, S. Ha, T. Kunz, S. Jain, Y. Ye, S. S. Srinivasa, M. Stilman, and C. K. Liu, “Dart: Dynamic animation and robotics toolkit,” The Journal of Open Source Software, vol. 3, no. 22, p. 500, 2018.
 [21] M. VUKOBRATOVIĆ and B. BOROVAC, “Zeromoment point — thirty five years of its life,” International Journal of Humanoid Robotics, vol. 01, no. 01, pp. 157–173, 2004.
Comments
There are no comments yet.