Data Efficient and Safe Learning for Locomotion via Simplified Model

06/10/2019 ∙ by Junhyeok Ahn, et al. ∙ The University of Texas at Austin 0

In this letter, we formulate a novel Markov Decision Process (MDP) for data efficient and safe learning for locomotion via a simplified model. In our previous studies on biped locomotion, we relied on a low-dimensional robot model, e.g., the Linear Inverted Pendulum Model (LIPM), commonly used in Walking Pattern Generators (WPG). However, employing low-level control cannot precisely track desired footstep locations due to the discrepancies between the real system and the simplified model. In this work, we propose an approach for mitigating this problem by complementing model-based policies with machine learning. We formulate an MDP process incorporating dynamic properties of robots, desired walking directions, and footstep features. We iteratively update the policy to determine footstep locations based on the previous MDP process aided by a deep reinforcement learning process. The policy of the proposed approach consists of a WPG and a parameterized stochastic policy. In addition, a Control Barrier Function (CBF) process applies corrections the above policy to prevent exploration of unsafe regions during learning. Our contributions include: 1) reduction of footstep tracking errors resulting from employing LIPM; 2) efficient exploration of the data driven process, and; 3) scalability of the procedure to any humanoid robot.



There are no comments yet.


page 1

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Humanoid robots are advantageous for mobility in tight spaces. However, fast bipedal locomotion requires precision control of the contact transition process. There are many successful studies addressing agile legged locomotion. Model-free approaches, such as the Policy Gradient (PG) method used in Deep Reinforcement Learning (DRL), rely on data and function approximation via neural networks. Model-based approaches employ differential dynamics of robots to synthesize locomotion controllers. Our work leverages the advantage of data driven methods and model-base approaches in a safe and efficient manner.

Recent work has shown the possibility of robust and agile locomotion control through model-free learning. In [1], “locomotors” were trained for various environments and were able to achieve robust behaviors. In [2], a policy is trained on a joint space trajectory generated by motion capture data from humans [3]. The work in [4] learns local models of the robot for locomotion, while the work in [5] penalizes asymmetric motions to achieve energy efficient motions. However, model-free learning approaches are limited due to data inefficiency, unsafe policy exploration and jerky motions.

On the other hand, model-based approaches decouple the problem into two sub-problems: 1) reduce the complexity of full-body dynamics via simplified models such as the Inverted Pendulum [6, 7, 8, 9] or the Centroidal Model [10, 11, 12], and then 2) compute a feedback joint torque command that makes the robot track the behavior of the simplified model. In our recent studies [13, 14], we achieved unsupported passive ankle dynamic locomotion via two computational elements: 1) a high-level footstep planner, dubbed Time-to-Velocity-Reversal (TVR) planner, based on the Linear Inverted Pendulum Model (LIPM) and 2) a low-level Whole Body Controller (WBC) that tracks the desired trajectories. However, because of using the LIPM, WBC has significant footstep tracking errors given trajectories given by the TVR planner.

In this paper, we devise a Markov Decision Process (MDP) for locomotion and employ CBF for safe learning. In contrast to model-free approaches, whose MDP is characterized by sensor data and joint torque at every control loop, our formulation augments the walking pattern generator with a model-free approach. More precisely, the moment the walking pattern is computed, we define actions related to footstep locations and used them for learning. Our objective is to find an optimal policy for the desired foot locations. We continuously update the foot location policy using the PG method and DRL. The policy is designed based on three components: the TVR planner, the parametric neural network stochastic policy, and the safety controller. Here, the TVR planner provides a good initial offset for the parametric neural network policy, which helps efficient learning. The parametric neural network takes arbitrary actions, explores the state space of the robot, and optimizes the parameters so that the long term reward is maximized. The safety controller corrects the policy so that it prevents the robot from being steered to unsafe state space regions. To design safe actions, we learn the discrepancies between the LIPM and the simulated robot using a Gaussian Process (GP).

The proposed MDP formulation and the learning framework have the following advantages: 1) The learned policy compensates for inaccurate tracking errors. For example, the policy compensates for the effects of limb dynamics and angular momentum. 2) It provides data efficiency and safe exploration during the learning process. The policies for both forward walking and turning converge after iterations approximately; 3) Since the LIPM approximates biped robots and since WBC is a task-oriented feedback controller, the proposed algorithm is scalable to many types of biped robots.

The remainder of this paper is organized as follow: Section II describes a model-based approach for biped locomotion and DRL with safety guarantees. Section III proposes an MDP formulation and Section IV shows how we compose and update the policy effectively and safely. Section V

evaluates the proposed framework in simulation for forward walking on a 10 Degree-of-Freedom (DoF) biped, dubbed DRACO, and includes a turning behavior of the 23 DoF humanoid robot, ATLAS. Finally, Section 

VI concludes the paper.

Ii Preliminaries

Ii-a Notation

denotes the real numbers, and and are the sets of non-negative and non-positive real numbers. is used for natural numbers. Given and , where , the set of natural numbers in the interval is denoted by . The sets of

-dimensional real vectors and

real matrices are denoted by and , respectively. Given and , represents their concatenation. The dimensional matrix whose elements are all one is denoted by , and the identity matrix is represented as . General Euclidean norm is denoted as . Inner product in the vector space is denoted by . represents the probabilistic expectation operator.

Ii-B A Model-based approach to Locomotion

Fig. 1: (a) shows the for locomotion behaviors. The blue and pink stars represent the th and th Apex Moments. (b) shows the abstraction of the walking motion with LIPM. The three scenes show the th Apex Moment, th Switching Moment and th Apex Moment, respectively.

In this subsection, we summarize how locomotion behaviors are represented and achieved by WPG and WBC. Locomotion behaviors are manifested as stabilizing leg contact changes (coordinated by a state machine) triggered by either pre-defined temporal specifications or foot contact detection sensors. Here, we define a Locomotion State and a state machine with simple structures to represent locomotion behaviors.

Definition 1.

(Locomotion State) A locomotion state is defined as a tuple, .

  • represents a semantic expression of locomotion behaviors: .

  • Subscripts , , and describe locomotion states for double support, lifting the right/left leg, and landing the right/left leg.

  • is a time duration for .

Definition 2.

(State Machine) We define a state machine as a sequence of Locomotion States:

  • Each Locomotion State is terminated and switched to the next state.

  • Locomotion state could be terminated before when contact is detected between the swing foot and the ground.

Based on the , we further define an Apex Moment and a Switching Moment.

Definition 3.

(Apex Moment and Switching Moment) Given the , an Apex Moment is defined as an instance when the state is switched to the state and labeled as . A Switching Moment is defined as an instance in the middle of and labeled as .

Let us consider the LIPM for our simplified model. The state of the LIPM is defined as the position and velocity of the Center of Mass (CoM) of the robot on a constant height surface and is denoted by , where represents a manifold embedded in with the LIPM dynamics. The stance of the LIPM is defined as the location of the pivot and denoted by . The input of the LIPM is defined as the desired location of the next stance and denoted by . The nomenclatures are used with a subscript to represent properties in th step, e.g., , and . When the LIPM is regulated by , we further use subscripts and to denote the properties of the robot at the Apex Moment and the Switching Moment in th step. For example,

denote the state of the LIPM at the Apex Moment and the Switching Moment in th step. Since the stance and input of the LIPM are invariant in the step, and are inter-changeable with and . Beyond the simplified model, properties of the actual robot could be represented with the subscript. For instance, and represent the orientation and angular velocity of the base link of the robot with respect to the world frame at the Apex Moment in th step, respectively. Fig. 1 illustrates and the abstraction of the locomotion behavior with the LIPM.

Given and the nomenclatures, the goal of WPG is to generate and the CoM trajectory based on and at the Apex Moment in th step. From the walking pattern, WBC provides the computation of sensor-based feedback control loops and torque command for the robot to track the desired location of the next stance and the CoM trajectory. Note that the WPG designs the pattern at the Apex Moment in each step, while the WBC computes the feedback torque command in a control loop.

Ii-C TVR Planner

As a WPG, the TVR planner decides the desired location of the next stance based on the LIPM. The differential equation of the LIPM is represented as follows:


where is the gravitational constant and is the constant height of the CoM of the point mass.

This subsection considers th stepping motion and shows how the TVR planner designs the desired location of the next stance. Given an initial condition and a stance position , the solution of Eq. (1) yields a state transition map , with expression



, , and , and , respectively.

Since the TVR planner decides the desired location of the next stance at the Apex Moment (i.e. ), we set the initial condition as . With pre-specified time duration , we compute the state at the Switching Moment as


From , the TVR planner computes , such that the sagittal velocity (lateral velocity ) of the CoM is driven to zero at ( times, respectively) after the LIPM switches to the new stance. The constraints are expressed as


where and . From Eq. (4), is computed with an additional bias term and as



and denotes a desired position for the CoM of the robot. Note that Eq. (5) is a simple proportional-derivative controller and and are the gain parameters to keep the CoM converging to the desired position. A more detailed derivation of the LIPM is described in [15].

Ii-D Reinforcement Learning with Safety Guarantees

Consider an infinite-horizon discounted MDP with control-affine, deterministic dynamics defined by the tuple , where is a set of states, is a set of actions, is the deterministic dynamics which is an affine in control, is the reward function, is the distribution of the initial state, and is the discount factor. The control affine dynamics are written as


where , denotes a state and input, , are the nominal under-actuated and actuated dynamics, and is the unknown system dynamics. Moreover, let denote a stochastic control policy parameterized with a vector that maps states to distributions over actions, and denote the policy’s expected discounted reward with expression


where is a trajectory drawn from the policy (e.g. ).

To achieve safe exploration in the learning process under the uncertain dynamics, [16]

employed a Gaussian Process (GP) to approximate the unknown part of the dynamics from the dataset by learning a mean estimate

and an uncertainty

in tandem with a policy update process with probability confidence intervals on the estimation,


where is a design parameter for confidence (e.g. for confidence). Then, the control input is computed so that the following state stays within a given invariant safe set by computing


where .

Augmenting the safe controller, PG methods, such as Deep Deterministic Policy Gradients (DDPG) [17] or Proximal Policy Optimization (PPO) [18], estimate the gradient of the expected reward with respect to the stochastic policy based on sampled trajectories.

Iii MDP Formulation

In this section, we define MDP components for data efficient and safe learning. Our MDP formulation augments the TVR planner with a model-free approach. We define a set of states and a set of actions associated with the Apex Moment in each step:

where can be set as when considering the infinite steps of the locomotion.

Recall from the nomenclatures in Section II-B that and denote the state, stance, and the input of the LIPM at the Apex Moment in th step. Note that and are inter-changeable with and . Moreover, and represent the orientation and the angular velocity of the base link at the same moment.

Based on Eq. (2), we define the transition function in the MDP as



in Eq. (10) represents the unknown part of the dynamics fitted via Eq. (8)111We use a squared exponential kernel for GP prior to implementation.. The uncertainty is attributed to discrepancies between the simplified model and the simulated robot. Note that the dynamics of the lower part of the states, , cannot be expressed in closed form. Therefore, we optimize our policy in a model-free sense, but utilize CoM dynamics to provide safe exploration and data efficiency in the learning process.

To improve the locomotion behavior, we define the folllowing reward function


Given and the Euler ZYX representation of , is an alive bonus, is penalizing the roll and pitch deviation to maintain the body upright, is a penalty for diverging from the desired CoM positions and the heading of the robot, is for steering the robot with a desired velocity, and penalizes excessive control input.

Iv policy Search

Our goal is to learn an optimal policy for desired foot locations. We use PPO to optimize the policy iteratively. PPO defines an advantage function , where is the state-action value function that evaluates the return of taking action at state and following the policy thereafter. By maximizing a modified objective function

where is the importance re-sampling term that allows us to use the dataset under the old policy to estimate for the current policy . is a short notation for . The and operator ensures that the policy does not change too much from the old policy .

Iv-a Safe Set Approximation

In this subsection, we compute a safe set and a CBF to design a safe policy. The work in [19] introduced an instantaneous capture point which enables the LIPM to come to a stop if it were to instantaneously place and maintain its stance there. Here, we consider one-step capture regions for the LIPM at the Apex Moment for the th step:


where , , and . is the maximum step length that the LIPM can reach. Both and are achieved from the kinematics of the robot. is a pre-defined temporal parameter that represents time to land the swing foot. We conservatively approximate the ellipsoid of Eq. (12) with a polytope and define the safe set as



The safe set in Eq. (13) denotes the set of the LIPM state and stance pairs that could be stabilized without falling by taking one-step. In other words, if the LIPM state and stance is inside the safe set, there is always a location for the next stance, that stabilizes the LIPM. The projection onto the and plane of the actual one-step capture regions and its approximation is represented in Fig. 2(b).

Iv-B Safety Guaranteed Policy Design

Fig. 2: (a) illustrates the design of the safety guaranteed policy, . (b) shows the projection onto the and plane of the actual one-step capture regions of the LIPM and its inner approximation.

For data efficient and safe learning, we design our control input at time step with three components:


where is computed by the TVR planner and is drawn from a stochastic Gaussian policy, , where and denote the mean vector and the covariance matrix parameterized by 222In the implementation, we choose fully connected two hidden layers with activation function..

Given and , ensures the following LIPM state and stance () steered by the final control input () stays inside the safe set . In our problem, Eq. (9) is modified as


Substituting Eq. (8), Eq. (10) and Eq. (13) into Eq. (15), the optimization problem is summarized into the following Quadratic Programming (QP) and efficiently solved for the safety compensation as

Data: Number of episode , Number of data
Initialize , , data array ;
for  do
       for  do
             ,   Eq. (5),   Eq. (16) ;
             Eq. (14) ;
             Eq. (11) ;
             WBC stabilizes the robot and brings it to the next Apex Moment;
             store in ;
       end for
       Optimize with w.r.t ;
       Update GP model with ;
       clear ;
end for
Algorithm 1 Policy Learning Process

where is a slack variable in the safety constraint, and is a large constant to penalize safety violation. Here,


The first segment of the inequality represents constraint for the safety and the last two are for the input constraint. The design of the safety guaranteed policy design is illustrated in Fig. 2(a). Based on the MDP formulation and the policy design, the overall algorithm for the efficient and safe learning for locomotion behaviors is summarized in Alg. 1.

Iv-C Further Details

It is worth taking a look at each of the components in the final control input described by Eq. (14). provides a “feedforward exploration” in the state space, where the parameterized Gaussian policy explores around the TVR planner policy and optimizes the long term reward. projects onto the safe set of policies and furnishes “safety compensation”.

Particularly, in the “feedforward exploration” provides a model based initial guess on the offset and resolves two major issues caused by the safety projection: 1) inactive exploration and 2) the credit assignment problem. For example, let us consider two cases with different “feedforward explorations” as illustrated in Fig. 3, whose final control policies are: (a) and (b) .

In the case of (a) (and (b), respectively), the cyan area represents “feedforward exploration”, which is the Gaussian distribution

(and , respectively) and the green dots are its samples. The pink arrow represents the “safety compensation” (and ), respectively). The black striped regions are the distribution of the final policy distributions and the yellow dots are their sample.

Fig. 3: The figure illustrates the “safety compensation” process. denotes an optimal control input and the orange area represents a set of safe action that ensures the state at the next time step stays inside the safe set . (a) and (b) represent two different instances of “feedforward exploration”.
Fig. 4: (a) illustrates the snapshots of forward walking regulated by , WBC, and the learned policy on the desired location of the next stance. Corresponding Locomotion States are illustrated below. (b) shows the sagittal plane phase plot of the CoM. (c) plots the sagittal directional LIPM states relative to the stance for , steps i.e., for

. (d) shows the learning curves for our MDP formulation and for the conventional MDP formulation used in model-free approaches to achieve similar forward walking behavior. Note that the average return is re-scaled for the comparison. (e) illustrates box plots of the 2-norm of ZMP in the dataset during the learning process. The left and right edge of the box represent the first and the third quartile, respectively. The orange line and the notch represent the median and 95 percent confidence interval around the median. On the other side of the axis, the inter-quartile range is plotted. (f) illustrates ATLAS turning regulated by

, WBC, and the learned policy. (e) shows the the heading of the yaw angle of the pelvis with respect to the world frame.

In (a), there is no intersection between the set of safe actions and the possible “feedforward exploration” since in most cases, we initialize the Gaussian policy with zero mean vector. Then, all explorations are projected onto the safe action set. The projection does not preserve the volume in the action space and it hinders active explorations in the learning. However, (b) leverages the TVR as a near optimal policy retains the volume in action space to explore over.

When it comes to computing a gradient of the long term reward, the projected actions make it difficult to evaluate the resulting trajectories and assign the credits in the space. In other words, in (a), three compensated samples (yellow dots) do not roll out different trajectories, which prevents gradient descent and results in local optimum.

V Simulation Results

Our proposed MDP formulation and the policy design could be applied to any kind of humanoid to achieve versatile locomotion behavior. In this section, we evaluate our framework via forward walking with 10-DoF DRACO biped [14] and turning with 23-DoF Boston Dynamic’s ATLAS humanoid in DART simulator [20]. Parameters such as the robot’s kinematics, time duration in , gains, the number of nodes in two hidden layers, reward scaling constant and the behavior steering factors are summarized in Table I.

V-a Draco Forward Walking

DRACO is 10-DoF biped designed for agile locomotion behavior that has light weighted small feet without ankle roll actuation. Due to the absence of the ankle roll actuation and the size of the feet, we design the WBC to control the position of the feet, the roll and pitch of the torso, and the height of the CoM of the robot. We move a target frame that represents the desired position and heading of the robot with velocity of to achieve a forward walking behavior.

Fig. 4 summarizes the results of the forward walking simulation. In (a) and (b), the forward walking behavior is regulated by , the WBC, and the learned footstep decision making policy. (c) illustrates the sagittal directional LIPM states relative to stances and shows that the explorations all stay inside the safe set. (d) illustrates the data efficiency of our proposed MDP formulation in policy learning compared to the other conventional MDP formulations in model-free approaches. For the comparison, we have trained the policy to achieve forward walking with similar velocity using the same PG methods but different MDP formulation. The learning curve for the proposed MDP formulation is converged with iterations, while the other one requires more than updates.

In (e), we show the 2-norm of Zero-Moment-Points (ZMP) in the dataset in the learning process and argue that the policy learning on the desired location of the next stance is enhancing the locomotion capability. The ZMP has been a significant indicator for dynamic balancing and widely used concepts in the control of walking robots [21]. For example, when the ZMP moves outside of the supporting polygon, the robot loses its balance. In the box plot, the inter-quartile range decreases as the learning process precedes. It indicates that less torque on the ankle actuation is used for balancing, which results in less shaky locomotion. To evaluate the learned GP model, we perform 4-folds cross validation. The mean of the coefficient of determination is .

V-B ATLAS Turning

In the second simulation, we adapt the proposed MDP formulation and accomplish a different type of locomotion behavior, which is turning. Here, we use the full humanoid robot, ATLAS. To achieve turning behavior in the higher DoF robot, WBC is designed to stabilize the position and orientation of the feet, pelvis, and torso. All the joints are commanded to maintain nominal positions at the lowest hierarchy.

We incrementally rotate a target frame with angular velocity . The policy learns to correct the desired location of the next stance for turning behavior which cannot be represented with the LIPM. Our algorithm is scalable regardless of the complexity of the robot and the learning curve is converged at a similar number of iterations to the first simulation. Fig. 4(f) and (e) show the results of ATLAS turning behavior.

Vi Concluding Remarks

In this letter, we have described an MDP formulation for data efficient and safe learning for locomotion. Our formulation is built upon our previous work [13, 14] that makes footstep decision using the LIPM and stabilizes the robot with WBC. Based on footstep decisions, we define states and actions in our MDP process while WBC stabilizes the robot to step over the desired locations. At the same time, we learn the transition function of the MDP process using GP based on the LIPM, such that we compensate for behaviors outside of the LIPM. We design our policy, in combination with the TVR policy, parametric stochastic policy and safety guaranties, via CBF. We evaluate our framework’s efficiency and safe exploration during the learning process through simulations of DRACO walking forward and ATLAS turning.

In the future, we plan to implement this framework into real bipedal hardware, in particular our liquid cooled viscoelastic biped, DRACO. We have seen many behaviors that the LIPM could not capture and cumbersome tuning procedures being needed in the past. We expect the policy learning technique presented here will automatically find the gap between model and reality and adjust the policy accordingly.

LIPM Reward Behavior


TABLE I: Simulation Parameters.


The authors would like to thank the members of the Human Centered Robotics Laboratory at The University of Texas at Austin for their great help and support. This work was supported by the Office of Naval Research, ONR Grant #N000141512507 and the National Science Foundation, NSF Grant #1724360.