Using Simulation to Improve Sample-Efficiency of Bayesian Optimization for Bipedal Robots

05/07/2018 ∙ by Akshara Rai, et al. ∙ KTH Royal Institute of Technology Carnegie Mellon University 0

Learning for control can acquire controllers for novel robotic tasks, paving the path for autonomous agents. Such controllers can be expert-designed policies, which typically require tuning of parameters for each task scenario. In this context, Bayesian optimization (BO) has emerged as a promising approach for automatically tuning controllers. However, when performing BO on hardware for high-dimensional policies, sample-efficiency can be an issue. Here, we develop an approach that utilizes simulation to map the original parameter space into a domain-informed space. During BO, similarity between controllers is now calculated in this transformed space. Experiments on the ATRIAS robot hardware and another bipedal robot simulation show that our approach succeeds at sample-efficiently learning controllers for multiple robots. Another question arises: What if the simulation significantly differs from hardware? To answer this, we create increasingly approximate simulators and study the effect of increasing simulation-hardware mismatch on the performance of Bayesian optimization. We also compare our approach to other approaches from literature, and find it to be more reliable, especially in cases of high mismatch. Our experiments show that our approach succeeds across different controller types, bipedal robot models and simulator fidelity levels, making it applicable to a wide range of bipedal locomotion problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning can provide methods for learning controllers for robotic tasks. Yet, even with recent advances in this field, the problem of automatically designing and learning controllers for robots, especially bipedal robots, remains a difficult problem. Some of the core challenges of learning for control scenarios can be summarized as follows: It is expensive to do learning experiments that require a large number of samples with physical robots. Specifically, legged robots are not robust to falls and failures, and are time-consuming to work with and repair. Furthermore, commonly used cost functions for optimizing controllers are noisy to evaluate, non-convex and non-differentiable. In order to find learning approaches that can be used on real robots, it is thus important to keep these considerations in mind.

Deep reinforcement learning approaches can deal with noise, discontinuities and non-convexity of the objective, but they are not data-efficient. These approaches could take on the order of a million samples to learn locomotion controllers 

(Peng et al., 2016), which would be infeasible on a real robot. For example, on the ATRIAS robot, samples would take days, in theory. But practically, the robot needs to be “reset” between trials and repaired in case of damage. Using structured expert-designed policies can help minimize damage to the robot and make the search for successful controllers feasible. However, the problem is black-box, non-convex and discontinuous. This eliminates approaches like PI (Theodorou et al., 2010) which make assumptions about the dynamics of the system and PILCO (Deisenroth and Rasmussen, 2011) which assumes a continuous cost landscape. Evolutionary approaches like CMA-ES (Hansen, 2006) can still be prohibitively expensive, needing thousands of samples (Song and Geyer, 2015).

Figure 1: ATRIAS robot.

In comparison, Bayesian optimization (BO) is a sample-efficient optimization technique that is robust to non-convexity, noise and even discontinuities. It has been recently used in a range of robotics problems, such as Calandra et al. (2016b),  Marco et al. (2017),  Cully et al. (2015). However, sample-efficiency of conventional BO degrades in high dimensions, even for dimensionalities commonly encountered in locomotion controllers. Because of this, hardware-only optimization becomes intractable for flexible controllers and complex robots. One way of addressing this issue is to utilize simulation to optimize controller parameters. However, simulation-only optimization is vulnerable to learning policies that exploit the simulation and perform well in simulation but poorly on the actual robot. This motivates the development of approaches that can incorporate simulation-based information into the learning method, then optimize with few samples on hardware.

Towards this goal, our previous work in Antonova, Rai, and Atkeson (2016),  Antonova et al. (2017),  Rai et al. (2017) presents a framework that uses information from high-fidelity simulators to learn sample-efficiently on hardware. We use simulation to build informed feature transforms that are used to measure similarity during BO. Thus, the similarity between controller parameters, during optimization on hardware, is informed by how they perform in simulation. With this, it becomes possible to quickly infer which regions of the input space are likely to perform well on hardware. This method has been tested on the ATRIAS biped robot (Figure 1) and shows considerable improvement in sample-efficiency over traditional BO.

In this article, we present in-depth explanations and empirical analysis of our previous work. Furthermore, for the first time, we present a procedure for systematically evaluating robustness of such approaches to simulation-hardware mismatch. We extend our previous work incorporating mismatch estimates 

(Rai et al., 2017) to this setting. We also conduct extensive comparisons with competitive baselines from related work, such as (Cully et al., 2015).

The rest of this article is organized as follows: Section 2 provides background for BO, then gives an overview of related work on optimizing locomotion controllers. Section 3.1 describes the idea of incorporating simulation-based transforms into BO; Section 3.2 explains how we handle simulation-hardware mismatch. Sections 4.1-4.5 describe the robot and controllers we use for our experiments; Section 4.6 explains the motivation and construction of simulators with various levels of fidelity. Section 5 gives a summary of hardware experiments conducted on the ATRIAS robot. Section 5.2 shows generalization to a different robot model in simulation. Section 5.3 shows empirical analysis of the impact of simulator fidelity on the performance of the proposed algorithms and alternative approaches.

2 Background and Related Work

This section gives a brief overview of Bayesian optimization (BO), the state-of-the-art research on optimizing locomotion controllers, and utilizing simulation information in BO.

2.1 Background on Bayesian Optimization

Bayesian optimization (BO) is a framework for online, black-box, gradient-free global search (Shahriari et al. (2016) and Brochu et al. (2010) provide a comprehensive introduction). The problem of optimizing controllers can be interpreted as finding controller parameters that optimize some cost function . Here contains parameters of a pre-structured policy; the cost is a function of the trajectory induced by controller parameters . For brevity, we will refer to ‘controller parameters ’ as ‘controller ’. We use BO to find controller , such that: .

BO is initialized with a prior that expresses the a priori uncertainty over the value of for each in the domain. Then, at each step of optimization, based on data seen so far, BO optimizes an auxiliary function (called acquisition function) to select the next to evaluate. The acquisition function balances exploration vs exploitation. It selects points for which the posterior estimate of the objective is promising, taking into account both mean and covariance of the posterior. A widely used representation for the cost function is a Gaussian process (GP):

The prior mean function is set to when no domain-specific knowledge is provided, or can be informative in the presence of information. The kernel function encodes similarity between inputs. If is large for inputs , then ) strongly influences . One of the most widely used kernel functions is the Squared Exponential (SE):

where

are signal variance and a vector of length scales respectively.

are referred to as ‘hyperparameters’ in the literature.

2.2 Optimizing Locomotion Controllers

Parametric locomotion controllers can be represented as , where is a policy structure that depends on parameters . For example, can be parameterized by feedback gains on the center of mass (CoM), reference joint trajectories, etc. Vector is the state of the robot, such as joint angles and velocities; used in closed-loop controllers. Vector represents the desired control action, for example: torques, angular velocities or positions for each joint on the robot. The sequence of control actions yields a sequence of state transitions, which form the overall ‘trajectory’ . This trajectory is used in the cost function to judge the quality of the controller . In our work, we use structured controllers designed by experts. State of the art research on walking robots featuring such controllers includes Feng et al. (2015), Kuindersma et al. (2016). The overall optimization then includes manually tuning the parameters . An alternative to manual tuning is to use evolutionary approaches, like CMA-ES, as in Song and Geyer (2015). However, these require a large number of samples and can usually be conducted only in simulation. Optimization in simulation can produce controllers that perform well in simulation, but not on hardware. In comparison, BO is a sample-efficient technique which has become popular for direct optimization on hardware. Recent successes include manipulation (Englert and Toussaint, 2016) and locomotion (Calandra et al., 2016b).

BO for locomotion has been previously explored for several types of mobile robots. These include: snake robots (Tesch et al., 2011), AIBO quadrupeds (Lizotte et al., 2007), and hexapods (Cully et al., 2015). Tesch et al. (2011) optimize a 3-dimensional controller for a snake robot in 10-40 trials (for speeds up to ). Lizotte et al. (2007) use BO to optimize gait parameters for a AIBO robot in 100-150 trials. Cully et al. (2015) learn 36 controller parameters for a hexapod. Even with hardware damage, they can obtain successful controllers for speeds up to in 12-15 trials.

Hexapods, quadrupeds and snakes spend a large portion of their gaits being statically stable. In contrast, bipedal walking can be highly dynamic, especially for point-feet robots like ATRIAS. ATRIAS can only be statically stable in double-stance, and like most bipeds, spends a significant time of its gait being “unstable”, or dynamically stable. In our experiments on hardware, ATRIAS goes up to speeds of . All of this leads to a challenging optimization setting and discontinuous cost function landscape. Calandra et al. (2016b) use BO for optimizing gaits of a dynamic biped on a boom, needing 30-40 samples for finding walking gaits for a 4-dimensional controller. While this is promising, optimizing a higher-dimensional controller needed for complex robots would be even more challenging. If significant number of samples lead to unstable gaits and falls, they could damage the robot. Hence, it is important to develop methods that can learn complex controllers fast, without damaging the robot.

2.3 Incorporating Simulation Information into Bayesian Optimization

The idea of using simulation to speed up BO on hardware has been explored before. Marco et al. (2017) use simulation as a second source of noisy data. Information from simulation can also be added as a prior to the GP used in BO, such as in Cully et al. (2015). While these methods can be successful, one needs to carefully tune the influence of simulation points over hardware points, especially when simulation is significantly different from hardware.

Recently, several approaches proposed incorporating Neural Networks (NNs) into the Gaussian process (GP) kernels (

Wilson et al. (2016), Calandra et al. (2016a)). The strength of these approaches is that they can jointly update the GP and the NN.  Calandra et al. (2016a) demonstrated how this added flexibility can handle discontinuities in the cost function landscape. However, these approaches do not directly address the problem of incorporating a large amount of data from simulation in hardware BO experiments.

Wilson et al. (2014) explored enhancing GP kernel with trajectories. Their Behavior Based Kernel (BBK) computes an estimate of a symmetric variant of the KL divergence between trajectories induced by two controllers, and uses this as a distance metric in the kernel. However, getting an estimate would require samples for each controller whenever is needed. This can be impractical, as it involves an evaluation of every controller considered. The authors suggest combining BBK with a model-based approach to overcome this issue by learning a model. But building a reliable model might be an expensive process in itself.

Cully et al. (2015) utilize simulation by defining a behavior metric and collecting best performing points in simulation. This behavior metric then guides BO to quickly find controllers on hardware, and can even compensate for damage to the robot. The search on hardware is conducted in behavior space, and limited to pre-selected “successful” points from simulation. This helps make their search faster and safer on hardware. However, if an optimal point was not pre-selected, BO cannot sample it during optimization.

In our work we develop two alternative strategies that utilize trajectories from simulation to build feature transforms that can be incorporated in the GP kernel used for BO. Our approaches incorporate trajectory/behavior information, but ensure that is also computed efficiently during BO. They bias the search towards regions that look promising, but are able to ‘recover’ and search in other parts of the space if simulation-hardware mismatch becomes apparent.

3 Proposed Approach: Bayesian Optimization with Informed Kernels

In this section, we offer in-depth explanation of approaches from our work in Antonova, Rai, and Atkeson (2016),  Antonova et al. (2017), and Rai et al. (2017). This work proposes incorporating domain knowledge into BO with the help of simulation. We evaluate locomotion controllers in simulation, and collect their induced trajectories, which are then used to build an informed transform. This can be achieved by using a domain-specific feature transform (Section 3.1.1) or by learning to reconstruct short trajectory summaries (Section 3.1.2). This feature trasform is used to construct an informed distance metric for BO, and helps BO discover promising regions faster. An overview can be found in Figure 2. In Section 3.2 we discuss how to incorporate simulation-hardware mismatch in to the transform, ensuring that BO can benefit from inaccurate simulations as well.

3.1 Constructing Flexible Kernels using Simulation-based Transforms

Figure 2: Overview of our proposed approach. Here, is the policy (Section 2.2); is a vector of controller parameters; is the state of the robot; is a trajectory observed in simulation for ; is the transform built using . is the cost of evaluated on hardware. BO uses and evaluated costs to propose next promising controller .

High dimensional problems with discontinuous cost functions are very common with legged robots, where slight changes to some parameters can make the robot unstable. Both of these factors can adversely affect BO’s performance, but informed feature transforms can help BO sample high-performing controllers even in such scenarios.

In this section, we demonstrate how to construct such transforms utilizing simulations for a given controller . We then use to create an informed kernel for BO on hardware:

(1)

Note that the functional form above is same as that of Squared Exponential kernel, if considered from the point of view of the transformed space, with as input. While this kernel is stationary as a function of , it is non-stationary in . can bring closer related parts of the space that would be otherwise far apart in the original space. BO can then operate in the space of , which is ‘informed’ by simulation.

3.1.1 The Determinants of Gait Transform

We propose a feature transform for bipedal locomotion derived from physiological features of human walking called Determinants of Gaits (DoG) (Inman et al., 1953). was originally developed for human-like robots and controllers (Antonova, Rai, and Atkeson, 2016), and then generalized to be applicable to a wider range of bipedal locomotion controllers and robot morphologies (Rai et al., 2017). It is based on the features in Table 1.

(Swing leg retraction) – If the maximum ground clearance of the swing foot is more than a threshold, (0 otherwise); ensures swing leg retraction.
(Center of mass height) – If CoM height stays about the same at the start and end of a step, (0 otherwise); checks that the robot is not falling.
(Trunk lean) – If the average trunk lean is the same at the start and end of a step, (0 otherwise); ensures that the trunk is not changing orientation.
(Average walking speed) – Average forward speed of a controller per step, ; helps distinguish controllers that perform similar on .
Table 1: Illustration of the features used to construct DoG transform.

combines features per step and scales them by the normalized simulation time to obtain the DoG score of controller :

(2)

Here is the number of steps taken in simulation, is time at which simulation terminated (possibly due to a fall), is total time allotted for simulation. Since larger number of steps lead to higher DoG, some controllers that chatter (step very fast before falling) could get misleadingly high scores; we scale the scores by to prevent that. for controller parameters now becomes the computed of the resulting trajectories when is simulated. essentially aids in (soft) clustering of controllers based on their behaviour in simulation. High scoring controllers are more likely to walk than low scoring ones. Since are based on intuitive gait features, they are more likely to transfer between simulation and hardware, as compared to direct cost. The thresholds in are chosen according to values observed in nominal human walking from Winter and Yack (1987).

3.1.2 Learning Feature Transform with a Neural Network

While domain-specific feature transforms can be extremely useful and robust, they might be difficult to generate when a domain expert is not present. This motivates directly learning such feature transforms from trajectory data. In this section we describe our approach to train neural networks to reconstruct trajectory summaries (Antonova et al., 2017) that achieves this goal of minimizing expert involvement.

Trajectory summaries are a convenient choice for reparametrizing controllers into an easy to optimize space. For example, controllers that fall would automatically be far away from controllers that walk. If these trajectories can be extracted from a high-fidelity simulator, we would not have to evaluate each controller on hardware. However, conventional implementations of BO evaluate the kernel function for a large number of points per iteration, requiring thousands of simulations each iteration. To avoid this, a Neural Network (NN) can be trained to reconstruct trajectory summaries from a large set of pre-sampled data points. NN provides flexible interpolation, as well as fast evaluation (controller

trajectory summary). Furthermore, trajectories are agnostic to the specific cost used during BO. Thus the data collection can be done offline, and there is no need to re-run simulations in case the definition of the cost is modified.

We use the term ‘trajectory’ in a general sense, referring to several sensory states recorded during a simulation. To create trajectory summaries for the case of locomotion, we include measurements of: walking time (time before falling), energy used during walking, position of the center of mass and angle of the torso. With this, we construct a dataset for NN to fit: a Sobol grid of controller parameters (, million) along with trajectory summaries from simulation. NN is trained using mean squared loss:

NN input: – a set of controller parameters

NN output: – reconstructed trajectory summary

NN loss:

The outputs are then used in the kernel for BO:

(3)

We did not carefully select the sensory traces used in the trajectory summaries. Instead, we used the most obvious states, aiming for an approach that could be easily adapted to other domains. To apply this approach to a new setting, one could simply include information that is customarily tracked, or used in costs. For example, for a manipulator, the coordinates of the end effector(s) could be recorded at relevant points. Force-torque measurements could be included, if available.

3.2 Kernel Adjustment for Handling Simulation-Hardware Mismatch

Approaches described in previous sections could provide improvement for BO when a high-fidelity simulator is used in kernel construction. In Rai et al. (2017) we presented promising results of experimental evaluation on hardware. However, it is unclear how the performance changes when simulation-hardware mismatch becomes apparent.

In Rai et al. (2017), we also proposed a way to incorporate information about simulation-hardware mismatch into the kernel from the samples evaluated so far. We augment the simulation-based kernel to include this additional information about mismatch, by expanding the original kernel by an extra dimension that contains the predicted mismatch for each controller .

A separate Gaussian process is used to model the mismatch experienced on hardware, starting from an initial prior mismatch of 0: . For any evaluated controller , we can compute the difference between in simulation and on hardware: . We can now use mismatch data to construct a model for the expected mismatch: . In the case of using a GP-based model, would denote the posterior mean. With this, we can predict simulation-hardware mismatch in the original space of controller parameters for unevaluated controllers. Combining this with kernel we obtain an adjusted kernel:

(4)

The similarity between points is now dictated by two components: representation in space and expected mismatch. This construction has an intuitive explanation: Suppose controller results in walking when simulated, but falls during hardware evaluation. would register a high mismatch for . Controllers would be deemed similar to only if they have both similar simulation-based and similar estimated mismatch. Points with similar simulation-based and low predicted mismatch would still be ‘far away’ from the failed . This would help BO sample points that still have high chances of walking in simulation, but are in a different region of the original parameter space. In the next section, we present a more mathematically rigorous interpretation for .

3.2.1 Interpretation of Kernel with Mismatch Modeling

Let us consider a controller evaluated on hardware. The difference between simulation-based and hardware-based feature transform for is . The ‘true’ hardware feature transform for is . After evaluations on hardware, can serve as data for modeling simulation-hardware mismatch. In principle, any data-efficient model can be used, such as GP (a multi-output GP in case ). With this, we can obtain an adjusted transform: , where is the output of the model fitted using .

Suppose has not been evaluated on hardware. We can use as the adjusted estimate of what the output of should be, taking into account what we have learned so far about simulation-hardware mismatch.

Let’s construct kernel that uses these hardware-adjusted estimates directly:

Using , we have:

If we now observe that we get:

Compare this to from Equation 4:

(5)

Now we see that and have a similar form. Hyperparameters provide flexibility in as compared to having only vector in . They can be adjusted manually or with Automatic Relevance Determination. For , the role of signal variance is captured by . This makes the kernel nonstationary in the transformed space. Since is already non-stationary in , it is unclear whether non-stationarity of in the transformed space has any advantages.

The above discussion shows that proposed in Rai et al. (2017) is motivated both intuitively and mathematically. It aims to use a transform that accounts for the hardware mismatch, without adding extra non-stationarity in the transformed space.

4 Robots, Simulators and Controllers Used

In this section we give a concise description of the robots, controllers and simulators used in experiments with BO for bipedal locomotion. We aim for our approach to be applicable to a wide range of bipedal robot morphologies and controllers, including state-of-the-art controllers (Feng et al., 2015). This ensures that our experimental results are relevant to current research for bipedal locomotion and are transferable to other systems.

We work with two different types of controllers – a reactively stepping controller and a human-inspired neuromuscular controller (NMC). The reactively stepping controller is model-based: it uses inverse-dynamics models of the robot to compute desired motor torques. In contrast, NMC is model-free: it computes desired torques using hand-designed policies, created with biped locomotion dynamics in mind. These controllers exemplify two different and widely used ways of controlling bipedal robots. In addition to this, we show results on two different robot morphologies – a parallel bipedal robot ATRIAS, and a serial 7-link biped model. Our hardware experiments are conducted on ATRIAS; the 7-link biped is only used in simulation. Our success on both robots shows that the approaches developed in this paper are widely applicable to a range of bipedal robots and controllers.

4.1 ATRIAS Robot

Our hardware platform is an ATRIAS robot (Figure 1). ATRIAS is a parallel bipedal robot with most of its mass concentrated around the torso, weighing . The legs are 4-segment linkages actuated by 2 Series Elastic Actuators (SEAs) in the sagittal plane and a DC motor in the lateral plane. Details can be found in Hubicki et al. (2016). In this work we focus on planar motion around a boom. ATRIAS is a highly dynamic system due to its point feet, with static stability only in double stance on the boom.

4.2 Planar 7-link Biped

Figure 3: 7-link biped

The second robot used in our experiments is a 7-link biped (Figure 3). It has a trunk and segmented legs with ankles. Unlike ATRIAS, this is a series robot with actuators in the hip, knees and ankles. The inertial properties of its links are similar to an average human (Winter and Yack, 1987). The simulator code is modified from Thatte and Geyer (2016). The 7-link model is a canonical simulator for testing bipedal walking algorithms, for example in Song and Geyer (2015). It is a simplified two-dimensional simulator for a large range of humanoid robots, like Atlas (Feng et al., 2015). The purpose of using this simulator is to study the generalizability of our proposed approaches to systems different from ATRIAS.

4.3 Feedback Based Reactive Stepping Policy

We design a parametrized controller for controlling the CoM height, torso angle and the swing leg by commanding desired ground reaction forces and swing foot landing location.

Here, is the desired horizontal ground reaction force (GRF), and are the proportional and derivative feedback gains on the torso angle and velocity . is the desired vertical GRF, and are the proportional and derivative gains on the CoM height and vertical velocity . and are the desired CoM height and torso lean. is the desired foot landing location for the end of swing; is the horizontal CoM velocity, is the feedback gain that regulates towards the target velocity .

is a constant and is the distance between the stance leg and the CoM; is the swing time.

The desired GRFs are sent to ATRIAS inverse dynamics model that generates desired motor torques . Details can be found in Rai et al. (2017).

This controller assumes no double-stance, and the swing leg takes off as soon as stance is detected. This leads to a highly dynamic gait, as the contact polygon for ATRIAS in single stance is a point, posing a challenging optimization problem.

To investigate the effects of increasing dimensionality on our optimization, we construct two controllers with different number of free parameters:

  • 5-dimensional controller : optimizing 5 parameters
    (, and the feedback on are hand tuned)

  • 9-dimensional controller : optimizing all 9 parameters of the high-level policy

4.4 16-dimensional Neuromuscular Controller

We use neuromuscular model policies, as introduced in Geyer and Herr (2010), as our controller for the 7-link planar human-like biped model. These policies use approximate models of muscle dynamics and human-inspired reflex pathways to generate joint torques, producing gaits that are similar to human walking.

Each leg is actuated by 7 muscles, which together produce torques about the hip, knee and ankle. Most of the muscle reflexes are length or force feedbacks on the muscle state aimed at generating a compliant leg, keeping knee from hyperextending and maintaining torso orientation in stance. The swing control has three main components – target leg angle, leg clearance and hip control due to reaction torques. Together with the stance control, this leads to a total of 16 controller parameters, described in details in Antonova et al. (2016).

Though originally developed for explaining human neural control pathways, this controller has recently been applied to prosthetics and bipeds, for example Thatte and Geyer (2016) and Van der Noot et al. (2015). As demonstrated in Song and Geyer (2015), this controller is capable of generating a variety of locomotion behaviours for a humanoid model – walking on rough ground, turning, running, and walking upstairs, making it a very versatile controller. This is a model-free controller as compared to the reactive-stepping controller, which was model-based.

4.5 50-dimensional Virtual Neuromuscular Controller

Another model-free controller we use on ATRIAS is a modified version of Batts et al. (2015). VNMC maps a neuromuscular model, similar to the one described in Section 4.4 to the ATRIAS robot’s topology and emulates it to generate desired motor torques. The robot’s states are mapped to the states of a virtual 5-link bipedal robot. This virtual robot then used by VNMC to generate knee and hip torques which are then mapped back to the robot torques, in swing and stance. We adapt VNMC by removing some biological components while preserving its basic functionalities. First, the new VNMC directly uses joint angles and angular velocity data instead of estimating it from physiological sensory data, such as muscle fiber length and velocity. Second, most of the neural transmission delays are removed, except those utilized by the controller. The final version of the controller consists of 50 parameters including low-level control parameters, such as feedback gains, as well as high level parameters, such as desired step length and desired torso lean. When optimized using CMA-ES, it can control ATRIAS to walk on rough terrains with height changes of 20 cm in planar simulation (Batts et al., 2015).

4.6 Simulators with Different Levels of Fidelity

To compare the performance of different methods that can be used to transfer information from simulation to hardware, we create a series of increasingly approximate simulators. These simulators emulate increasing mismatch between simulation and hardware and its effect on the information transfer. In this setting, the high-fidelity ATRIAS simulator (Martin et al., 2015), which was used in all the previous simulation experiments becomes the simulated “hardware”. Next we make dynamics approximations to the original simulator, which are used commonly in simulators to decrease fidelity and increase simulation speed. For example, the complex dynamics of harmonic drives are approximated as a torque multiplication, and the boom is removed from the simulation, leading to a two-dimensional simulator. These approximate simulators now become the simulated “simulators”. As the approximations in these simulators are increased, we expect the performance of methods that utilize simulation for optimization on hardware to deteriorate.

The details of the approximate simulators are described in the two paragraphs below:

1. Simulation with simplified gear dynamics : The ATRIAS robot has geared DC motors attached to leaf springs on the legs. Their high gear ratio of 50 is achieved through a harmonic drive. In the original simulator, this drive is modelled using gear constraints in MATLAB SimScape Multibody simulation environment. These require significant computation time as the constraint equations have to be solved at every time instant, but lead to a very good match between the robot and simulation. We replace this model with a commonly used approximation for geared systems – multiplying the rotor torque by the gear ratio. This reduces the simulation time to about a third of the original simulator, but leads to an approximate gear dynamics model.

2. Simulation with no boom and simplified gear dynamics : The ATRIAS robot walks on a boom in our hardware experiments. The boom leads to lateral torques on the robot, which have vertical and horizontal force components that need to be considered in a realistic simulation of the robot. In our second approximation, we remove the boom from the original simulator and constraint the motion of the robot to a 2-dimensional plane, making a truly two-dimensional simulation of ATRIAS. This is a common approximation for two-dimensional robots. Since this approximation has both simplified gear dynamics and no boom, it is further from the original simulator than the first approximation.

The advantage of such an arrangement is that we can extensively test the effect of un-modelled and wrongly modelled dynamics on information transfer between simulation and hardware. Even in our high-fidelity original simulator, there are several un-modelled components of the actual hardware. For example, the non-rigidness of the robot parts, misaligned motors and relative play between joints. In our experiments, we find that the 50-dimensional VNMC is a sensitive controller, with little hope of directly transferring from simulation to hardware. Anticipating this, we can now test several methods of compensating for this mismatch using our increasingly approximate simulators. In the future, we would like to take this approximations further and study when there is useful information even in over-simplified simulations of legged systems.

5 Experiments

We will now present our experiments on optimizing controllers that are 5, 9, 16 and 50 dimensional. We split our experiments into three categories: hardware experiments on the ATRIAS robot, simulation experiments on the 7-link biped and experiments using simulators with different levels of fidelity. We demonstrate that our proposed approach is able to generalize to different controllers and robot structures and is also robust to simulation inaccuracies.

5.1 Hardware Experiments on the ATRIAS Robot

In this section we describe experiments conducted on the ATRIAS robot, described in Section 4.1. These experiments were conducted around a boom. The cost function used in our experiments is a slight modification of the cost used in (Song and Geyer, 2015):

(6)

where is distance covered before falling, is average speed per step and contains target velocity profile, which can be variable. This cost function heavily penalizes falls, and encourages walking controllers to track target velocity.

Figure 4: ATRIAS during BO with DoG-based kernel (video: https://youtu.be/hpXNFREgaRA)

We do multiple runs of each algorithm on the robot. Each run typically consists of 10 experiments on the robot. Hence 3 runs for one algorithm involve 30 robot trials. Each robot trial is designed to be between to a minute long and the robot needs to be reset to its “home” position between trials. While this might not appear to be very time consuming, often parts of the robot malfunction between trials and repairs need to be done, especially when sampling unstable controllers. We try our best to keep the robot performance consistent across the different algorithms being compared.

We will present two sets of hardware experiments in the following sections. First we present experiments with the DoG-based kernel on the 5 and 9 dimensional controllers introduced in Section 4.3. In these experiments from our work in Rai et al. (2017), the inertial measurement unit (IMU) of the robot had been damaged, and we replaced it with external boom sensors. While these sensors give all the required information, they are lower resolution than the IMU, leading to noisier readings and larger time delays. This makes these experiments especially challenging. In our second set of experiments, we optimize a 9-dimensional controller using a Neural Network based kernel on hardware. In this new set of experiments the IMU had been fixed, leading to better state estimation on the robot. As a result, the behavior of the robot was slightly changed, and we re-conducted experiments for the baseline for this setting. The baseline performed slightly better than the first set of experiments, as can be expected as a result of improved sensing on the robot.

5.1.1 Experiments with a 5-dimensional controller and DoG-based kernel

In our first set of experiments on the robot, we investigated optimizing the 5-dimensional controller from Section 4.3. For these experiments we picked a challenging variable target speed profile: . The controller was stopped after the robot took 50 steps.

To evaluate the difficulty of this setting, we sampled 100 random points on hardware. 10% of these were found to walk. In contrast, in simulation the success rate of random sampling was 27.5%. This indicates that the simulation was easier, which could be potentially detrimental to algorithms that rely heavily on simulation, because a large portion of controllers that walk in simulation fall on hardware. Nevertheless, using a DoG-based kernel offered significant improvements over a standard SE kernel, as shown in Figure 4(a).

We conducted 5 runs of each – BO with DoG-based kernel and BO with SE, 10 trials for DoG-based kernel per run, and 20 for SE kernel. In total, this led to 150 experiments on the robot (excluding the 100 random samples). BO with DoG-based kernel finds walking points in 100% of runs within 3 trials. In comparison, BO with SE found walking points in 10 trials in 60% runs, and in 80% runs in 20 trials (Figure 4(a)).

(a) BO for 5D controller. BO with SE finds walking points in 4/5 runs within 20 trials. BO with DoG-based kernel finds walking points in 5/5 runs within 3 trials.
(b) BO for 9D controller. BO with SE doesn’t find walking points in 3 runs. BO with DoG-based kernel finds walking points in 3/3 runs within 5 trials.
Figure 5: BO for 5D and 9D controller on ATRIAS robot hardware. Plots show mean best cost so far. Shaded region shows

one standard deviation. Re-created from 

Rai et al. (2017).

5.1.2 Experiments with a 9-dimensional controller and DoG-based kernel

Our next set of experiments optimized the 9-dimensional controller from Section 4.3. First we sampled 100 random points for the variable speed profile described above, but this led to no walking points. To ensure that we have a reasonable baseline we decided to simplify the speed profile for this setting: for steps. We evaluated 100 random points on hardware, and 3 walked for the easier speed profile. In comparison, the success rate in simulation is 8% for the tougher variable-speed profile, implying an even greater mismatch between hardware and simulation than the 5-dimensional controller. Part of the mismatch can be attributed to the lack of IMU in these experiments. In the 9-dimensional controller, the desired CoM height as well as the feedback gains for this height are optimized. Without the IMU, our system does not have a good estimation of vertical height of the CoM, except through kinematics, leading to poor control authority. However, the IMU on ATRIAS is a very expensive fiber-optic IMU that is not commonly used on humanoid robots, and most robots use simple state estimation methods. So, this is a common setting for humanoid robots, even if it presents a challenge for the optimization methods.

We conducted 3 runs of BO with DoG-based kernel and BO with SE, 10 trials for DoG-based kernel per run, and 10 for SE. In total, this led to 60 experiments on the hardware (excluding the random sampling). BO with DoG-based kernel found walking points in 5 trials in 3/3 runs. BO with SE did not find any walking points in 10 trials in all 3 runs. These results are shown in Figure 4(b).

Based on these results, we concluded that BO with DoG-based kernel was indeed able to extract useful information from simulation and speed up learning on hardware, even when there was mismatch between simulation and hardware.

5.1.3 Experiments with a 9-dimensional controller and NN-based kernel

Figure 6: BO for 9D controller on ATRIAS robot hardware.

In the next set of experiments, we evaluated performance of the NN-based kernel described in Section 3.1.2. We optimize the 9-dimensional controller from Section 4.3.

The target of hardware experiments was to walk for 30 steps at , similar to Section 5.1.2. However, by these experiments the IMU had been re-installed on the robot.

We observed that the SE performance improved, even though starting from the same random samples, hyper-parameter setting and speed profile. We attribute this change to a better estimation and control of the CoM vertical height.

Figure 6 shows comparison of BO with NN-based kernel and SE kernels. We conducted 5 runs of both algorithms with 10 trials in each run, leading to a total of 100 robot trials. BO with the NN-based kernel found walking points in all 5 runs within 6 trials, while BO with SE kernel only found walking points in 2 of 5 runs in 10 trials. Hence, even without explicit hand-designed domain knowledge, like the DoG-based kernel, the NN-based kernel is able to extract useful information from simulation and successfully guide hardware experiments.

5.2 Simulation Experiments on a 7-link Biped

In this section, we discuss simulation experiments with a 16-dimensional Neuromuscular controller (Section 4.4) on a 7-link biped model. These experiments, first reported in  Antonova et al. (2017), also demonstrate the cost-agnostic nature of our approach by optimizing two very different costs.

Figure 7 shows BO with DoG-based kernel, NN-based kernel and SE kernel for two different costs from prior literature. The first cost promotes walking further and longer before falling, while penalizing deviations from the target speed (Antonova et al., 2016):

(7)

where is seconds walked, is the final CoM position, is speed and is the desired walking speed ( in our case). The second cost function is similar to the cost used in Section 5. It penalizes falls explicitly, and encourages walking at desired speed and with lower cost of transport:

(8)

where is the distance covered before falling, is the average speed of walking, is the target velocity, and captures the cost of transport. The changed constant is to account for a longer simulation time.

Figure 6(a) shows that the NN-based kernel and the DoG-based kernel offer a significant improvement over BO with the SE kernel in sample efficiency when using the , with more than 90% of runs achieving walking after 25 trials. BO with the SE kernel takes 90 trials to get 90% success rate. Figure 6(b) shows that similar performance by the two proposed approaches is observed on the non-smooth cost. With the NN-based kernel, 70% of the runs find walking solutions after 100 trials, similar to the DoG-based kernel. However, optimizing non-smooth cost is very challenging for BO with the SE kernel: a walking solution is found only in 1 out of 50 runs after 100 trials.

(a) Using smooth cost from Equation 7.
(b) Using non-smooth cost from Equation 8.
Figure 7: BO for the Neuromuscular controller. trajNN and DoG kernels were constructed with undisturbed model on flat ground. BO is run with mass/inertia disturbances on different rough ground profiles. Plots show means over 50 runs, 95% CIs. Re-created from Antonova et al. (2017).

We attribute the difference in performance of the SE kernel on the two costs to the nature of the costs. If a point walks some distance , Equation 7 reduces in terms of and Equation 8 reduces by . A sharper fall in the first cost causes BO to exploit around points that walk at some distance, finding points that walk forever. BO with the second cost continues to explore, as the signal is too weak. However the success of both NN-based and DoG-based kernels on both costs shows that the same kernel can indeed be used for optimizing multiple costs robustly, without any further tuning needed. This is important because often the cost has to be changed based on the outcome of the optimization, and it would be impractical to recreate the kernel for each of these costs.

5.3 Experiments with Increasing Simulation-Hardware Mismatch

Figure 8: BO for 50d controller on original ATRIAS simulation (Rai et al., 2017).

In this section, we describe our experiments with increasing simulation-hardware mismatch and its effect on approaches that use information from simulation during hardware optimization. The quality of information transfer between simulation and hardware depends not only on the mismatch between the two, but also on the controller used. For a robust controller, small dynamics errors would not cause a significant deterioration in performance, while for a sensitive controller this might be much more detrimental. There is still an advantage to studying such a sensitive controller, as it might be much more energy efficient and versatile. In our experiments, the 50-dimensional VNMC described in Section 4.5 is capable of generating very efficient gaits but is sensitive to modelling errors. Figure 8 shows the performance of the DoG-based and adjusted DoG-based kernel on the original high-fidelity simulator. While both methods find walking points in 20 trials, adjusted-DoG performs better. There is mismatch even between short and long simulations for this controller. This mismatch is compensated by the adjusted-DoG kernel.

In the rest of this section, we provide experimental analysis of settings with increasing simulated mismatch and their effect on optimization of the 50-dimensional VNMC. We compare several approaches that improve sample-efficiency of BO and investigate if the improvement they offer is robust to mismatch between the simulated setting used for constructing kernel/prior and the setting on which BO is run.

(a) Informed kernels generated using simulator with simplified gear dynamics.
(b) Informed kernels generated using simplified gear dynamics, without boom model.
Figure 9: BO is run on the original simulator. Informed kernels perform well despite significant mismatch, when kernels are generated using simulator with simplified gear dynamics (left). In the case of severe mismatch, when the boom model is also removed, informed kernels still improve over baseline SE (right). Plots show best cost for mean over 50 runs for each algorithm, 95% CIs.

First, we examine the performance of our proposed approaches with informed kernels: , and . Figure 8(a) shows the case when informed kernels are generated using the simulator with simplified gear dynamics while BO is run on the original simulator. After 50 trials, all runs with informed kernels find walking solutions, while for SE only have walking solutions.

Next, Figure 8(b) shows performance of , and when the kernels are constructed using a simulator with simplified dynamics and without a the boom. In this case the mismatch with the original simulator is larger than before and we see the advantage of using adjustment for DoG-based kernel: finds walking points in all runs after 35 trials. also achieves this, but after 50 trials. finds walking points in of the runs after 50 trials. The performance of SE stays the same, as it uses no prior information from any simulator.

This illustrates that while the original DoG-based kernel can recover from slight simulation-hardware mismatch, the adjusted DoG-kernel is required if one expects higher mismatch. seems to recover from the mismatch, but might benefit from an adjusted version. We leave this to future work.

5.3.1 Comparisons of Prior-based and Kernel-based Approaches

We will classify approaches that use simulation information in hardware optimization as prior-based or kernel-based. Prior-based approaches use costs from the simulation in the prior of the GP used in the BO. This can help BO a lot if the costs between simulation and hardware transfer, and the cost function is fixed. However, in the presence of large mismatch, points that perform well in simulation might fail on hardware. A prior-based method can be biased towards sampling promising points from simulation, resulting in an even poorer performance than methods with no prior. Kernel-based approaches consist of methods that incorporate the information from simulation into the kernel of the GP. These can be sample-inefficient as compared to prior-based method, but less likely to be biased towards unpromising regions in the presence of mismatch. They also easily generalize to multiple costs, so that there is no additional computation if the cost is changed. This is important because a lot of these approaches can take several days of computation to generate the informed kernel. For example,

Cully et al. (2015) report taking 2 weeks on a 16-core computer to generate their map.

It is possible to also combine both prior-based and kernel-based methods, as in Cully et al. (2015). We will classify these as ‘prior-based’ methods, since in our experiments prior outweighs the kernel effects for such cases. In our comparison with Cully et al. (2015), we will implement a version with and without the prior points. We do not add a cost prior to BO using DoG-based kernel, as this limits us to a particular cost, and high-fidelity simulators. Since both of these can be major obstacles in real robot experiments, we refrain from doing so.

Figure 9(a) shows the performance when using simulation cost in the prior during BO. BO with a cost prior created using the original version of the simulator illustrates what would happen in the best case scenario, as optimization is merely a look-up here. When the simulator with simplified gear dynamics is used for constructing the prior, we observe significant improvements over uninformed BO prior. However, when the prior is constructed from simplified gear dynamics and no boom setting, the approach performs slightly worse than uninformed BO. This shows that while an informed prior can be very helpful when created from a simulator close to hardware, it can hurt performance if simulator is significantly different from hardware.

(a) BO with cost prior: straightforward approach useful for low-to-medium mismatch; but no improvement if mismatch is severe.
(b) Performance of IT&E algorithm (our implementation of Cully et al. (2015), adapted to bipedal locomotion).
Figure 10: BO using prior-based approaches. Mean over 50 runs for each algorithm, 95% CIs.

Next, we discuss experiments with our implementation of Intelligent Trial and Error (IT&E) algorithm from Cully et al. (2015). This algorithm combines adding a cost prior from simulated evaluations with adding simulation information into the kernel. IT&E defines a behavior metric and tabulates best performing points from simulation on their corresponding behavior score. The behavior metric used in our experiments is duty-factor of each leg, which can go from 0 to 1.0. We discretize the duty factor into 21 cells of 0.05 increments, leading to a grid. We collect the 5 highest performing controllers for each square in the behavior grid, creating a grid. Next, we generate 50 random combinations of a grid, selecting 1 out of the 5 best controllers per grid cell. Care was taken to ensure that all 5 controllers had comparable costs in the simulator used for creating the map. Cost of each selected controller is added to the prior and BO was performed in the behavior space, like in Cully et al. (2015).

Figure 9(b) shows BO with IT&E constructed using different versions of the simulator. IT&E constructed using simplified gear dynamics simulator is slightly less sample-efficient than the straightforward ‘cost prior’ approach. When constructed with the simulator with no boom, IT&E is able to improve over uninformed BO. However, it only finds walking points in 77% of the runs in 50 trials in this case, as some of the generated maps contained no controllers that could walk on the ‘hardware’. This is a shortcoming of the IT&E algorithm, as it eliminates a very large part of the search space and if the pre-selected space does not contain a walking point, no walking controllers can be sampled with BO. This problem could possibly be avoided by using a finer grid, or a different behavior metric. However tuning such hyper-parameters can turn out to be expensive, in computation and hardware experiment time.

(a) BO using our implementation of IT&E without cost prior (from Cully et al. (2015)).
(b) BO using constructed from simulators with various levels of mismatch.
Figure 11: BO using kernel-based approaches. Mean over 50 runs for each algorithm, 95% CIs.

To separate the effects of using simulation information in prior mean vs kernel, we evaluated a kernel-only version of IT&E algorithm. Figure 10(a) shows these results. It shows that the cost prior is crucial for the success of IT&E and performance deteriorates without it. Hence, it is not practical to use IT&E on a cost different than what it was generated for.

Nonetheless, Figure 9 showed that BO with adjusted DoG kernel is able to handle both moderate and severe mismatch with kernel-only information, collected in Figure 10(b).

In summary, we created two simulators with increasing modelling approximations, and studied the effect of using these to aid optimization on the original simulator. We found that while methods that use cost in the prior of BO can be very sample-efficient in low mismatch, their performance worsens as mismatch increases. IT&E introduced in Cully et al. (2015) uses simulation information in both prior mean and kernel, and is very sample-efficient in cases of low mismatch. Even with high mismatch, it performs better than just prior-based BO but doesn’t find walking controllers reliably. In comparison, adjusted DoG-based kernel performed well in all the tested scenarios. All of this shows that the adjusted DoG-based kernel can reliably improve sample-efficiency of BO even when the mismatch between simulation and hardware is high. We would like to continue working in this direction and explore the usefulness of even simpler simulators in the future.

6 Conclusion

In this paper, we presented and analyzed in details our work from Antonova et al. (2016), Antonova et al. (2017) and Rai et al. (2017). These works introduce domain-specific feature transforms that can be used to optimize locomotion controllers on hardware efficiently. The feature transforms project the original controller space into a space where BO can discover promising regions quickly. We described a transform for bipedal locomotion designed with the knowledge of human walking and a neural network based transform that uses more general information from simulated trajectories. Our experiments demonstrate success at optimizing controllers on the ATRIAS robot. Further simulation-based experiments also indicate potential for other bipedal robots. For optimizing sensitive high-dimensional controllers, we proposed an approach to adjust simulation-based kernel using data seen on hardware. To study the performance of this, as well as compare our approach to other methods, we created a series of increasingly approximate simulators. Our experiments show that while several methods from prior literature can perform well with low simulation-hardware mismatch (sometimes even better than our proposed approach), they suffer when this mismatch increases. In such cases, our proposed kernels with hardware adjustment can yield reliable performance across different costs, simulators and robots.

This research was supported in part by National Science Foundation grant IIS-1563807, the Max-Planck-Society, & the Knut and Alice Wallenberg Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.

Appendix A: Implementation Details

In this Appendix we provide a summary of data collection and implementation details. Our implementation of BO was based on the framework in Gardner et al. (2014). We used Expected Improvement (EI) acquisition function (Mockus et al., 1978). We also experimented with Upper Confidence Bound (UCB) (Srinivas et al., 2010), but found that performance was not sensitive to the choice of acquisition function. Hyper-parameters for BO were initialized to default values: 0 for mean offset, 1.0 for kernel length scales and signal variance, 0.1 for (noise parameter). Hyperparameters were optimized using the marginal likelihood (Shahriari et al. (2016), Section V-A). For all algorithms, we optimized hyperparameters after a low-cost controller was found (to save compute resources and avoid premature hyperparameter optimization).

Kernel type Controller dim # Sim points Sim duration Kernel dim Features in kernel
5 20K 3.5s 1
9 100K 5s 1
50 200K 5s 1
9 100K 5s 4 , , ,
16 100K 5s 8 , , ,
, , ,
50 200K 5s 13 , , , ,
Table 2: Simulation Data Collection Details. was described in Section 3.1.1. For : is time walked in simulation before falling, and are the and positions of Center of Mass (CoM) at the end of the short simulation, is the torso angle, is the torso velocity, is the CoM speed ( is the horizontal and is the vertical component), is the squared sum of torques applied; , denote vectors with mean CoM and measurements every second.

Our choice of SE kernel as the baseline for BO was due to its widespread use. The SE kernel belongs to a broader class of Matérn kernels. In some applications, carefully choosing the parameters of Matérn kernel could improve performance of BO. However, Matérn kernels are stationary: depend only on for all . Our approach seeks to build kernels that remove this limitation in a manner informed by simulation.

To create cost prior for experiments in Section 5.3 we collected 50,000 evaluations of 30s trials for a range of controller parameters. Then we conducted 50 runs, using random subsets of 35,000 evaluations to construct the prior. The numbers were chosen such that this approach used similar amount of computation as our kernel-based approaches. To accommodate GP prior with a large number of points we used a sparse GP construction provided by Rasmussen and Nickisch (2010).


References

  • Antonova et al. (2016) Rika Antonova, Akshara Rai, and Christopher G Atkeson. Sample efficient optimization for learning controllers for bipedal locomotion. In Humanoid Robots (Humanoids), 2016 IEEE-RAS 16th International Conference on, pages 22–28. IEEE, 2016.
  • Antonova et al. (2017) Rika Antonova, Akshara Rai, and Christopher G Atkeson. Deep kernels for optimizing locomotion controllers. In Conference on Robot Learning, pages 47–56, 2017.
  • Batts et al. (2015) Zachary Batts, Seungmoon Song, and Hartmut Geyer. Toward a virtual neuromuscular control for robust walking in bipedal robots. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 6318–6323. IEEE, 2015.
  • Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599, 2010.
  • Calandra et al. (2016a) Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian processes for regression. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 3338–3345. IEEE, 2016a.
  • Calandra et al. (2016b) Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian Optimization for Learning Gaits Under Uncertainty.

    Annals of Mathematics and Artificial Intelligence

    , 76(1-2):5–23, 2016b.
  • Cully et al. (2015) Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015.
  • Deisenroth and Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465–472, 2011.
  • Englert and Toussaint (2016) Peter Englert and Marc Toussaint. Combined Optimization and Reinforcement Learning for Manipulation Skills. In Robotics: Science and Systems, 2016.
  • Feng et al. (2015) Siyuan Feng, Eric Whitman, X Xinjilefu, and Christopher G Atkeson. Optimization-based full body control for the darpa robotics challenge. Journal of Field Robotics, 32(2):293–312, 2015.
  • Gardner et al. (2014) Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John Cunningham. Bayesian Optimization with Inequality Constraints. In ICML, pages 937–945, 2014.
  • Geyer and Herr (2010) Hartmut Geyer and Hugh Herr. A Muscle-reflex Model that Encodes Principles of Legged Mechanics Produces Human Walking Dynamics and Muscle Activities. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 18(3):263–273, 2010.
  • Hansen (2006) Nikolaus Hansen. The cma evolution strategy: a comparing review. In

    Towards a new evolutionary computation

    , pages 75–102. Springer, 2006.
  • Hubicki et al. (2016) Christian Hubicki, Jesse Grimes, Mikhail Jones, Daniel Renjewski, Alexander Spröwitz, Andy Abate, and Jonathan Hurst. Atrias: Design and validation of a tether-free 3d-capable spring-mass bipedal robot. The International Journal of Robotics Research, 35(12):1497–1521, 2016.
  • Inman et al. (1953) Verne T Inman, Howard D Eberhart, et al. The major determinants in normal and pathological gait. JBJS, 35(3):543–558, 1953.
  • Kuindersma et al. (2016) Scott Kuindersma, Robin Deits, Maurice Fallon, Andrés Valenzuela, Hongkai Dai, Frank Permenter, Twan Koolen, Pat Marion, and Russ Tedrake. Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot. Autonomous Robots, 40(3):429–455, 2016.
  • Lizotte et al. (2007) Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In IJCAI, volume 7, pages 944–949, 2007.
  • Marco et al. (2017) Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 1557–1563. IEEE, 2017.
  • Martin et al. (2015) William C Martin, Albert Wu, and Hartmut Geyer. Robust spring mass model running for a physical bipedal robot. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 6307–6312. IEEE, 2015.
  • Mockus et al. (1978) J Mockus, V Tiesis, and A Zilinskas. Toward Global Optimization, Volume 2, Chapter Bayesian Methods for Seeking the Extremum. 1978.
  • Peng et al. (2016) Xue Bin Peng, Glen Berseth, and Michiel van de Panne. Terrain-adaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (TOG), 35(4):81, 2016.
  • Rai et al. (2017) Akshara Rai, Rika Antonova, Seungmoon Song, William Martin, Hartmut Geyer, and Christopher G Atkeson. Bayesian Optimization Using Domain Knowledge on the ATRIAS Biped. 2017.
  • Rasmussen and Nickisch (2010) Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. J. Mach. Learn. Res., 11:3011–3015, December 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953029.
  • Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
  • Song and Geyer (2015) Seungmoon Song and Hartmut Geyer. A Neural Circuitry that Emphasizes Spinal Feedback Generates Diverse Behaviours of Human Locomotion. The Journal of Physiology, 593(16):3493–3511, 2015.
  • Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015–1022. Omnipress, 2010.
  • Tesch et al. (2011) Matthew Tesch, Jeff Schneider, and Howie Choset. Using response surfaces and expected improvement to optimize snake robot gait parameters. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 1069–1074. IEEE, 2011.
  • Thatte and Geyer (2016) Nitish Thatte and Hartmut Geyer. Toward Balance Recovery with Leg Prostheses Using Neuromuscular Model Control. IEEE Transactions on Biomedical Engineering, 63(5):904–913, 2016.
  • Theodorou et al. (2010) Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11(Nov):3137–3181, 2010.
  • Van der Noot et al. (2015) Nicolas Van der Noot, Luca Colasanto, Allan Barrea, Jesse van den Kieboom, Renaud Ronsse, and Auke J Ijspeert. Experimental validation of a bio-inspired controller for dynamic walking with a humanoid robot. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 393–400. IEEE, 2015.
  • Wilson et al. (2014) Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning. The Journal of Machine Learning Research, 15(1):253–282, 2014.
  • Wilson et al. (2016) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
  • Winter and Yack (1987) DA Winter and HJ Yack.

    EMG profiles during normal human walking: stride-to-stride and inter-subject variability.

    Electroencephalography and clinical neurophysiology, 67(5):402–411, 1987.