1 Introduction
Machine learning can provide methods for learning controllers for robotic tasks. Yet, even with recent advances in this field, the problem of automatically designing and learning controllers for robots, especially bipedal robots, remains a difficult problem. Some of the core challenges of learning for control scenarios can be summarized as follows: It is expensive to do learning experiments that require a large number of samples with physical robots. Specifically, legged robots are not robust to falls and failures, and are timeconsuming to work with and repair. Furthermore, commonly used cost functions for optimizing controllers are noisy to evaluate, nonconvex and nondifferentiable. In order to find learning approaches that can be used on real robots, it is thus important to keep these considerations in mind.
Deep reinforcement learning approaches can deal with noise, discontinuities and nonconvexity of the objective, but they are not dataefficient. These approaches could take on the order of a million samples to learn locomotion controllers
(Peng et al., 2016), which would be infeasible on a real robot. For example, on the ATRIAS robot, samples would take days, in theory. But practically, the robot needs to be “reset” between trials and repaired in case of damage. Using structured expertdesigned policies can help minimize damage to the robot and make the search for successful controllers feasible. However, the problem is blackbox, nonconvex and discontinuous. This eliminates approaches like PI (Theodorou et al., 2010) which make assumptions about the dynamics of the system and PILCO (Deisenroth and Rasmussen, 2011) which assumes a continuous cost landscape. Evolutionary approaches like CMAES (Hansen, 2006) can still be prohibitively expensive, needing thousands of samples (Song and Geyer, 2015).In comparison, Bayesian optimization (BO) is a sampleefficient optimization technique that is robust to nonconvexity, noise and even discontinuities. It has been recently used in a range of robotics problems, such as Calandra et al. (2016b), Marco et al. (2017), Cully et al. (2015). However, sampleefficiency of conventional BO degrades in high dimensions, even for dimensionalities commonly encountered in locomotion controllers. Because of this, hardwareonly optimization becomes intractable for flexible controllers and complex robots. One way of addressing this issue is to utilize simulation to optimize controller parameters. However, simulationonly optimization is vulnerable to learning policies that exploit the simulation and perform well in simulation but poorly on the actual robot. This motivates the development of approaches that can incorporate simulationbased information into the learning method, then optimize with few samples on hardware.
Towards this goal, our previous work in Antonova, Rai, and Atkeson (2016), Antonova et al. (2017), Rai et al. (2017) presents a framework that uses information from highfidelity simulators to learn sampleefficiently on hardware. We use simulation to build informed feature transforms that are used to measure similarity during BO. Thus, the similarity between controller parameters, during optimization on hardware, is informed by how they perform in simulation. With this, it becomes possible to quickly infer which regions of the input space are likely to perform well on hardware. This method has been tested on the ATRIAS biped robot (Figure 1) and shows considerable improvement in sampleefficiency over traditional BO.
In this article, we present indepth explanations and empirical analysis of our previous work. Furthermore, for the first time, we present a procedure for systematically evaluating robustness of such approaches to simulationhardware mismatch. We extend our previous work incorporating mismatch estimates
(Rai et al., 2017) to this setting. We also conduct extensive comparisons with competitive baselines from related work, such as (Cully et al., 2015).The rest of this article is organized as follows: Section 2 provides background for BO, then gives an overview of related work on optimizing locomotion controllers. Section 3.1 describes the idea of incorporating simulationbased transforms into BO; Section 3.2 explains how we handle simulationhardware mismatch. Sections 4.14.5 describe the robot and controllers we use for our experiments; Section 4.6 explains the motivation and construction of simulators with various levels of fidelity. Section 5 gives a summary of hardware experiments conducted on the ATRIAS robot. Section 5.2 shows generalization to a different robot model in simulation. Section 5.3 shows empirical analysis of the impact of simulator fidelity on the performance of the proposed algorithms and alternative approaches.
2 Background and Related Work
This section gives a brief overview of Bayesian optimization (BO), the stateoftheart research on optimizing locomotion controllers, and utilizing simulation information in BO.
2.1 Background on Bayesian Optimization
Bayesian optimization (BO) is a framework for online, blackbox, gradientfree global search (Shahriari et al. (2016) and Brochu et al. (2010) provide a comprehensive introduction). The problem of optimizing controllers can be interpreted as finding controller parameters that optimize some cost function . Here contains parameters of a prestructured policy; the cost is a function of the trajectory induced by controller parameters . For brevity, we will refer to ‘controller parameters ’ as ‘controller ’. We use BO to find controller , such that: .
BO is initialized with a prior that expresses the a priori uncertainty over the value of for each in the domain. Then, at each step of optimization, based on data seen so far, BO optimizes an auxiliary function (called acquisition function) to select the next to evaluate. The acquisition function balances exploration vs exploitation. It selects points for which the posterior estimate of the objective is promising, taking into account both mean and covariance of the posterior. A widely used representation for the cost function is a Gaussian process (GP):
The prior mean function is set to when no domainspecific knowledge is provided, or can be informative in the presence of information. The kernel function encodes similarity between inputs. If is large for inputs , then ) strongly influences . One of the most widely used kernel functions is the Squared Exponential (SE):
where
are signal variance and a vector of length scales respectively.
are referred to as ‘hyperparameters’ in the literature.
2.2 Optimizing Locomotion Controllers
Parametric locomotion controllers can be represented as , where is a policy structure that depends on parameters . For example, can be parameterized by feedback gains on the center of mass (CoM), reference joint trajectories, etc. Vector is the state of the robot, such as joint angles and velocities; used in closedloop controllers. Vector represents the desired control action, for example: torques, angular velocities or positions for each joint on the robot. The sequence of control actions yields a sequence of state transitions, which form the overall ‘trajectory’ . This trajectory is used in the cost function to judge the quality of the controller . In our work, we use structured controllers designed by experts. State of the art research on walking robots featuring such controllers includes Feng et al. (2015), Kuindersma et al. (2016). The overall optimization then includes manually tuning the parameters . An alternative to manual tuning is to use evolutionary approaches, like CMAES, as in Song and Geyer (2015). However, these require a large number of samples and can usually be conducted only in simulation. Optimization in simulation can produce controllers that perform well in simulation, but not on hardware. In comparison, BO is a sampleefficient technique which has become popular for direct optimization on hardware. Recent successes include manipulation (Englert and Toussaint, 2016) and locomotion (Calandra et al., 2016b).
BO for locomotion has been previously explored for several types of mobile robots. These include: snake robots (Tesch et al., 2011), AIBO quadrupeds (Lizotte et al., 2007), and hexapods (Cully et al., 2015). Tesch et al. (2011) optimize a 3dimensional controller for a snake robot in 1040 trials (for speeds up to ). Lizotte et al. (2007) use BO to optimize gait parameters for a AIBO robot in 100150 trials. Cully et al. (2015) learn 36 controller parameters for a hexapod. Even with hardware damage, they can obtain successful controllers for speeds up to in 1215 trials.
Hexapods, quadrupeds and snakes spend a large portion of their gaits being statically stable. In contrast, bipedal walking can be highly dynamic, especially for pointfeet robots like ATRIAS. ATRIAS can only be statically stable in doublestance, and like most bipeds, spends a significant time of its gait being “unstable”, or dynamically stable. In our experiments on hardware, ATRIAS goes up to speeds of . All of this leads to a challenging optimization setting and discontinuous cost function landscape. Calandra et al. (2016b) use BO for optimizing gaits of a dynamic biped on a boom, needing 3040 samples for finding walking gaits for a 4dimensional controller. While this is promising, optimizing a higherdimensional controller needed for complex robots would be even more challenging. If significant number of samples lead to unstable gaits and falls, they could damage the robot. Hence, it is important to develop methods that can learn complex controllers fast, without damaging the robot.
2.3 Incorporating Simulation Information into Bayesian Optimization
The idea of using simulation to speed up BO on hardware has been explored before. Marco et al. (2017) use simulation as a second source of noisy data. Information from simulation can also be added as a prior to the GP used in BO, such as in Cully et al. (2015). While these methods can be successful, one needs to carefully tune the influence of simulation points over hardware points, especially when simulation is significantly different from hardware.
Recently, several approaches proposed incorporating Neural Networks (NNs) into the Gaussian process (GP) kernels (
Wilson et al. (2016), Calandra et al. (2016a)). The strength of these approaches is that they can jointly update the GP and the NN. Calandra et al. (2016a) demonstrated how this added flexibility can handle discontinuities in the cost function landscape. However, these approaches do not directly address the problem of incorporating a large amount of data from simulation in hardware BO experiments.Wilson et al. (2014) explored enhancing GP kernel with trajectories. Their Behavior Based Kernel (BBK) computes an estimate of a symmetric variant of the KL divergence between trajectories induced by two controllers, and uses this as a distance metric in the kernel. However, getting an estimate would require samples for each controller whenever is needed. This can be impractical, as it involves an evaluation of every controller considered. The authors suggest combining BBK with a modelbased approach to overcome this issue by learning a model. But building a reliable model might be an expensive process in itself.
Cully et al. (2015) utilize simulation by defining a behavior metric and collecting best performing points in simulation. This behavior metric then guides BO to quickly find controllers on hardware, and can even compensate for damage to the robot. The search on hardware is conducted in behavior space, and limited to preselected “successful” points from simulation. This helps make their search faster and safer on hardware. However, if an optimal point was not preselected, BO cannot sample it during optimization.
In our work we develop two alternative strategies that utilize trajectories from simulation to build feature transforms that can be incorporated in the GP kernel used for BO. Our approaches incorporate trajectory/behavior information, but ensure that is also computed efficiently during BO. They bias the search towards regions that look promising, but are able to ‘recover’ and search in other parts of the space if simulationhardware mismatch becomes apparent.
3 Proposed Approach: Bayesian Optimization with Informed Kernels
In this section, we offer indepth explanation of approaches from our work in Antonova, Rai, and Atkeson (2016), Antonova et al. (2017), and Rai et al. (2017). This work proposes incorporating domain knowledge into BO with the help of simulation. We evaluate locomotion controllers in simulation, and collect their induced trajectories, which are then used to build an informed transform. This can be achieved by using a domainspecific feature transform (Section 3.1.1) or by learning to reconstruct short trajectory summaries (Section 3.1.2). This feature trasform is used to construct an informed distance metric for BO, and helps BO discover promising regions faster. An overview can be found in Figure 2. In Section 3.2 we discuss how to incorporate simulationhardware mismatch in to the transform, ensuring that BO can benefit from inaccurate simulations as well.
3.1 Constructing Flexible Kernels using Simulationbased Transforms
High dimensional problems with discontinuous cost functions are very common with legged robots, where slight changes to some parameters can make the robot unstable. Both of these factors can adversely affect BO’s performance, but informed feature transforms can help BO sample highperforming controllers even in such scenarios.
In this section, we demonstrate how to construct such transforms utilizing simulations for a given controller . We then use to create an informed kernel for BO on hardware:
(1) 
Note that the functional form above is same as that of Squared Exponential kernel, if considered from the point of view of the transformed space, with as input. While this kernel is stationary as a function of , it is nonstationary in . can bring closer related parts of the space that would be otherwise far apart in the original space. BO can then operate in the space of , which is ‘informed’ by simulation.
3.1.1 The Determinants of Gait Transform
We propose a feature transform for bipedal locomotion derived from physiological features of human walking called Determinants of Gaits (DoG) (Inman et al., 1953). was originally developed for humanlike robots and controllers (Antonova, Rai, and Atkeson, 2016), and then generalized to be applicable to a wider range of bipedal locomotion controllers and robot morphologies (Rai et al., 2017). It is based on the features in Table 1.
(Swing leg retraction) – If the maximum ground clearance of the swing foot is more than a threshold, (0 otherwise); ensures swing leg retraction.  
(Center of mass height) – If CoM height stays about the same at the start and end of a step, (0 otherwise); checks that the robot is not falling.  
(Trunk lean) – If the average trunk lean is the same at the start and end of a step, (0 otherwise); ensures that the trunk is not changing orientation.  
(Average walking speed) – Average forward speed of a controller per step, ; helps distinguish controllers that perform similar on . 
combines features per step and scales them by the normalized simulation time to obtain the DoG score of controller :
(2) 
Here is the number of steps taken in simulation, is time at which simulation terminated (possibly due to a fall), is total time allotted for simulation. Since larger number of steps lead to higher DoG, some controllers that chatter (step very fast before falling) could get misleadingly high scores; we scale the scores by to prevent that. for controller parameters now becomes the computed of the resulting trajectories when is simulated. essentially aids in (soft) clustering of controllers based on their behaviour in simulation. High scoring controllers are more likely to walk than low scoring ones. Since are based on intuitive gait features, they are more likely to transfer between simulation and hardware, as compared to direct cost. The thresholds in are chosen according to values observed in nominal human walking from Winter and Yack (1987).
3.1.2 Learning Feature Transform with a Neural Network
While domainspecific feature transforms can be extremely useful and robust, they might be difficult to generate when a domain expert is not present. This motivates directly learning such feature transforms from trajectory data. In this section we describe our approach to train neural networks to reconstruct trajectory summaries (Antonova et al., 2017) that achieves this goal of minimizing expert involvement.
Trajectory summaries are a convenient choice for reparametrizing controllers into an easy to optimize space. For example, controllers that fall would automatically be far away from controllers that walk. If these trajectories can be extracted from a highfidelity simulator, we would not have to evaluate each controller on hardware. However, conventional implementations of BO evaluate the kernel function for a large number of points per iteration, requiring thousands of simulations each iteration. To avoid this, a Neural Network (NN) can be trained to reconstruct trajectory summaries from a large set of presampled data points. NN provides flexible interpolation, as well as fast evaluation (controller
trajectory summary). Furthermore, trajectories are agnostic to the specific cost used during BO. Thus the data collection can be done offline, and there is no need to rerun simulations in case the definition of the cost is modified.We use the term ‘trajectory’ in a general sense, referring to several sensory states recorded during a simulation. To create trajectory summaries for the case of locomotion, we include measurements of: walking time (time before falling), energy used during walking, position of the center of mass and angle of the torso. With this, we construct a dataset for NN to fit: a Sobol grid of controller parameters (, million) along with trajectory summaries from simulation. NN is trained using mean squared loss:
NN input: – a set of controller parameters
NN output: – reconstructed trajectory summary
NN loss:
The outputs are then used in the kernel for BO:
(3) 
We did not carefully select the sensory traces used in the trajectory summaries. Instead, we used the most obvious states, aiming for an approach that could be easily adapted to other domains. To apply this approach to a new setting, one could simply include information that is customarily tracked, or used in costs. For example, for a manipulator, the coordinates of the end effector(s) could be recorded at relevant points. Forcetorque measurements could be included, if available.
3.2 Kernel Adjustment for Handling SimulationHardware Mismatch
Approaches described in previous sections could provide improvement for BO when a highfidelity simulator is used in kernel construction. In Rai et al. (2017) we presented promising results of experimental evaluation on hardware. However, it is unclear how the performance changes when simulationhardware mismatch becomes apparent.
In Rai et al. (2017), we also proposed a way to incorporate information about simulationhardware mismatch into the kernel from the samples evaluated so far. We augment the simulationbased kernel to include this additional information about mismatch, by expanding the original kernel by an extra dimension that contains the predicted mismatch for each controller .
A separate Gaussian process is used to model the mismatch experienced on hardware, starting from an initial prior mismatch of 0: . For any evaluated controller , we can compute the difference between in simulation and on hardware: . We can now use mismatch data to construct a model for the expected mismatch: . In the case of using a GPbased model, would denote the posterior mean. With this, we can predict simulationhardware mismatch in the original space of controller parameters for unevaluated controllers. Combining this with kernel we obtain an adjusted kernel:
(4) 
The similarity between points is now dictated by two components: representation in space and expected mismatch. This construction has an intuitive explanation: Suppose controller results in walking when simulated, but falls during hardware evaluation. would register a high mismatch for . Controllers would be deemed similar to only if they have both similar simulationbased and similar estimated mismatch. Points with similar simulationbased and low predicted mismatch would still be ‘far away’ from the failed . This would help BO sample points that still have high chances of walking in simulation, but are in a different region of the original parameter space. In the next section, we present a more mathematically rigorous interpretation for .
3.2.1 Interpretation of Kernel with Mismatch Modeling
Let us consider a controller evaluated on hardware. The difference between simulationbased and hardwarebased feature transform for is . The ‘true’ hardware feature transform for is . After evaluations on hardware, can serve as data for modeling simulationhardware mismatch. In principle, any dataefficient model can be used, such as GP (a multioutput GP in case ). With this, we can obtain an adjusted transform: , where is the output of the model fitted using .
Suppose has not been evaluated on hardware. We can use as the adjusted estimate of what the output of should be, taking into account what we have learned so far about simulationhardware mismatch.
Let’s construct kernel that uses these hardwareadjusted estimates directly:
Using , we have:
If we now observe that we get:
Compare this to from Equation 4:
(5) 
Now we see that and have a similar form. Hyperparameters provide flexibility in as compared to having only vector in . They can be adjusted manually or with Automatic Relevance Determination. For , the role of signal variance is captured by . This makes the kernel nonstationary in the transformed space. Since is already nonstationary in , it is unclear whether nonstationarity of in the transformed space has any advantages.
The above discussion shows that proposed in Rai et al. (2017) is motivated both intuitively and mathematically. It aims to use a transform that accounts for the hardware mismatch, without adding extra nonstationarity in the transformed space.
4 Robots, Simulators and Controllers Used
In this section we give a concise description of the robots, controllers and simulators used in experiments with BO for bipedal locomotion. We aim for our approach to be applicable to a wide range of bipedal robot morphologies and controllers, including stateoftheart controllers (Feng et al., 2015). This ensures that our experimental results are relevant to current research for bipedal locomotion and are transferable to other systems.
We work with two different types of controllers – a reactively stepping controller and a humaninspired neuromuscular controller (NMC). The reactively stepping controller is modelbased: it uses inversedynamics models of the robot to compute desired motor torques. In contrast, NMC is modelfree: it computes desired torques using handdesigned policies, created with biped locomotion dynamics in mind. These controllers exemplify two different and widely used ways of controlling bipedal robots. In addition to this, we show results on two different robot morphologies – a parallel bipedal robot ATRIAS, and a serial 7link biped model. Our hardware experiments are conducted on ATRIAS; the 7link biped is only used in simulation. Our success on both robots shows that the approaches developed in this paper are widely applicable to a range of bipedal robots and controllers.
4.1 ATRIAS Robot
Our hardware platform is an ATRIAS robot (Figure 1). ATRIAS is a parallel bipedal robot with most of its mass concentrated around the torso, weighing . The legs are 4segment linkages actuated by 2 Series Elastic Actuators (SEAs) in the sagittal plane and a DC motor in the lateral plane. Details can be found in Hubicki et al. (2016). In this work we focus on planar motion around a boom. ATRIAS is a highly dynamic system due to its point feet, with static stability only in double stance on the boom.
4.2 Planar 7link Biped
The second robot used in our experiments is a 7link biped (Figure 3). It has a trunk and segmented legs with ankles. Unlike ATRIAS, this is a series robot with actuators in the hip, knees and ankles. The inertial properties of its links are similar to an average human (Winter and Yack, 1987). The simulator code is modified from Thatte and Geyer (2016). The 7link model is a canonical simulator for testing bipedal walking algorithms, for example in Song and Geyer (2015). It is a simplified twodimensional simulator for a large range of humanoid robots, like Atlas (Feng et al., 2015). The purpose of using this simulator is to study the generalizability of our proposed approaches to systems different from ATRIAS.
4.3 Feedback Based Reactive Stepping Policy
We design a parametrized controller for controlling the CoM height, torso angle and the swing leg by commanding desired ground reaction forces and swing foot landing location.
Here, is the desired horizontal ground reaction force (GRF), and are the proportional and derivative feedback gains on the torso angle and velocity . is the desired vertical GRF, and are the proportional and derivative gains on the CoM height and vertical velocity . and are the desired CoM height and torso lean. is the desired foot landing location for the end of swing; is the horizontal CoM velocity, is the feedback gain that regulates towards the target velocity .
is a constant and is the distance between the stance leg and the CoM; is the swing time.
The desired GRFs are sent to ATRIAS inverse dynamics model that generates desired motor torques . Details can be found in Rai et al. (2017).
This controller assumes no doublestance, and the swing leg takes off as soon as stance is detected. This leads to a highly dynamic gait, as the contact polygon for ATRIAS in single stance is a point, posing a challenging optimization problem.
To investigate the effects of increasing dimensionality on our optimization, we construct two controllers with different number of free parameters:

5dimensional controller : optimizing 5 parameters
(, and the feedback on are hand tuned) 
9dimensional controller : optimizing all 9 parameters of the highlevel policy
4.4 16dimensional Neuromuscular Controller
We use neuromuscular model policies, as introduced in Geyer and Herr (2010), as our controller for the 7link planar humanlike biped model. These policies use approximate models of muscle dynamics and humaninspired reflex pathways to generate joint torques, producing gaits that are similar to human walking.
Each leg is actuated by 7 muscles, which together produce torques about the hip, knee and ankle. Most of the muscle reflexes are length or force feedbacks on the muscle state aimed at generating a compliant leg, keeping knee from hyperextending and maintaining torso orientation in stance. The swing control has three main components – target leg angle, leg clearance and hip control due to reaction torques. Together with the stance control, this leads to a total of 16 controller parameters, described in details in Antonova et al. (2016).
Though originally developed for explaining human neural control pathways, this controller has recently been applied to prosthetics and bipeds, for example Thatte and Geyer (2016) and Van der Noot et al. (2015). As demonstrated in Song and Geyer (2015), this controller is capable of generating a variety of locomotion behaviours for a humanoid model – walking on rough ground, turning, running, and walking upstairs, making it a very versatile controller. This is a modelfree controller as compared to the reactivestepping controller, which was modelbased.
4.5 50dimensional Virtual Neuromuscular Controller
Another modelfree controller we use on ATRIAS is a modified version of Batts et al. (2015). VNMC maps a neuromuscular model, similar to the one described in Section 4.4 to the ATRIAS robot’s topology and emulates it to generate desired motor torques. The robot’s states are mapped to the states of a virtual 5link bipedal robot. This virtual robot then used by VNMC to generate knee and hip torques which are then mapped back to the robot torques, in swing and stance. We adapt VNMC by removing some biological components while preserving its basic functionalities. First, the new VNMC directly uses joint angles and angular velocity data instead of estimating it from physiological sensory data, such as muscle fiber length and velocity. Second, most of the neural transmission delays are removed, except those utilized by the controller. The final version of the controller consists of 50 parameters including lowlevel control parameters, such as feedback gains, as well as high level parameters, such as desired step length and desired torso lean. When optimized using CMAES, it can control ATRIAS to walk on rough terrains with height changes of 20 cm in planar simulation (Batts et al., 2015).
4.6 Simulators with Different Levels of Fidelity
To compare the performance of different methods that can be used to transfer information from simulation to hardware, we create a series of increasingly approximate simulators. These simulators emulate increasing mismatch between simulation and hardware and its effect on the information transfer. In this setting, the highfidelity ATRIAS simulator (Martin et al., 2015), which was used in all the previous simulation experiments becomes the simulated “hardware”. Next we make dynamics approximations to the original simulator, which are used commonly in simulators to decrease fidelity and increase simulation speed. For example, the complex dynamics of harmonic drives are approximated as a torque multiplication, and the boom is removed from the simulation, leading to a twodimensional simulator. These approximate simulators now become the simulated “simulators”. As the approximations in these simulators are increased, we expect the performance of methods that utilize simulation for optimization on hardware to deteriorate.
The details of the approximate simulators are described in the two paragraphs below:
1. Simulation with simplified gear dynamics : The ATRIAS robot has geared DC motors attached to leaf springs on the legs. Their high gear ratio of 50 is achieved through a harmonic drive. In the original simulator, this drive is modelled using gear constraints in MATLAB SimScape Multibody simulation environment. These require significant computation time as the constraint equations have to be solved at every time instant, but lead to a very good match between the robot and simulation. We replace this model with a commonly used approximation for geared systems – multiplying the rotor torque by the gear ratio. This reduces the simulation time to about a third of the original simulator, but leads to an approximate gear dynamics model.
2. Simulation with no boom and simplified gear dynamics : The ATRIAS robot walks on a boom in our hardware experiments. The boom leads to lateral torques on the robot, which have vertical and horizontal force components that need to be considered in a realistic simulation of the robot. In our second approximation, we remove the boom from the original simulator and constraint the motion of the robot to a 2dimensional plane, making a truly twodimensional simulation of ATRIAS. This is a common approximation for twodimensional robots. Since this approximation has both simplified gear dynamics and no boom, it is further from the original simulator than the first approximation.
The advantage of such an arrangement is that we can extensively test the effect of unmodelled and wrongly modelled dynamics on information transfer between simulation and hardware. Even in our highfidelity original simulator, there are several unmodelled components of the actual hardware. For example, the nonrigidness of the robot parts, misaligned motors and relative play between joints. In our experiments, we find that the 50dimensional VNMC is a sensitive controller, with little hope of directly transferring from simulation to hardware. Anticipating this, we can now test several methods of compensating for this mismatch using our increasingly approximate simulators. In the future, we would like to take this approximations further and study when there is useful information even in oversimplified simulations of legged systems.
5 Experiments
We will now present our experiments on optimizing controllers that are 5, 9, 16 and 50 dimensional. We split our experiments into three categories: hardware experiments on the ATRIAS robot, simulation experiments on the 7link biped and experiments using simulators with different levels of fidelity. We demonstrate that our proposed approach is able to generalize to different controllers and robot structures and is also robust to simulation inaccuracies.
5.1 Hardware Experiments on the ATRIAS Robot
In this section we describe experiments conducted on the ATRIAS robot, described in Section 4.1. These experiments were conducted around a boom. The cost function used in our experiments is a slight modification of the cost used in (Song and Geyer, 2015):
(6) 
where is distance covered before falling, is average speed per step and contains target velocity profile, which can be variable. This cost function heavily penalizes falls, and encourages walking controllers to track target velocity.
We do multiple runs of each algorithm on the robot. Each run typically consists of 10 experiments on the robot. Hence 3 runs for one algorithm involve 30 robot trials. Each robot trial is designed to be between to a minute long and the robot needs to be reset to its “home” position between trials. While this might not appear to be very time consuming, often parts of the robot malfunction between trials and repairs need to be done, especially when sampling unstable controllers. We try our best to keep the robot performance consistent across the different algorithms being compared.
We will present two sets of hardware experiments in the following sections. First we present experiments with the DoGbased kernel on the 5 and 9 dimensional controllers introduced in Section 4.3. In these experiments from our work in Rai et al. (2017), the inertial measurement unit (IMU) of the robot had been damaged, and we replaced it with external boom sensors. While these sensors give all the required information, they are lower resolution than the IMU, leading to noisier readings and larger time delays. This makes these experiments especially challenging. In our second set of experiments, we optimize a 9dimensional controller using a Neural Network based kernel on hardware. In this new set of experiments the IMU had been fixed, leading to better state estimation on the robot. As a result, the behavior of the robot was slightly changed, and we reconducted experiments for the baseline for this setting. The baseline performed slightly better than the first set of experiments, as can be expected as a result of improved sensing on the robot.
5.1.1 Experiments with a 5dimensional controller and DoGbased kernel
In our first set of experiments on the robot, we investigated optimizing the 5dimensional controller from Section 4.3. For these experiments we picked a challenging variable target speed profile: . The controller was stopped after the robot took 50 steps.
To evaluate the difficulty of this setting, we sampled 100 random points on hardware. 10% of these were found to walk. In contrast, in simulation the success rate of random sampling was 27.5%. This indicates that the simulation was easier, which could be potentially detrimental to algorithms that rely heavily on simulation, because a large portion of controllers that walk in simulation fall on hardware. Nevertheless, using a DoGbased kernel offered significant improvements over a standard SE kernel, as shown in Figure 4(a).
We conducted 5 runs of each – BO with DoGbased kernel and BO with SE, 10 trials for DoGbased kernel per run, and 20 for SE kernel. In total, this led to 150 experiments on the robot (excluding the 100 random samples). BO with DoGbased kernel finds walking points in 100% of runs within 3 trials. In comparison, BO with SE found walking points in 10 trials in 60% runs, and in 80% runs in 20 trials (Figure 4(a)).
one standard deviation. Recreated from
Rai et al. (2017).5.1.2 Experiments with a 9dimensional controller and DoGbased kernel
Our next set of experiments optimized the 9dimensional controller from Section 4.3. First we sampled 100 random points for the variable speed profile described above, but this led to no walking points. To ensure that we have a reasonable baseline we decided to simplify the speed profile for this setting: for steps. We evaluated 100 random points on hardware, and 3 walked for the easier speed profile. In comparison, the success rate in simulation is 8% for the tougher variablespeed profile, implying an even greater mismatch between hardware and simulation than the 5dimensional controller. Part of the mismatch can be attributed to the lack of IMU in these experiments. In the 9dimensional controller, the desired CoM height as well as the feedback gains for this height are optimized. Without the IMU, our system does not have a good estimation of vertical height of the CoM, except through kinematics, leading to poor control authority. However, the IMU on ATRIAS is a very expensive fiberoptic IMU that is not commonly used on humanoid robots, and most robots use simple state estimation methods. So, this is a common setting for humanoid robots, even if it presents a challenge for the optimization methods.
We conducted 3 runs of BO with DoGbased kernel and BO with SE, 10 trials for DoGbased kernel per run, and 10 for SE. In total, this led to 60 experiments on the hardware (excluding the random sampling). BO with DoGbased kernel found walking points in 5 trials in 3/3 runs. BO with SE did not find any walking points in 10 trials in all 3 runs. These results are shown in Figure 4(b).
Based on these results, we concluded that BO with DoGbased kernel was indeed able to extract useful information from simulation and speed up learning on hardware, even when there was mismatch between simulation and hardware.
5.1.3 Experiments with a 9dimensional controller and NNbased kernel
In the next set of experiments, we evaluated performance of the NNbased kernel described in Section 3.1.2. We optimize the 9dimensional controller from Section 4.3.
The target of hardware experiments was to walk for 30 steps at , similar to Section 5.1.2. However, by these experiments the IMU had been reinstalled on the robot.
We observed that the SE performance improved, even though starting from the same random samples, hyperparameter setting and speed profile. We attribute this change to a better estimation and control of the CoM vertical height.
Figure 6 shows comparison of BO with NNbased kernel and SE kernels. We conducted 5 runs of both algorithms with 10 trials in each run, leading to a total of 100 robot trials. BO with the NNbased kernel found walking points in all 5 runs within 6 trials, while BO with SE kernel only found walking points in 2 of 5 runs in 10 trials. Hence, even without explicit handdesigned domain knowledge, like the DoGbased kernel, the NNbased kernel is able to extract useful information from simulation and successfully guide hardware experiments.
5.2 Simulation Experiments on a 7link Biped
In this section, we discuss simulation experiments with a 16dimensional Neuromuscular controller (Section 4.4) on a 7link biped model. These experiments, first reported in Antonova et al. (2017), also demonstrate the costagnostic nature of our approach by optimizing two very different costs.
Figure 7 shows BO with DoGbased kernel, NNbased kernel and SE kernel for two different costs from prior literature. The first cost promotes walking further and longer before falling, while penalizing deviations from the target speed (Antonova et al., 2016):
(7) 
where is seconds walked, is the final CoM position, is speed and is the desired walking speed ( in our case). The second cost function is similar to the cost used in Section 5. It penalizes falls explicitly, and encourages walking at desired speed and with lower cost of transport:
(8) 
where is the distance covered before falling, is the average speed of walking, is the target velocity, and captures the cost of transport. The changed constant is to account for a longer simulation time.
Figure 6(a) shows that the NNbased kernel and the DoGbased kernel offer a significant improvement over BO with the SE kernel in sample efficiency when using the , with more than 90% of runs achieving walking after 25 trials. BO with the SE kernel takes 90 trials to get 90% success rate. Figure 6(b) shows that similar performance by the two proposed approaches is observed on the nonsmooth cost. With the NNbased kernel, 70% of the runs find walking solutions after 100 trials, similar to the DoGbased kernel. However, optimizing nonsmooth cost is very challenging for BO with the SE kernel: a walking solution is found only in 1 out of 50 runs after 100 trials.
We attribute the difference in performance of the SE kernel on the two costs to the nature of the costs. If a point walks some distance , Equation 7 reduces in terms of and Equation 8 reduces by . A sharper fall in the first cost causes BO to exploit around points that walk at some distance, finding points that walk forever. BO with the second cost continues to explore, as the signal is too weak. However the success of both NNbased and DoGbased kernels on both costs shows that the same kernel can indeed be used for optimizing multiple costs robustly, without any further tuning needed. This is important because often the cost has to be changed based on the outcome of the optimization, and it would be impractical to recreate the kernel for each of these costs.
5.3 Experiments with Increasing SimulationHardware Mismatch
In this section, we describe our experiments with increasing simulationhardware mismatch and its effect on approaches that use information from simulation during hardware optimization. The quality of information transfer between simulation and hardware depends not only on the mismatch between the two, but also on the controller used. For a robust controller, small dynamics errors would not cause a significant deterioration in performance, while for a sensitive controller this might be much more detrimental. There is still an advantage to studying such a sensitive controller, as it might be much more energy efficient and versatile. In our experiments, the 50dimensional VNMC described in Section 4.5 is capable of generating very efficient gaits but is sensitive to modelling errors. Figure 8 shows the performance of the DoGbased and adjusted DoGbased kernel on the original highfidelity simulator. While both methods find walking points in 20 trials, adjustedDoG performs better. There is mismatch even between short and long simulations for this controller. This mismatch is compensated by the adjustedDoG kernel.
In the rest of this section, we provide experimental analysis of settings with increasing simulated mismatch and their effect on optimization of the 50dimensional VNMC. We compare several approaches that improve sampleefficiency of BO and investigate if the improvement they offer is robust to mismatch between the simulated setting used for constructing kernel/prior and the setting on which BO is run.
First, we examine the performance of our proposed approaches with informed kernels: , and . Figure 8(a) shows the case when informed kernels are generated using the simulator with simplified gear dynamics while BO is run on the original simulator. After 50 trials, all runs with informed kernels find walking solutions, while for SE only have walking solutions.
Next, Figure 8(b) shows performance of , and when the kernels are constructed using a simulator with simplified dynamics and without a the boom. In this case the mismatch with the original simulator is larger than before and we see the advantage of using adjustment for DoGbased kernel: finds walking points in all runs after 35 trials. also achieves this, but after 50 trials. finds walking points in of the runs after 50 trials. The performance of SE stays the same, as it uses no prior information from any simulator.
This illustrates that while the original DoGbased kernel can recover from slight simulationhardware mismatch, the adjusted DoGkernel is required if one expects higher mismatch. seems to recover from the mismatch, but might benefit from an adjusted version. We leave this to future work.
5.3.1 Comparisons of Priorbased and Kernelbased Approaches
We will classify approaches that use simulation information in hardware optimization as priorbased or kernelbased. Priorbased approaches use costs from the simulation in the prior of the GP used in the BO. This can help BO a lot if the costs between simulation and hardware transfer, and the cost function is fixed. However, in the presence of large mismatch, points that perform well in simulation might fail on hardware. A priorbased method can be biased towards sampling promising points from simulation, resulting in an even poorer performance than methods with no prior. Kernelbased approaches consist of methods that incorporate the information from simulation into the kernel of the GP. These can be sampleinefficient as compared to priorbased method, but less likely to be biased towards unpromising regions in the presence of mismatch. They also easily generalize to multiple costs, so that there is no additional computation if the cost is changed. This is important because a lot of these approaches can take several days of computation to generate the informed kernel. For example,
Cully et al. (2015) report taking 2 weeks on a 16core computer to generate their map.It is possible to also combine both priorbased and kernelbased methods, as in Cully et al. (2015). We will classify these as ‘priorbased’ methods, since in our experiments prior outweighs the kernel effects for such cases. In our comparison with Cully et al. (2015), we will implement a version with and without the prior points. We do not add a cost prior to BO using DoGbased kernel, as this limits us to a particular cost, and highfidelity simulators. Since both of these can be major obstacles in real robot experiments, we refrain from doing so.
Figure 9(a) shows the performance when using simulation cost in the prior during BO. BO with a cost prior created using the original version of the simulator illustrates what would happen in the best case scenario, as optimization is merely a lookup here. When the simulator with simplified gear dynamics is used for constructing the prior, we observe significant improvements over uninformed BO prior. However, when the prior is constructed from simplified gear dynamics and no boom setting, the approach performs slightly worse than uninformed BO. This shows that while an informed prior can be very helpful when created from a simulator close to hardware, it can hurt performance if simulator is significantly different from hardware.
Next, we discuss experiments with our implementation of Intelligent Trial and Error (IT&E) algorithm from Cully et al. (2015). This algorithm combines adding a cost prior from simulated evaluations with adding simulation information into the kernel. IT&E defines a behavior metric and tabulates best performing points from simulation on their corresponding behavior score. The behavior metric used in our experiments is dutyfactor of each leg, which can go from 0 to 1.0. We discretize the duty factor into 21 cells of 0.05 increments, leading to a grid. We collect the 5 highest performing controllers for each square in the behavior grid, creating a grid. Next, we generate 50 random combinations of a grid, selecting 1 out of the 5 best controllers per grid cell. Care was taken to ensure that all 5 controllers had comparable costs in the simulator used for creating the map. Cost of each selected controller is added to the prior and BO was performed in the behavior space, like in Cully et al. (2015).
Figure 9(b) shows BO with IT&E constructed using different versions of the simulator. IT&E constructed using simplified gear dynamics simulator is slightly less sampleefficient than the straightforward ‘cost prior’ approach. When constructed with the simulator with no boom, IT&E is able to improve over uninformed BO. However, it only finds walking points in 77% of the runs in 50 trials in this case, as some of the generated maps contained no controllers that could walk on the ‘hardware’. This is a shortcoming of the IT&E algorithm, as it eliminates a very large part of the search space and if the preselected space does not contain a walking point, no walking controllers can be sampled with BO. This problem could possibly be avoided by using a finer grid, or a different behavior metric. However tuning such hyperparameters can turn out to be expensive, in computation and hardware experiment time.
To separate the effects of using simulation information in prior mean vs kernel, we evaluated a kernelonly version of IT&E algorithm. Figure 10(a) shows these results. It shows that the cost prior is crucial for the success of IT&E and performance deteriorates without it. Hence, it is not practical to use IT&E on a cost different than what it was generated for.
Nonetheless, Figure 9 showed that BO with adjusted DoG kernel is able to handle both moderate and severe mismatch with kernelonly information, collected in Figure 10(b).
In summary, we created two simulators with increasing modelling approximations, and studied the effect of using these to aid optimization on the original simulator. We found that while methods that use cost in the prior of BO can be very sampleefficient in low mismatch, their performance worsens as mismatch increases. IT&E introduced in Cully et al. (2015) uses simulation information in both prior mean and kernel, and is very sampleefficient in cases of low mismatch. Even with high mismatch, it performs better than just priorbased BO but doesn’t find walking controllers reliably. In comparison, adjusted DoGbased kernel performed well in all the tested scenarios. All of this shows that the adjusted DoGbased kernel can reliably improve sampleefficiency of BO even when the mismatch between simulation and hardware is high. We would like to continue working in this direction and explore the usefulness of even simpler simulators in the future.
6 Conclusion
In this paper, we presented and analyzed in details our work from Antonova et al. (2016), Antonova et al. (2017) and Rai et al. (2017). These works introduce domainspecific feature transforms that can be used to optimize locomotion controllers on hardware efficiently. The feature transforms project the original controller space into a space where BO can discover promising regions quickly. We described a transform for bipedal locomotion designed with the knowledge of human walking and a neural network based transform that uses more general information from simulated trajectories. Our experiments demonstrate success at optimizing controllers on the ATRIAS robot. Further simulationbased experiments also indicate potential for other bipedal robots. For optimizing sensitive highdimensional controllers, we proposed an approach to adjust simulationbased kernel using data seen on hardware. To study the performance of this, as well as compare our approach to other methods, we created a series of increasingly approximate simulators. Our experiments show that while several methods from prior literature can perform well with low simulationhardware mismatch (sometimes even better than our proposed approach), they suffer when this mismatch increases. In such cases, our proposed kernels with hardware adjustment can yield reliable performance across different costs, simulators and robots.
This research was supported in part by National Science Foundation grant IIS1563807, the MaxPlanckSociety, & the Knut and Alice Wallenberg Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.
Appendix A: Implementation Details
In this Appendix we provide a summary of data collection and implementation details. Our implementation of BO was based on the framework in Gardner et al. (2014). We used Expected Improvement (EI) acquisition function (Mockus et al., 1978). We also experimented with Upper Confidence Bound (UCB) (Srinivas et al., 2010), but found that performance was not sensitive to the choice of acquisition function. Hyperparameters for BO were initialized to default values: 0 for mean offset, 1.0 for kernel length scales and signal variance, 0.1 for (noise parameter). Hyperparameters were optimized using the marginal likelihood (Shahriari et al. (2016), Section VA). For all algorithms, we optimized hyperparameters after a lowcost controller was found (to save compute resources and avoid premature hyperparameter optimization).
Kernel type  Controller dim  # Sim points  Sim duration  Kernel dim  Features in kernel 
5  20K  3.5s  1  
9  100K  5s  1  
50  200K  5s  1  
9  100K  5s  4  , , ,  
16  100K  5s  8  , , ,  
, , ,  
50  200K  5s  13  , , , , 
Our choice of SE kernel as the baseline for BO was due to its widespread use. The SE kernel belongs to a broader class of Matérn kernels. In some applications, carefully choosing the parameters of Matérn kernel could improve performance of BO. However, Matérn kernels are stationary: depend only on for all . Our approach seeks to build kernels that remove this limitation in a manner informed by simulation.
To create cost prior for experiments in Section 5.3 we collected 50,000 evaluations of 30s trials for a range of controller parameters. Then we conducted 50 runs, using random subsets of 35,000 evaluations to construct the prior. The numbers were chosen such that this approach used similar amount of computation as our kernelbased approaches. To accommodate GP prior with a large number of points we used a sparse GP construction provided by Rasmussen and Nickisch (2010).
References
 Antonova et al. (2016) Rika Antonova, Akshara Rai, and Christopher G Atkeson. Sample efficient optimization for learning controllers for bipedal locomotion. In Humanoid Robots (Humanoids), 2016 IEEERAS 16th International Conference on, pages 22–28. IEEE, 2016.
 Antonova et al. (2017) Rika Antonova, Akshara Rai, and Christopher G Atkeson. Deep kernels for optimizing locomotion controllers. In Conference on Robot Learning, pages 47–56, 2017.
 Batts et al. (2015) Zachary Batts, Seungmoon Song, and Hartmut Geyer. Toward a virtual neuromuscular control for robust walking in bipedal robots. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 6318–6323. IEEE, 2015.
 Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. arXiv preprint arXiv:1012.2599, 2010.
 Calandra et al. (2016a) Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian processes for regression. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 3338–3345. IEEE, 2016a.

Calandra et al. (2016b)
Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth.
Bayesian Optimization for Learning Gaits Under Uncertainty.
Annals of Mathematics and Artificial Intelligence
, 76(12):5–23, 2016b.  Cully et al. (2015) Antoine Cully, Jeff Clune, Danesh Tarapore, and JeanBaptiste Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015.
 Deisenroth and Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 Englert and Toussaint (2016) Peter Englert and Marc Toussaint. Combined Optimization and Reinforcement Learning for Manipulation Skills. In Robotics: Science and Systems, 2016.
 Feng et al. (2015) Siyuan Feng, Eric Whitman, X Xinjilefu, and Christopher G Atkeson. Optimizationbased full body control for the darpa robotics challenge. Journal of Field Robotics, 32(2):293–312, 2015.
 Gardner et al. (2014) Jacob R Gardner, Matt J Kusner, Zhixiang Eddie Xu, Kilian Q Weinberger, and John Cunningham. Bayesian Optimization with Inequality Constraints. In ICML, pages 937–945, 2014.
 Geyer and Herr (2010) Hartmut Geyer and Hugh Herr. A Musclereflex Model that Encodes Principles of Legged Mechanics Produces Human Walking Dynamics and Muscle Activities. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 18(3):263–273, 2010.

Hansen (2006)
Nikolaus Hansen.
The cma evolution strategy: a comparing review.
In
Towards a new evolutionary computation
, pages 75–102. Springer, 2006.  Hubicki et al. (2016) Christian Hubicki, Jesse Grimes, Mikhail Jones, Daniel Renjewski, Alexander Spröwitz, Andy Abate, and Jonathan Hurst. Atrias: Design and validation of a tetherfree 3dcapable springmass bipedal robot. The International Journal of Robotics Research, 35(12):1497–1521, 2016.
 Inman et al. (1953) Verne T Inman, Howard D Eberhart, et al. The major determinants in normal and pathological gait. JBJS, 35(3):543–558, 1953.
 Kuindersma et al. (2016) Scott Kuindersma, Robin Deits, Maurice Fallon, Andrés Valenzuela, Hongkai Dai, Frank Permenter, Twan Koolen, Pat Marion, and Russ Tedrake. Optimizationbased locomotion planning, estimation, and control design for the atlas humanoid robot. Autonomous Robots, 40(3):429–455, 2016.
 Lizotte et al. (2007) Daniel J Lizotte, Tao Wang, Michael H Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In IJCAI, volume 7, pages 944–949, 2007.
 Marco et al. (2017) Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with bayesian optimization. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 1557–1563. IEEE, 2017.
 Martin et al. (2015) William C Martin, Albert Wu, and Hartmut Geyer. Robust spring mass model running for a physical bipedal robot. In Robotics and Automation (ICRA), 2015 IEEE International Conference on, pages 6307–6312. IEEE, 2015.
 Mockus et al. (1978) J Mockus, V Tiesis, and A Zilinskas. Toward Global Optimization, Volume 2, Chapter Bayesian Methods for Seeking the Extremum. 1978.
 Peng et al. (2016) Xue Bin Peng, Glen Berseth, and Michiel van de Panne. Terrainadaptive locomotion skills using deep reinforcement learning. ACM Transactions on Graphics (TOG), 35(4):81, 2016.
 Rai et al. (2017) Akshara Rai, Rika Antonova, Seungmoon Song, William Martin, Hartmut Geyer, and Christopher G Atkeson. Bayesian Optimization Using Domain Knowledge on the ATRIAS Biped. 2017.
 Rasmussen and Nickisch (2010) Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. J. Mach. Learn. Res., 11:3011–3015, December 2010. ISSN 15324435. URL http://dl.acm.org/citation.cfm?id=1756006.1953029.
 Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando de Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Song and Geyer (2015) Seungmoon Song and Hartmut Geyer. A Neural Circuitry that Emphasizes Spinal Feedback Generates Diverse Behaviours of Human Locomotion. The Journal of Physiology, 593(16):3493–3511, 2015.
 Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 1015–1022. Omnipress, 2010.
 Tesch et al. (2011) Matthew Tesch, Jeff Schneider, and Howie Choset. Using response surfaces and expected improvement to optimize snake robot gait parameters. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ International Conference on, pages 1069–1074. IEEE, 2011.
 Thatte and Geyer (2016) Nitish Thatte and Hartmut Geyer. Toward Balance Recovery with Leg Prostheses Using Neuromuscular Model Control. IEEE Transactions on Biomedical Engineering, 63(5):904–913, 2016.
 Theodorou et al. (2010) Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11(Nov):3137–3181, 2010.
 Van der Noot et al. (2015) Nicolas Van der Noot, Luca Colasanto, Allan Barrea, Jesse van den Kieboom, Renaud Ronsse, and Auke J Ijspeert. Experimental validation of a bioinspired controller for dynamic walking with a humanoid robot. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 393–400. IEEE, 2015.
 Wilson et al. (2014) Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning. The Journal of Machine Learning Research, 15(1):253–282, 2014.
 Wilson et al. (2016) Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.

Winter and Yack (1987)
DA Winter and HJ Yack.
EMG profiles during normal human walking: stridetostride and intersubject variability.
Electroencephalography and clinical neurophysiology, 67(5):402–411, 1987.
Comments
There are no comments yet.