1 Introduction
Reinforcement learning (RL) is becoming popular in robotics, since in some cases it can deal with realworld challenges, such as noise in control and measurements, nonconvexity and discontinuities in objectives. However, most flexible RL methods require thousands to millions of data samples, which can make direct application to realworld robotics infeasible. For example, 10,000 30s trials/episodes on a real robot would require 100 hours of operation. Most fullscale platforms, especially in locomotion, cannot operate this long without maintenance. Nowadays, commercially available arms can operate for longer, however sophisticated anthropomorphic hands and advanced grippers are still highly prone to breakage after even a handful of trials [1]. Hence the need for algorithms that can learn in very few trials, without causing significant wearand tear to the hardware.
In this work we focus on cases with a budget of only 1020 trials. In such settings, using approaches like Bayesian optimization (BO) to adjust parameters of structured controllers can help improve data efficiency. However, success of BO on hardware has been demonstrated either with lowdimensional controllers or with simulationbased kernels that required handdesigned features. We propose learning simulationbased kernels in unsupervised way with a sequential variational autoencoder (SVAE). Our approach embeds simulated trajectories to a space of latent paths
, and jointly learns a probability distribution
that controllers with parameters induce over the space of latent paths. Our work is inspired by initial success of trajectorybased BO kernels [2], however that was demonstrated for BO in low dimensions (24D). Our results show that performance of a kernel based on raw trajectories deteriorates quickly for higherdimensional problems. In contrast, a kernel based on latent paths can still offer gains even for 48dimensional controllers.Global optimization in latent space can still suffer from sampling unsuccessful controllers, especially in the absence of dense rewards. One solution can be adding domainspecific constraints to point optimization in the right direction. While these can be hard to define in controller parameter space, frequently they can be easily expressed in observation/state space. For example, high velocities might be undesirable if they result in hard impacts. However, formulating this as constrained optimization could result in overly conservative controllers. Instead, we incorporate controller desirability into BO by reducing exploration in the part of the trajectory space that leads to undesirable behavior. We compress the search space during BO dynamically by scaling the distance between controllers based on their desirability, initially inferred from simulation. BO can then quickly reject the undesirable parts of the search space, allowing for more exploration in the desirable parts. Figure 1 gives an overview of the proposed approach.
We test our approach (SVAEDC: informed kernel with Dynamic Compression) on a Daisy hexapod and an ABB Yumi manipulator on hardware^{1}^{1}1Video demonstrating hardware experiments: https://youtu.be/2SvdwGZNrvY
. We also conduct further simulationbased analysis on Daisy and two manipulators. On Daisy, our method consistently learns to walk in less than 10 hardware trials, outperforming uninformed BO. We also demonstrate significant gains on a nonprehensile manipulation task on Yumi. All latent components of our kernel can be adjusted online (by optimizing marginal likelihood as is done for BO hyperparameters). We anticipate that such adjustment could be useful for future works for settings with a medium budget of trials (
100+). Our code builds on the recently released BoTorch library [3] that supports highly scalable BO on GPUs. We open source our code for simulation environments, training and BO^{2}^{2}2SVAEDC and BO code: https://github.com/contactrika/bosvaedc.2 Background and Related Work
For learning with a small number of trials we turn to Bayesian Optimization (BO). BO can be thought of as a dataefficient RL method that obtains a reward only at the end of each trial/episode. For higherdimensional robotics problems BO can benefit significantly from using simulationbased kernels. However, previous work required defining domainspecific features to be extracted from largescale simulation data (see Section 2.1). Variational Autoencoders (VAEs) [4] provide an unsupervised alternative for embedding highdimensional observations into a lowerdimensional space. For example, [5] recently used VAE in a Gaussian Process (GP) kernel to optimize chemical molecules. In robotics, VAEs have been used to process visual and tactile data (see [6] for a survey). We are interested in encoding trajectory data, so a sequential VAE (SVAE) could be applicable. [7, 8] show SVAEs learning latent dynamics. However, their physics simulations are lowdimensional (e.g. position of a 2D ball), sequences have length 2030 steps, and the focus is on visual reconstruction. We aim to develop SVAE architecture that can easily handle simulations from fullscale robotics systems (state spaces 27D+) and much longer sequences (lengths 5001000).
Our original motivation for embedding trajectory data into the kernel was Behavior Based Kernel (BBK) [2]. On lowdimensional problems it matched the performance of PILCO [9]
, which is a popular dataefficient modelbased RL algorithm for small domains. BBK is directly applicable only to stochastic policies, but we adapted it to our setting as BBKKL baseline. We randomize simulator parameters when collecting trajectories. Hence even though the simulator and controllers are deterministic, each controller still induces a probability distribution over the trajectories. As proposed for BBK, for kernel distances we used symmetrized KL between trajectory distributions induced by the controllers. The generation and reconstruction parts of SVAE were used to estimate this KL. Since this baseline uses a neural network in the kernel, there is some relation to methods like
[10, 11] (though these focused on GP regression, and did not incorporate trajectories).2.1 BO for Locomotion and Manipulation
Locomotion controllers most commonly used for real systems are structured and parametric [12, 13, 14]. BO has been used to optimize their parameters, e.g. [15, 16, 17]. Typically, these methods take 40 trials for lowdimensional controllers (35D). For highdimensional controllers further domain information is needed. For example [18] use simulation and userdefined features to transform the space of a 36dimensional controller into 6D, making the search for walking controllers of a hexapod much more dataefficient. [19] employ bipedal locomotion features to build informed kernels.
In manipulation, active learning and BO have been used, for example, for grasping
[20, 21]. These works did not incorporate simulation into the kernel, so their performance would be similar to BO with uninformed/standard kernel. [22] showed advantages of a simulationbased kernel, but needed graspingspecific features. Somewhat related are works in simtoreal transfer, like [1], though many have visuomotor control as the focus (not considered here) and usually do not adapt online. [23] do adjust simulation parameters to match reality, so it would be interesting to combine this with BO in the future for global optimality (their work employs PPO, which is locally optimal). Due to uncertainty over friction and contact forces, simtoreal is challenging for nonprehensile problems. However, such motions can be useful to make solutions feasible (e.g pushing when the object is too large/heavy to lift or the goal is out of reach). [24, 25, 26] report success in transfer/adaptation on a pushtogoal task, showing the task is challenging but feasible. In our experiments we consider a ‘stable push’ task: push two tall objects across a table without tipping them over. The further challenges come from interaction between objects and inability to recover from them tipping over.2.2 Challenges of Realworld Locomotion: the Need for Ultra Dataefficient Optimization
Learning for legged locomotion can be a daunting task, since a robot needs to perfectly balance its interaction forces with the ground to move forward. For a hexapod robot, this means coordinating the movements of six legs, as well as the forces being applied on each leg. While it is easy to find a walking gait, it is extremely difficult to find a gait that can move forward at a reasonable speed.
Recently, [27, 28] showed that RL can be used for locomotion on hardware. However, they learn conservative controllers in simulation and help transfer via system identification of actuator dynamics [27] and a userdesigned structured controller [28]. While these methods can help, they do not guarantee that a controller learned in simulation will perform well on hardware. [29] showed learning to walk on a Minitaur quadruped in only two hours. The Minitaur robot has 8 motors that control its longitudinal motion, and no actuation for lateral movements. In comparison, our hexapod (Daisy) has 18 motors, and has omnidirectional movements. This makes the problem of controlling Daisy especially challenging, and would require significantly longer training. However, most present day locomotion robots get damaged from wear and tear when operated for long. For example, in the course of our experiments, we had to replace two motors, and fix issues such as faulty wiring and broken parts multiple times. With these considerations, we develop a ultra dataefficient approach that can learn controllers on Daisy in less than 10 hardware trials.
3 SVAEDC: Learning Informed Trajectorybased Embeddings
We model our setting as a joint Variational Inference problem: learning to compress/reconstruct trajectories while at the same time learning to associate controllers with their corresponding probability distributions over the latent paths. For this we develop a version of sequential VAE (SVAE). The training is guided by ELBO (Evidence Lower Bound) derived for our setting directly from the modeling assumptions and doesn’t require any auxiliary objectives. First, we define notation:

[leftmargin=*]

: policy/controller with parameters ; policies can be either deterministic or stochastic; for brevity we will refer to simply as ‘controller ’

: original trajectory for time steps containing highfrequency sensor readings

: latent space ‘path’ (embedding of a trajectory)

: a conditional probability distribution over the trajectories induced by controller
; the relationship between the controller and trajectories could be probabilistic either because the controller is stochastic, or because the simulator environment is stochastic, or both 
: a conditional probability distribution over latent space paths induced controller by

a map denoting whether an observation is within an undesirable region

: fraction of time spends in undesirable regions; captures analogous notion in latent space
Our goal is to learn . is analogous to , only the paths are encoded in a lowerdimensional latent space.
This is useful for constructing kernels for efficient BO on hardware.
As a measure of trajectory ‘quality’ we can keep track of how long each trajectory spends in undesirable regions (). For the latent paths we learn the analogous notion (). We will not impose hard constraints during optimization, so used to compute can be specified roughly with approximate guesses. Our framework also supports , but for users it is frequently easier to make a rough thresholded estimate rather than providing smooth estimates or probabilities. The graphical model we construct for this setting is shown in Figure 2. Not all independencies are captured by the illustration. So explicitly, the generative model is:
.
Approximate posterior is modeled by: .
We collect trajectories by simulating controllers with parameters for time steps.
We derive ELBO for this setting to maximize . Using ‘’ over the variables to indicate samples from the current variational approximation, we get:
(1) 
Some aspects of this model resemble a setup from [30]; see derivation details in Appendix A.
4 Bayesian Optimization with Dynamic Compression
In Bayesian Optimization (BO), the problem of optimizing controllers is viewed as finding controller parameters that optimize some objective function : . At each optimization trial BO optimizes an auxiliary function to select the next promising to evaluate. is commonly modeled with a Gaussian process (GP): .
The key object is the kernel function , which encodes similarity between inputs. If is large for inputs , then ) strongly influences . One of the most widely used kernel functions is the Squared Exponential (SE) kernel: , where
are signal variance and a vector of length scales respectively.
are called ‘hyperparameters’ and are optimized automatically by maximizing marginal likelihood ([31], Section VA). SE belongs to a broader class of Matérn kernels. One common parameter choice yields Matérn: . SE and Matérn kernels are stationary, since they depend on , and not on individual . Section 2.1 discussed recent work that showed how to effectively remove nonstationarity by using informed feature transforms for kernel computations. But these required extracting domainspecific features manually, or learning to fit a predefined set of features using a deterministic NN in a supervised way.We propose to use learned by SVAEDC. [2] showed that a ‘symmetrization’ of KL divergence can be used to define a KLbased kernel for trajectories in the original space:
(2) 
We could use this to define an analogous kernel in the latent space:
In theory, this would be a natural way to define a pathbased kernel in the latent space. However, it is widely known that Variational Inference tends to underestimate variances in theory [32, 33] and in practice [34, 35]. This underestimation could negatively impact the practical performance of such kernel. Since we indeed observed variance underestimation we implemented a version of the kernel to work with the latent means directly. We define our kernel function with:
(3) 
(4) 
This formulation is convenient in practice, since the form of Equation 4 allows us to apply existing machinery for optimizing kernel hyperparameters . We can also define SVAEDCMatérn version of the kernel by changing the form of Equation 4 to the Matérn function. Scaling latent representations by yields dynamic compression: latent representations that correspond to controllers frequently visiting undesirable parts of the space are scaled down. Hence ‘bad’ controllers are brought closer together. This allows BO to reduce the number of samples from the ‘bad’ parts of the space. ‘Dynamic compression’ here means this search space transformation is applied after SVAE training, in addition to the compression obtained by SVAE’s dimensionality reduction. The scaling can be made nonlinear with . This can help achieving aggressive compression in settings like ours with an extremely small budget of trials. The additional parameters , as well as can be optimized online in the same way as BO hyperparameters.
Overall, SVAEDC and the resulting kernel described above allow us to obtain a fully automatic way of learning latent trajectory embeddings in unsupervised way. For domains where is given we can also achieve dynamic compression of the latent space, making BO ultra dataefficient. All the components used during BO can be optimized online via the same methods already implemented for automatically adjusting BO hyperparameters.
5 SVAEDC: NN Architectures and Training
Guided by prior literature we experimented with RNNs, LSTMs, and sequencetosequence RNNs. Learning was slow and frequently unsuccessful, despite trying adaptive learning rates, manual tuning, weighting various parts of ELBO. Using MLPs instead did not improve performance either. [36] notes that CNNs can succeed on sequence data, but one recent alternative (QuasiRNNs [37]) did not yield a notable improvement for us. Instead, an effective idea we had was to view dimensions of as different channels. Then we could feed to 1D convolutional layers to learn , deconvolutional for
. With that, for all our experiments (all different robot and controller architectures) we were able to use the same network parameters: 3layer 1D convolutions with [32, 64, 128] channels (reverse order for deconvolutions; kernel size 4, stride 2) followed by MLP layer for
outputs. We were also able to use same latent space sizes: 3dimensional , latent sequence length . This yielded a small 9D optimization space for BO, which is highly desirable for optimization with few trials. Notably, this NN architecture also retained good reconstruction accuracy, not far from results with larger latent spaces () and hidden sizes (2561024). We also interpreted as a sequence of length 1 and used deconvolutional architecture for . It had 4layers with [512, 256, 128, 128] channels, since was one of the key parts for BO (though a smaller CNN or MLP could have sufficed). For we used a 2layer MLP (hidden size 64). Training took 30180 minutes on 1 GPU, using learning rate (halving after each 5K gradient updates, stopping at ). See Appendix B for reconstruction/generation visualizations.6 Locomotion on the Daisy Hexapod
For our locomotion experiments we used Daisy robot (Figure 3) from Hebi robotics [38]. It has six legs, each with 3 motors – base, shoulder and elbow. The robot is practically omnidirectional, however, the motors are velocity limited, so the robot is unable to achieve very high velocities. Vive tracking system was used to measure robot’s position in the global frame for rewards.
In general, locomotion is a hard learning problem, but complex high degreeoffreedom robots further complicate it. While in simulation all 6 legs of Daisy are identical, each motor has a slightly different behavior on hardware. This also depends on the environment, and makes it extremely hard to predict the robot’s behavior from simulation. For example, one of our successful straightwalking controllers from simulation, turns left when executed on a carpet floor, but turns right on a wooden floor. This raises the need for learning approaches that can transfer information from simulation to hardware, without suffering too much from the mismatch between the two. We simulated the Daisy robot in PyBullet
[39]. The simulator was fast, but did not have an accurate contact model with the ground. While freespace motion of individual joints transferred to hardware, the overall behavior of the robot when interacting with the ground was very different between simulation and hardware. As a result, rewards obtained by controllers in simulation could be significantly different on hardware.Daisy Controllers: We used Central Pattern Generators (CPGs) from [40]. These are capable of generating a large number of locomotion gaits by changing the frequency, amplitude, and offset of each joint, as well as the relative phase differences between joints. Different CPG parameters can be restricted to obtain controllers with various dimensionalities. We experimented with 11D controller on hardware and 27D in simulation. For hardware, we assume that all joints have the same amplitude, frequency and offset (3 parameters), all base motors have independent phases (6 parameters), all shoulders and elbows have the same phase difference w.r.t. the base (2 parameters). This assumption implies that all joints are treated identically, which doesn’t always hold, since each motor has slightly different tracking and bandwidth. In the future, we would like to use alternatives that allow each motor to learn independently. For simulation: base, shoulder and elbow joints were allowed to have independent amplitudes, frequencies and offsets, but fixed across the six legs (9 parameters); each of the 18 joints was allowed to have an independent phase (18 parameters).
6.1 Daisy Experiments
For SVAEDC training we sampled controllers randomly in simulation and collected the corresponding trajectories for time steps (). For dynamic compression the states were marked as undesirable if they had: high joint velocities (more than 10rad/sec); robot base tilting by more than 60°in roll and pitch, elbows hitting the ground; height of the base outside of [0.1, 0.7]cm from the ground. These aimed to reduce the chance of robot breaking: controllers with high joint velocities can harm the motors on impact with the ground; tilting the torso can cause the robot to fall on its back; scraping the ground or lifting off and then falling can cause further damage. Since our BO trials were in a narrow walkway, we also marked as undesirable states deviating more than m from the starting coordinate of the base. The objective function for BO was: , where was the final coordinate of the robot (how much the robot walked forward), was the number of timesteps with velocities exceeding 10rad/sec. All BO experiments used UCB acquisition function (with ).
We completed 5 runs of BO on the Daisy robot hardware, initializing with 2 random samples, followed by 10 trials of BO (Figure 4). BO with SE kernel used the same initialization as BO with SVAEDC kernel. For Daisy robot on hardware the controller would be considered acceptable if it walked forward for more than during a trial of 25 seconds. For comparison to random search we sampled 60 controllers at random. Of these only 2 were able to walk forward a distance of over in . So the problem was challenging, as the chance of randomly sampling a successful controller was %. BO with SVAEDC kernel found walking controllers reliably in all 5/5 runs within fewer than 10 trials. In contrast, both BO with SE found forward walking controllers only in 2/5 runs.
For simulation experiments, we created an artificial ‘simtoreal’ gap, allowing to gauge the potential for simulationbased kernels without running all the experiments on hardware. For each BO run we randomly sampled ground restitution parameters, and kept them fixed for all trials within a run. Hence simulationbased kernels did not have full information about the exact properties of the environment used during BO (even though the range of parameters was the same as for data collection). Kernels were informed about performance on a range of parameters, but could have caused negative transfer by lagging to identify controllers that perform best in a particular setting (not only well on average across settings).
Figure 5 shows BO with 27D controller. BO with SVAEDC outperformed all baselines. BBKKL kernel obtained smaller improvements over SE and Random baselines. This indicated that a trajectorybased kernel was useful even when optimizing a highdimensional controller, although BBKKL benefits were greatly diminished compared to BBK results for 24 dimensional controllers reported in prior work. In these experiments, SVAE without dynamic compression was very similar to SE (omitted from the plot for clarity, since it was overlapping with SE). This showed that dimensionality reduction alone does not guarantee improvement (even when the latent space contains information needed to decode back into the space of original trajectories).
7 Manipulation Experiments
Our manipulation task was to push two objects from one side of the table to another without tipping them over. For Yumi environment the objects had mass and inertial properties similar to paper towel rolls (mass of 150g, 22cm height, 5cm radius); for Franka these had properties similar to wooden rolls (2kg, 22cm height, 8cm radius). Compared to ‘pushtotarget’ task, our task had two different challenges. The objects were likely to come into contact with each other (not only the robot arm). Moreover, they could easily tip over, especially if forces were applied above an object’s center of mass. Reward was given only at the end of the task: the distance each upright object moved in the desired direction minus a penalty for objects that tipped over (with being table width): .
Controllers: We tested our approach on two types of controllers: 1) joint velocity controller suitable for robots like ABB Yumi and 2) torque controller suitable for robots like Franka Emika. The first was parameterized by 6 joint velocity “waypoints”, one target velocity for each joint of the robot arm (so parameters for a 7DoF arm). Each “waypoint” also had a duration parameter that specified the fraction of time to be spent attaining the desired joint velocities. Overall this yielded a 48dimensional parametric controller. The second controller type was aimed to be safe to use on robots with torque control that are more powerful than ABB Yumi. Instead of exploring randomly in torque space, we designed a parametric controller with desired waypoints in endeffector space. Each of the 6 waypoints had 6 parameters for the pose (3D position, 3D orientation) and 2 parameters for controller proportional and derivative gains. Overall this yielded a 48dimensional parametric controller:
. This controller interpolated between the waypoints using a
order minimum jerk trajectory for positions, and used linear interpolation for orientations. End effector Jacobian for the corresponding robot model was used to convert to joint torques.7.1 Experimental Setup and Results
For training SVAEDC we collected 500,000 simulated trajectories for both Yumi and Franka robot. These contained joint angles of the robot and object poses at each time step (1000 steps for Yumi and 500 steps for Franka, simulated with pybullet at 500Hz). A step on the trajectory was marked as undesirable when: any object tipped over or was pushed beyond the table; robot collided with the table or the end effector was outside of main workspace (not over the table area). Mass, friction and restitution of the objects were randomized at the start of each episode/trajectory. Randomization ranges were set to roughly resemble variability of how realworld objects behaved.
ABB Yumi robot available to us could operate effectively only at low velocities ( of simulation maximum). Highvelocity trajectories successful in simulation yielded different results on hardware. To prevent Yumi from shutting down due to high load we stopped execution if the robot’s arm extended too far outside the main workspace, also stopped if it was about to collide with the table (giving reward in such cases). These factors caused a large simreal gap. Nonetheless,
BO with SVAEDC kernel was still able to significantly outperform BO with SE (Figure 7). Even when controllers successful in simulation yielded very different outcomes on hardware, SVAEDC kernel was still able to find wellperforming alternatives (more conservative, yet successful on hardware).
For simulation experiments with manipulators we emulated ‘simtoreal’ gap as with Daisy simulation: sampled different object properties (mass, friction, restitution) at the start of each BO run. Results in Figure 8 show that BO on Yumi with SVAEDC kernel yielded substantial improvement over all baselines. BO in the latent space of SVAE (without dynamic compression) was also able to substantially outperform all baselines, matching SVAEDC gains after 15 trials.
Figure 9 shows BO results on Franka Emika simulation (left).
Furthermore, we analyze how increasing the size of SVAE latent space and NNs impacts performance (middle). The larger latent space is D (vs 9D in other experiments), the hidden layer size of NNs is increased from 128 to 256. Larger latent space implies larger search space for BO, which could impair data efficiency. Indeed, we see what BO with SVAE kernel outperforms BBKKL and SE kernels not as early as before. However, BO with SVAEDC is able to keep the gains and even decrease the variance between runs (wellperforming points are found more reliably). This indicates that dynamic compression could counterbalance increase in kernel dimensionality. Finally, we experimented with Matérn kernel (right plot in Figure 9), but it did not show benefits over using SE kernel. We attempted changing hyperparameter prior and restricting hyperparameter ranges, but it did not consistently outperform random search (same held for SE in high dimensions). The performance of BO with SVAE kernel using Matérn as outer kernel function showed modest improvement over baselines. In contrast, BO with SVAEDC kernel kept most improvements.
8 Conclusion
In this work employed BO to optimize robot controllers with a small budget of trials. Previously, the success of BO has been either limited to lowdimensional controllers or required simulationbased kernels with domainspecific features. We proposed an unsupervised alternative with sequential variational autoencoder. We used it to embed simulated trajectories into a latent space, and to jointly learn relating controllers with latent space paths they induce. Furthermore, we provided a mechanism for dynamic compression, helping BO reject undesirable regions quickly, and explore more in other regions. Our approach yielded ultradata efficient BO in hardware experiments with hexapod locomotion and a manipulation task, using the same SVAEDC architecture, training and BO parameters.
This research was supported in part by the Knut and Alice Wallenberg Foundation.
References
 OpenAI [2018] OpenAI. Learning dexterous inhand manipulation. arXiv:1808.00177, 2018.

Wilson et al. [2014]
A. Wilson, A. Fern, and P. Tadepalli.
Using Trajectory Data to Improve Bayesian Optimization for
Reinforcement Learning.
Journal of Machine Learning Research
, 15(1):253–282, 2014.  [3] Max Balandat, Brian Karrer, Daniel Jiang, Ben Letham, Sam Daulton, Andrew Wilson, Eytan Bakshy. BoTorch. https://botorch.org/. Accessed: 201905.
 Kingma and Welling [2013] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv:1312.6114, 2013.
 GómezBombarelli et al. [2018] R. GómezBombarelli, J. N. Wei, D. Duvenaud, J. M. HernándezLobato, B. SánchezLengeling, D. Sheberla, J. AguileraIparraguirre, T. D. Hirzel, R. P. Adams, and A. AspuruGuzik. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
 Lesort et al. [2018] T. Lesort, N. DíazRodríguez, J.F. Goudou, and D. Filliat. State representation learning for control: An overview. Neural Networks, 2018.
 Yingzhen and Mandt [2018] L. Yingzhen and S. Mandt. Disentangled sequential autoencoder. In International Conference on Machine Learning, pages 5656–5665, 2018.

Fraccaro et al. [2017]
M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther.
A disentangled recognition and nonlinear dynamics model for unsupervised learning.
In Advances in Neural Information Processing Systems, pages 3601–3610, 2017.  Deisenroth and Rasmussen [2011] M. Deisenroth and C. E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 Calandra et al. [2016] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 3338–3345. IEEE, 2016.
 Wilson et al. [2016] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. In Artificial Intelligence and Statistics, pages 370–378, 2016.
 Thatte et al. [2018] N. Thatte, H. Duan, and H. Geyer. A method for online optimization of lower limb assistive devices with high dimensional parameter spaces. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–6. IEEE, 2018.
 Feng et al. [2015] S. Feng, E. Whitman, X. Xinjilefu, and C. G. Atkeson. Optimizationbased Full Body Control for the DARPA Robotics Challenge. Journal of Field Robotics, 32(2):293–312, 2015.
 Gong et al. [2018] Y. Gong, R. Hartley, X. Da, A. Hereid, O. Harib, J.K. Huang, and J. Grizzle. Feedback control of a cassie bipedal robot: Walking, standing, and riding a segway. arXiv:1809.07279, 2018.
 Calandra [2017] R. Calandra. Bayesian Modeling for Optimization and Control in Robotics. PhD thesis, Darmstadt University of Technology, Germany, 2017.
 Lizotte et al. [2007] D. J. Lizotte, T. Wang, M. H. Bowling, and D. Schuurmans. Automatic Gait Optimization with Gaussian Process Regression. In International Joint Conference on Artificial Intelligence (IJCAI), volume 7, pages 944–949, 2007.
 Tesch et al. [2011] M. Tesch, J. Schneider, and H. Choset. Using response surfaces and expected improvement to optimize snake robot gait parameters. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1069–1074. IEEE, 2011.
 Cully et al. [2015] A. Cully, J. Clune, D. Tarapore, and J.B. Mouret. Robots that can adapt like animals. Nature, 521(7553):503–507, 2015.
 Rai et al. [2019] A. Rai, R. Antonova, F. Meier, and C. G. Atkeson. Using simulation to improve sampleefficiency of bayesian optimization for bipedal robots. Journal of machine learning research, 20(49):1–24, 2019.
 Kroemer et al. [2010] O. Kroemer, R. Detry, J. Piater, and J. Peters. Combining active learning and reactive control for robot grasping. Robotics and Autonomous systems, 58(9):1105–1116, 2010.

Montesano and Lopes [2012]
L. Montesano and M. Lopes.
Active learning of visual descriptors for grasping using nonparametric smoothed beta distributions.
Robotics and Autonomous Systems, 60(3):452–462, 2012.  Antonova et al. [2018] R. Antonova, M. Kokic, J. A. Stork, and D. Kragic. Global search with bernoulli alternation kernel for taskoriented grasping informed by simulation. In Conference on Robot Learning, pages 641–650, 2018.
 Chebotar et al. [2018] Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. Ratliff, and D. Fox. Closing the simtoreal loop: Adapting simulation randomization with real world experience. arXiv:1810.05687, 2018.
 Peng et al. [2018] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018.
 He et al. [2018] Z. He, R. Julian, E. Heiden, H. Zhang, S. Schaal, J. Lim, G. Sukhatme, and K. Hausman. Zeroshot skill composition and simulationtoreal transfer by learning task representations. arXiv:1810.02422, 2018.
 Arnekvist et al. [2019] I. Arnekvist, D. Kragic, and J. A. Stork. VPE: Variational Policy Embedding for Transfer Reinforcement Learning. In 2019 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2019.
 Tan et al. [2018] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke. Simtoreal: Learning agile locomotion for quadruped robots. arXiv:1804.10332, 2018.
 Li et al. [2018] T. Li, A. Rai, H. Geyer, and C. G. Atkeson. Using deep reinforcement learning to learn highlevel policies on the atrias biped. arXiv:1809.10811, 2018.
 Haarnoja et al. [2019] T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, and S. Levine. Learning to walk via deep reinforcement learning. In Robotics: Science and Systems (RRS), 2019.
 Louizos et al. [2016] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder. International Conference on Learning Representations, 2016.
 Shahriari et al. [2016] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
 Minka [2005] T. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005.
 Bishop [2006] C. M. Bishop. Pattern recognition and machine learning. springer, 2006.
 Riquelme et al. [2018] C. Riquelme, M. Johnson, and M. Hoffman. Failure modes of variational inference for decision making. Prediction and Generative Modeling in RL Workshop (AAMAS, ICML, IJCAI), 2018.
 Tschiatschek et al. [2018] S. Tschiatschek, K. Arulkumaran, J. Stühmer, and K. Hofmann. Variational inference for dataefficient model learning in pomdps. arXiv:1805.09281, 2018.
 Bai et al. [2018] S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271, 2018.

Bradbury et al. [2017]
J. Bradbury, S. Merity, C. Xiong, and R. Socher.
Quasirecurrent neural networks.
International Conference on Learning Representations, 2017.  [38] Hebi Robotics. http://docs.hebi.us. Accessed: 201906.
 [39] Pybullet simulator. https://github.com/bulletphysics/bullet3. Accessed: 201906.
 Crespi and Ijspeert [2008] A. Crespi and A. J. Ijspeert. Online optimization of swimming and crawling in an amphibious snake robot. IEEE Transactions on Robotics, 24(1):75–87, 2008.
 Kingma et al. [2014] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
Appendix A: SVAEDC Modeling Details
The backbone of our model is inspired by hierarchical constructions, like those developed in [30, 41]
. However, these works considered supervised and semisupervised settings, where a discrete label was associated with each highdimensional data point (e.g. a label for an image). We are dealing with sequential trajectory data instead, so the internal structure of our data is different. But for the moment let us think about each trajectory as a point in some highdimensional space. Our idea is to interpret controllers
as continuous ‘labels’ for trajectories. Then, on a high level, for random variables
we can extend the standard ELBO bound as follows:(5) 
In the above, denote parameters of the variational approximation, denote the parameters of the generative part of the model. In our work, are weights of deep neural networks. It is customary to drop subscripts indicating NN weight parameters and write for a shorthand notation.
The derivation for the above is similar to [30]. We can also recognize the similarity to a standard ELBO for a simplified model without :
In our case, is observed (we know which controller is executed when we obtain a trajectory ), so there is no further uncertainty about . Also, from the independencies in the model we see that the rest of the variables are independent of given . Hence conditioning is the only modification that appears in . This is why terms like do not appear in our ELBO, but they would have been included if we also had a nontrivial prior . Our construction treats and as observed data, which is in fact what we have available from simulating trajectories. For data collection we can sample controllers at random, since we assume access to a relatively inexpensive simulator (in a sense that it is viable to simulate 100K+ trajectories for training). So we don’t need a sophisticated prior for .
To derive the full SVAEDC ELBO we can use the decomposition assumptions of the generative model and the approximate posterior:
Generative model:  
Approximate posterior: 
(6) 
The 4 terms inside the expectation in Equation 6 above are the 4 neural networks whose parameters will be optimized by gradient ascent to maximize . The choices for their architectures are described in the main paper. For ease of implementation we treat the outputs of as a single latent code , and separate it into components only when needed (e.g. to feed only the part into NN, etc.
One advantage of our formulation is that it is agnostic to whether policies/controllers are stochastic or deterministic, and to whether the simulators used to collect samples are stochastic or deterministic. This is especially convenient, since in robotics deterministic controllers are used widely, while the Reinforcement Learning community frequently considers stochastic policies and environments. With our model: the stochasticity of either environment or controllers (or both) will be encoded in . Even in the case of deterministic controllers and environment (deterministic relationship between and ) the model remains meaningful because of the bottleneck and randomness coming from sampling of during data collection.
One fair question would be: why learn an embedding into lowerdimensional space of path jointly with learning , instead of decomposing the problem into separate dimensionality reduction and modeling stages. For an arbitrary space of paths (either lowdimensional, or even the original highdimensional space of trajectories) the relationship between and the corresponding probability distribution over the paths would be challenging. This is because it involves the controller properties and the dynamics of the physical environment, both of which are usually highly nontrivial. In the joint model we propose, term can be seen as a ‘regularization’ part of the ELBO. It keeps the latent representation s.t. it is well suited for modeling the relationship between and . The terms pertaining to ‘reconstructing’ original trajectories are the encoder and decoder . Learning progress for these is fast if there is sufficient capacity in the bottleneck . However, if these make fast progress, but learn the space of latent paths that is not easy to relate to the space of controllers – then will drop. Hence ELBO will be lower for this ‘inconvenient’ representation for , encouraging alternatives. Consequently, our joint representation allows not only ‘compressing’ the space of trajectories, but also finding a compression scheme that simplifies the problem of modeling .
As is customary with VAEs, at first we expressed variational approximate posterior and generative model components by multivariate Gaussians with diagonal covariance. Later we found that using Laplace distributions yielded more consistent training results. The reconstruction and generation for successful training runs were comparable. However, some runs using Gaussians collapsed to the mean instead of learning useful latent representations. So we kept Laplace as the default choice.
Appendix B: SVAEDC Training Visualizations
Below we include visualizations of the training progress. We developed an easytouse training pipeline that generates training statistics and visualization videos in Tensorboard. We included our code and a detailed README with instructions on how to install and use the codebase. We took care to comment our implementation, so all further details about our implementation, parameter choices and training procedure would be easy to infer from the code attached to this submission.
Appendix C: Parametric vs Intrinsic Dimensionality
When optimizing higherdimensional controllers with few trials, one could question whether BO with SE kernel should be among the baselines. If our reward functions came from an arbitrary distribution, for BO in 30D space, for example, we would expect to need at least 60 trials to starts seeing the benefits. However, our reward landscapes come from realworld problems, not from purely analytic constructions. While robotics problems have a clear parametric dimensionality, their intrinsic dimensionality is usually unknown. The vision community is familiar with this concept: they frequently refer to a ‘lowerdimensional manifold of realworld images’. The intrinsic dimensionality of vision problems could be orders of magnitude lower than their parametric dimensionality expressed in pixel space.
In the context of BO, consider a 30D quadratic: with . Even on this simple quadratic BO with SE kernel gives only modest gains for the first 60 trials. Now consider such that a large number of dimensions do not contribute significantly: . Figure 13 shows that BO with SE kernel succeeds with few trials (as long as hyperparameters do not force BO to overexplore). So SE baseline is a reasonable check for adequate performance in such settings.
In most cases it is difficult to estimate (or even approximately guess) the intrinsic dimensionality of a realworld problem. Parametric representation doesn’t even give an upper bound on complexity. In robotics, intrinsic dimensionality of a problem could be higher than parametric dimensionality of its most commonly used representation. For example, the effects of friction are sometimes abstracted away as a few parameters of a simplified friction model. When such problems are declared ‘lowdimensional’, it creates a misconception that they are ‘easy’. In fact, they remain hard for cases where friction matters for success and a crude model is inadequate. It is also not easy to gauge the complexity of a problem by applying algorithms like PCA on the whole optimization space. A ‘simple’ structure might be characteristic for only a part of the space. For example, failing controllers could exhibit near chaotic behavior, making have high intrinsic dimensionality when considering the whole space. But this space might contain subregions, where the relationship between reward and change in controller parameters is gradual. If domain knowledge or informed kernels can help find a few points close to a successful region: the gains for BO could be paramount. BO could quickly focus exploration on the promising regions, without being restricted to a particular model or simulator structure that originally helped to point to a promising part of the space.
Comments
There are no comments yet.