I Introduction
Soft microrobots are mobile robotic devices at the submillimeter scale that are made of soft, stimuliresponsive materials, which can act as sensors and/or actuators [1, 2, 3, 4]. Indeed, as at the micro scale no robotic components are available, basic robotic functionalities need to be implemented in microrobots by the very materials they are made of. Recently, soft microrobots consisting of photoresponsive liquidcrystal elastomers (LCEs) have been presented [5]. The materialbased actuation of these soft continuum robots is powered and controlled by structured light fields (see Fig. 1
), resulting in microrobots with many internal degrees of freedom (DOFs)
[6]. We define a gait, in the context of the microrobotic system, as any pattern of periodic deformation that results in locomotion. The deformation and, thus, the gait of these soft microrobots can be flexibly controlled and tuned by the driving light field, enabling the adaptation of the gait to different environments [7]. Nonetheless, finding good locomotion performance by handtuning the light field parameters is extremely inefficient and timeconsuming, especially considering that: (i) often there is no accurate model of the microrobot locomotion; (ii) each microrobot is to some extent different because of their manual fabrication process; and (iii) the response of the microrobot changes over time as the material slowly degrades. Therefore, learning effective gaits from data is an attractive approach.In this paper, we propose an automatic learning procedure to find the optimal gait for lightdriven soft microrobots in a given environment. The optimal controller’s parameters, i.e. the parameters of the driving light field, are progressively evaluated by Bayesian Optimization (BO). Learning in this scenario presents a few peculiar challenges. First, the developed learning scheme should be highly data efficient, making use of prior information and leveraging empirical data to approach optimal locomotion performance in a limited number of experiments. At the same time, it should be robust against differences among microrobots, performing well independently of the specific microrobot used. This work represents the first attempt to learn a controller from data for submillimeter robots. First, we define a method to efficiently evaluate the locomotion speed from the microrobot’s tracking images. We then report a comparison of a number of BO settings (i.e. acquisition functions, kernels, hyperparameters) on a semisynthetic dataset, which is based on actual data from microrobot experiments and augmented to model intermicrorobot variability. Finally, we demonstrate successful learning in realrobot experiments, with a locomotion performance improvement of
with respect to an informed initial guess. The experimental validation was performed using a previously untested microrobot. The paper is organized as follows. The microrobotic system is described in Sec. II, and the proposed gait learning approach is introduced in Sec. III. Section IV presents the microrobotic controller and the extraction of prior information from previously published experiments. The benchmarking of a number of BO settings on a semisynthetic dataset is described in Sec. V, whereas Sec. VI reports successful learning in microrobot experiments.Ii Microrobotic System
In this work, we learn optimal gaits for the soft microrobots developed in [5]. These consist of monolithic LCEs that actively and continuously deform in response to light. As the deformation occurs locally, desired motion patterns can be obtained by exciting the microrobots with proper light fields. This approach results in microrobots having many internal DOFs [6]. In other words, the lightbased control allows these submillimeter monolithic structures to behave as if they contained many independent, wirelessly controllable actuators. The driving light fields are generated by modulating the intensity of a laser beam and projected onto the microrobots workspace by a microscope objective [5]. The microrobot is observed through the same objective by a camera that images the workspace at 10 frames per second and at a resolution of 12801024 pixel (). The laser beam (532 nm – Verdi G10, Coherent) is modulated in space and time by means of a computercontrolled digital micromirror device (DMD) module (V7000, ViaLUX). The DMD consists of an array of 1024768 micromirrors, each representing a pixel in the projected light field. The micromirrors modulate the light intensity in a binary fashion, yet grayscale levels control can be achieved by highfrequency pulsewidth modulation (PWM). Therefore, the microrobotic system has, in principle, almost 10 control parameters that can each assume 2 values. This represents a huge parameter space for gait optimization. Learning controller parameters directly in such a highdimensional parameter space would require a prohibitive amount of experimental data. Therefore, at this stage, we propose learning gaits by employing an effective controller structure as in [5] and tuning its critical free parameters with BO.
Iii Gait Learning Approach
In this section, we formulate the problem of gait learning for lightdriven soft microrobots as a costminimization problem over a suitably parametrized openloop controller. The gait learning algorithm is schematically represented in Fig. 2. Following the approach in [5], we use periodic light patterns with only few open parameters. The optimization then takes place in the parameter space of the light pattern. We also provide here a brief overview of BO with Gaussian Processes (GPs) [8], which represents the core of he gait learning method presented herein. BO with GPs has successfully been used, for example, for gait learning with bipedal walkers [9], quadrupedal robots [10], and in cmscale hexapodal robots [11]. It has also been proposed for automatic feedback controller tuning [12, 13, 14]
. A major strength of GPs is that they allow one to include existing information about the system in the form of a probabilistic prior. This reduces the size of the data set needed to learn a good approximation of the cost function, while at the same time retaining the flexibility of a nonparametric model.
Iiia Controller Learning
We formulate the problem of gait learning in soft microrobots as a parametric controller tuning problem. This work builds on the automatic controller tuning methods proposed in [12], where an unknown controller cost function was learned in a dataefficient way using BO with GPs. The microrobot is considered an uncertain, nonlinear dynamic system. The dynamics of a specific microrobot in a given environment are unknown and are evaluated by carrying out experiments. The experimental observations are affected by uncertainty and noise. The general state space model of the microrobot is
(1) 
where
is the system’s state (which, we assume, can be fully observed or estimated),
the input, and is the process noise. The input signal is generated by a controller , which is a function of a set of bounded parameters and of the desired movement reference , namely,(2) 
For gait learning, we need to evaluate how each input sequence influences the unknown system dynamics (1). For this, a cost function is used to map the controller parameters to a scalar cost value given the experimental data. Learning an optimal gait for the given objective corresponds to finding an optimal that minimizes :
(3) 
Due to the process noise in (1), however, we never directly observe but a noisy version of it, , where
is assumed zeromean Gaussian noise with variance
.IiiB Gaussian Process
To account for uncertainty in measurements and state estimation, as well as to formulate prior knowledge about the cost function, if such knowledge is available, the cost function is modeled as a GP. A GP can be viewed as a distribution over a space of functions. It is a nonparametric model, defined by its mean and covariance functions, and , where is the kernel function
(4) 
Given a set of samples, , the posterior at is
(5)  
where with , , and is the covariance matrix of the training points plus the noise variance , where is the Kronecker delta. While the prior mean is often assumed to be constant, the choice of the kernel function encodes our assumptions and prior knowledge about the cost function. The kernel function usually depends on a set of hyperparameters. Typical hyperparameters are the length scales for each dimension of the parameter space and the signal variance . The kernel’s length scales are related to the rate of variation of the underlying function, where short length scales correspond to quicklyvarying functions and long length scales to slowvarying ones. The signal variance describes the width of the distribution and relates to the magnitude of variation around the mean. In principle, kernel hyperparameters may be obtained from problem knowledge, or estimated from data, e.g., by maximizing the marginal likelihood of the observation with respect to the hyperparameters. Here, we follow a hybrid approach by setting a prior belief over the hyperparameters (hyperprior
), and using maximum a posteriori (MAP) estimation for the likelihood, i.e., maximizing marginal likelihood times hyperprior. We can thus optimize hyperparameters of the GP after each experimental evaluation. Encoding knowledge in the form of hyperpriors facilitates hyperparameter learning when the data set is small, as is the case for the problem herein, while still retaining flexibility by adjusting hyperparameters based on observed data. In this work the hyperparameters are estimated from previous experimental results, as reported in Sec.
IVD.IiiC Bayesian Optimization
As the system dynamics are unknown, the gait performance, captured by the cost function (3), can only be optimized by sampling, i.e. by conducting experiments on the physical system. BO is a derivativefree global optimization algorithm, which is particularly suited for problems with costly and uncertain function evaluations [15]. Given the current belief about the cost function (i.e. the posterior GP (5)), BO suggests the parameters for the next evaluation by maximizing an acquisition function ,
(6) 
. The acquisition function is formulated to capture the utility of knowing the outcome of the experiment at the suggested point, for example, the probability of improvement or some expected information gain. The concrete acquisition functions used for the gait learning problem are described in Sec.
VC. After evaluating , the new data point is added to , the GP posterior (5) updated, and BO enters the next iteration.Iv Learning Gait for 1d Locomotion
In the previous section, we presented gait learning for microrobots as an optimization problem, where parameters of an openloop controller are optimized from experimental trials via BO. In this section, we consider a concrete instance of a gait learning problem, namely learning the onedimensional (1D) movement of cylindrical microrobots submerged in a liquid (see Fig. (a)a). In particular, the microrobots are placed on the bottom of a Petri dish (which is coated with Polydimethylsiloxane (PDMS) to avoid adhesion) and submerged in silicone oil, as in [6]. The locomotion of the microrobots occurs as a result of their periodic deformation, which is powered and controlled by the projected light fields. This section continues by describing the light patterns used to control the microrobot and their defining parameters. We then define the movement objective and the corresponding cost function used to learn the 1D gait. Lastly, we estimate the GP hyperparameters (cf. Sec. IIIB).
Iva Light Pattern
The gait objective for this learning task is locomotion along one dimension, for which linear traveling waves of deformation are generated in the microrobot’s body. The deformation is controlled by the light field, which consists of a linear stripe pattern (see Fig. 1). The controller generates the light field with intensity , defined as a rectangular wave in the 2D workspace of the microrobot. This light field represents the controller output from (2). As the microrobot is aligned along the axis, the light pattern can be expressed as
(7) 
The pattern is parametrized by the duty cycle , the spatial wavelength , and the time domain frequency . In the reported experiment, we set the frequency to , whereas the parameters are to be optimized.
IvB Linear Movement
Swimming at the microscale is governed by the Stokes equations, where inertial forces play a negligible role with respect to viscous forces. In other words, a microrobot generating a constant propulsive force will swim at a constant speed , and its acceleration to this speed will be instantaneous. Therefore, the acceleration of the microrobot is neglected and its state space consists of two dimensions, namely the microrobot position and its speed . However, because of the control scheme described in Sec. IVA, the position of the microrobot’s center of mass will have some oscillation at the frequency of the traveling wave deformation (see Fig. (b)b). We are interested in optimizing the linear component of the propulsion, regardless of the oscillations caused by the exciting light field. Therefore we consider only the microrobots constant speed . As the cost in (3), we use the deviation of this constant speed component from the desired, . The estimation of from the microrobot’s position tracking is explained in the next section.
IvC Velocity Estimation
We estimate the velocity based on the experimental setup described in Sec. II. The position
of the microrobot is calculated by binarizing the acquired images and calculating the center of mass of the area corresponding to the microrobot (see Fig.
(a)a). One possible approach to obtain the underlying constant speed from the observed state trajectory consists in estimating the instantaneous velocity , by differentiating the position , and then averaging over an integer number of periods . As derivatives are highly susceptible to noise and the sampling frequency is close to the oscillation frequency, we chose to use a different approach. We estimate the average speed directly from the position by fitting the phenomenological model(8) 
This model captures the two types of movement of the soft microrobots, namely the linear movement at constant speed and the oscillations at frequency . The oscillation term is defined by the known frequency , and the unknown amplitude and phase shift . The linear term includes the offset , which accounts for changes in the position that occur during the transient (cf. Fig. (b)b). In previous experiment we observed the presence of a transient, which is not due to acceleration but to the initial response time of the material with respect to the start of the light projection. As we have no phenomenological model of the transient, but previous experiments have shown that it decayed after about two seconds, we only fit the model starting at . Fig. (b)b shows the measurement for one experiment as well as the fitted movement model. The constant speed is the slope of the linear part represented by the solid line. The speed estimation procedure also allows for quantifying the uncertainty of the velocity estimate, which we did not use, however, for the experiments herein.
IvD Hyperparameter Estimation
Estimates of the hyperparameters are obtained from the experimental data reported in [6]. For this, the raw data from these experiments are reprocessed to estimate the speed according to the model (8). These data are then used to estimate the boundaries of the parameter space and the prior for the cost function hyperparameters, namely, the constant mean , the signal variance for the kernel, and the noise variance . The kernel length scales are more difficult to estimate with the given data and are instead set to one fourth of the total range for each parameter. To account for variability in the microrobot size, the estimated speed is normalized on the bodylength, as common in microrobotics, and is expressed in bodylength percent per second (). The fastest speed observed in these experiments, according to the velocity estimation described in Sec. IVC is , obtained with a duty cycle and a wavelength
. From the measurements, we estimate the signal standard deviation
to be and the noise standard deviation to be . Physically reasonable ranges for the parameter space are a duty cycle comprised between and and a wavelength ranging from to pixels in the camera frame (corresponding to to). As prior distributions for each hyperparameter, we use Gaussian distributions with the corresponding estimated value as mean and one fourth of the estimate as standard deviation.
V SemiSynthetic Benchmark
For BO to work well in practice, we need to appropriately choose some settings, in particular, the kernel and the acquisition function. Appropriate choices for these should allow for generalization to different microrobots, while retaining dataefficiency. To determine a good setup in a systematic way, we consider a semisynthetic cost function and benchmark different settings. In this section, we first describe how the semisynthetic cost function is generated, then we motivate the considered design choices for the GP and BO, and finally we present the benchmarking results.
Va Creating a SemiSynthetic Cost Function
As the systems dynamics (1
) are unknown, we cannot properly benchmark the aforementioned BO settings on a pure simulation. Furthermore, a good performance on a fully synthetic optimization test function will not necessarily translate to realworld performance gains. We thus propose to sample from the real system on a coarse grid, and then interpolate the data at unobserved locations to create a semisynthetic cost function, which has similar properties to the real one, while being cheaper to evaluate. To collect the necessary data, we defined a grid on the parameter space and used BO with a standard kernel to sample from this grid. Specifically, we collected 56 cost function evaluations by running BO with a squared exponential (SE) kernel and parameters defined in
IVD. The observations were conducted with the microrobotic system described in Sec. II and with the experimental conditions reported in Sec. IV. These are the same conditions as used in [6], from which we estimated the prior information. We use a different microrobot which results in a slightly different locomotion behavior. For each BO run, we generate different semisynthetic cost functions by considering the intrinsic noise of the measurements and resampling each data point. This assures that the cost function, including the location of the optimum, is different for each run, while retaining the overall ‘shape’ to some extent. As not all points on the grid were actually evaluated in experiments, the missing data points were generated by linear interpolation. The resulting data was smoothed with a 33 spatial mean filter. To obtain the final continuous cost function, the discrete data set was further interpolated using cubic splines. Figure 6 shows one sample from the possible benchmark functions.VB Gaussian Process Priors
In this section, we describe the choice of the prior mean, signal variance, and kernel function, to model the semisynthetic cost function. We distinguish between two kinds of priors: optimistic and pessimistic. An optimistic prior assumes good performance (i.e. low cost values) at unobserved locations, while a pessimistic prior assumes the opposite. Since we terminate BO after a small number of experiments, the choice of the prior has an impact on the search: An optimistic prior encourages an extensive exploration, while a pessimistic prior explores slower.
VB1 Mean
As prior mean function we use a constant nonzero mean. For a cost function that is close to zero at its minimum, a mean of zero would be optimistic. If the real cost function has in fact a higher mean, a zero mean prior would make the optimizer believe that it has initially evaluated bad regions of the parameter space and therefore keep exploring new regions. This leads to an undesirably high amount of exploration at the corners and edges of the feasible parameter space. With a more pessimistic prior mean (e.g. at approximately the standard deviation of the signal, ) the belief is that most controllers perform badly. We found that this yields a much better explorationexploitation tradeoff. All settings presented herein use a pessimistic mean of as prior mean.
VB2 Signal Variance
With a pessimistic mean , we can readjust the estimated value for the signal variance from the initial estimate used to find such that the lowest possible cost, which is by definition, is either inside or outside the confidence interval of the prior. We call the first case an optimistic setting and the second a pessimistic one. During the benchmark we compared an optimistic signal variance, such that , to a pessimistic one, with , to evaluate the effect of this setting on the BO performance.
VB3 Kernel
The kernel is the main object in the GP that encodes prior knowledge. We compare four standard kernels and one composite kernel. The SE kernel is one of the most frequently used kernels for GP regression [8]. We use SE with automatic relevance determination (ARD), i.e. we have different length scales for each dimension. Another common option is the Rational Quadratic (RQ) kernel with ARD [8]. This kernel can be seen as a weighted sum over SE kernels of all lengthscales. The SE and RQ kernels both assume very smooth functions [8]. While the smoothness assumption of the SE and RQ kernels is too restrictive for most practical optimization problems (e.g. physical systems), the Matern kernel is known to be a much better choice [8, 16]. We compare two versions of the Matern (M) kernel with ARD, (where ) and (where ). The property of the Matern kernel determines the smoothness of modeled functions, where a higher yields smoother functions. We also adopt a composite kernel, which we term , having one component that captures changes of the cost function at a short length scale, and one smoother component for variations at a longer scale:
(9) 
The sum of two kernels is again a valid kernel, [8]. Combining simple kernels, e.g., with different length scales, is a powerful way to build more expressive models.
VC Acquisition Functions
As mentioned in Sec. IIIC, the adopted acquisition function determines the behavior and performance of BO. An acquisition function evaluates the utility of a candidate for sampling. Here a candidate is a parameter for the controller in the feasible parameter space. BO then chooses as sampling location the candidate with the maximum utility. Here we test the following acquisition functions. Probability of Improvement (PI) [17] is the probability that the candidates cost value is below some threshold . Here is the smallest cost as predicted by the posterior mean and a constant factor used to trade off exploration vs. exploitation. We set to an optimistic value of , expecting that the posterior is pessimistic, as its prior is a pessimistic mean. Expected Improvement (EI) [18] is the expected value of improvement over with an optimistic of . In contrast to PI, EI takes into account not only the probability of an improvement in the cost function at a candidate position, but also the expected magnitude of such improvement. Entropy Search (ES) [19] evaluates the expected information gain (change in entropy) about the location of the minimum at the candidate. As such, ES is not directly concerned about sampling at the location of the minimum, but about samples that maximize the information gain about the minimum location.
VD Results

EI  EI  PI  ES  

fixed Hyp.  learned Hyp.  fixed Hyp.  learned Hyp.  fixed Hyp.  learned Hyp.  fixed Hyp.  
SE  11.8 (54.3)  10.7 (43.8)  11.4 (53.2)  14.9 (59.7)  9.9 (47.5)  13.8 (49.4)  11.7 (51.1) 
RQ  5.5 (35.3)  4.7 (27.6)  7.5 (37.8)  7.3 (46.2)  7.3 (42.8)  10.1 (66.1)  6.0 (46.1) 
M32  5.8 (37.7)  3.3 (33.2)  4.1 (33.6)  4.4 (36.2)  5.8 (51.2)  7.4 (66.2)  7.5 (43.9) 
M52  3.5 (29.2)  4.2 (33.9)  20.0 (62.4)  4.5 (37.6)  13.5 (71.8)  8.0 (53.3)  7.9 (47.7) 
2Mat  3.4 (23.0)  3.5 (26.7)  5.4 (43.8)  7.0 (38.6)  6.8 (41.1)  5.6 (33.4)   
median (95percentile) over 200 runs of the normalized regret (%) after 20 iterations.
The results for BO on the semisynthetic cost functions serve to evaluate the relative performances of different design choices. While we do not necessarily expect to achieve the exact same absolute performance on a real microrobot, the comparison gives a good indicator on which setting to prefer, when learning microrobot gaits. All experiments start at the same location, i.e. the initial controller found in Sec. IVD (, ). For EI we tried both an optimistic and a pessimistic signal variance, respectively and (see above), whereas PI and ES were tested only with the optimistic signal variance . Each design choice of the BO was run on 200 different semisynthetic cost function. Each BO run consisted of 20 iterations on one cost function, representing 20 experimental observations on the same system (the experimental budget). For each iteration, we calculated the normalized regret after each iteration defined as
(10) 
where is the best parameter set as predicted by the GPs posterior mean, and is the parameter set at the true optimum. The results are summarized in Table I, which shows the median over the 200 runs of the normalized regret after 20 iterations. The number in brackets represent the 95percentile over the 200 runs of the normalized regret after 20 iterations; that is, the value that contains the best of cases (in other words, of the cases performed better than this value and only worse). This metric, which quantifies the robustness of the specific learning scheme, is arguably the most important in practice, as in real experiments, the gait shall usually only be learned once for each microrobot. Numbers in bold represent the best performances achieved for the given design choices with and without hyperparameter optimization. The settings plotted in Fig. 9 are filled grey.
One can see from the results that the SE kernel is (with one exception) outperformed by all other kernels. The Matern kernel and the sum of two Matern kernels of different length scales always perform reasonably well. It can also be observed that MAP hyperparameter learning during BO does not lead to consistently better results. Hyperparameter optimization can lead to inferior performance when directly compared to the same settings with fixed hyperparameters. Optimizing the hyperparameter does however lead to better performance in cases where the set of fixed hyperparameters are performing relatively poorly. Comparing design choices where the hyperparameter were seemingly off (M52, EI, ), learning the parameter helps significantly. This is especially relevant for the performance in terms of the 95percentile. The best performing acquisition function was EI. We did, however, not test ES with hyperparameter optimization. Figure (a)a shows the comparison of a selection of design choices. Examining the regret of the settings with the worst performance after 20 iterations, one can see it decreases cost faster and then flattens out after six to eight iterations. We have observed that, after finding a local optimum, exploration stopped until the experimental budget was exhausted. Hyperparameter optimization keeps the steep decrease in cost during the first iterations, but does not get stuck in a local optimum. The histogram in Fig. (b)b shows the distribution of cost for the blue and the yellow curves from Fig. (a)a. Not only is the median improved, but the spread is also reduced significantly. The above comparison among different BO design choices allows us to evaluate the best setting to optimize the gait of a new, potentially different microrobot. As discussed in Sec. I, the learning scheme of choice has to be robust with respect to differences in microrobots and be highly data efficient, such that a good controller is found after 20 iterations. As it can be observed in Table I, the 2Mat kernel with EI and the optimistic signal variance is expected to lead to substantial improvement in the locomotion performance, and to do so consistently. As learning hyperparameters generally makes the learning scheme more robust, we selected the 2Mat EI learned Hyp. configuration for gait learning.
Vi Experimental Results
A new microrobot is used for the experimental validation of the developed learning control framework. As each microrobot differs slightly, the optimal gait parameters are also expected to be different. Without learning the specific cost function of the new microrobot the best known controller parameters are the ones identified in Sec. IVD, namely and . Starting from this initial controller, we performed BO with an experimental budget of 20 samples. The learned cost function over the parameter space is shown in Fig. 10. The initial performance was then compared to the performance at the parameter sets with lowest posterior mean () and with the lowest observed cost (). The lowestposteriormean parameters were and , whereas the lowestobservedcost ones were and . Fig. 11 shows the obtained values of the cost function . Notably, we were able to reduce the cost by and in the lowestposteriormean and lowestobservedcost cases, respectively. Correspondigly, the locomotion performance, i.e. the average speed , was increased by () and by () with respect to the initial controller parameters obtained by previous experiments ().
Vii Conclusion
This paper is the first to investigate and demonstrate the potential of probabilistic learning of optimal gaits for submillimeter robots. In particular, the proposed learning scheme based on BO with GPs proved to be highly dataefficient and robust against differences among microrobots. These features are essential for lightdriven soft microrobots, which can have a relatively high variability and a limited lifetime under continuous excitation. The proposed BO learning approach is general and could be applied to find optimal controller parameters for other microrobotic systems. The method we used to develop the learning scheme for gait optimization is based on the definition of a semisynthetic benchmark function. This approach allowed us to assess the robustness of different learning schemes against the intermicrorobot variability, and to select the scheme with the highest chance to provide substantial improvement in the locomotion performance of a slightly different microrobot. This was verified in an actual microrobot experiment. By using a different microrobot for the semisynthetic benchmark and the validation experiment, we show that our method can handle differences between microrobots. Learning control is an important step toward building a complete robotic system based on lightdriven soft microrobots. Still, there are many more interesting challenges ahead. Future research directions are the design of problemspecific kernels, as well as the learning of more complex gaits in higher dimensional parameter spaces to take advantage of the microrobot’s many DOFs [6]. This is relevant, for example, for planar locomotion using diskshaped microrobots [5]. In addition, extending BO to learn timevarying cost functions is also particularly interesting, as it will enable microrobots to adapt to degrading material properties and, more importantly, to changes in their environment.
Acknowledgment
The authors thank Dr. Hao Zeng, Dr. Camilla Parmeggiani, Dr. Daniele Martella, and Prof. Diederik S. Wiersma for providing the LCE samples for the microrobots.
References
 [1] A. Mourran, H. Zhang, R. Vinokur, and M. Möller, “Soft microrobots employing nonequilibrium actuation via plasmonic heating,” Advanced Materials, vol. 29, no. 2, 2017.
 [2] H.W. Huang, M. S. Sakar, A. J. Petruska, S. Pane, and B. J. Nelson, “Soft micromachines with programmable motility and morphology,” Nature Communications, vol. 7, 2016.
 [3] J. C. Breger, C. Yoon, R. Xiao, H. R. Kwag, M. O. Wang, J. P. Fisher, T. D. Nguyen, and D. H. Gracias, “Selffolding thermomagnetically responsive soft microgrippers,” ACS Applied Materials & Interfaces, vol. 7, no. 5, pp. 3398–3405, 2015.
 [4] H. Zeng, P. Wasylczyk, C. Parmeggiani, D. Martella, M. Burresi, and D. S. Wiersma, “Lightfueled microscopic walkers,” Advanced Materials, vol. 27, no. 26, pp. 3883–3887, 2015.
 [5] S. Palagi, A. G. Mark, S. Y. Reigh, K. Melde, T. Qiu, H. Zeng, C. Parmeggiani, D. Martella, A. SanchezCastillo, N. Kapernaum, F. Giesselmann, D. S. Wiersma, E. Lauga, and P. Fischer, “Structured light enables biomimetic swimming and versatile locomotion of photoresponsive soft microrobots,” Nature Materials, vol. 15, no. 6, pp. 647–653, 2016.
 [6] S. Palagi, A. G. Mark, K. Melde, H. Zeng, C. Parmeggiani, D. Martella, D. S. Wiersma, and P. Fischer, “Soft continuous microrobots with multiple intrinsic degrees of freedom,” in Int. Conf. on Manipulation, Automation and Robotics at Small Scales, July 2016, pp. 1–5.
 [7] S. Palagi, A. G. Mark, K. Melde, T. Qiu, H. Zeng, C. Parmeggiani, D. Martella, D. S. Wiersma, and P. Fischer, “Locomotion of lightdriven soft microrobots through a hydrogel via local melting,” in Int. Conf. on Manipulation, Automation and Robotics at Small Scales, July 2017, pp. 1–5.

[8]
C. Rasmussen and C. Williams,
Gaussian Processes for Machine Learning
. MIT Press, 2006. 
[9]
R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth, “Bayesian
optimization for learning gaits under uncertainty,”
Annals of Mathematics and Artificial Intelligence
, vol. 76, no. 1, pp. 5–23, Feb 2016.  [10] D. J. Lizotte, T. Wang, M. H. Bowling, and D. Schuurmans, “Automatic gait optimization with Gaussian process regression,” in Int. Joint Conf. on Artifical intelligence, Jan. 2007, pp. 944–949.
 [11] B. Yang, G. Wang, R. Calandra, D. Contreras, S. Levine, and K. S. J. Pister, “Learning flexible and reusable locomotion primitives for a microrobot,” IEEE Robotics and Automation Letters, 2018.
 [12] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe, “Automatic LQR tuning based on Gaussian process global optimization,” in IEEE Int. Conf. on Robotics and Automation, May 2016, pp. 270–277.
 [13] F. Berkenkamp, A. P. Schoellig, and A. Krause, “Safe controller optimization for quadrotors with gaussian processes,” in IEEE Int. Conf. on Robotics and Automation, May 2016, pp. 491–496.

[14]
A. Marco, F. Berkenkamp, P. Hennig, A. P. Schoellig, A. Krause, S. Schaal, and S. Trimpe, “Virtual vs. Real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization,” in
IEEE Int. Conf. on Robotics and Automation, May 2017, pp. 1557–1563.  [15] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the human out of the loop: A review of Bayesian optimization,” Proc. of the IEEE, vol. 104, no. 1, pp. 148–175, Jan. 2016.
 [16] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimization of machine learning algorithms,” in Advances in Neural Information Processing Systems, 2012, pp. 2951–2959.
 [17] D. J. Lizotte, “Practical Bayesian optimization,” Ph.D. dissertation, University of Alberta, Edmonton, Canada, 2008.
 [18] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimization of expensive blackbox functions,” Journal of Global Optimization, vol. 13, no. 4, pp. 455–492, 1998.
 [19] P. Hennig and C. J. Schuler, “Entropy search for informationefficient global optimization,” Journal of Machine Learning Research, vol. 13, pp. 1809–1837, June 2012.