Contextual policy search (CPS) is a popular means for multi-task reinforcement learning in robotic control. CPS learns a hierarchical policy, in which the lower-level policy is often a domain-specific behavior representation such as dynamical movement primitives (DMPs) 
. Learning takes place on the upper-level policy, which is a conditional probability density
that defines a distribution over the parameter vectorsof the lower-level policy for a given context . The context encodes properties of environment or task such as a desired walking speed for a locomotion behavior or a desired target position for a ball-throw behavior. The objective of CPS is to learn an upper-level policy which maximizes the expected return of the lower-level policy for a given context distribution.
CPS is typically based on local search based approaches such as cost-regularized
kernel regression  and contextual relative
entropy search (C-REPS) [17, 21].
From the field of black-box optimization, it is well-known that
local search based approaches are well suited for problems with a moderate
dimensionality and no gradient-information. However, for the special case of
relatively low-dimensional search spaces combined with an expensive cost
function, which limits the number of evaluations of the cost functions, global
search approaches like Bayesian optimization  are often superior,
for instance for selecting hyperparameters . Combining contextual
policy search with pre-trained movement primitives111 DMPs can
be pre-trained for fixed contexts in simulation or via some kind of imitation learning.
DMPs can be pre-trained for fixed contexts in simulation or via some kind of imitation learning.can also fall into this category as evaluating the cost function requires an execution of the behavior on the robot while only a small set of hyperparameters might have to be adapted. Bayesian optimization has been used for non-contextual policy search on locomotion tasks [5, 18] and robot grasping  and for (passive) contextual policy search on a simulated robotic ball-throwing task .
In this work, we focus on problems where the agent can select the task (context) in which it will perform the next trial during learning. This facilitates active learning, which is considered to be a prerequisite for lifelong learning. A core challenge in active multi-task robot control learning is the incommensurability of performance in different tasks, i.e., how a learning system can account for the relative (unknown) difficulty of a task: for instance, if a relatively small reward is obtained when executing a specific low-level policy in a task, is it because the low-level policy is not well adapted to the task or because the task is inherently more difficult than other tasks? Fabisch et al. 
presented an approach for estimating the task-difficulty explicitly, which allows defining heuristic intrinsic reward functions based on which a discounted multi-arm bandit selects the next task actively.
In this work, we follow a different approach: rather than explicitly addressing the incommensurability of rewards, we propose ACES, an information theoretic approach for active task selection which selects tasks not based on rewards directly but rather based on the expected reduction in uncertainty about the optimal parameters for the contexts. ACES allows selecting task and parameters jointly without requiring a heuristic definition of a task selection criterion. ACES is motivated by entropy search , which has been extended in a similar fashion to non-contextual settings , and multi-task Bayesian optimization , which focuses on problems with discrete context spaces.
Contextual Policy Search (CPS) denotes a model-free approach to reinforcement learning, in which the (low-level) policy is parametrized by a vector . The choice of is governed by an upper-level policy . For generalizing learned policies to multiple tasks, the task is characterized by a context vector and the upper-level policy is conditioned on the respective context. The objective of CPS is to learn such that the expected return over all contexts is maximized, with . Here, is the distribution over contexts and is the expected return when executing the low level policy with parameter in context . We refer to Deisenroth et al.  for a recent overview of (contextual) policy search approaches in robotics.
Bayesian optimization for contextual policy search (BO-CPS) is based on applying ideas from Bayesian optimization to contextual policy search . BO-CPS learns internally a model of the expected return of a parameter vector in a context . The model is based on Gaussian process (GP) regression . It learns from sample returns obtained in rollouts at query points consisting of a context determined by the environment and a parameter vector selected by BO-CPS. By learning a joint GP model over the context-parameter space, experience collected in one context is naturally generalized to similar contexts.
The GP model provides both an estimate of the expected return
and its standard deviation. Based on this information, the parameter vector for the given context is selected by maximizing an acquisition function, which allows controlling the trade-off between exploitation (selecting parameters with maximal estimated return) and exploration (selecting parameters with high uncertainty). Common acquisition functions used in Bayesian optimization such as the probability of improvement (PI) and the expected improvement (EI)  are not easily generalized to BO-CPS . In contrast, the acquisition function GP-UCB , which defines the acquisition value of a parameter vector in a context as , where controls the exploration-exploitation trade-off, can be applied to BO-CPS straightforwardly resulting in an approach similar to CGP-UCB . BO-CPS selects parameters for a given fixed context by performing an optimization over the parameter space using the global maximizer DIRECT  to find the approximate global maximum, followed by L-BFGS  to refine it.
Entropy search (ES) is a recently proposed approach to probabilistic global optimization that mainly differs from Bayesian optimization in the choice of the acquisition function . While typical acquisition functions used for Bayesian optimization select query points where they expect the optimum, ES selects query points where it expects to learn most about the optimum. More specifically, ES explicitly represents , the probability that the global optimum (maximum or minimum, depending on the problem) of the unknown function is at . ES estimates at finitely many points on a non-uniform grid that are selected heuristically. Moreover, it approximates at based on expectation propagation or Monte Carlo integration. To select a query point, ES predicts the change of the GP when drawing a sample at the query point and assuming different outcomes sampled from the GP’s predictive distribution at . Thereupon, ES selects a query point which minimizes the average loss , i.e., which maximizes the relative entropy between and a uniform measure , where
denotes the probability distribution of the global optimumafter an assumed query at .
3 Active Contextual Entropy Search
In this section, we present active contextual entropy search (ACES), an extension of ES to CPS which allows selecting both parameters and context of the next trial. Let
denote the conditional probability distribution of the maximum expected return given contextand let the loss denote the expected change of relative entropy in context after performing a trial in context with parameter . A straightforward extension of ES to active learning in BO-CPS would be selecting , i.e., select the context in which the maximum increase of relative entropy is expected. This, however, would not account for information gained about the optima in contexts by a query at .
ACES instead averages over the expected change in relative entropy at different points in the context space: , where is a set of contexts which is drawn uniform randomly from the context space. Unfortunately, each evaluation of is computationally expensive and thus would have to be chosen small. On the other hand, GPs have an intrinsic length-scale for many choices of the kernel and thus, a query in context will only affect when is “similar” to . We define similarity between contexts based on the Mahalanobis distance with being a diagonal matrix with the (anisotropic) length scales of the GP on the diagonal. Based on this we can approximate with NN returning the nearest neighbors of in according to the Mahalanobis distance. A larger value of corresponds to a better approximation of at the cost of a linearly increased computational cost.
are selected by performing Thompson sampling onrandomly chosen with . The number of trial contexts is set to and we compare empirically and . The quantity is approximated using Monte-Carlo integration based on drawing 1000 samples from the GP posterior. The number of samples from the GP’s predictive distribution at for approximating the average loss for a query point is set to . Since there is noise in the Monte Carlo estimates of , we use CMA-ES  as optimizer rather than DIRECT.
We present results in a simulated robotic control task, in which the robot arm COMPI  is used to throw a ball at a target on the ground encoded in a two-dimensional context vector. The target area is and the robot arm is mounted at the origin of this coordinate system. Thus, contexts can be chosen from a two-dimensional context space: . The low-level policy is a joint-space DMP with preselected start and goal angle for each joint and all DMP weights set to 0. This DMP results in throwing a ball such that it hits the ground close to the center of the target area. Adaptation to different target positions is achieved by modifying a two-dimensional vector : the first component of corresponds to the execution time of the DMP, which determines how far the ball is thrown, and the second component to the final angle of the first joint, which determines the rotation of the arm around the z-axis.
The upper-level policy222The upper-level policy could in principle be defined directly on the surrogate model. This would, however, require a computationally expensive maximization over the parameter space for each evaluation of the policy. is a deterministic policy which selects parameters based on an affine function of context . This policy is trained on the training data using the C-REPS policy update. The limits on the parameter space are and . All approaches use an anisotropic Matérn kernel for the GP surrogate model. Since we focus on a “pure exploration” scenario , GP-UCB’s exploration parameter is set to a constant value of . The reward is defined as , where denotes the goal position, denotes the position hit by the ball, and denotes a penalty term on the sum of squared joint velocities during DMP execution. The maximum achievable reward for different differs as values of further away from the origin (where the arm is mounted) require larger joint velocities and thus incur a larger penalty. Thus, rewards in different contexts are incommensurable.
Figure 1 summarizes the main results of the empirical evaluation. The left graph shows the mean offline performance of the upper-level policy at 16 test contexts on a grid over the context space. Sampling contexts and parameters randomly during learning (“Random”) is shown as a baseline and indicates that generalizing experience using a GP model alone does not suffice for quick learning in this task. Rather, a non-random way of exploration is required. BO-CPS with random context selection and UCB for parameter selection improves considerably over random parameter selection. Using ES for parameter selection further improves the learning speed. Closer inspection (not shown) indicates that ES improves over UCB mainly because UCB samples often at the boundaries of the parameter space since the uncertainty is typically large there. ES samples more often in the inner regions of the parameter space since those regions promise a larger information gain globally.
Active context selection using ACES further improves over BOCPS-ES, in particular when the sum over the context space is approximated using several samples ( in the case of ACES_20) rather than a single sample ( for ACES_01). The right graph shows the contexts selected by different variants of ACES. It can be seen that ACES_20 avoids selecting targets close to the boundary of the context space as those typically reveal less global information about the context-dependent optima as boundary points are far away from most other regions of the context space. We attribute the improved learning performance to this way of selecting targets during learning. In contrast, ACES_1 samples more often close to the boundaries as it only considers the local information gain and thus has no reason to prefer inner over boundary contexts.
(Left) Learning curves on the simulated robot arm COMPI: the offline performance is evaluated each 10 episodes on 16 test contexts distributed on a grid over the target area. Shown are mean and its standard error over 20 independent runs. (Right) Scatter plot showing the sampled contexts for all (blue) and a single representative run (red).
5 Discussion and Conclusion
We have presented an active learning approach for contextual policy search based on entropy search. First experimental results indicate that the proposed active learning approach provides considerable speed-ups of the learning of movement primitives compared to a random task selection. Comparison with other active task selection approaches  remains future work. Moreover, investigating and enhancing the scalability to higher dimensional problems, potentially by employing a combination with random embedding approaches such as REMBO , and combining active task selection with predictive entropy search  or portfolio-based approaches  would be interesting.
This work was performed as part of the project BesMan333More information are available at http://robotik.dfki-bremen.de/en/research/projects/besman.html. and supported through two grants of the German Federal Ministry of Economics and Technology (BMWi, FKZ 50 RA 1216 and FKZ 50 RA 1217).
-  V. Bargsten and J. de Gea. COMPI: Development of a 6-DOF compliant robot arm for human-robot cooperation. In Proceedings of the 8th International Workshop on Human-Friendly Robotics (HFR-2015). Technische Universität München (TUM), Oct. 2015.
-  E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical report.
-  S. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Multi-armed Bandits Problems. In Algorithmic Learning Theory, number 5809 in Lecture Notes in Computer Science, pages 23–37. Springer Berlin Heidelberg, Oct. 2009.
-  R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A Limited-Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific Computing, 16:1190–1208, 1995.
R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth.
Bayesian Optimization for Learning Gaits under Uncertainty.
Annals of Mathematics and Artificial Intelligence (AMAI), 2015.
-  M. P. Deisenroth, G. Neumann, and J. Peters. A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013.
A. Fabisch and J. H. Metzen.
Active contextual policy search.
Journal of Machine Learning Research, 15:3371–3399, 2014.
-  A. Fabisch, J. H. Metzen, M. M. Krell, and F. Kirchner. Accounting for Task-Difficulty in Active Multi-Task Robot Control Learning. KI - Künstliche Intelligenz, pages 1–9, May 2015.
-  N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9:159–195, 2001.
-  P. Hennig and C. J. Schuler. Entropy Search for Information-Efficient Global Optimization. JMLR, 13:1809–1837, 2012.
-  J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions. In Advances in Neural Information Processing Systems 27, pages 918–926, 2014.
-  A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal. Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors. Neural Computation, 25:1–46, 2013.
-  D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, Oct. 1993.
-  J. Kober, A. Wilhelm, E. Oztop, and J. Peters. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots, 33(4):361–379, 2012.
-  A. Krause and C. S. Ong. Contextual Gaussian Process Bandit Optimization. In Advances in Neural Information Processing Systems 24, pages 2447–2455, 2011.
-  O. B. Kroemer, R. Detry, J. Piater, and J. Peters. Combining active learning and reactive control for robot grasping. Robot. Auton. Syst., 58(9):1105–1116, Sept. 2010.
-  A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann. Data-Efficient Generalization of Robot Skills with Contextual Policy Search. In 27th AAAI Conference on Artificial Intelligence, June 2013.
-  D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with gaussian process regression. pages 944–949, 2007.
-  A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe. Automatic LQR Tuning based on Gaussian Process Optimization: Early Experimental Results. In Proceedings of the Second Machine Learning in Planning and Control of Robot Motion Workshop, Hamburg, 2015. IROS.
-  J. H. Metzen, A. Fabisch, and J. Hansen. Bayesian Optimization for Contextual Policy Search. In Proceedings of the Second Machine Learning in Planning and Control of Robot Motion Workshop, Hamburg, 2015. IROS.
-  J. Peters, K. Mülling, and Y. Altun. Relative Entropy Policy Search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, Atlanta, Georgia, USA, 2010. AAAI Press.
-  C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
-  P. Ruvolo and E. Eaton. Active Task Selection for Lifelong Machine Learning. In Twenty-Seventh AAAI Conference on Artificial Intelligence, June 2013.
-  B. Shahriari, Z. Wang, M. W. Hoffman, A. Bouchard-Côté, and N. de Freitas. An Entropy Search Portfolio for Bayesian Optimization. In NIPS workshop on Bayesian optimization, 2014.
-  J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959, 2012.
-  N. Srinivas, A. Krause, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
-  K. Swersky, J. Snoek, and R. P. Adams. Multi-Task Bayesian Optimization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2004–2012. Curran Associates, Inc., 2013.
-  Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. d. Freitas. Bayesian Optimization in High Dimensions via Random Embeddings. In International Joint Conferences on Artificial Intelligence (IJCAI), 2013.