1 Introduction
Contextual policy search (CPS) is a popular means for multitask reinforcement learning in robotic control
[6]. CPS learns a hierarchical policy, in which the lowerlevel policy is often a domainspecific behavior representation such as dynamical movement primitives (DMPs) [12]. Learning takes place on the upperlevel policy, which is a conditional probability density
that defines a distribution over the parameter vectors
of the lowerlevel policy for a given context . The context encodes properties of environment or task such as a desired walking speed for a locomotion behavior or a desired target position for a ballthrow behavior. The objective of CPS is to learn an upperlevel policy which maximizes the expected return of the lowerlevel policy for a given context distribution.CPS is typically based on local search based approaches such as costregularized kernel regression [14] and contextual relative entropy search (CREPS) [17, 21]. From the field of blackbox optimization, it is wellknown that local search based approaches are well suited for problems with a moderate dimensionality and no gradientinformation. However, for the special case of relatively lowdimensional search spaces combined with an expensive cost function, which limits the number of evaluations of the cost functions, global search approaches like Bayesian optimization [2] are often superior, for instance for selecting hyperparameters [25]. Combining contextual policy search with pretrained movement primitives^{1}^{1}1
DMPs can be pretrained for fixed contexts in simulation or via some kind of imitation learning.
can also fall into this category as evaluating the cost function requires an execution of the behavior on the robot while only a small set of hyperparameters might have to be adapted. Bayesian optimization has been used for noncontextual policy search on locomotion tasks [5, 18] and robot grasping [16] and for (passive) contextual policy search on a simulated robotic ballthrowing task [20].In this work, we focus on problems where the agent can select the task (context) in which it will perform the next trial during learning. This facilitates active learning, which is considered to be a prerequisite for lifelong learning
[23]. A core challenge in active multitask robot control learning is the incommensurability of performance in different tasks, i.e., how a learning system can account for the relative (unknown) difficulty of a task: for instance, if a relatively small reward is obtained when executing a specific lowlevel policy in a task, is it because the lowlevel policy is not well adapted to the task or because the task is inherently more difficult than other tasks? Fabisch et al. [8]presented an approach for estimating the taskdifficulty explicitly, which allows defining heuristic intrinsic reward functions based on which a discounted multiarm bandit selects the next task actively
[7].In this work, we follow a different approach: rather than explicitly addressing the incommensurability of rewards, we propose ACES, an information theoretic approach for active task selection which selects tasks not based on rewards directly but rather based on the expected reduction in uncertainty about the optimal parameters for the contexts. ACES allows selecting task and parameters jointly without requiring a heuristic definition of a task selection criterion. ACES is motivated by entropy search [10], which has been extended in a similar fashion to noncontextual settings [19], and multitask Bayesian optimization [27], which focuses on problems with discrete context spaces.
2 Background
Contextual Policy Search (CPS) denotes a modelfree approach to reinforcement learning, in which the (lowlevel) policy is parametrized by a vector . The choice of is governed by an upperlevel policy . For generalizing learned policies to multiple tasks, the task is characterized by a context vector and the upperlevel policy is conditioned on the respective context. The objective of CPS is to learn such that the expected return over all contexts is maximized, with . Here, is the distribution over contexts and is the expected return when executing the low level policy with parameter in context . We refer to Deisenroth et al. [6] for a recent overview of (contextual) policy search approaches in robotics.
Bayesian optimization for contextual policy search (BOCPS) is based on applying ideas from Bayesian optimization to contextual policy search [20]. BOCPS learns internally a model of the expected return of a parameter vector in a context . The model is based on Gaussian process (GP) regression [22]. It learns from sample returns obtained in rollouts at query points consisting of a context determined by the environment and a parameter vector selected by BOCPS. By learning a joint GP model over the contextparameter space, experience collected in one context is naturally generalized to similar contexts.
The GP model provides both an estimate of the expected return
and its standard deviation
. Based on this information, the parameter vector for the given context is selected by maximizing an acquisition function, which allows controlling the tradeoff between exploitation (selecting parameters with maximal estimated return) and exploration (selecting parameters with high uncertainty). Common acquisition functions used in Bayesian optimization such as the probability of improvement (PI) and the expected improvement (EI) [2] are not easily generalized to BOCPS [20]. In contrast, the acquisition function GPUCB [26], which defines the acquisition value of a parameter vector in a context as , where controls the explorationexploitation tradeoff, can be applied to BOCPS straightforwardly resulting in an approach similar to CGPUCB [15]. BOCPS selects parameters for a given fixed context by performing an optimization over the parameter space using the global maximizer DIRECT [13] to find the approximate global maximum, followed by LBFGS [4] to refine it.Entropy search (ES) is a recently proposed approach to probabilistic global optimization that mainly differs from Bayesian optimization in the choice of the acquisition function [10]. While typical acquisition functions used for Bayesian optimization select query points where they expect the optimum, ES selects query points where it expects to learn most about the optimum. More specifically, ES explicitly represents , the probability that the global optimum (maximum or minimum, depending on the problem) of the unknown function is at . ES estimates at finitely many points on a nonuniform grid that are selected heuristically. Moreover, it approximates at based on expectation propagation or Monte Carlo integration. To select a query point, ES predicts the change of the GP when drawing a sample at the query point and assuming different outcomes sampled from the GP’s predictive distribution at . Thereupon, ES selects a query point which minimizes the average loss , i.e., which maximizes the relative entropy between and a uniform measure , where
denotes the probability distribution of the global optimum
after an assumed query at .3 Active Contextual Entropy Search
In this section, we present active contextual entropy search (ACES), an extension of ES to CPS which allows selecting both parameters and context of the next trial. Let
denote the conditional probability distribution of the maximum expected return given context
and let the loss denote the expected change of relative entropy in context after performing a trial in context with parameter . A straightforward extension of ES to active learning in BOCPS would be selecting , i.e., select the context in which the maximum increase of relative entropy is expected. This, however, would not account for information gained about the optima in contexts by a query at .ACES instead averages over the expected change in relative entropy at different points in the context space: , where is a set of contexts which is drawn uniform randomly from the context space. Unfortunately, each evaluation of is computationally expensive and thus would have to be chosen small. On the other hand, GPs have an intrinsic lengthscale for many choices of the kernel and thus, a query in context will only affect when is “similar” to . We define similarity between contexts based on the Mahalanobis distance with being a diagonal matrix with the (anisotropic) length scales of the GP on the diagonal. Based on this we can approximate with NN returning the nearest neighbors of in according to the Mahalanobis distance. A larger value of corresponds to a better approximation of at the cost of a linearly increased computational cost.
Candidate points
are selected by performing Thompson sampling on
randomly chosen with . The number of trial contexts is set to and we compare empirically and . The quantity is approximated using MonteCarlo integration based on drawing 1000 samples from the GP posterior. The number of samples from the GP’s predictive distribution at for approximating the average loss for a query point is set to . Since there is noise in the Monte Carlo estimates of , we use CMAES [9] as optimizer rather than DIRECT.4 Evaluation
We present results in a simulated robotic control task, in which the robot arm COMPI [1] is used to throw a ball at a target on the ground encoded in a twodimensional context vector. The target area is and the robot arm is mounted at the origin of this coordinate system. Thus, contexts can be chosen from a twodimensional context space: . The lowlevel policy is a jointspace DMP with preselected start and goal angle for each joint and all DMP weights set to 0. This DMP results in throwing a ball such that it hits the ground close to the center of the target area. Adaptation to different target positions is achieved by modifying a twodimensional vector : the first component of corresponds to the execution time of the DMP, which determines how far the ball is thrown, and the second component to the final angle of the first joint, which determines the rotation of the arm around the zaxis.
The upperlevel policy^{2}^{2}2The upperlevel policy could in principle be defined directly on the surrogate model. This would, however, require a computationally expensive maximization over the parameter space for each evaluation of the policy. is a deterministic policy which selects parameters based on an affine function of context . This policy is trained on the training data using the CREPS policy update. The limits on the parameter space are and . All approaches use an anisotropic Matérn kernel for the GP surrogate model. Since we focus on a “pure exploration” scenario [3], GPUCB’s exploration parameter is set to a constant value of . The reward is defined as , where denotes the goal position, denotes the position hit by the ball, and denotes a penalty term on the sum of squared joint velocities during DMP execution. The maximum achievable reward for different differs as values of further away from the origin (where the arm is mounted) require larger joint velocities and thus incur a larger penalty. Thus, rewards in different contexts are incommensurable.
Figure 1 summarizes the main results of the empirical evaluation. The left graph shows the mean offline performance of the upperlevel policy at 16 test contexts on a grid over the context space. Sampling contexts and parameters randomly during learning (“Random”) is shown as a baseline and indicates that generalizing experience using a GP model alone does not suffice for quick learning in this task. Rather, a nonrandom way of exploration is required. BOCPS with random context selection and UCB for parameter selection improves considerably over random parameter selection. Using ES for parameter selection further improves the learning speed. Closer inspection (not shown) indicates that ES improves over UCB mainly because UCB samples often at the boundaries of the parameter space since the uncertainty is typically large there. ES samples more often in the inner regions of the parameter space since those regions promise a larger information gain globally.
Active context selection using ACES further improves over BOCPSES, in particular when the sum over the context space is approximated using several samples ( in the case of ACES_20) rather than a single sample ( for ACES_01). The right graph shows the contexts selected by different variants of ACES. It can be seen that ACES_20 avoids selecting targets close to the boundary of the context space as those typically reveal less global information about the contextdependent optima as boundary points are far away from most other regions of the context space. We attribute the improved learning performance to this way of selecting targets during learning. In contrast, ACES_1 samples more often close to the boundaries as it only considers the local information gain and thus has no reason to prefer inner over boundary contexts.
(Left) Learning curves on the simulated robot arm COMPI: the offline performance is evaluated each 10 episodes on 16 test contexts distributed on a grid over the target area. Shown are mean and its standard error over 20 independent runs. (Right) Scatter plot showing the sampled contexts for all (blue) and a single representative run (red).
5 Discussion and Conclusion
We have presented an active learning approach for contextual policy search based on entropy search. First experimental results indicate that the proposed active learning approach provides considerable speedups of the learning of movement primitives compared to a random task selection. Comparison with other active task selection approaches [7] remains future work. Moreover, investigating and enhancing the scalability to higher dimensional problems, potentially by employing a combination with random embedding approaches such as REMBO [28], and combining active task selection with predictive entropy search [11] or portfoliobased approaches [24] would be interesting.
Acknowledgments
This work was performed as part of the project BesMan^{3}^{3}3More information are available at http://robotik.dfkibremen.de/en/research/projects/besman.html. and supported through two grants of the German Federal Ministry of Economics and Technology (BMWi, FKZ 50 RA 1216 and FKZ 50 RA 1217).
References
 [1] V. Bargsten and J. de Gea. COMPI: Development of a 6DOF compliant robot arm for humanrobot cooperation. In Proceedings of the 8th International Workshop on HumanFriendly Robotics (HFR2015). Technische Universität München (TUM), Oct. 2015.
 [2] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical report.
 [3] S. Bubeck, R. Munos, and G. Stoltz. Pure Exploration in Multiarmed Bandits Problems. In Algorithmic Learning Theory, number 5809 in Lecture Notes in Computer Science, pages 23–37. Springer Berlin Heidelberg, Oct. 2009.
 [4] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A LimitedMemory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific Computing, 16:1190–1208, 1995.

[5]
R. Calandra, A. Seyfarth, J. Peters, and M. P. Deisenroth.
Bayesian Optimization for Learning Gaits under Uncertainty.
Annals of Mathematics and Artificial Intelligence (AMAI)
, 2015.  [6] M. P. Deisenroth, G. Neumann, and J. Peters. A Survey on Policy Search for Robotics. Foundations and Trends in Robotics, 2(12):1–142, 2013.

[7]
A. Fabisch and J. H. Metzen.
Active contextual policy search.
Journal of Machine Learning Research
, 15:3371–3399, 2014.  [8] A. Fabisch, J. H. Metzen, M. M. Krell, and F. Kirchner. Accounting for TaskDifficulty in Active MultiTask Robot Control Learning. KI  Künstliche Intelligenz, pages 1–9, May 2015.
 [9] N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9:159–195, 2001.
 [10] P. Hennig and C. J. Schuler. Entropy Search for InformationEfficient Global Optimization. JMLR, 13:1809–1837, 2012.
 [11] J. M. HernándezLobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Efficient Global Optimization of Blackbox Functions. In Advances in Neural Information Processing Systems 27, pages 918–926, 2014.
 [12] A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal. Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors. Neural Computation, 25:1–46, 2013.
 [13] D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157–181, Oct. 1993.
 [14] J. Kober, A. Wilhelm, E. Oztop, and J. Peters. Reinforcement learning to adjust parametrized motor primitives to new situations. Autonomous Robots, 33(4):361–379, 2012.
 [15] A. Krause and C. S. Ong. Contextual Gaussian Process Bandit Optimization. In Advances in Neural Information Processing Systems 24, pages 2447–2455, 2011.
 [16] O. B. Kroemer, R. Detry, J. Piater, and J. Peters. Combining active learning and reactive control for robot grasping. Robot. Auton. Syst., 58(9):1105–1116, Sept. 2010.
 [17] A. G. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann. DataEfficient Generalization of Robot Skills with Contextual Policy Search. In 27th AAAI Conference on Artificial Intelligence, June 2013.
 [18] D. Lizotte, T. Wang, M. Bowling, and D. Schuurmans. Automatic gait optimization with gaussian process regression. pages 944–949, 2007.
 [19] A. Marco, P. Hennig, J. Bohg, S. Schaal, and S. Trimpe. Automatic LQR Tuning based on Gaussian Process Optimization: Early Experimental Results. In Proceedings of the Second Machine Learning in Planning and Control of Robot Motion Workshop, Hamburg, 2015. IROS.
 [20] J. H. Metzen, A. Fabisch, and J. Hansen. Bayesian Optimization for Contextual Policy Search. In Proceedings of the Second Machine Learning in Planning and Control of Robot Motion Workshop, Hamburg, 2015. IROS.
 [21] J. Peters, K. Mülling, and Y. Altun. Relative Entropy Policy Search. In Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence, Atlanta, Georgia, USA, 2010. AAAI Press.
 [22] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
 [23] P. Ruvolo and E. Eaton. Active Task Selection for Lifelong Machine Learning. In TwentySeventh AAAI Conference on Artificial Intelligence, June 2013.
 [24] B. Shahriari, Z. Wang, M. W. Hoffman, A. BouchardCôté, and N. de Freitas. An Entropy Search Portfolio for Bayesian Optimization. In NIPS workshop on Bayesian optimization, 2014.
 [25] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25, pages 2951–2959, 2012.
 [26] N. Srinivas, A. Krause, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
 [27] K. Swersky, J. Snoek, and R. P. Adams. MultiTask Bayesian Optimization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2004–2012. Curran Associates, Inc., 2013.
 [28] Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. d. Freitas. Bayesian Optimization in High Dimensions via Random Embeddings. In International Joint Conferences on Artificial Intelligence (IJCAI), 2013.
Comments
There are no comments yet.