1 Introduction
Exploration versus exploitation (E/E) dilemmas arise in many subfields of Science, and in related fields such as artificial intelligence, finance, medicine and engineering. In its most simple version, the multiarmed bandit problem formalizes this dilemma as follows
[1]: a gambler has coins, and at each step he may choose among one of slots (or arms) to allocate one of these coins, and then earns some money (his reward) depending on the response of the machine he selected. Each arm response is characterized by an unknown probability distribution that is constant over time. The goal of the gambler is to collect the largest cumulated reward once he has exhausted his coins (i.e. after plays). A rational (and riskneutral) gambler knowing the reward distributions of the arms would play at every stage an arm with maximal expected reward, so as to maximize his expected cumulative reward (irrespectively of the number of arms, his numberof coins, and the variances of the reward distributions). When reward distributions are unknown, it is less trivial to decide how to play optimally since two contradictory goals compete:
exploration consists in trying an arm to acquire knowledge on its expected reward, while exploitation consists in using the current knowledge to decide which arm to play. How to balance the effort towards these two goals is the essence of the E/E dilemma, which is specially difficult when imposing a finite number of playing opportunities .Most theoretical works about multiarmed bandit problem have focused on the design of generic E/E strategies which are provably optimal in asymptotic conditions (large ), while assuming only very unrestrictive conditions on the reward distributions (e.g., bounded support). Among these, some strategies work by computing at every play a quantity called “upper confidence index” for each arm that depends on the rewards collected so far by this arm, and by selecting for the next play (or round of plays) the arm with the highest index. Such E/E strategies are called indexbased policies and have been initially introduced by [2] where the indices were difficult to compute. More easy to compute indices where proposed later on [3, 4, 5].
Indexbased policies typically involve hyperparameters whose values impact their relative performances. Usually, when reporting simulation results, authors manually tuned these values on problems that share similarities with their test problems (e.g., the same type of distributions for generating the rewards) by running trialanderror simulations [4, 6]. By doing so, they actually used prior information on the problems to select the hyperparameters.
Starting from these observations, we elaborated an approach for learning in a reproducible way good policies for playing multiarmed bandit problems over finite horizons. This approach explicitly models and then exploits the prior information on the target set of multiarmed bandit problems. We assume that this prior knowledge is represented as a distribution over multiarmed bandit problems, from which we can draw any number of training problems. Given this distribution, metalearning consists in searching in a chosen set of candidate E/E strategies one that yields optimal expected performances. This approach allows to automatically tune hyperparameters of existing indexbased policies. But, more importantly, it opens the door for searching within much broader classes of E/E strategies one that is optimal for a given set of problems compliant with the prior information. We propose two such hypothesis spaces composed of indexbased policies: in the first one, the index function is a linear function of features and whose metalearnt parameters are real numbers, while in the second one it is a function generated by a grammar of symbolic formulas.
We empirically show, in the case of Bernoulli arms, that when the number of arms and the playing horizon are fully specified a priori, learning enables to obtain policies that significantly outperform a wide range of previously proposed generic policies (UCB1, UCB1Tuned, UCB2, UCBV, KLUCB and Greedy), even after careful tuning. We also evaluate the robustness of the learned policies with respect to erroneous prior assumptions, by testing the E/E strategies learnt for Bernoulli arms on bandits with rewards following a truncated Gaussian distribution.
The ideas presented in this paper take their roots in two previously published papers. The idea of learning multiarmed bandit policies using global optimization and numerically parameterized indexbased policies was first proposed in [7]. Searching good multiarmed bandit policies in a formula space was first proposed in [8]. Compared to this previous work, we adopt here a unifying perspective, which is the learning of E/E strategies from prior knowledge. We also introduce an improved optimization procedure for formula search, based on equivalence classes identification and on a pure exploration multiarmed problem formalization.
This paper is structured as follows. We first formally define the multiarmed bandit problem and introduce indexbased policies in Section 2. Section 3 formally states of E/E strategy learning problem. Section 4 and Section 5 present the numerical and symbolic instantiation of our learning approach, respectively. Section 6 reports on experimental results. Finally, we conclude and present future research directions in Section 7.
2 Multiarmed bandit problem and policies
We now formally describe the (discrete) multiarmed bandit problem and the class of indexbased policies.
2.1 The multiarmed bandit problem
We denote by the () arms of the bandit problem, by the reward distribution of arm , and by its expected value; is the arm played at round , and is the obtained reward.
is a vector that gathers the history over the first
plays, and we denote by the set of all possible histories of any length . An E/E strategy (or policy) is an algorithm that processes at play the vector to select the arm : .The regret of the policy after plays is defined by: where refers to the expected reward of the optimal arm. The expected value of the regret represents the expected loss due to the fact that the policy does not always play the best machine. It can be written as:
(1) 
where denotes the number of times the policy has drawn arm on the first rounds.
The multiarmed bandit problem aims at finding a policy that for a given minimizes the expected regret (or, in other words, maximizes the expected reward), ideally for any and any .
2.2 Indexbased bandit policies
Indexbased bandit policies are based on a ranking that computes for each arm a numerical value based on the subhistory of responses of that arm gathered at time . These policies are sketched in Algorithm 1 and work as follows. During the first plays, they play sequentially the machines to perform initialization. In all subsequent plays, these policies compute for every machine the score that depends on the observed subhistory of arm and possibly on . At each step , the arm with the largest score is selected (ties are broken at random).
Here are some examples of popular index functions:
(2)  
(3)  
(4)  
(5) 
where and
are the mean and standard deviation of the rewards so far obtained from arm
and is the number of times it has been played.Policies UCB1, UCB1Tuned and UCB1Normal^{1}^{1}1Note that this indexbased policy does not strictly fit inside Algorithm 1 as it uses an additional condition to play bandits that were not played since a long time. have been proposed by [4]. UCB1 has one parameter whose typical value is 2. Policy UCBV has been proposed by [5] and has two parameters and . We refer the reader to [4, 5] for detailed explanations of these parameters. Note that these index function are the sum of an exploitation term to give preference on arms with high reward mean () and an exploration term that aims at playing arms to gather more information on their underlying reward distribution (which is typically an upper confidence term).
3 Learning exploration/exploitation strategies
Instead of relying on a fixed E/E strategy to solve a given class of problems, we propose a systematic approach to exploit prior knowledge by learning E/E strategies in a problemdriven way. We now state our learning approach in abstract terms.
Prior knowledge is represented as a distribution over bandit problems . From this distribution, we can sample as many training problems as desired. In order to learn E/E strategies exploiting this knowledge, we rely on a parametric family of candidate strategies whose members are policies that are fully defined given parameters . Given , the learning problem aims at solving:
(6) 
where is the expected cumulative regret of on problem and where is the (apriori given) time playing horizon. Solving this minimization problem is non trivial since it involves an expectation over an infinite number of problems. Furthermore, given a problem , computing relies on the expected values of , which we cannot compute exactly in the general case. Therefore, we propose to approximate the expected cumulative regret by the empirical mean regret over a finite set of training problems from :
(7) 
and where
values are estimated performing a single trajectory of
on problem . Note that the number of training problems will typically be large in order to make the variance reasonably small.In order to instantiate this approach, two components have to be provided: the hypothesis space and the optimization algorithm to solve Eq. 7. The next two sections describe different instantiations of these components.
4 Numeric parameterization
We now instantiate our metalearning approach by considering E/E strategies that have numerical parameters.
4.1 Policy search space
To define the parametric family of candidate policies , we use index functions expressed as linear combinations of history features. These index functions rely on an history feature function , that describes the history w.r.t. a given arm as a vector of scalar features. Given the function , index functions are defined by
where are parameters and is the classical dot product operator. The set of candidate policies is composed of all indexbased policies obtained with such index functions given parameters .
History features may describe any aspect of the history, including empirical reward moments, current time step, arm play counts or combinations of these variables. The set of such features should not be too large to avoid parameter estimation difficulties, but it should be large enough to provide the support for a rich set of E/E strategies. We here propose one possibility for defining the history feature function, that can be applied to any multiarmed problem and that is shown to perform well in Section
6.To compute , we first compute the following four variables: and , i.e. the square root of the logarithm of the current time step, the inverse square root of the number of times arm has been played, the empirical mean and standard deviation of the rewards obtained so far by arm .
Then, these variables are multiplied in different ways to produce features. The number of these combinations is controlled by a parameter whose default value is . Given , there is one feature per possible combinations of values of , which is defined as follows: .
In other terms, there is one feature per possible polynomial up to degree using variables . In the following, we denote Power1 (resp., Power2) the policy learned using function with parameter (resp., ). The index function that underlies these policies can be written as following:
(8) 
where are the learned parameters. The Power1 policy has such parameters and the Power2 has parameters.
4.2 Optimisation algorithm
We now discuss the optimization of Equation 7 in the case of our numerical parameterization. Note that the objective function we want to optimize, in addition to being stochastic, has a complex relation with the parameters . A slight change in the parameter vector may lead to significantly different bandit episodes and expected regret values. Local optimization approaches may thus not be appropriate here. Instead, we suggest the use of derivativefree global optimization algorithms.
In this work, we use a powerful, yet simple, class of global optimization algorithms known as crossentropy and also known as Estimation of Distribution Algorithms (EDA) [9]. EDAs rely on a probabilistic model to describe promising regions of the search space and to sample good candidate solutions. This is performed by repeating iterations that first sample a population of candidates using the current probabilistic model and then fit a new probabilistic model given the best candidates.
Any kind of probabilistic model may be used inside an EDA. The simplest form of EDAs uses one marginal distribution per variable to optimize and is known as the univariate marginal distribution algorithm [10]. We have adopted this approach by using one Gaussian distribution for each parameter . Although this approach is simple, it proved to be quite effective experimentally to solve Equation 7. The full details of our EDAbased policy learning procedure are given by Algorithm 2. The initial distributions are standard Gaussian distributions . The policy that is returned corresponds to the parameters that led to the lowest observed value of .
5 Symbolic parametrization
The index functions from the literature depend on the current time step and on three statistics extracted from the subhistory : , and . We now propose a second parameterization of our learning approach, in which we consider all index functions that can be constructed using small formulas built upon these four variables.
5.1 Policy search space
We consider index functions that are given in the form of small, closedform formulas. Closedform formulas have several advantages: they can be easily computed, they can formally be analyzed and they are easily interpretable.
Let us first explicit the set of formulas that we consider in this paper. A formula is:

either a binary expression , where belongs to a set of binary operators and and are also formulas from ,

or a unary expression where belongs to a set of unary operators and ,

or an atomic variable , where belongs to a set of variables ,

or a constant , where belongs to a set of constants .
In the following, we consider a set of operators and constants that provides a good compromise between high expressiveness and low cardinality of . The set of binary operators considered in this paper includes the four elementary mathematic operations and the and operators: . The set of unary operators contains the square root, the logarithm, the absolute value, the opposite and the inverse: . The set of variables is: . The set of constants has been chosen to maximize the number of different numbers representable by small formulas. It is defined as .
Figure 1 summarizes our grammar of formulas and gives two examples of index functions. The length of a formula is the number of symbols occurring in the formula. For example, the length of is 5 and the length of is 9. Let be a given maximal length. is the subset of formulas whose length is no more than : and is the set of indexbased policies whose index functions are defined by formulas .
5.2 Optimisation algorithm
We now discuss the optimization of Equation 7 in the case of our symbolic parameterization. First, notice that several different formulas can lead to the same policy. For example, any increasing function of defines the greedy policy, which always selects the arm that is believed to be the best. Examples of such functions in our formula search space include , , or .
Since it is useless to evaluate equivalent policies multiple times, we propose the following twostep approach. First, the set is partitioned into equivalence classes, two formulas being equivalent if and only if they lead to the same policy. Then, Equation 7 is solved over the set of equivalence classes (which is typically one or two orders of magnitude smaller than the initial set ).
Partitioning . This task is far from trivial: given a formula, equivalent formulas can be obtained through commutativity, associativity, operatorspecific rules and through any increasing transformation. Performing this step exactly involves advanced static analysis of the formulas, which we believe to be a very difficult solution to implement. Instead, we propose a simple approximated solution, which consists in discriminating formulas by comparing how they rank (in terms of values returned by the formula) a set of random samples of the variables . More formally, the procedure is the following:

we first build , the space of all formulas such that ;

for , we uniformly draw (within their respective domains) some random realizations of the variables that we concatenate into a vector ;

we cluster all formulas from according to the following rule: two formulas and belong to the same cluster if and only if they rank all the points in the same order, i.e.: . Formulas leading to invalid index functions (caused for instance by division by zero or logarithm of negative values) are discarded;

among each cluster, we select one formula of minimal length;

we gather all the selected minimal length formulas into an approximated reduced set of formulas .
In the following, we denote by the cardinality of the approximate set of formulas .
Optimization algorithm. A naive approach for finding the best formula would be to evaluate for each formula and simply return the best one. While extremely simple to implement, such an approach could reveal itself to be timeinefficient in case of spaces of large cardinality.
Preliminary experiments have shown us that contains a majority of formulas leading to relatively bad performing indexbased policies. It turns out that relatively few samples of are sufficient to reject with high confidence these badly performing formulas. In order to exploit this idea, a natural idea is to formalize the search for the best formula as another multiarmed bandit problem. To each formula , we associate an arm. Pulling the arm consists in selecting a training problem and in running one episode with the indexbased policy whose index formula is . This leads to a reward associated to arm whose value is the quantity observed during the episode. The purpose of multiarmed bandit algorithms is here to process the sequence of observed rewards to select in a smart way the next formula to be tried so that when the budget of pulls has been exhausted, one (or several) highquality formula(s) can be identified.
In the formalization of Equation 7 as a multiarmed bandit problem, only the quality of the finally suggested arm matters. How to select arms so as to identify the best one in a finite amount of time is known as the pure exploration multiarmed bandit problem [11]. It has been shown that indexbased policies based on upper confidence bounds were good policies for solving pure exploration bandit problems. Our optimization procedure works as follows: we use a bandit algorithm such as UCB1Tuned during a given number of steps and then return the policy that corresponds to the formula with highest expected reward . The problem instances are selected depending on the number of times the arm has been played so far: at each step, we select the training problem with .
In our experiments, we estimate that our multiarmed bandit approach is one hundred to one thousand times faster than the naive Monte Carlo optimization procedure, which clearly demonstrates the benefits of this approach. Note that this idea could also be relevant to our numerical case. The main difference is that the corresponding multiarmed bandit problem relies on a continuousarm space. Although some algorithms have already been proposed to solve such multiarmed bandit problems [12], how to scale these techniques to problems with hundreds or thousands parameters is still an open research question. Progresses in this field could directly benefit our numerical learning approach.
6 Numerical experiments
We now illustrate the two instances of our learning approach by comparing learned policies against a number of generic previously proposed policies in a setting where prior knowledge is available about the target problems. We show that in both cases, learning enables to obtain exploration/exploitation strategies significantly outperforming all tested generic policies.
6.1 Experimental protocol
We compare learned policies against generic policies. We distinguish between untuned generic policies and tuned generic policies. The former are either policies that are parameterfree or policies used with default parameters suggested in the literature, while the latter are generic policies whose hyperparameters were tuned using Algorithm 2.
Training and testing. To illustrate our approach, we consider the scenario where the number of arms , the playing horizon and the kind of distributions
are known a priori and where the parameters of these distributions are missing information. Since we are learning policies, care should be taken with generalization issues. As usual in supervised machine learning, we use a training set which is distinct from the testing set. The training set is composed of
bandit problems sampled from a given distribution over bandit problems whereas the testing set contains another problems drawn from this distribution. To study the robustness of our policies w.r.t. wrong prior information, we also report their performance on a set of problems drawn from another distribution with different kinds of distributions . When computing , we estimate the regret for each of these problems by averaging results overs runs. One calculation of thus involves simulating (resp. ) bandit episodes during training (resp. testing).Problem distributions. The distribution
is composed of twoarmed bandit problems with Bernoulli distributions whose expectations are uniformly drawn from
. Hence, in order to sample a bandit problem from , we draw the expectations and uniformly from and return the bandit problem with two Bernoulli arms that have expectations and , respectively. In the second distribution , the reward distributions are changed by Gaussian distributions truncated to the interval . In order to sample one problem from , we select a mean and a standard deviation for each arm uniformly in range . Rewards are then sampled using a rejection sampling approach: samples are drawn from the corresponding Gaussian distribution until obtaining a value that belongs to the interval .Generic policies. We consider the following generic policies: the Greedy policy as described in [4], the policies introduced by [4]: UCB1, UCB1Tuned, UCB1Normal and UCB2, the policy KLUCB introduced in [13] and the policy UCBV proposed by [5]. Except Greedy, all these policies belong to the family of indexbased policies discussed previously. UCB1Tuned and UCB1Normal are parameterfree policies designed for bandit problems with Bernoulli distributions and for problems with Gaussian distributions respectively. All the other policies have hyperparameters that can be tuned to improve the quality of the policy. Greedy has two parameters and , UCB2 has one parameter , KLUCB has one parameter and UCBV has two parameters and . We refer the reader to [4, 5, 13] for detailed explanations of these parameters.
Learning numerical policies. We learn policies using the two parameterizations Power1 and Power2 described in Section 4.1. Note that tuning generic policies is a particular case of learning with numerical parameters and that both learned policies and tuned generic policies make use of the same prior knowledge. To make our comparison between these two kinds of policies fair, we always use the same training procedure, which is Algorithm 2 with iterations, candidate policies per iteration and best elements, where is the number of parameters to optimize. Having a linear dependency between and is a classical choice when using EDAs [14]. Note that, in most cases the optimization is solved in a few or a few tens iterations. Our simulations have shown that is a careful choice for ensuring that the optimization has enough time to properly converge. For the baseline policies where some default values are advocated, we use these values as initial expectation of the EDA Gaussians. Otherwise, the initial Gaussians are centered on zero. Nothing is done to enforce the EDA to respect the constraints on the parameters (e.g., and for Greedy). In practice, the EDA automatically identifies interesting regions of the search space that respect these constraints.
Learning symbolic policies. We apply our symbolic learning approach with a maximal formula length of , which leads to a set of millions of formulas. We have applied the approximate partitioning approach described in Section 5.2 on these formulas using samples to discriminate among strategies. This has resulted in million invalid formulas and distinct candidate E/E strategies (i.e. distinct formula equivalence classes). To identify the best of those distinct strategies, we apply the UCB1Tuned algorithm for steps. In our experiments, we report the two best found policies, which we denote Formula1 and Formula2.
6.2 Performance comparison
Policy  Training  Parameters  Bernoulli  Gaussian  

Horizon  T=10  T=100  T=1000  T=10  T=100  T=1000  
Untuned generic policies  
UCB1    1.07  5.57  20.1  1.37  10.6  66.7  
UCB1Tuned    0.75  2.28  5.43  1.09  6.62  37.0  
UCB1Normal    1.71  13.1  31.7  1.65  13.4  58.8  
UCB2    0.97  3.13  7.26  1.28  7.90  40.1  
UCBV    1.45  8.59  25.5  1.55  12.3  63.4  
KLUCB    0.76  2.47  6.61  1.14  7.66  43.8  
KLUCB    0.82  3.29  9.81  1.21  8.90  53.0  
Greedy    1.07  3.21  11.5  1.20  6.24  41.4  
Tuned generic policies  
T=10  0.74  2.05  4.85  1.05  6.05  32.1  
UCB1  T=100  0.74  2.05  4.84  1.05  6.06  32.3  
T=1000  0.74  2.08  4.91  1.05  6.17  33.0  
T=10  0.97  3.15  7.39  1.28  7.91  40.5  
UCB2  T=100  0.97  3.12  7.26  1.33  8.14  40.4  
T=1000  0.97  3.13  7.25  1.28  7.89  40.0  
T=10  0.75  2.36  5.15  1.01  5.75  26.8  
UCBV  T=100  0.75  2.28  7.07  1.01  5.30  27.4  
T=1000  0.77  2.43  5.14  1.13  5.99  27.5  
T=10  0.73  2.14  5.28  1.12  7.00  38.9  
KLUCB  T=100  0.73  2.10  5.12  1.09  6.48  36.1  
T=1000  0.73  2.10  5.12  1.08  6.34  35.4  
T=10  0.79  3.86  32.5  1.01  7.31  67.6  
Greedy  T=100  0.95  3.19  14.8  1.12  6.38  46.6  
T=1000  1.23  3.48  9.93  1.32  6.28  37.7  
Learned numerical policies  
T=10  0.72  2.29  14.0  0.97  5.94  49.7  
Power1  T=100  (16 parameters)  0.77  1.84  5.64  1.04  5.13  27.7 
T=1000  0.88  2.09  4.04  1.17  5.95  28.2  
T=10  0.72  2.37  15.7  0.97  6.16  55.5  
Power2  T=100  (81 parameters)  0.76  1.82  5.81  1.05  5.03  29.6 
T=1000  0.83  2.07  3.95  1.12  5.61  27.3  
Learned symbolic policies  
T=10  0.72  2.37  14.7  0.96  5.14  30.4  
Formula1  T=100  0.76  1.85  8.46  1.12  5.07  29.8  
T=1000  0.80  2.31  4.16  1.23  6.49  26.4  
T=10  0.72  2.88  22.8  1.02  7.15  66.2  
Formula2  T=100  0.78  1.92  6.83  1.17  5.22  29.1  
T=1000  1.10  2.62  4.29  1.38  6.29  26.1 
Policy  T = 10  T = 100  T = 1000  Policy  T = 10  T = 100  T = 1000 

Generic policies  Learned policies  
UCB1  48.1 %  78.1 %  83.1 %  Power1  54.6 %  82.3 %  91.3 % 
UCB2  12.7 %  6.8 %  6.8 %  Power2  54.2 %  84.6 %  90.3 % 
UCBV  38.3 %  57.2 %  49.6 %  Formula1  61.7 %  76.8 %  88.1 % 
KLUCB  50.5 %  65.0 %  67.0 %  Formula2  61.0 %  80.0 %  73.1 % 
Greedy  37.5 %  14.1 %  10.7 % 
Table 1 reports the results we obtain for untuned generic policies, tuned generic policies and learned policies on distributions and with horizons . For both tuned and learned policies, we consider three different training horizons to see the effect of a mismatch between the training and the testing horizon.
Generic policies. As already pointed out in [4], it can be seen that UCB1Tuned is particularly well fitted to bandit problems with Bernoulli distributions. It also proves effective on bandit problems with Gaussian distributions, making it nearly always outperform the other untuned policies. By tuning UCB1, we outperform the UCB1Tuned policy (e.g. instead of on Bernoulli problems with ). This also sometimes happens with UCBV. However, though we used a careful tuning procedure, UCB2 and Greedy do never outperform UCB1Tuned.
Learned policies. We observe that when the training horizon is the same as the testing horizon , the learned policies (Power1, Power2, Formula1 and Formula2) systematically outperform all generic policies. The overall best results are obtained with Power2
policies. Note that, due to their numerical nature and due to the large number of parameters, these policies are extremely hard to interpret and to understand. The results related to symbolic policies show that there exist very simple policies that perform nearly as well as these blackbox policies. This clearly shows the benefits of our two hypothesis spaces: numerical policies enable to reach very high performances while symbolic policies provide interpretable strategies whose behavior can be more easily analyzed. This interpretability/performance tradeoff is common in machine learning and has been identified several decades ago in the field of supervised learning. It is worth mentioning that, among the
formula equivalence classes, a surprisingly large number of strategies outperforming generic policies were found: when (resp. ), we obtain about 50 (resp. 80) different symbolic policies outperforming the generic policies.Robustness w.r.t. the horizon . As expected, the learned policies give their best performance when the training and the testing horizons are equal. Policies learned with large training horizon prove to work well also on smaller horizons. However, when the testing horizon is larger than the training horizon, the quality of the policy may quickly degrade (e.g. when evaluating Power1 trained with on an horizon ).
Robustness w.r.t. the kind of distribution. Although truncated Gaussian distributions are significantly different from Bernoulli distributions, the learned policies most of the time generalize well to this new setting and still outperform all the other generic policies.
A word on the learned symbolic policies. It is worth noticing that the best indexbased policies (Formula1) found for the two largest horizons ( and ) work in a similar way as the UCBtype policies reported earlier in the literature. Indeed, they also associate to an arm an index which is the sum of and of a positive (optimistic) term that decreases with . However, for the shortest time horizon (), the policy found () is totally different from UCBtype policies. With such a policy, only the arms whose empirical reward mean is higher than a given threshold (0.5) have positive index scores and are candidate for selection, i.e. making the scores negative has the effect to kill bad arms. If the of an arm is above the threshold, then the index associated with this arm will increase with the number of times it is played and not decrease as it is the case for UCB policies. If all empirical means are below the threshold, then for equal reward means, arms that have been less played are preferred. This finding is amazing since it suggests that this optimistic paradigm for multiarmed bandits upon which UCB policies are based may in fact not be adapted at all to a context where the horizon is small.
Percentage of wins against UCB1Tuned. Table 2 gives for each policy, its percentage of wins against UCB1Tuned, when trained with the same horizon as the test horizon. To compute this percentage of wins, we evaluate the expected regret on each of the 10000 testing problems and count the number of problems for which the tested policy outperforms UCB1Tuned. We observe that by minimizing the expected regret, our learned policies also reach high values of percentage of wins: 84.6 % for and 91.3 % for . Note that, in our approach, it is easy to change the objective function. So if the real applicative aim was to maximize the percentage of wins against UCB1Tuned, this criterion could have been used directly in the policy optimization stage to reach even better scores.
6.3 Computational time
We used a C++ based implementation to perform our experiments. In the numerical case with cores at , performing the whole learning of Power1 took one hour for and ten hours for . In the symbolic case using a single core at , performing the whole learning took 22 minutes for and a bit less than three hours for . Note that the fact that symbolic learning is much faster can be explained by two reasons. First, we tuned the EDA algorithm in a very careful way to be sure to find a high quality solution; what we observe is that by using only of this learning time, we already obtain closetooptimal strategies. The second factor is that our symbolic learning algorithm saves a lot of CPU time by being able to rapidly reject bad strategies thanks to the multiarmed bandit formulation upon which it relies.
7 Conclusions
The approach proposed in this paper for exploiting prior knowledge for learning exploration/exploitation policies has been tested for twoarmed bandit problems with Bernoulli reward distributions and when knowing the time horizon. The learned policies were found to significantly outperform other policies previously published in the literature such as UCB1, UCB2, UCBV, KLUCB and Greedy. The robustness of the learned policies with respect to wrong information was also highlighted, by evaluating them on twoarmed bandits with truncated Gaussian reward distribution.
There are in our opinion several research directions that could be investigated for still improving the algorithm for learning policies proposed in this paper. For example, we found out that problems similar to the problem of overfitting met in supervised learning could occur when considering a too large set of candidate polices. This naturally calls for studying whether our learning approach could be combined with regularization techniques. Along this idea, more sophisticated optimizers could also be thought of for identifying in the set of candidate policies, the one which is predicted to behave at best.
The UCB1, UCB2, UCBV, KLUCB and Greedy policies used for comparison were shown (under certain conditions) to have interesting bounds on their expected regret in asymptotic conditions (very large ) while we did not aim at providing such bounds for our learned policies. It would certainly be relevant to investigate whether similar bounds could be derived for our learned policies or, alternatively, to see how the approach could be adapted so as to target policies offering such theoretical performance guarantees in asymptotic conditions. For example, better bounds on the expected regret could perhaps be obtained by identifying in a set of candidate policies the one that gives the smallest maximal value of the expected regret over this set rather than the one that gives the best average performances.
Finally, while our paper has provided simulation results in the context of the most simple multiarmed bandit setting, our exploration/exploitation policy metalearning scheme can also in principle be applied to any other explorationexploitation problem. In this line of research, the extension of this investigation to (finite) Markov Decision Processes studied in
[15], suggests already that our approach to metalearning E/E strategies can be successful on much more complex settings.References
 [1] Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of The American Mathematical Society 58 (1952) 527–536
 [2] Lai, T., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1985) 4–22
 [3] Agrawal, R.: Sample mean based index policies with o(log n) regret for the multiarmed bandit problem. Advances in Applied Mathematics 27 (1995) 1054–1078
 [4] Auer, P., Fischer, P., CesaBianchi, N.: Finitetime analysis of the multiarmed bandit problem. Machine Learning 47 (2002) 235–256
 [5] Audibert, J., Munos, R., Szepesvari, C.: Tuning bandit algorithms in stochastic environments. Algorithmic Learning Theory (ALT) (2007) 150–165
 [6] Audibert, J., Munos, R., Szepesvari, C.: Explorationexploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science (2008)
 [7] Maes, F., Wehenkel, L., Ernst, D.: Learning to play Karmed bandit problems. In: Proc. of the 4th International Conference on Agents and Artificial Intelligence. (2012)
 [8] Maes, F., Wehenkel, L., Ernst, D.: Automatic discovery of ranking formulas for playing with multiarmed bandits. In: Proc. of the 9th European Workshop on Reinforcement Learning. (2011)

[9]
Gonzalez, C., Lozano, J., Larrañaga, P.
In: Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation. Kluwer Academic Publishers (2002)

[10]
Pelikan, M., Mühlenbein, H.:
Marginal distributions in evolutionary algorithms.
In: Proceedings of the 4th International Conference on Genetic Algorithms. (1998)
 [11] Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in multiarmed bandits problems. In: Algorithmic Learning Theory. (2009) 23–37
 [12] Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C.: Xarmed bandits. Journal of Machine Learning Research 12 (2011) 1655–1695
 [13] Garivier, A., Cappé, O.: The KLUCB algorithm for bounded stochastic bandits and beyond. CoRR abs/1102.2490 (2011)

[14]
Rubenstein, R., Kroese, D.:
The crossentropy method : a unified approach to combinatorial optimization, MonteCarlo simluation, and machine learning.
Springer, New York (2004)  [15] Castronovo, M., Maes, F., Fonteneau, R., Ernst, D.: Learning exploration/exploitation strategies for single trajectory reinforcement learning. In: Proc. of 10th European Workshop on Reinforcement Learning. (2012)