1 Introduction
Recently Bayesian optimization has received much attention in the machine learning community
Shahriari et al. (2016). This literature studies the problem of maximizing an unknown blackbox objective function by collecting noisy measurements of the function at carefully chosen sample points. At first a prior belief over the objective function is prescribed, and then the statistical model is refined sequentially as data are observed. Expected improvement (EI) Jones et al. (1998)is one of the most widelyused Bayesian optimization algorithms. It is a greedy improvementbased heuristic that samples the point offering greatest expected improvement over the current best sampled point. EI is simple and readily implementable, and it offers reasonable performance in practice.
Although EI is reasonably effective, it is too greedy, focusing nearly all sampling effort near the estimated optimum and gathering too little information about other regions in the domain. This phenomenon is most transparent in the simplest setting of Bayesian optimization where the function’s domain is a finite grid of points. This is the problem of bestarm identification (BAI)
Audibert et al. (2010) in a multiarmed bandit. The player sequentially selects arms to measure and observes noisy reward samples with the hope that a small number of measurements enable a confident identification of the best arm. Recently Ryzhov (2016) studied the performance of EI in this setting. His work focuses on a link between EI and another algorithm known as the optimal computing budget allocation Chen et al. (2000), but his analysis reveals EI allocates a vanishing proportion of samples to suboptimal arms as the total number of samples grows. Any method with this property will be far from optimal in BAI problems Audibert et al. (2010).In this paper, we improve the EI algorithm dramatically through a simple modification. The resulting algorithm, which we call toptwo expected improvement (TTEI), combines the toptwo sampling idea of Russo (2016) with a careful change to the improvementmeasure used by EI. We show that this simple variant of EI achieves strong asymptotic optimality properties in the BAI problem, and benchmark the algorithm in simulation experiments.
Our main theoretical contribution is a complete characterization of the asymptotic proportion of samples TTEI allocates to each arm as a function of the true (unknown) arm means. These particular sampling proportions have been shown to be optimal from several perspectives Chernoff (1959); Jennison et al. (1982); Glynn and Juneja (2004); Russo (2016); Garivier and Kaufmann (2016), and this enables us to establish two different optimality results for TTEI. The first concerns the rate at which the algorithm gains confidence about the identity of the optimal arm as the total number of samples collected grows. Next we study the socalled fixed confidence setting, where the algorithm is able to stop at any point and return an estimate of the optimal arm. We show that when applied with the stopping rule of Garivier and Kaufmann (2016)
, TTEI essentially minimizes the expected number of samples required among all rules obeying a constraint on the probability of incorrect selection.
One undesirable feature of our algorithm is its dependence on a tuning parameter. Our theoretical results precisely show the impact of this parameter, and reveal a surprising degree of robustness to its value. It is also easy to design methods that adapt this parameter over time to the optimal value, and we explore one such method in simulation. Still, removing this tuning parameter is an interesting direction for future research.
Further related literature.
Despite the popularity of EI, its theoretical properties are not well studied. A notable exception is the work of Bull (2011), who studies a global optimization problem and provides a convergence rate for EI’s expected loss. However, it is assumed that the observations are noiseless. Our work also relates to a large number of recent machine learning papers that try to characterize the sample complexity of the bestarm identification problem Evendar et al. (2002); Mannor et al. (2004); Audibert et al. (2010); Gabillon et al. (2012); Karnin et al. (2013); Jamieson et al. (2014); Jamieson and Nowak (2014); Kaufmann and Kalyanakrishnan (2013); Kaufmann et al. (2014, 2016). Despite substantial progress, matching asymptotic upper and lower bounds remained elusive in this line of work. Building on older work in statistics Chernoff (1959); Jennison et al. (1982) and simulation optimization Glynn and Juneja (2004), recent work of Garivier and Kaufmann (2016) and Russo (2016) characterized the optimal sampling proportions. Two notions of asymptotic optimality are established: sample complexity in the fixed confidence setting and rate of posterior convergence. Garivier and Kaufmann (2016) developed two sampling rules designed to closely track the asymptotic optimal proportions and showed that, when combined with a stopping rule motivated by Chernoff (1959), this sampling rule minimizes the expected number of samples required to guarantee a vanishing threshold on the probability of incorrect selection is satisfied. Russo (2016) independently proposed three simple Bayesian algorithms, and proved that each algorithm attains the optimal rate of posterior convergence. TTEI proposed in this paper is conceptually most similar to the toptwo value sampling of Russo (2016), but it is more computationally efficient.
1.1 Main Contributions
As discussed below, our work makes both theoretical and algorithmic contributions.
 Theoretical:

Our main theoretical contribution is Theorem 1, which establishes that TTEI–a simple modification to a popular Bayesian heuristic–converges to the known optimal asymptotic sampling proportions. It is worth emphasizing that, unlike recent results for other toptwo sampling algorithms Russo (2016), this theorem establishes that the expected time to converge to the optimal proportions is finite, which we need to establish optimality in the fixed confidence setting. Proving this result required substantial technical innovations. Theorems 2 and 3 are additional theoretical contributions. These mirror results in Russo (2016) and Garivier and Kaufmann (2016), but we extract minimal conditions on sampling rules that are sufficient to guarantee the two notions of optimality studied in these papers.
 Algorithmic:

On the algorithmic side, we substantially improve a widely used algorithm. TTEI can be easily implemented by modifying existing EI code, but, as shown in our experiments, can offer an order of magnitude improvement. A more subtle point involves the advantages of TTEI over algorithms that are designed to directly target convergence on the asymptotically optimal proportions. In the experiments, we show that TTEI substantially outperforms an oracle sampling rule
whose sampling proportions directly track the asymptotically optimal proportions. This phenomenon should be explored further in future work, but suggests that by carefully reasoning about the value of information TTEI accounts for important factors that are washed out in asymptotic analysis. Finally–as discussed in the conclusion–although we focus on uncorrelated priors we believe our method can be easily extended to more complicated problems like that of bestarm identification in linear bandits
Soare et al. (2014).
2 Problem Formulation
Let be the set of arms. The reward of arm at time
follows a normal distribution
with common known variance
, but unknown mean . At each time , an arm is measured, and the corresponding noisy reward is observed. The objective is to allocate measurement effort wisely in order to confidently identify the arm with highest mean using a small number of measurements. We assume that , i.e., the armmeans are unique and arm 1 is the best arm. Our analysis takes place in a frequentist setting, in which the true means are fixed but unknown. The algorithms we study, however, are Bayesian, in the sense that they begin with prior over the arm means and update the belief to form a posterior distribution as evidence is gathered.Prior and Posterior Distributions.
The sampling rules studied in this paper begin with a normally distributed prior over the true mean of each arm denoted by , and update this to form a posterior distribution as observations are gathered. By conjugacy, the posterior distribution after observing the sequence is also a normal distribution denoted by . The posterior mean and variance can be calculated using the following recursive equations:
and
We denote the posterior distribution over the vector of arm means by
and let . For example, with this notation
The posterior probability assigned to the event that arm
is optimal is(1) 
To avoid confusion, we use to denote a random vector of arm means drawn from the algorithm’s posterior , and to denote the vector of true arm means.
Two notions of asymptotic optimality.
Our first notion of optimality relates to the rate of posterior convergence. As the number of observations grows, one hopes that the posterior distribution definitively identifies the true best arm, in the sense that the posterior probability assigned by the event that a different arm is optimal tends to zero. By sampling the arms intelligently, we hope this probability can be driven to zero as rapidly as possible. We will see that under TTEI the posterior probability tends to zero at an exponential rate, and so following Russo (2016), we aim to maximize the exponent governing the rate of decay, effectively solving the optimization problem
The second setting we consider is often called the “fixed confidence” setting. Here, the agent is allowed at any point to stop gathering samples and return an estimate of the identity of the optimal. In addition to the sampling rule TTEI, we require a stopping rule that selects a time at which to stop, and decision rule that returns an estimate of the optimal arm based on the first observations. We consider minimizing the average number of observations required by an algorithm guaranteeing a vanishing probability of incorrect identification, i.e., . Following Garivier and Kaufmann (2016), the number of samples required scales with , and so we aim to minimize
among algorithms with probability of error no more than . In this setting, we study the performance of EI when combined with the stopping rule studied by Chernoff (1959) and Garivier and Kaufmann (2016).
3 Sampling Rules
In this section, we first introduce the expected improvement algorithm, and point out its weakness. Then a simple variant of the expected improvement algorithm is proposed. Both algorithms make calculations using function where and are the CDF and PDF of the standard normal distribution. One can show that as , , and so for very large . One can also show that is an increasing function.
Expected Improvement.
Expected improvement Jones et al. (1998) is a simple improvementbased sampling rule. The EI algorithm favors the arm that offers the largest amount of improvement upon a target. The EI algorithm measures the arm where is the EI value of arm at time . Let denote the arm with largest posterior mean at time . The EI value of arm at time is defined as
where . The above expectation can be computed analytically as follows,
The EI value measures the potential of arm to improve upon the largest posterior mean at time . Because is an increasing function, is increasing in both the posterior mean
and posterior standard deviation
.TopTwo Expected Improvement.
The EI algorithm can have very poor performance for selecting the best arm. Once it finds a particular arm with reasonably high probability to be the best, it allocates nearly all future samples to this arm at the expense of measuring other arms. Recently Ryzhov (2016) showed that EI only allocates samples to suboptimal arms asymptotically. This is a severe shortcoming, as it means must be extremely large before the algorithm has enough samples from suboptimal arms to reach a confident conclusion.
To improve the EI algorithm, we build on the toptwo sampling idea in Russo Russo (2016). The idea is to identify in each period the two “most promising” arms based on current observations, and randomize to choose which to sample. A tuning parameter controls the probability assigned to the “top” arm. A naive toptwo variant of EI would identify the two arms with largest EI value, and flip a –weighted coin to decide which to measure. However, one can prove that this algorithm is not optimal for any choice of . Instead, what we call the toptwo expected improvement algorithm uses a novel modified EI criterion which more carefully accounts for the decisionmaker’s uncertainty when deciding which arm to sample.
For , define . This measures the expected magnitude of improvement arm offers over arm , but unlike the typical EI criterion, this expectation integrates over the uncertain quality of both arms. This measure can be computed analytically as
TTEI depends on a tuning parameter , set to by default. With probability , TTEI measures the arm by optimizing the EI criterion, and otherwise it measures an alternative that offers the largest expected improvement on the arm . Formally, TTEI measures the arm
Note that , which implies .
We notice that TTEI with is the standard EI algorithm. Comparing to the EI algorithm, TTEI with allocates much more measurement effort to suboptimal arms. We will see that TTEI allocates proportion of samples to the best arm asymptotically, and it uses the remaining fraction of samples for gathering evidence against each suboptimal arm.
4 Convergence to Asymptotically Optimal Proportions
For all and , we define to be the number of samples of arm before time . We will show that under TTEI with parameter , . That is, the algorithm asymptotically allocates
proportion of the samples to true best arm. Dropping for the moment questions regarding the impact of this tuning parameter, let us consider the optimal asymptotic proportion of effort to allocate to each f the
remaining arms. It is known that the optimal proportions are given by the unique vector satisfying, and(2) 
We set , so encodes the sampling proportions of each arm.
To understand the source of equation (2), imagine that over the first periods each arm is sampled exactly times, and let denote the empirical mean of arm . Then
The probability –leading to an incorrect estimate of the arm with highest mean–is where is the CDF of the standard normal distribution. Equation (2) is equivalent to requiring is equal for all arms , so the probability of falsely declaring is equal for all . In a sense, these sampling frequencies equalize the evidence against each suboptimal arm. These proportions appeared first in the machine learning literature in Russo (2016); Garivier and Kaufmann (2016), but appeared much earlier in the statistics literature in Jennison et al. (1982), and separately in the simulation optimization literature in Glynn and Juneja (2004). As we will see in the next section, convergence to this allocation is a necessary condition for both notions of optimality considered in this paper.
Our main theoretical contribution is the following theorem, which establishes that under TTEI sampling proportions converge to the proportions derived above. Therefore, while the sampling proportion of the optimal arm is controlled by the tuning parameter , the remaining fraction of measurement is optimally distributed among the remaining arms. One of our results requires more than convergence to with probability 1, but a sense in which the expected time until convergence is finite. To make this precise, we introduce a time after which for each arm, both its empirical mean and empirical proportion are accurate. Specifically, given and , we define
(3) 
If
with probability 1, then by the law of large numbers
for every . Such a result was established for other toptwo sampling algorithms in Russo (2016). To establish optimality in the “fixed confidence setting”, we need to prove in addition that for all , which requires substantial new technical innovations.Theorem 1.
If TTEI is applied with parameter , for any . Therefore,
4.1 Problem Complexity Measure
Given , define the problem complexity measure
which is a function of the true arm means and variances. This will be the exponent governing the rate of posterior convergence, and also characterizing the average number of samples in the fixed confidence stetting. The optimal exponent comes from maximizing over . Let us define and and set
Russo Russo (2016) has proved that for , and therefore . This demonstrates a surprising degree of robustness to . In particular, is close to if is adjusted to be close to , and the choice of always yields a 2approximation to .
5 Implied Optimality Results
This section establishes formal optimality guarantees for TTEI. Both results, in fact, hold for any algorithm satisfying the conclusions of Theorem 1, and is therefore one of broader interest.
5.1 Optimal Rate of Posterior Convergence
We first provide upper and lower bounds on the exponent governing the rate of posterior convergence. The same result has been has been proved in Russo (2016) for bounded correlated priors. We use different proof techniques to prove the following result for uncorrelated Gaussian priors.
This theorem shows that no algorithm can attain a rate of posterior convergence faster than and that this is attained by any algorithm that, like TTEI with optimal tuning parameter , has asymptotic sampling ratios . The second part implies TTEI with parameter attains convergence rate and that it is optimal among sampling rules that allocation –fraction of samples to the optimal arm. Recall that, without loss of generality, we have assumed arm is the arm with true highest mean . We will study the posterior mass assigned to the event that some other has the highest mean.
Theorem 2 (Posterior Convergence  Sufficient Condition for Optimality).
The following properties hold with probability 1:

Under any allocation rule satisfying for each ,
Under any sampling rule,

For , under any allocation rule satisfying for each ,
Under any sampling rule satisfying ,
This result reveals that when the tuning parameter is set optimally to , TTEI attains the optimal rate of posterior convergence. Since , when set to the default value , the exponent governing the convergence rate of TTEI is at least half of the optimal one.
5.2 Optimal Average Sample Size
Chernoff’s Stopping Rule.
In the fixed confidence setting, besides an efficient sampling rule, a player also needs to design an intelligent stopping rule. This section introduces a stopping rule proposed by Chernoff (1959) and studied recently by Garivier and Kaufmann (2016). This stopping rule makes use of the Generalized Likelihood Ratio statistic, which depends on the current maximum likelihood estimates of all unknown means. For each arm , the maximum likelihood estimate of its unknown mean at time is its empirical mean . If , we set . For arms , if , the Generalized Likelihood Ratio statistic has the following explicit expression for Gaussian noise distributions:
where is the KLdivergence between two normal distributions and , and is a weighted average of the empirical means of arms defined as
On the other hand, if , then is welldefined as above, and (if , we let ). Given a target confidence , to ensure that one arm is better than the others with probability at least , we use the stopping time
where is an appropriate threshold. By definition, we known that is nonnegative if and only if for all . Hence, whenever is unique, .
Next we introduce the exploration rate for normal bandit models that can ensure to identify the best arm with probability at least . We use the following result given in Garivier and Kaufmann Garivier and Kaufmann (2016).
Proposition 1 (Garivier and Kaufmann Garivier and Kaufmann (2016) Proposition 12).
Let and . For any normal bandit model, there exists a constant such that under any possible sampling rule, using the Chernoff’s stopping rule with the threshold guarantees
Sample Complexity.
Garivier and Kaufmann Garivier and Kaufmann (2016) recently provided a general lower bound on the number of samples required in the fixed confidence setting. In particular, they show that for any normal bandit model, under any sampling rule and stopping time that guarantees a probability of error less than ,
Recall that , defined in (3), is the first time after which the empirical means and empirical proportions are within of their asymptotic limits. The next result provides a condition in terms of that is sufficient to guarantees optimality in the fixed confidence setting.
Theorem 3 (Fixed Confidence  Sufficient Condition for Optimality).
Let . Consider any sampling rule which, if applied with no stopping rule, satisfies for all . Fix any . Then if this sampling rule is applied with Chernoff’s stopping rule with the threshold , we have
Since can be chosen to be arbitrarily close to 1, when the general lower bound on sample complexity of is essentially matched. In addition, when is set to the default value and is taken to be arbitrarily close to 1, the sample complexity of TTEI combined with the Chernoff’s stopping rule is at most twice the optimal sample complexity since .
6 Numerical Experiments
To test the empirical performances of TTEI, we conduct several numerical experiments. The first experiment compares the performance of TTEI with
and EI. The second experiment compares the performances of different versions of TTEI, toptwo Thompson sampling (TTTS)
Russo (2016), knowledge gradient (KG) Frazier et al. (2008) and oracle algorithms that know the optimal proportions a priori. Each algorithm plays arm exactly once at the beginning, and then prescribe a prior for unknown armmean where is the observation from . In both experiments, we fix the common known variance and the number of arms . We consider three instances and . The optimal parameter equals 0.48, 0.45 and 0.35, respectively.Recall that , defined in (1), denotes the posterior probability that arm is optimal. Table 1 shows the average number of measurements required for the largest posterior probability being the best to reach a given confidence level , i.e., . The results in Table 1 are averaged over 100 trials. We see that TTEI with outperforms standard EI by an order of magnitude.
TTEI1/2  EI  

14.60  238.50  
16.72  384.73  
24.39  1525.42 
The second experiment compares the performance of different versions of TTEI, TTTS, KG, random sampling oracle (RSO) and tracking oracle (TO). The random sampling oracle draws a random arm in each round from the distribution encoding the asymptotically optimal proportions. The tracking oracle tracks the optimal proportions at each round. Specifically, the tracking oracle samples the arm with the largest ratio its optimal and empirical proportions. Two tracking algorithms proposed by Garivier and Kaufmann Garivier and Kaufmann (2016) are similar to this tracking oracle. TTEI with adaptive (aTTEI) works as follows: it starts with and updates every 10 rounds where is the maximizer of equation (2) based on plugin estimators for the unknown armmeans. Table 2 shows the average number of measurements required for the largest posterior probability being the best to reach the confidence level . The results in Table 2 are averaged over 200 trials. We see that the performances of TTEI with adaptive and TTEI with are better than the performances of all other algorithms. We note that TTEI with adaptive substantially outperforms the tracking oracle.
TTEI1/2  aTTEI  TTEI  TTTS  RSO  TO  KG  

61.97  61.98  61.59  62.86  97.04  77.76  75.55  
66.56  65.54  65.55  66.53  103.43  88.02  81.49  
76.21  72.94  71.62  73.02  101.97  96.90  86.98 
7 Conclusion and Extensions to Correlated Arms
We conclude by noting that while this paper thoroughly studies TTEI in the case of uncorrelated priors, we believe the algorithm is also ideally suited to problems with complex correlated priors and large sets of arms. In fact, the modified information measure was designed with an eye toward dealing with correlation in a sophisticated way. In the case of a correlated normal distribution , one has
This closed form accommodates efficient computation. Here the term accounts for the correlation or similarity between arms and . Therefore is large for arms that offer large potential improvement over , i.e. those that (1) have large posterior mean, (2) have large posterior variance, and (3) are not highly correlated with arm . As concentrates near the estimated optimum, we expect the third factor will force the algorithm to experiment in promising regions of the domain that are “far” away from the currentestimated optimum, and are underexplored under standard EI.
References
 AbbasiYadkori et al. (2012) Yasin AbbasiYadkori, David Pal, and Csaba Szepesvari. Onlinetoconfidenceset conversions and application to sparse stochastic bandits. In AISTATS, volume 22, pages 1–9, 2012.
 Audibert et al. (2010) JeanYves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multiarmed bandits. In COLT 2010  The 23rd Conference on Learning Theory, Haifa, Israel, June 2729, 2010, pages 41–53, 2010.
 Bull (2011) Adam D. Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12:2879–2904, 2011. URL http://dblp.unitrier.de/db/journals/jmlr/jmlr12.html#Bull11.
 Chen et al. (2000) ChunHung Chen, Jianwu Lin, Enver Yücesan, and Stephen E Chick. Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, 10(3):251–270, 2000.
 Chernoff (1959) Herman Chernoff. Sequential design of experiments. Ann. Math. Statist., 30(3):755–770, 09 1959. doi: 10.1214/aoms/1177706205. URL http://dx.doi.org/10.1214/aoms/1177706205.

Evendar et al. (2002)
Eyal Evendar, Shie Mannor, and Yishay Mansour.
Pac bounds for multiarmed bandit and markov decision processes.
InIn Fifteenth Annual Conference on Computational Learning Theory (COLT
, pages 255–270, 2002.  Frazier et al. (2008) Peter I Frazier, Warren B Powell, and Savas Dayanik. A knowledgegradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
 Gabillon et al. (2012) Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 3212–3220. Curran Associates, Inc., 2012.
 Garivier and Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, pages 998–1027, 2016.
 Glynn and Juneja (2004) P. Glynn and S. Juneja. A large deviations perspective on ordinal optimization. In Simulation Conference, 2004. Proceedings of the 2004 Winter, volume 1. IEEE, 2004.
 Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ ucb : An optimal exploration algorithm for multiarmed bandits. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 423–439, Barcelona, Spain, 13–15 Jun 2014. PMLR. URL http://proceedings.mlr.press/v35/jamieson14.html.
 Jamieson and Nowak (2014) Kevin G. Jamieson and Robert D. Nowak. Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 48th Annual Conference on Information Sciences and Systems, CISS 2014, Princeton, NJ, USA, March 1921, 2014, pages 1–6, 2014.
 Jennison et al. (1982) C. Jennison, I. M. Johnstone, and B. W. Turnbull. Asymptotically optimal procedures for sequential adaptive selection of the best of several normal means. Statistical decision theory and related topics III, 2:55–86, 1982.
 Jones et al. (1998) Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient global optimization of expensive blackbox functions. Journal of Global Optimization, 13(4):455–492, 1998. ISSN 15732916. doi: 10.1023/A:1008306431147. URL http://dx.doi.org/10.1023/A:1008306431147.
 Karnin et al. (2013) Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multiarmed bandits. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1238–1246, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/karnin13.html.
 Kaufmann and Kalyanakrishnan (2013) Emilie Kaufmann and Shivaram Kalyanakrishnan. Information complexity in bandit subset selection. In Shai ShalevShwartz and Ingo Steinwart, editors, Proceedings of the 26th Annual Conference on Learning Theory, volume 30 of Proceedings of Machine Learning Research, pages 228–251, Princeton, NJ, USA, 12–14 Jun 2013. PMLR. URL http://proceedings.mlr.press/v30/Kaufmann13.html.
 Kaufmann et al. (2014) Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of a/b testing. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 461–481, Barcelona, Spain, 13–15 Jun 2014. PMLR. URL http://proceedings.mlr.press/v35/kaufmann14.html.
 Kaufmann et al. (2016) Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of bestarm identification in multiarmed bandit models. Journal of Machine Learning Research, 17(1):1–42, 2016. URL http://jmlr.org/papers/v17/kaufman16a.html.
 Mannor et al. (2004) Shie Mannor, John N. Tsitsiklis, Kristin Bennett, and Nicolò Cesabianchi. The sample complexity of exploration in the multiarmed bandit problem. Journal of Machine Learning Research, 5:2004, 2004.
 Peña et al. (2008) Victor H Peña, Tze Leung Lai, and QiMan Shao. Selfnormalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
 Russo (2016) Daniel Russo. Simple bayesian algorithms for best arm identification. In 29th Annual Conference on Learning Theory, pages 1417–1418, 2016.
 Ryzhov (2016) Ilya O. Ryzhov. On the convergence rates of expected improvement methods. Operations Research, 64(6):1515–1528, 2016. doi: 10.1287/opre.2016.1494. URL http://dx.doi.org/10.1287/opre.2016.1494.
 Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016. doi: 10.1109/JPROC.2015.2494218. URL http://dx.doi.org/10.1109/JPROC.2015.2494218.
 Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Bestarm identification in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836, 2014.
Appendix A Outline
The appendix is organized as follows.
Appendix B Notation
For notational convenience, we assume that sampling rules begin with an improper prior for each arm with and . Consequently, if , and , and if ,
so the posterior parameters are identical to the frequentist sample mean and variance under the observations collected so far.
We introduce some further notations. We define
Since the arm means are unique, we have . In addition, we define
Note that for , .
We introduce the filtration where
is the sigma algebra generated by observations up to time . For all and , define
Note that for all , . Both and measure the effort allocated to arm up to period .
Finally, rather than use the notation and introduced in Section 3 for the expectedimprovement measures it is more convenient to work with the notation defined here. Set
to be the expected improvement used in the identifying the first among in the toptwo, and
to be the second expected improvement measure where is the arm optimizing the first expected improvement measure.
Appendix C Proof of Theorem 2
To prove Theorem 2, we first need to introduce the socalled Gaussian tail inequality.
Lemma 1.
Let and , then we have
Proof.
We first prove the upper bound.
Next we prove the lower bound.
∎
Proof of Theorem 2.
We let and . Note that contains arms that are only sampled finite times. First, suppose that is nonempty. For each , we define
Recall that for each , an improper prior with and is prescribed. Then if , and , and if .
Hence, for , and , while for , . We let
and for each , we define
For is nonempty, we have since . This implies and so
Now suppose is empty. By definition, , so , and then we have
(4) 
where the second inequality uses the union bound.
To simplify the presentation, we need to introduce the following asymptotic notation. We say two realvalued sequences and are logarithmically equivalent if . We denote this by . Using equation 4, we conclude
Next we want to show that for , . Note that at time , and . Since every arm is sampled infinite times, when is large, , and then using Lemma 1, we have
which implies
Note that when ,
Comments
There are no comments yet.