Motivated by the problem of finding the optimal move in minimax tree search with noisy leaf evaluations, we introduce best arm identification problems with structured payoffs and micro-observables. In these problems, the learner’s goal is to find the best arm when the payoff of each arm is a fixed and known function of a set of unknown values. In each round, the learner can choose one of the micro-observables to make a noisy measurement (i.e., the learner can obtain a “micro-observation”). We study these problems in the so-called fixed confidence setting.
A special case of this problem is the standard best arm identification, which has seen a flurry of activity during the last decade, e.g., (Even-Dar et al., 2006; Audibert and Bubeck, 2010; Gabillon et al., 2012; Kalyanakrishnan et al., 2012; Karnin et al., 2013; Jamieson et al., 2014; Chen and Li, 2015). Recently, Garivier et al. (2016a) considered the motivating problem mentioned above. However, they only considered the simplest (non-trivial) instance when two players alternate for a single round. One of their main observations is that such two-move problems can be solved more efficiently than if one considers the problem as an instance of a nested best arm identification problem. They proposed two algorithms, one for the fixed confidence setting, the other for the (asymptotic) vanishing confidence setting and provided upper bounds. An implicit (optimization-based) lower bound was also briefly sketched, together with a plan to derive an algorithm that matches it in the vanishing confidence setting.
Our main interest in this paper is to see whether the ideas of Garivier et al. (2016a) extend to more general settings, such as when the depth can be non-uniform and is in particular not limited to two, or when the move histories can lead to shared states (that is, in the language of adversarial search we allow “transpositions”). While considering these extensions, we found it cleaner to introduce the abstract setting mentioned below (Section 2). The motivation here is to clearly delineate the crucial properties of the problem that our results use. For the general structured setting, in Section 3 we prove an instance dependent lower bound along the lines of Auer et al. (2002) or Garivier et al. (2016b) (a mild novelty is the way our proof deals with the technical issue that best arm identification algorithms ideally stop and hence their behavior is undefined after the random stopping time). This is then specialized to the minimax game search setting (Section 4), where we show the crucial role of what we call proof sets, which are somewhat reminiscent of the so-called conspiracy sets from adversarial search (McAllester, 1988). Our lower bound matches that of Garivier et al. (2016a) in the case of two-move alternating problems. Considering again the abstract setting, we propose a new algorithm, which we call LUCB-micro (Section 5), and which can be considered as a natural generalization of Maximin-LUCB of Garivier et al. (2016a)
(with some minor differences). Under a regularity assumption on the payoff maps, we prove that the algorithm meets the risk-requirement. We also provide a high-probability, instance-dependent upper bound on algorithm’s sample complexity (i.e., on the number of observations the algorithm takes). As we discuss, while this bound meets the general characteristics of existing bounds, it fails to reproduce the corresponding result ofGarivier et al. (2016a). To the best of authors’ knowledge, the only comparable algorithm to study best arm identification in a full-length minimax tree search setting (which was the motivating example of our work) is FindTopWinner by Teraoka et al. (2014). This algorithm is a round-based elimination based algorithm with additional pruning steps that come from the tree structure. When we specialize our framework to the minimax game scenario (and implement other necessary changes to put our work into their (-PAC setting), our upper bound is a strict improvement of theirs, e.g., in the number of samples related to the near-optimal micro-observables (leaves of the minimax game tree). Next, we consider the minimax setting (Section 6). First, we show that the regularity assumptions made for the abstract setting are met in this case. We also show how to efficiently compute the choices that LUCB-micro makes using a “min-max” algorithm. Finally, we strengthen our previous result so that it is able to reproduce the mentioned result of Garivier et al. (2016a).
We use to denote the set of positive integers, while we let denote the set of reals. For a positive integer , we let
. For a vector, we denote its -th element by ; though occasionally we will also use for the same purpose, i.e., we identify and in the obvious way. We let denote the vector defined by . For two vectors , we define if and only if for all . Further, we write when and . For , we write to denote the -dimensional vector obtained from restricting to components with index in : . We use to denote the dimensional vector whose components are all equal to one. For a nonempty set , we also use to denote the -dimensional all-one vector. We let to denote the complementer of (when is used, the base set that the complementer is taken for should be clear from the context). The indicator function will be denoted by . We will use and . For , denotes its topological closure, while denotes its interior. Given a real value , and . For a sequence of some values and some other value , we define .
2 Problem setup
Fix two positive integers, and . A problem instance of a structured -armed best arm identification instance with micro-observations is defined by a tuple , where and is an -tuple of distributions over the reals. We let denote the mean of distribution . We shall denote the component functions of by : . The value is interpreted as the payoff of arm and we call the reward map. The goal of the learner is to identify the arm with the highest payoff. It is assumed that the arm with the highest payoff is unique. The learner knows , while is unaware of , and, in particular, unaware of . To gain information about , the learner can query the distributions in discrete rounds indexed by , in a sequential fashion. The learner is also given , a risk parameter (also known as a confidence parameter). The goal of the learner is to identify the arm with the highest payoff using the least number of observations while keeping the probability of making a mistake below the specified risk level.
A learner is admissible for a given set of problem instances if (i) for any instance from , the probability of the learner misidentifying the optimal arm in the instance is below the given fixed risk factor ; and (ii) the learner stops with probability one on any instance from . The interaction of a learner and a problem instance is shown on Fig. 1.
As a motivating example, consider the problem of finding the optimal move for the first player in a finite two-player minimax game. The game is finite because the game finishes in finitely many steps (by reaching one of the possible terminating states). The first player has moves. The value of each move is a function of the values of the possible terminating states.
Formally, such a minimax game is described by , where is a non-empty finite set of possible moves, is a finite set of (feasible) histories of moves, the function determines, for each feasible history, the identity of the player on turn, and is a surjection that maps a subset of histories, the set of maximal histories in , to (in particular, note that may map multiple maximal histories to the same terminating state). An element of is maximal in if it is not the prefix of any other history , or, in other words, if it has no continuation in . The set has the property that if then every prefix of with positive length is also is in . The first player’s moves are given by the histories in that have a unit length. To minimize clutter, without the loss of generality (WLOG), we identify this set with .
The function underlying gives the payoffs of the first player. To define we use the auxiliary function that evaluates any particular history given the values assigned to terminal states. Given , we define for any . It remains to define : For , . For any other feasible history , , where is the set of immediate successors of in . Thus, when , is the maximum of the values associated with the immediate successors of , while when , is the minimum of these values. We define as the move defining the optimal immediate successor of given . Note that all many of the defined functions depend on , but the dependence is suppressed, as we will keep fixed. One natural problem that fits our setting is a (small) game when the payoffs at the terminating states of a game are themselves randomized (e.g., at the end of a game some random hidden information such as face down cards can decide the value of the final state). As explained by Garivier et al. (2016a), the setting may also shed light on how to design better Monte-Carlo Tree Search (MCTS) algorithms, which is a relatively novel class of search algorithms that proved to be highly successful in recent years (e.g., Gelly et al., 2012; Silver et al., 2016).
3 Lower bound: General setting
In this section we will prove a lower bound for the case of a fixed map. Our results can be easily extended to the case of other sufficiently-rich family of distributions.
For the next result, assume without loss of generality that . Fix a learner (policy) , which maps histories to actions. For simplicity, we assume that is deterministic (the extension to randomized algorithms is standard). Let be the set of (infinite) sequences of observable-index and observation pairs so that for any , , and . We equip with the associated Lebesgue -algebra . For an infinite sequence , we let be the round index when the algorithm stops (we let if the algorithm never stops on ). Thus, . Similarly, define to be the choice of the algorithm when it stops, where we define in case .
The interaction of a problem instance (uniquely determined by ) and the learner (uniquely determined by the associated policy ) induces a unique distribution over the measurable space , where we agree that in rounds with index , we specify that the algorithm chooses arm , while the observation distributions are modified so that the observation is deterministically set to zero. We will also use to denote the expectation operator corresponding to .
To appease the prudent reader, let us note that our statements will always be concerned with events that are subsets of the event and as such they are not effected by how we specify the “choices” of the algorithm and the “responses” of the environment for . Take, as an example, the expected number of steps that takes in an environment , , which we bound below. Since we bound this only in the case when the algorithm is admissible, which implies that , we have . which shows that the behavior of outside of outside of is immaterial for this statement. The choices we made for (for the algorithm and the environment) will however be significant in that they simplify a key technical result.
To state our result, we need to introduce the set of significant departures, , from . This set contains all vectors such that the best arm under is not arm . Formally,
[Lower bound] Fix a risk parameter . Assume that is admissible over the instance set at the risk level . Define
Then, . The proof can be shown to reproduce the result of Garivier and Kaufmann (2016) (see page 6 of their paper) when the setting is best arm identification. The proof uses standard steps (e.g., Auer et al., 2002; Kaufmann et al., 2016) and one of its main merit is its simplicity. In particular, it relies on two information theoretical results; a high-probability Pinsker inequality (Lemma 2.6 from (Tsybakov, 2008)) and a standard decomposition of divergences. The proof is given in Appendix B (all proofs omitted from the main body can be found in the appendix).
[Minimal significant departures ()] From the set of significant departures one can remove all vectors that are componentwise dominating in absolute value some other significant departure without effecting the lower bound. To see this, write the lower bound as , where . Then, if are such that then . Hence, where .
4 Lower bound for minimax games
In this section we prove a corollary of the general lower bound of the previous section in the context of minimax games; the question being what role the structure of a game plays in the lower bound. For this section fix a minimax game structure (cf. Section 2). We first need some definitions:
Definition 1 (Proof sets)
Take a minimax game structure with first moves and terminal states. Take . A set is said to be sufficient for proving upper bounds on the value of move if for any and , implies . Symmetrically, a set is said to be sufficient for proving lower bounds on the value of move if for any and , implies .
We will call the sets satisfying the above definition upper (resp., lower) proof sets, denoted by (resp., ). Proof sets are closely related to conspiracy sets (McAllester, 1988), forming the basis of “proof number of search” (Allis, 1994; Kishimoto et al., 2012). In a minimax game tree, a conspiracy set of a node (say, ) is the set of leaves that must change their evaluation value to cause a change in the minimax value of that node . Proof sets are also related to cuts in – search (Russell and Norvig, 2010).
One can obtain minimal upper proof sets that belong to in the following way: Let denote the set of histories that start with move . Consider a non-empty that satisfies the following properties: (i) if and (minimizing turn) then ; (ii) if and (maximizing turn) then . Call the set of that can be obtained this way . From the construction of we immediately get the following proposition:
Take any as above. Then, .
A similar construction and statement applies in the case of , resulting in the set . Our next result will imply that the lower bound is achieved by considering departures of a special form, related to proof sets:
Proposition (Minimal significant departures for minimax games)
Assume WLOG that . Let
Note that the second inclusion shows that replacing by in the definition of would only decrease the value of , while the first inclusion shows that the value actually does not change. The following lemma, characterizing minimal departures, is essential for our proof of Section 4:
Take any , and assume WLOG that . Then, there exist , and such that
if ; if ;
, either or .
Section 4 implies the following:
Let be a valuation and assume WLOG that . Let . Then,
Hence, for any algorithm admissible over the instance set at the risk level , is at least as large than the right-hand side of the above display.
5 Upper bound
In this section we propose an algorithm generalizing the LUCB algorithm of Kalyanakrishnan et al. (2012) and prove a theoretical guarantee for the proposed algorithm’s sample complexity under some (mild) assumptions on the structure of the reward mapping . Our result is inspired and extends the results of Garivier et al. (2016a) (who also started from the LUCB algorithm) to the general setting proposed in this paper. In Section 6 we give a version of the algorithm presented here that is specialized to minimax games and refine the upper bound of this section, highlighting the advantages of the extra structure of minimax games.
In this section we shall assume that the distributions are subgaussian with a common parameter, which we take to be one for simplicity:
Assumption 1 (-Subgaussian observations)
For any , ,
We will need a result for anytime confidence intervals for martingales with subgaussian increments. For stating this result, letbe a filtration over the probability space
holding our random variables and introduce. This result appears as (essentially) Theorem 8 in the paper by Kaufmann et al. (2016) who also cite precursors:
Lemma (Anytime subgaussian concentration)
Let be an -adapted -subgaussian, martingale difference sequence (i.e., for any , is -measurable, , and ). For , let , while for and we let
Then, for any ,111Note that is also defined for . The value used is arbitrary: It plays no role in the current result. The reason we define for is because it simplifies some subsequent definitions.
For a fixed , let denote the number of observations taken from up to time . Define the confidence interval for as follows: We let
the empirical mean of observations from to be the center of the interval (when , we define ) and
where is as in Section 5 (note that when , the confidence interval is ). Let be the index of the round when the algorithm soon to be proposed stops (or if it does not stop). Let be the “good” event when the proposed respective intervals before the algorithm stops all contain for all . One can easily verify that, regardless the choice of the algorithm (i.e., the stopping time ),
For define . With this definition, if we let then for any , holds for any on . Thus, is a valid, -level confidence set for . For general , these sets may have a complicated structure. Hence, we will adapt the following simplifying assumption:
Assumption 2 (Regular reward maps)
The following hold:
The mapping function is monotonous with respect to the partial order of vectors: for any , implies ;
For any , , , the set is non-empty.
We will also let . Note that the assumption is met when is the reward map underlying minimax games (see the next section). The second assumption could be replaced by the following weaker assumption without essentially changing our result: with some , , for any , , for some . The point of this assumption is that by guaranteeing that all intervals on the micro-observables shrink, the interval on the arm-rewards will also shrink at the same rate. We expect that other ways of weakening this assumption are also possible, perhaps at the price of slightly changing the algorithm (e.g., by allowing it to use even more micro-observations per round).
At time , let
( stands for candidate “best” arm, stands for best “contender” arm). Based on the above assumption, we can now propose our algorithm, LUCB-micro (cf. Algorithm 1). Following the idea of LUCB, LUCB-micro chooses and in an effort to separate the highest lower bound from the best competing upper bound.222 Using a lower bound departs from the choice of LUCB, which would use to define . The reason of this departure is that we found it easier to work with a lower bound. We expect the two versions (original, our choice) to behave similarly. To decrease the width of the confidence intervals, both for and , a micro-observable is chosen with the help of Assumption 2(ii). This can be seen as a generalization of the choice made in Maximin-LUCB by Garivier et al. (2016a). Here, we found that the specific way Maximin-LUCB’s choice is made considerably obscured the idea behind this choice, which one can perhaps attribute to that the fact that the two-move setting makes it possible to write the choice in a more-or-less direct fashion.
It remains to specify the ‘Stop()’ function used by our algorithm. For this, we propose the standard choice (as in LUCB):
All statements in this section assume that the assumptions stated so far in this section hold.
The following proposition is immediate from the definition of the algorithm.
On the event , LUCB-micro returns correctly: .
Let denote the round index when LUCB-micro stops333The number of observations, or number of rounds as per Fig. 1, taken by LUCB-micro until it stops is . and define and , where we assumed that . The main result of this section is a high-probability bound on , which we present next. The following lemma is the key to the proof:
Let . Then, on , there exists such that and .
The proof follows standard steps (e.g., Garivier et al. 2016a). In particular, the above lemma implies that if then for , and . This in turn implies that for , cannot be too large. [LUCB-micro upper bound] Let
Then, for , on the event , the stopping time of LUCB-micro satisfies . Note that and thus is well-defined. Furthermore, letting , for sufficiently small and sufficiently large, elementary calculations give
The constant acts as a hardness measure of the problem. Section 5 can be applied to the best arm identification problem in the multi-armed bandits setting, as it is a special case of our problem setup. Compared to state-of-the-art results available for this setting, our bound is looser in several ways: We lose on the constant factor multiplying (Kalyanakrishnan et al., 2012; Jamieson and Nowak, 2014; Jamieson et al., 2014; Kaufmann et al., 2016), we also lose an additive term of (Chen and Li, 2015). We also lose terms on the suboptimal arms (Simchowitz et al., 2017). Comparing with the only result available in the two-move minimax tree setting, due to Garivier et al. (2016a), our bound is looser than their Theorem 1. This motivates the refinement of this result to the minimax setting, which is done in the next section, and where we recover the mentioned result of Garivier et al. (2016a). On the positive side, our result is more generally applicable than any of the mentioned results. It remains an interesting sequence of challenges to prove an upper bound for this or some other algorithm which would match the mentioned state-of-the-art results, when the general setting is specialized.
6 Best move identification in minimax games
In this section we will show upper bounds on the number of observations LUCB-micro takes in the case of minimax game problems. We still assume that the micro-observations are subgaussian (Assumption 1) and the optimal arm is unique. To apply our result, this leaves us with showing that the payoff function in the minimax game satisfies the regularity assumption (Assumption 2).
Fix a minimax game structure . We first show that Property (i) of Assumption 2 holds. This follows easily from the following lemma, which can be proven by induction based on “distance from the terminating states”.
For any and such that , .
From this result we immediately get the following corollary:
For , , , per Property (ii) of Assumption 2, we need to show that the sets are nonempty. For a history and , we denote its length- prefix by . We give an algorithmic demonstration, which also shows how to efficiently pick an element of these sets. The resulting algorithm is called (cf. Algorithm 2). We define in a recursive fashion: For each nonmaximal history the algorithm extends the history by adding the move which is optimal for for minimizing moves, while it extends it by adding the optimal move for for maximizing moves, and then it calls itself with the new history. The algorithm returns when its input is a maximal history. To show that we have the following result:
Fix , , and . Let and in particular let . Then, for all ,
where is the length-k prefix of .
We immediately get that is an element of :
For as in the previous result, setting , , hence .
Let , be as in Section 5. If is the stopping time of LUCBMinMax running on a minimax game search problem then .
When applied to a minimax game, as defined in Section 2, the upper bound of Section 6 is loose and can be further improved as shown in the result below. To state this result we need some further notation. Given a set of reals , define the “span” of as . For a path that connects some move in and some move in : with some , and . Finally, for such that there is a unique path satisfying , define . Let be an empty set if there is multiple such that .[LUCBMinMax on MinMax Trees] Let
Then, on , the stopping time of LUCBMinMax satisfies . Note that this result recovers Theorem 1 of Garivier et al. (2016a). To see this note that for every leaf (as numbered in their paper), . Also note that , thus . Therefore, .
7 Discussion and Conclusions
The gap between the lower bound and the upper bound
There is a substantial gap between the lower and the upper bound. Besides the gaps that already exist in the multi-armed bandit setting and which have been mentioned before, there exists a substantial gap: In particular, it is not hard to show that in regular minimax game trees with a fixed branching factor of and depth , the upper bound scales with while the lower bound scales with . One potential is to improve the lower bound so as to consider adversarial perturbations of the values assigned to the leaf nodes: That is, after the algorithm is fixed, an adversary can perturb the values of to maximize the lower bound. Simchowitz et al. (2017) introduces an interesting technique for proving lower bounds of this form and they demonstrate nontrivial improvements in the multi-armed bandit setting.
Does the algorithm need to explore all leaves?
The hardness measure is rooted in a uniform bound that suggests that all the leaves must potentially be pulled, which may not hold for some particular structure. In particular, the algorithm may be able to benefit from the specific structure of , saving explorations on some leaves. We present one example when is a minimax game tree, as in Figure 2. Assume that for and . A reasonable algorithm would sample each arm once, then discover that the others arms are much less than the sampled leaf under arm 1. Then the algorithm will continue to explore the other leaves of arm 1, and decide arm 1 to be the best arm. This behavior is also in agreement with our lower bound, where the resulting constraints are:
which implies and for if is large enough. As we can see from this example, (out of ) leaves need no exploration at all. On the other hand, although we don’t have a tight upper bound, our algorithm in practice manages to explore the remaining leaves under arm 1 for the next rounds, and then make the right decision.
In general, we would expect that a problem with a feed forward neural network structure is easier than that of a tree structure, as the share of the leaves provides more information and thus save the exploration. This is illustrated onFig. 3, where an optimal arm can be identified solely based on the network structure, thus the algorithm requires 0 sample for all possible . Note that our lower bound does not fail, as we have here.
Appendix A The Uniqueness Assumption
Recall that throughout the paper, by definition, we have the following assumption:
The instance is such that is unique.
We state this assumption explicitly here, so that we can refer to it easily throughout the appendix.
Appendix B Proofs for Section 3
We start with the two information theoretic results mentioned in the main body of the text. To state these results, let denote the Kullback–Leibler (KL) divergence of two distributions and . Recall that this is when is absolutely continuous with respect to and is infinite otherwise. For the next result, let denote the number of times an observation on the micro-observable with index before time .
Lemma (Divergence decomposition)
For any it holds that
Note that on the right-hand side is the KL divergence between the normal distributions with means and and both having a unit variance. The result, naturally, holds for other distributions, as well. This is the result that relies strongly on that we forced the same observations and same observation-choices for . In particular, this is what makes the left-hand side of (7) finite! The proof is standard and hence is omitted.
Lemma (High probability Pinsker, e.g., Lemma 2.6 from (Tsybakov, 2008))
Let and be probability measures on the same measurable space and let be an arbitrary event. Then,
[of Section 3] WLOG, we may assume that is non-empty. Pick any and let . Let . Since is admissible, . Further, since , is not an optimal arm in . Hence, again by the admissibility of , . Therefore, by Appendix B,
The result follows by continuity, after noting that , that was arbitrary.
Appendix C Proofs for Section 4
We start with the following lemma:
Pick any , . Then,