Thompson Sampling Guided Stochastic Searching on the Line for Deceptive Environments with Applications to Root-Finding Problems

by   Sondre Glimsdal, et al.
Universitetet Agder

The multi-armed bandit problem forms the foundation for solving a wide range of on-line stochastic optimization problems through a simple, yet effective mechanism. One simply casts the problem as a gambler that repeatedly pulls one out of N slot machine arms, eliciting random rewards. Learning of reward probabilities is then combined with reward maximization, by carefully balancing reward exploration against reward exploitation. In this paper, we address a particularly intriguing variant of the multi-armed bandit problem, referred to as the Stochastic Point Location (SPL) Problem. The gambler is here only told whether the optimal arm (point) lies to the "left" or to the "right" of the arm pulled, with the feedback being erroneous with probability 1-π. This formulation thus captures optimization in continuous action spaces with both informative and deceptive feedback. To tackle this class of problems, we formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal arm as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning π, TS-SPL also supports deceptive environments that are lying about the direction of the optimal arm. This, in turn, allows us to solve the fundamental Stochastic Root Finding (SRF) Problem. Empirical results demonstrate that our scheme deals with both deceptive and informative environments, significantly outperforming competing algorithms both for SRF and SPL.



There are no comments yet.


page 1

page 2

page 3

page 4


Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-stationary Rewards

In a multi-armed bandit (MAB) problem a gambler needs to choose at each ...

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

We study the multi-armed bandit (MAB) problem with composite and anonymo...

Some performance considerations when using multi-armed bandit algorithms in the presence of missing data

When using multi-armed bandit algorithms, the potential impact of missin...

On Adaptive Estimation for Dynamic Bernoulli Bandits

The multi-armed bandit (MAB) problem is a classic example of the explora...

Arm order recognition in multi-armed bandit problem with laser chaos time series

By exploiting ultrafast and irregular time series generated by lasers wi...

Infomax strategies for an optimal balance between exploration and exploitation

Proper balance between exploitation and exploration is what makes good d...

The Road to VEGAS: Guiding the Search over Neutral Networks

VEGAS (Varying Evolvability-Guided Adaptive Search) is a new methodology...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research on the Stochastic Point Location (SPL) problem [1] has delivered increasingly efficient schemes for locating the optimal point on a line. In all brevity, the optimal point must be found based on iteratively proposing candidate points, with each candidate revealing whether the optimal point lies to the candidate’s left or to its right. The provided directions can be erroneous, and the goal is to locate the optimal point with as few non-optimal candidate proposals as possible. The SPL problem can also be cast as an agent that moves on a line, attempting to locate a particular location . The agent communicates with a teacher that notifies the agent whether its current location is greater or lower than . However, the teacher is of a stochastic nature and with probability feeds the agent erroneous feedback.

Despite the simplicity of the SPL problem, SPL schemes have provided novel solutions for a wide range of problems. Intriguing applications include estimation of non-stationary binomial distributions

[2], communication network routing [3], and meta-optimization [4]. Furthermore, recent research that addresses the related Stochastic Root-Finding (SRF) problem provides promising solutions for parameter estimation, transportation system optimization, as well as supply chain optimization [5, 6].

State-of-the-art. Adaptive Step Searching (ASS) [7] is currently the leading approach to solving SPL problems, although it is outperformed by Hierarchical Stochastic Searching on the Line (HSSL) [8] in highly volatile non-stationary environments [7]. Optimal Computing Budget Allocation (OCBA) has also been applied to SPL [9] and provides stable solutions while converging slightly slower than ASS. Unfortunately, these state-of-the-art schemes fail when noise increases beyond a certain degree, which happens when the majority of obtained directions mislead rather than guide. Indeed, by naively following the directions provided under such circumstances, one is systematically led away from the optimal point. We refer to this kind of problem environments as deceptive environments, as opposed to informative ones, to be further clarified below.

To the best of the authors’ knowledge, the pioneering CPL-AdS [10]

scheme was the first known approach handling deceptive SPL environments. CPL-AdS relies on two consecutive phases. In the first phase, a sequence of intelligently selected questions is used to classify the environment as either informative or deceptive. By spending a sufficient amount of time in this phase, the classification can be made arbitrarily accurate. In the second phase, a regular SPL scheme is applied, except that the directions obtained are reversed if the problem environment was classified as deceptive in the first phase. This means that the scheme may have to remain in the first phase for an extensive amount of time to ensure that the problem environment is correctly classified, otherwise, one risks being systematically mislead in the second phase. These properties largely render CPL-AdS inappropriate for on-line or any-time problem solving.

Recently, HSSL has been extended by Zhang et al. to cover both informative and deceptive environments, using a Symmetric HSSL (SHSSL) [11]. This scheme essentially runs two HSSL schemes in conjunction: one regular that handles informative environments and one where all feedback from the environment is inverted to handle deceptive environments. The hierarchy navigation capabilities of HSSL are then exploited to allow SHSSL to switch between the two HSSLs, depending on the nature of the environment. However, a significant limitation of HSSL, namely, that must be larger than the conjugate of the golden ratio, carries over to SHSSL. Indeed, SHSSL fails to converge for , which amounts to approximately of the feasible values for . This is in contrast to the approach we propose in this paper, as well as to CPL-AdS [10], since both of these schemes operate along the whole range of (apart from ).

To cast further light on the challenges lined out above, we here introduce the N-Door Puzzle as a framework for modeling deception. We further propose an accompanying novel solution scheme — Thompson Sampling guided Stochastic Point Location (TS-SPL). The TS-SPL scheme handles both SPL and SRF problems, and is capable of simultaneously solving the problem as well as determining whether we are dealing with an informative or a deceptive environment. As we shall see, not only does this scheme handle an arbitrary level of noise, but it also outperforms current state-of-the-art techniques in both informative and deceptive environments.

The N-Door Puzzle. In the book ”To Mock a Mockingbird” [12] the following puzzle is formulated: ”Someone was sentenced to death, but since the king loves riddles, he threw this guy into a room with two doors. One leading to death, one leading to freedom. There are two guards, each one guarding one door. One of the guards is a perfect liar, the other one will always tell the truth. The man is allowed to ask one guard a single yes-no question and then has to decide, which door to take. What single question can he ask to guarantee his freedom?” To avoid spoiling the puzzle for the reader, we omit the solution here and note that asking a double negative question will often be the correct course of action for these types of puzzles.

The above puzzle can be generalized by increasing the number of doors. Instead of deciding between merely two doors, the prisoner now faces doors, with a guard posted between each pair of doors. Only a single door leads to freedom, the remaining doors lead to death. At sunrise each day the prisoner is allowed to ask one of the guards whether the door leading to freedom is to the guard’s left or to the guard’s right. However, only a fixed proportion of the guards answers truthfully, the rest are compulsive liars. Further, the guards are randomly assigned a position each sunrise, and thus, knowing who lies and who tells the truth is impossible. As an additional complication, depending on the mood of the king, the prisoner may be ordered to walk through one door of his choosing at an arbitrary day. Therefore, to save his life, it is imperative that the prisoner as quickly as possible determines which door leads to freedom.

Specifically, let be the fraction of truthful guards. Since the guards are randomly assigned a position each day, the probability of obtaining a truthful answer is governed by . If then the majority of the guards are compulsive liars, and the guards as an entity can be characterized as being deceptive. Conversely, if then the majority of the guards are truthful and the guards can be seen as informative. For completeness, we mention that the puzzle is unsolvable for the case where is exactly equal to , since it then becomes impossible to gain information on neither the nature of the doors or the guards.

Thompson Sampling. The Thompson Sampling (TS) principle was introduced by Thompson already in 1933 [13] and now forms the basis for several state-of-the-art approaches to the Multi-Armed Bandit (MAB) problem — a fundamental sequential resource allocation problem that has challenged researchers for decades. At each time step in MAB, one is offered to pull one out of bandit arms, which in turn triggers a stochastic reward. Each arm has an underlying probability of providing a reward, however, these probabilities are unknown to the decision maker. The challenge is thus to decide which of the arms to pull at every time step, so as to maximize the expected total number of rewards obtained [14].

In all brevity, TS seeks to achieve the above goal by quickly shifting from exploring reward probabilities to maximizing the number of rewards obtained. This is achieved by recursively estimating the underlying reward probability of each arm, using Bayesian filtering of the rewards obtained thus far. TS then simply selects the next arm to pull based on the Bayesian estimates of the reward probabilities (one reward probability density function per arm).

The arm selection strategy of TS is rather straightforward, yet surprisingly efficient. To determine which arm to pull, a single candidate reward probability is sampled from the probability density function of each arm. The arm whose sampled value is the highest is the one pulled next

. The outcome of pulling that arm is in turn used to perform the next Bayesian update of the arm’s reward probability estimate. It is this simple scheme that makes TS select arms with frequency proportional to the posterior probability of being optimal, leading to quick convergence towards always selecting the optimal arm.

TS has turned out to be among the top performers for traditional MAB problems [15, 16], supported by theoretical regret bounds [17, 18]. It has also been been successfully applied to contextual MAB problems [19], Constrained Gaussian Process optimization [20], Distributed Quality of Service Control in Wireless Networks [21], Cognitive Radio Optimization [22]

, as well as a foundation for solving the Maximum a Posteriori Estimation problem


Pure Exploration Bandits. Throughout this paper we assume that each SPL problem potentially takes part in a larger system consisting of multiple SPL problems, and not necessarily operating in isolation. From existing applications in the literature, such as web crawler balancing [24], it is clear that the value of an SPL scheme does hinge upon its ability to cooperate and interact with other decision makers. Such cooperation demands predictable behaviour from the individual decision makers, as well as coordinated balancing of exploring new solution candidates against maintaining good solution candidates. Without such an ability, the system as a whole will not be able to systematically move towards the more promising areas of the search space, gradually focusing in on an optimal configuration. Therefore, in this paper we omit a direct comparison with schemes that rely on a ”fixed sampling then decide” approach, such as Unimodal Bandits [25]. For the same reason, we will not investigate purely exploitative bandits [26, 27, 28, 29, 30], that is, bandits that have a pre-defined finite time horizon and whose performance is only measured at the end of that horizon. Consequentially, such algorithms are free to explore without any negative impact. These types of algorithms are shown to outperform traditional exploitation-exploration bandits such as TS and UCB for scenarios where exploitation is not required.111There also exists a wide spectrum of techniques and schemes in the literature on the topic of searching with noise. See for instance [31] for a comprehensive survey. These are unable to handle unknown and deceptive environments, with stochastic directional feedback, and are therefore not directly comparable to SPL solution schemes. We have therefore not included this class of techniques in the present paper.

Paper Contributions. The contributions of this paper can be summarized as follows. First of all, we introduce a novel scheme for solving the SPL problem, namely, Thompson Sampling guided Stochastic Point Localization (TS-SPL). First of all, we formulate a compact and scalable Bayesian representation of the solution space. This Bayesian representation simultaneously captures both the location of the optimal point (bandit arm) as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning , TS-SPL also supports deceptive environments that are lying about the direction of the optimal arm. This, in turn, allows us to solve the fundamental Stochastic Root Finding (SRF) Problem. More specifically, the contributions of the paper can be summarized as follows:

  1. We introduce the novel TS-SPL scheme that represents the solution space of N-Door Puzzles, and thus SPL problems, in terms of a Bayesian model. As opposed to competing solutions that merely maintain and refine a single candidate solution, our Bayesian model encompasses the complete space of candidate solutions at every time instant. This Bayesian representation of the problem opens up for efficient exploration and exploitation of the solution space with Thompson Sampling.

  2. We formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal point (arm), as well as the probability of receiving correct feedback.

  3. We link TS-SPL to so-called Stochastic Bisection Search; and unify accompanying methods under the umbrella of Thompson Sampling.

  4. Similarly, we enhance Soft Generalized Binary Search (SGBS), Probabilistic Bisection Search (PBS) and Burnashev-Zigangirov Algorithm (BZ) by introducing novel parameter free solutions that take advantage of our Bayesian model of the N-Door Puzzle/SPL problem. This approach eliminates previous reliance on prior knowledge of the degree of noise affecting the system to be optimized.

  5. We finally demonstrate the empirical performance of TS-SPL for both SPL and SRF problems. TS-SPL outperforms state-of-the-art algorithms in both informative and deceptive environments, except that it is beaten by the SGBS and BZ schemes with correctly specified observation noise.

Paper Outline. The paper is organized as follows. In Section 2, we present our scheme for Thompson Sampling guided Stochastic Point Location (TS-SPL). We first introduce the Bayesian model of the N-Door Puzzle. Based on the Bayesian model, we then formulate a TS based scheme that balances solution space exploration against reward maximization. We further extend selected state-of-the-art solution schemes with the Bayesian model that TS-SPL employ. This extension removes the need for prior information on observation noise. Then, in Section 3, we provide extensive empirical results comparing TS-SPL with state-of-the-art schemes for both SPL and SRF. We conclude in Section 4 and point to promising venues for further work.

2 Thompson Sampling guided Stochastic Point Location (TS-SPL)

In this section, we introduce the Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme. At the core of TS-SPL we find a Bayesian model of the N-Door Puzzle (introduced in Section 1).

Formally, we represent an N-Door Puzzle instance as a tuple , where is the set of doors and is the truthfulness of the guards. Let be the particular N-Door Puzzle faced. A novel aspect of TS-SPL is that instead of maintaining a single or a limited set of candidate solutions, we instead maintain a posterior distribution over the whole solution space, . This distribution is conditioned on the feedback already obtained up to time step , allowing us to single in on as the number of time steps increases, ultimately converging to .

Assuming no prior information, we assign a uniform distribution over

, i.e., all puzzle instances are equally probable. By gradually refining the posterior distribution over , we can select guards to question in a goal directed manner. In all brevity, we sample a solution candidate from , selecting the guard to the left or to right of . The answer of the selected guard is then used to update our posterior distribution. By repeating this procedure, the expected probability of the underlying N-Door Puzzle instance, , increases monotonically, reducing the probability of other puzzle instances. In effect, given enough iterations, TS-SPL will correctly identify the door leading to freedom as the posterior probability of approaches unity.

2.1 Bayesian Model of the N-Door Puzzle

The main purpose of the Bayesian model is to facilitate efficient calculation of a posterior distribution over the possible N-Door Puzzle instances, . Since the prisoner does not initially know which problem instance he is facing, and since the observations are stochastic, we cast and

as two random variables. We further assume that

and are independent of each other. Furthermore, the information we obtain from questioning the guards is represented as a set of random variables , with each random variable representing the answer from question . Finally, we assume that the outcomes of the individual questions, , are independent when conditioned on and . For each question , we can then compute the probability of the answer (”left” or ”right”) that we received from the guard, as summarized in Table 1.

Guard to the left of door to freedom:

left guard, door, right guard, door,

Guard to the right of door to freedom:

left guard, door, right guard, door,

Table 1: Conditional door probabilities

As an example, assume that the truthfulness of the guards is . Let us further for instance solicit the guard to the left of door , with the guard replying that the door leading to freedom lies to his left. We can then infer that all doors to the left has a probability of of leading to freedom, and all the doors to the right has the probability of leading to freedom.

Applying Bayes Theorem to

, defined in Table 1, we are able to derive closed-form expressions for the posterior distributions of both and . The derivation of follows (the derivation of is analogous, and are left out here for the sake of brevity):


Above, and and (2) follows directly from Bayes Theorem. We obtain (3) by marginalizing out . Eq. (4) is a result of the independence of and , and (5) from the independence between the questions in . This leads us to the following two equations for updating our knowledge surrounding both the door probabilities (Eq. 6) and the truthfulness of the guards (Eq. 7).


2.2 Guard Selection

We have now formally established how we can turn information from the guards into a probability distribution over which door leads to freedom. However, as mentioned previously, we here face a trade-off between exploring different doors and zeroing in on the best door found so far. To handle this trade-off we model the door selection as a so-called Global Information MAB (GI-MAB)


To decide on what door to select at each iteration, we solve the GI-MAB by utilizing the principle of TS. Here, the selection process is simply to select a random door proportional to the probability that that door is the one leading to freedom. Once the door has been selected, we need to decide which of the guards to query: the guard to the left or to the right of the door selected. We do this by, again, selecting one of the guards randomly, proportionally to the sum of the probabilities of the doors next to each guard. Assume for instance that we have three doors with probability of leading to freedom: . Then, according to the TS principle, these are also the probabilities we use to sample a particular door. Note that since the answer obtained from each guard queried affects the complete probability distribution over (the probability associated with every door is updated), we have a GI-MAB as opposed to a traditional MAB.

2.3 Improving State-of-the-Art Schemes with Bayesian Model

A main advantage of TS-SPL compared to similar schemes is the utilization of the Bayesian model that enables TS-SPL to operate without prior problem parameters. Due to TS-SPL’s close connection to the Probabilistic Bisection Search (PBS) [33], Noisy Generalized Binary Search (NGBS) [34] and the BZ algorithm [35], we will here utilize our Bayesian TS-SPL model to also make these other schemes parameter free.

Probabilistic Bisection Search

The goal of Probabilistic Bisection Search (PBS)222In this context this scheme also covers the Stochastic Binary Search [36, 34] is to locate an unknown point . To acquire intelligence on the location of one queries an Oracle of the relation between a point and . The oracle responds by informing whether is on the left or the right side of . If we assume that the Oracle always tells the truth, then the well known deterministic Bisection Search that halves the search space with each query can be employed to efficiently find . However, in PBS we assume that the Oracle provides correct answers with probability and erroneous ones with probability .

The origin of PBS can be traced to Horstein [33]. In PBS a probability distribution is mapped over the search space and is gradually updated using a Bayesian methodology under the assumption that the environment noise is known a-priori. The search space is then continuously explored using the median of the posterior distribution as the point of interest. It has been shown that PBS has a geometric rate of convergence under the latter assumptions [36].

As the noise is assumed given, one can simply invoke Eq.. 8 to calculate the posterior distribution.


Here is the conditional probability of obtaining answer . That is, for every location to the left of , the probability that the Oracle directs the decision maker to the right is , . And conversely, for to the right of .

To explicitly represent PBS’ dependence on knowing beforehand, we can cast Eq. 8 in terms of Eqs. 6 and 7. The resulting model then becomes identical to TS-SPL, with the major difference that PBS employ the median to explore the search space. We denote this new and improved scheme PBS-M.

Generalized Binary Search

The Generalized Binary Search (GBS) problem can be formulated as follows [37, 34]. Consider a collection of unique binary-valued functions defined on a domain . Each is defined as a mapping from to . Assume that there exists an optimal function that produces the correct binary labeling for each . For each query the value of is observed, possibly corrupted by independent binary noise. The objective is then to determine the function using as few queries as possible. In this paper we restrict to the class of threshold binary functions with the effect of turning the GBS into the informative N-Door Puzzle.

If the feedback is noiseless then the problem boils down to the combinatorial problem of finding an optimal decision tree in the

space, a problem that Hyafil and Rivest showed to be NP-complete [38, 37].

The Soft-Decision Generalized Binary Search SDGB-Search [34, 37] is the state-of-art algorithm for finding when probability of binary noise is less than , i.e., for informative environments.

Similarly to TS-SPL, SDBG-Search employs a probabilistic model that for time step assigns a probability to each . However, for each time-step, it decides which

to query next based on a deterministic heuristic:


SDGB uses the following equation to determine and update at each time step:


Where and is the response from . Simplifying Eq. 10 we observe that represents an AND operator that takes on the value 1 if is equal to and -1 otherwise. Furthermore, we note that since , then one of and will have to take the value , while the other takes the value .

By applying the transformation we can rewrite Eq. 10 as:



This update scheme is identical to the one found in PBS and thus suffers from the same limitation (noise probability is assumed known a priori). In the same manner as we enhanced PBS to utilize a prior over the noise, we can enhance SDGB (using Eq. 6,7) to become a parameter free scheme, again employing our Bayesian TS-SPL scheme. In the following, we will denote this improved version of SDGB as SDGB-M.

Burnashev-Zigangirov Algorithm

The Burnashev-Zigangirov (BZ) Algorithm [35]

is one of the most widely used algorithms for solving the discrete PBS problem and has in particular been employed in the context of Active Learning

[39, 40]. In BZ, we search for a point that is located on a line. This line is discretized into bins and we are only allowed to query the borders of the bins for the direction of . The BZ algorithm suffers from the same practical limitation as PBS and SDGB, namely a dependency on knowing the noise level beforehand.

We will now show how BZ can be improved in a similar fashion as PBS and SDGB, leveraging our Bayesian model. Let denote the probability of residing in bin at time-step . The probability mass function (pmf) of all the bins is therefore with its cumulative density function (cdf) denoted as .

To decide which point to investigate next, that is, deciding a value for , BZ selects one of the two closest points to the median of . We denote this point

. The binary response variable

is observed with probability , whereas with probability .

To update the probability distribution over we define as the probability for noise and let and .

For we have

and for

To change BZ into a parameter free scheme we first notice that for any given noise : , , and . After some simple algebraic manipulations, it turns out that the updating scheme of the BZ algorithm is identical to PBS expect that:

  1. BZ calculates the normalizing factor as a part of the updating rule instead of using the likelihood value, and then later normalizes as PBS does.

  2. BZ samples on the interval edges while PBS samples the midpoints of each interval.

To obtain an enhanced parameter free version of BZ, we simply replace as a pre-determined constant with a prior distribution that we marginalize out using Eq. 6,7. We denote the resulting scheme BZ-M.

3 Empirical Results

In this section we evaluate the performance of TS-SPL empirically, compared to competing schemes. We investigate both the effect the various parameter settings has on behavior, as well as the capability of TS-SPL to handle different applications, including Stochastic Point Location and Stochastic Root Finding problems. Unless otherwise noted, the empirical results report the average of 10 000 independent trials.

For some of the applications we investigate here, we do not find any existing scheme that handles deceptive environments. Instead, the schemes we have identified assume that feedback is informative on average. To render comparison fair, we thus introduce TS-SPL-INF, configured with a prior that the feedback is informative. This also serves to exemplify the power of our Bayesian approach, because we can leverage from a prior tailored for the task at hand. Note that this informed prior is equivalent to the priors used for the other probability theory based schemes, PBS-M and SDGB-M.

Further note that we apply a fixed set of parameter values across the whole suite of experiments, set to optimize overall performance. For SHSSL [11] and HSSL [8] we used a tree branching factor of , and for ASS [7] we set and . For OCBA [11] we set and . The priors used for TS-SPL is uniform over the unit interval and is discretized as , and . For the informative schemes TS-SPL-INF, PGA-M, SGDB-M, BZ-M, we utilize the same prior for the doors as for TS-SPL, however, we use an uniform prior over the interval for truthfulness, with .

We will in the following subsections investigate (1) the effect of different priors on TS-SPL; (2) TS-SPL’s ability to identify the nature of the underlying stochastic environment; (3) the ability to solve the Stochastic Point Location Problem; and (4) performance on Stochastic root-finding problems - a particularly intriguing class of deceptive environments that arises naturally as a result of the properties of stochastic root finding.

3.1 Sensitivity to Discretization and Distribution of Prior

Although TS-SPL is a parameter free scheme it depends on defining , the set of all possible N-Door Puzzles, and then formulating a prior distribution over this space. Since TS-SPL is a discrete scheme, an important question is how does TS-SPL fare under various level of discretization, that is, how is the TS-SPL performance affected by the cardinality of .

We define convergence for TS-SPL to an interval when 95% of the probability mass is contained in the interval, i.e. . The measure of interest is then the number of time-steps passed before convergence.

From Table 2 we identify that the cardinality of in fact, does affect the performance of TS-SPL. As increases so does the time it takes before TS-SPL converge. However, from Table 2 it is evident that this relationship between convergence time and is not linear, indeed the increase in convergence time is insignificant even when doubling from 3200 to 6400 possible doors, suggesting a logarithmic relation between and convergence time.

100 200 400 800 1600 3200 6400
Convergence Steps: 31.4 36.0 38.9 39.4 39.3 40.2 40.9
Table 2: Convergence steps for TS-SPL solving the N-Door Puzzle with , , and .

To see how the cardinality of affects performance we gradually increase the discretization of the interval . We fix the cardinality of to 100. Observing Table 3 it is clear that an increase in the discretization of does not significantly affect performance.

50 100 200 400 800 1600 3200
Convergence Steps: 51.6 50.8 48.4 52.1 51.0 52.4 52.1
Table 3: Convergence steps for TS-SPL solving the N-Door Puzzle with , , and .

Another advantage of our Bayesian scheme is the ability to incorporate prior information to guide the algorithm. On the other hand, specifying an incorrect prior can deteriorate performance instead of enhancing it. In Table 4 we give the results for an informed prior over and . With the correct underlying values , we specify three types of priors: Correct , Incorrect and Flat (all solutions equally probable), denoted C, I and F respectively. From Table 4 we can see the effect of different priors. In brief, having a correct prior over the doors contributes more to convergence time than having a correct prior over the truthfulness of the guards. The disadvantage of setting an incorrectly biased prior is also evident, as the flat prior performs better than any combination involving a biased prior.

Door Truthfulness Convergence Door Truthfulness Convergence Door Truthfulness Convergence
F F 36.4 C F 30.2 I F 46.4
F C 35.7 C C 30.0 I C 45.2
F I 41.2 C I 40.5 I I 113.1
Table 4: Convergence steps for TS-SPL solving the N-Door Puzzle with different priors: C - Correct Prior, F - Flat Prior, I - Incorrect prior. , , and .

3.2 Tracking the Truthfulness of the Environment

A interesting property of TS-SPL is its ability to provide a distribution over the truthfulness for that problem instance. This is a significant advantage as it present the end-user with a better view into the underlying environment when it comes to practical applications. This can in particular be leveraged in the case of repeated trials, where the information from previous trials can be us as a prior on subsequent trials, hence greatly increasing the speed of convergence as seen in Section 3.1. Figure 1 shows the probability of each level of noise as the TS-SPL progresses with noise probability (a highly deceptive environment). As seen, TS-SPL is capable of quickly estimating the correct value of .

Figure 1: TS-SPL maintains a posterior distribution over , here the true underlying value of is . The figure shows the posterior distribution of after various number of iterations during a single run of TS-SPL. As evident from the figure TS-SPL obtains in this case a sharply peaked posterior over after only 20 iterations.

3.3 Stochastic Point Location

The N-Door Puzzle, as outlined in the introduction, is dependent on two variables and , with specifying the door leading to freedom and the truthfulness of the guards. Since the N-Door Puzzle does not pose any spatial requirements on the placements of the doors we can generate a mapping from the N-Door Puzzle to the SPL problem by uniformly placing the doors over the unit interval.

As not all of the schemes evaluated in this section are Bayesian, we introduce the notion of regret, as typical for the multi-armed bandit scenario, as a metric for measuring the performance of the different schemes. Regret can be stated as the cumulative penalty from selecting sub-optimal actions. In the case of SPL we define regret as the (unsigned) distance between the selected point and the optimal point .

3.3.1 Informative SPL

We evaluate the performance of TS-SPL and TS-SPL-INF in an informative SPL problem against algorithms designed to handle informative environments. To the best of our knowledge this is the first time both the family of PBS based schemes and the family of SPL based schemes are compared.

The performance of the different schemes is summarized in Table 5. One significant observation is the performance difference between the Learning Automata (LA) based schemes (HSSL and SHSSL) and the Bayesian schemes. It is clear that performance wise, Bayesian schemes significantly outperform the LA based schemes, however it should be noted that the LA based schemes require less memory and run faster than the Bayesian ones due to their simplicity.

As can be deduced from Table 5, the distance is an important metric for how hard a particular SPL problem is to solve. This can be explained by the fact that most schemes start exploring from the center. Thus, if is far from the center then such a scheme has to obtain more evidence to explore in the peripheral regions of the search space. This is particularly apparent for PBS-M as its performance peaks in the case where , even when faced with significant noise ().

Since PBS-M pursues the median of the probability distribution, we can say that PBS-M is conservative in its exploration. This is because it takes significant more evidence to move the point of exploration compared to TS-SPL. TS-SPL on the other hand has a tendency to explore too much, and as noted by Lattimore [41]

using TS for exploration can lead to over-exploration when facing high variance distributions. In the low noise scenarios, on the other hand, NGBS-M is the most efficient scheme, exploring deterministically.

Moreover, from Table 6

we observe that TS-SPL-INF exhibits the lowest standard deviation overall, and is consequentially the scheme that consistently perform closest to its expected regret for every trial. This is in sharp contrast to PBS-M who outperform TS-SPL when it comes to average regret, but is unable to do so consistently. NGBS-M also displays significant variance in high noise scenarios.

Avg Regret Avg Regret Avg Regret
TS-SPL 29.2 / 9.8 / 5.1 36.4 / 12.9 / 6.2 57.3 / 20.3 / 10.1
TS-SPL-INF 22.2 / 7.3 / 3.7 22.5 / 7.7 / 3.8 23.9 / 8.7 / 4.3
PBS-M 9.8 / 4.0 / 2.6 32.7 / 14.2 / 8.5 52.1 / 29.6 / 16.9
BZ-M 23.5 / 5.9 / 2.2 27.5 / 6.3 / 2.5 35.1 / 9.6 / 3.4
NGBS-M 36.9 / 3.5 / 1.0 48.9 / 4.5 / 1.5 68.5 / 7.1 / 2.3
ASS 45.8 / 17.0 / 6.7 30.4 / 8.9 / 3.6 38.8 / 11.7 / 3.9
OCBA 70.8 / 47.4 / 35.2 89.9 / 55.8 / 37.1 112.1 / 78.4 / 48.8
HSSL 117.3 / 23.1 / 8.2 111.7 / 16.7 / 4.8 131.5 / 19.1 / 5.3
SHSSL 152.2 / 32.6 / 11.8 151.8 / 23.5 / 6.5 175.1 / 26.1 / 7.3
Table 5: Average regret for the different schemes in an informative SPL. The result is reported in the format ”a” / ”b” / ”c” where is the average regret for when and ”b” & ”c” is with and respectively. The number of time steps per trial is 1000, with 10000 independent trials per data point.
Std. dev. Std. dev. Std. dev.
TS-SPL 16.8 / 5.9 / 2.6 20.5 / 6.5 / 3.1 30.9 / 10.3 / 4.6
TS-SPL-INF 13.8 / 4.2 / 2.0 14.2 / 4.4 / 2.5 15.7 / 5.7 / 2.4
PBS-M 15.2 / 10.3 / 10.2 69.1 / 40.9 / 31.5 94.2 / 71.1 / 56.6
BZ-M 30.4 / 8.9 / 3.1 40.8 / 9.8 / 4.9 48.8 / 15.3 / 5.3
NGBS-M 68.5 / 8.9 / 0.9 83.7 / 13.6 / 1.4 108.7 / 19.4 / 1.6
ASS 51.6 / 22.4 / 10.1 47.8 / 15.7 / 5.4 62.3 / 23.1 / 4.6
OCBA 46.2 / 27.6 / 19.4 63.9 / 43.9 / 25.6 76.1 / 64.6 / 41.9
HSSL 71.7 / 16.1 / 4.6 83.6 / 16.1 / 4.2 94.8 / 19.4 / 4.5
SHSSL 89.7 / 23.5 / 6.2 108.5 / 23.4 / 5.8 126.5 / 27.7 / 6.4
Table 6: Standard deviation for the different schemes in an informative SPL. The result is reported in the format ”a” / ”b” / ”c” where is the standard deviation for when and ”b” & ”c” is with and respectively. The number of time steps per trial is 1000 with 10000 independent trials per data point.

3.4 Stochastic Point Location in Deceptive Environments

With the underlying taking on values in the interval we test TS-SPL, CPL-AdS[10] and SHSSL[11] for speed of convergence and how much regret on average one accumulates before converging. However, since CPL-AdS operates in a two-phase mannerm direct comparison with TS-SPL and SHSSL is inappropriate because the latter schemes operate on-line. Oommen et al. states in [10] that this decision phase needs approximately 200 time steps, and by this time TS-SPL is already close to converging to the actual solution. To further explore this point, see Table 7. Here it is clear that TS-SPL is superior to CPL-AdS by several orders of magnitude, as well as outperforming SHSSL.

Another interesting observation is that the performance of TS-SPL is symmetrical around . Further note that as stated earlier, SHSSL fails to converge for , so SHSSL is effectively operating with a smaller search space for than both TS-SPL and CPL-AdS.

After modifying PBS, NGBS and BZ to support a Bayesian model of truthfulness, we can use the same prior that we apply in TS-SPL also for these schemes, leading to PBS-M, NGBS-M and BZ-M. The effect of this enhancement to existing schemes is summarized in Table 7. As clearly seen, the query selection method for these schemes is not suited to handle deceptive environments.

TS-SPL () 6.2 6.2
CPL-AdS () 501.6 / 354.9 842.8/502.3
PBS-M () 31.5 77.5
BZ-M () 4.9 352.5
NGBS-M () 1.4 191.2
SHSSL () 6.5 6.5
Table 7: Cumulative regret for the deceptive SPL problem. All entries were estimated taking the average of an ensemble of 10000 independent trials and each entry corresponds to the estimated cumulative regret after time steps, leading to a negligible variance of the estimates relative to the large difference in performance among the competing schemes. The number after the slash corresponds to the regret accumulated after CPL-AdS has determined if the teacher is informative or deceptive.

3.5 Stochastic Root-Finding Problem

The deterministic root finding problem is the procedure of locating a root such that for a function defined over an interval . We assume that is unknown, however, an oracle returns when queried at point . Then the problem becomes, how can we using as few queries as possible determine the root ? If the response from the oracle is noisy, then we obtain the Stochastic Root Finding Problem (SRFP)[6].

One approach to solving the deterministic root finding problem is the Bisection Method. In the Bisection Method we halves the search space each iteration by continually querying the oracle on the mid point of the remaining search space. However, for the SRFP the Bisection Method is unable to discard half of the search space since the oracle may provide false information regarding value of .

The objective is therefore to select a sequence of queries to gather information about such that the final query is close to i.e., . [42]

Formally, let be a function such that given then is equal for all , and for all .

For any the oracle generate a sample where is stochastic noise. Furthermore, define as the sign of i.e. . As we shall see, the TS-SPL as well as PBS and its variants operate solely using . While disregarding the scalar information of might seem wasteful, it open up for the highly efficient Bayesian framework deployed by TS-SPL. In the scenario that we will investigate here, noise reverse the sign of the function value, i.e., returns with probability and with probability .

The traditional way of solving SRFP is to apply a variant of Stochastic Approximation (SA) [43, 44]. Implementation wise SA methods333 The form of SA shown here is also referred to as Classical Stochastic Approximation (CSA) as it closely resemble the original form proposed by Robbins and Monro [45]. extend or modify the iterative Newton-Raphson algorithm to handle noise:

where is a sequence of step lengths that decreasing as increase. Applying SA to SRFP has been extensively studied in the literature and it is outside the scope of this article to give a full literature review, interested readers are referred to [46, 47, 6] and the references therein. As there exists a myriad of different SA algorithms we have selected one of the more fundamental approaches to form a basis for comparing the different types of schemes. Note that a limitation of this SA scheme is that is required to be monotone.

The main difference between SRFP and SPL is that unlike SPL, SRFP does not provide feedback concerning the direction of the root from the query location . To map the feedback into a direction it is necessary to know whether is increasing or decreasing. If it is increasing and then the root is to the left of (and to the right if ). Conversely, if is decreasing and then is to the right of (and to the left if ).

Learning the direction of can be done by repeatedly querying a single point on the edge of the interval . To gain an insight into how many repeated samples are sufficient we employ the two sided Hoeffding’s inequality where is the average of queries at , is a value such that . Setting the rhs. equal to and solving for , we obtain . Plugging in for and we obtain . Thus we are 99% sure of our estimate of , given that .

However, it turns out that TS-SPL, being able to handle a deceptive environment does not require this sampling phase. It merely require an arbitrary, yet consistent mapping of each sign to a direction. For instance, left, and right. The reason for this is that if the initial mapping is wrong, then TS-SPL will recognize that the feedback is deceptive and thus still be able to solve the problem with no additional effort. An informative scheme will on the other hand be unable to recognizing this and will therefore be unable to find the root without an additional sampling phase.

We remark that the above sampling procedure only enables the other methods to handle SRFP for an unknown functions in an informative environment, it does not provide a definite answer to whether the environment is informative or deceptive. See [10] for a way to implement this as an additional sampling phase. The functions that we use to measure performance and compare schemes are illustrated in Figure 2.

Figure 2: The three functions A, B and C for benchmarking stochastic root finding schemes.
Func A Avg Regret. Avg Regret. Avg Regret
TS-SPL 46.0 17.1 8.6
TS-SPL-INF 55.2 (24.3) 39.1 ( 8.1 ) 35.0 ( 4.1 )
PBS-M 55.1 (24.2) 40.3 ( 9.3 ) 35.2 ( 4.2 )
NGBS-M 53.4 (22.4) 35.3 ( 4.3 ) 32.9 ( 1.9 )
BZ-M 63.8 (32.8) 39.9 ( 8.9 ) 33.8 ( 2.8 )
SA 32.4 14.8 5.4
ASS 80.4 (49.4) 38.7 ( 7.7 ) 33.5 ( 2.5 )
HSSL 60.4 (29.4) 45.5 ( 14.6 ) 35.8 ( 4.8 )
SHSSL 62.8 20.2 6.7
CPL-AdS 162.1 (107.9) 146.3 (97.4) 135.3 (90.1)
Table 8: Average residuals for the different schemes when finding the root of the monotonic function A under various noise levels. The root is

. The results are given in the format ”average residuals (average residuals after sampling)” for each scheme. For CPL-AdS the sampling period is the estimation period (epoch 0) as defined by the scheme. The number of iterations per trial is 250 with 10000 independent trials per data point.

Func B Avg Regret. Avg Regret. Avg Regret
TS-SPL 47.1 17.8 8.5
TS-SPL-INF 53.8 ( 22.9 ) 39.5 ( 8.5 ) 35.1 ( 4.1 )
PBS-M 41.1 ( 10.2 ) 35.3 ( 4.31 ) 33.3 ( 2.3 )
SGBS-M 50.6 ( 19.7 ) 35.7 ( 4.7 ) 33.0 ( 2.0 )
BZ-M 60.3 ( 29.4 ) 39.6 ( 8.7 ) 33.6 ( 2.6 )
SA 175.1 204.5 223.3
ASS 81.6 ( 50.6 ) 39.5 ( 8.7 ) 40.2 ( 9.0 )
HSSL 85.3 ( 54.4 ) 50.6 ( 19.6 ) 39.0 ( 8.0 )
SHSSL 75.4 30.8 12.7
CPL-AdS 117.9 (109.3) 116.7 (107.1) 144.9 (96.5)
Table 9: Average residuals for the different schemes when finding the root of the quadric function B under various noise levels. The root is . The results are given in the format ”average residuals (average residuals after sampling)” for each scheme. For CPL-AdS the sampling period is the estimation period (epoch 0) as defined by the scheme. The number of iterations per trial is 250 with 10000 independent trials per data point.
Func C Avg Regret. Avg Regret. Avg Regret
TS-SPL 36.9 13.7 6.4
TS-SPL-INF 52.8 ( 21.9 ) 38.8 ( 7.8 ) 34.9 ( 3.9 )
PBS-M 49.6 ( 18.7 ) 39.2 ( 8.3 ) 34.3 ( 3.3 )
SGBS-M 47.2 ( 16.2 ) 34.6 ( 3.6 ) 32.5 ( 1.6 )
BZ-M 58.8 ( 27.9 ) 38.4 ( 7.4 ) 33.7 ( 2.7 )
SA 149.0 178.0 185.0
ASS 54.3 ( 23.4 ) 39 ( 8.0 ) 33.5 ( 2.5 )
HSSL 75.2 ( 44.3 ) 44.3 ( 13.4 ) 35.6 ( 4.6 )
SHSSL 56.5 18.4 6.4
CPL-AdS 153.0 (101.6) 156.2 (103.7) 165.0 (109.6)
Table 10: Average residuals for the different schemes when finding the root of the sinusoidal function B under various noise levels. The root is . The results are given in the format ”average residuals (average residuals after sampling)” for each scheme. For CPL-AdS the sampling period is the estimation period (epoch 0) as defined by the scheme. The number of iterations per trial is 250 with 10000 independent trials per data point.

From Table 8, 9 and 10 it is clear that TS-SPL is the most efficient root solver among state-of-the-art schemes. This largely comes from the fact that it simultaneously learns whether is decreasing or increasing with , as well as trying to locate the root . In addition there is the risk that the sampling procedure that the other schemes apply to determine the direction is increasing may conclude with the wrong answer. If this happens then none of the schemes depending on the sampling will converge towards the root . Furthermore, an advantage of using TS-SPL is that it is can be applied to a wide range of functions without regards to any local extrema residing in the function. This is unlike SA that shows excellent performance only for monotonic functions as exemplified in Table 8.

4 Conclusions and Further Work

In this paper, we investigated a novel reinforcement learning problem derived from the so-called ”N-Door Puzzle”. This puzzle has the fascinating property that it involves stochastic

compulsive liars. Feedback is erroneous on average, systematically misleading the decision maker. This renders traditional reinforcement learning (RL) based approaches ineffective due to their dependency on ”on average” correct feedback.

To solve the problem of deceptive feedback, we recast the problem as a particularly intriguing variant of the multi-armed bandit problem, referred to as the Stochastic Point Location (SPL) Problem. The decision maker is here only told whether the optimal point on a line lies to the “left” or to the “right” of a current guess, with the feedback being erroneous with probability . Solving this problem opens up for optimization in continuous action spaces with both informative and deceptive feedback.

Our solution to the above problem, introduced in the present paper, is based on a novel compact and scalable Bayesian representation of the solution space. This model simultaneously captures both the location of the optimal point, as well as the probability of receiving correct feedback, . We further introduced an accompanying Thompson Sampling guided Stochastic Point Location (TS-SPL) scheme for balancing exploration against exploitation. By learning , TS-SPL supports deceptive environments that are lying about the direction of the optimal point.

The resulting scheme was applied to the Stochastic Point Location (SPL) problem and outperformed all of the Learning Automata driven methods. However, by enhancing the Soft Generalized Binary Search (SGBS) scheme with our Bayesian representation of the solution space, SGBS was able to outperform TS-SPL under informative feedback. For deceptive SPL problems, TS-SPL outperformed all of the existing state-of-art schemes by several orders of magnitude, even when the latter schemes were supported by our Bayesian model.

We also applied TS-SPL to the Stochastic Root Finding Problem (SRFP). We demonstrated that SRFP can be seen as a deceptive problem, allowing TS-SPL to outperform existing dedicated state-of-art SRFP schemes by an order of magnitude. Thus, TS-SPL can be considered state-of-the-art for both deceptive SPL and for the SRFP, while yielding comparable results to the top performing schemes in the case of informative SPLs.

Despite the above performance gains, TS-SPL is based on Thompson Sampling, which is known to have a tendency to over-explore high variance reward distributions [41]. In future work, it is therefore interesting to investigate mechanisms that eliminate or reduce this tendency, to further increase convergence speed.


  • [1] B. J. Oommen, “Stochastic Searching on the Line and its Applications to Parameter Learning in Nonlinear Optimization,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 27, no. 4, pp. 733–739, 1997.
  • [2] A. Yazidi, B. J. Oommen, and O.-C. Granmo, “A novel stochastic discretized weak estimator operating in non-stationary environments,” in Computing, Networking and Communications (ICNC), 2012 International Conference on.   IEEE, 2012, pp. 364–370.
  • [3] B. J. Oommen, S. Misra, and O.-C. Granmo, “Routing bandwidth-guaranteed paths in mpls traffic engineering: A multiple race track learning approach,” Computers, IEEE Transactions on, vol. 56, no. 7, pp. 959–976, 2007.
  • [4] B. J. Oommen, O.-C. Granmo, and Z. Liang, “A novel multidimensional scaling technique for mapping word-of-mouth discussions,” in Opportunities and Challenges for Next-Generation Applied Intelligence.   Springer, 2009, pp. 317–322.
  • [5] H. Chen and B. W. Schmeiser, “Stochastic root finding via retrospective approximation,” IIE Transactions, vol. 33, no. 3, pp. 259–275, 2001.
  • [6] R. Pasupathy and S. Kim, “The stochastic root-finding problem: Overview, solutions, and open questions,” ACM Transactions on Modeling and Computer Simulation (TOMACS), vol. 21, no. 3, p. 19, 2011.
  • [7] T. Tao, H. Ge, G. Cai, and S. Li, “Adaptive step searching for solving stochastic point location problem,” in Intelligent Computing Theories.   Springer, 2013, pp. 192–198.
  • [8] A. Yazidi, O.-C. Granmo, B. J. Oommen, and M. Goodwin, “A hierarchical learning scheme for solving the stochastic point location problem,” in Advanced Research in Applied Artificial Intelligence.   Springer, 2012, pp. 774–783.
  • [9] J. Zhang, L. Zhang, and M. Zhou, “Solving stationary and stochastic point location problem with optimal computing budget allocation,” in Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on, Oct 2015, pp. 145–150.
  • [10] B. J. Oommen, G. Raghunath, and B. Kuipers, “On how to learn from a stochastic teacher or a stochastic compulsive liar of unknown identity,” in AI 2003: Advances in Artificial Intelligence.   Springer, 2003, pp. 24–40.
  • [11] J. Zhang, Y. Wang, C. Wang, and M. Zhou, “Symmetrical hierarchical stochastic searching on the line in informative and deceptive environments,” IEEE Transactions on Cybernetics, vol. PP, no. 99, pp. 1–10, 2016.
  • [12] R. Smullyan, To Mock a Mockingbird and Other Logic Puzzles: Including an Amazing Adventure in Combinatory Logic.   Knopf, 1988.
  • [13] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933.
  • [14] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
  • [15] O.-C. Granmo, “Solving two-armed bernoulli bandit problems using a bayesian learning automaton,” International Journal of Intelligent Computing and Cybernetics, vol. 3, no. 2, pp. 207–234, 2010.
  • [16] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds.   Curran Associates, Inc., 2011, pp. 2249–2257.
  • [17] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in Conference on Learning Theory, COLT, 2012.
  • [18] ——, “Further optimal regret bounds for thompson sampling,” in Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, 2013, pp. 99–107.
  • [19] ——, “Thompson sampling for contextual bandits with linear payoffs,” in Proceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 127–135.
  • [20] S. Glimsdal and O.-C. Granmo, “Gaussian process based optimistic knapsack sampling with applications to stochastic resource allocation,” in Proceedings of the 24th Midwest Artificial Intelligence and Cognitive Science Conference 2013.   CEUR Workshop Proceedings, 2013, pp. 43–50.
  • [21] O.-C. Granmo and S. Glimsdal, “Accelerated bayesian learning for decentralized two-armed bandit based decision making with applications to the goore game,” Applied intelligence, vol. 38, no. 4, pp. 479–488, 2013.
  • [22] L. Jiao, X. Zhang, B. J. Oommen, and O.-C. Granmo, “Optimizing channel selection for cognitive radio networks using a distributed bayesian learning automata-based approach,” Applied Intelligence, vol. 44, no. 2, pp. 307–321, 2016.
  • [23] D. Tolpin and F. Wood, “Maximum a posteriori estimation by search in probabilistic programs,” in Eighth Annual Symposium on Combinatorial Search, 2015.
  • [24] O.-C. Granmo, B. J. Oommen, S. A. Myrer, and M. G. Olsen, “Learning Automata-based Solutions to the Nonlinear Fractional Knapsack Problem with Applications to Optimal Resource Allocation,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 37, no. 1, pp. 166–175, 2007.
  • [25] Y. Y. Jia and S. Mannor, “Unimodal bandits,” in International Conference on Machine Learning (ICML-11), 2011, pp. 41–48.
  • [26] E. Even-Dar, S. Mannor, and Y. Mansour, “Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems,” Journal of Machine Learning Research, vol. 7, no. Jun, pp. 1079–1105, 2006.
  • [27] K. G. Jamieson, M. Malloy, R. D. Nowak, and S. Bubeck, “lil’ucb: An optimal exploration algorithm for multi-armed bandits.” in COLT, vol. 35, 2014, pp. 423–439.
  • [28] J.-Y. Audibert and S. Bubeck, “Best arm identification in multi-armed bandits,” in COLT-23th Conference on Learning Theory-2010, 2010, pp. 13–p.
  • [29] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck, “Multi-bandit best arm identification,” in Advances in Neural Information Processing Systems, 2011, pp. 2222–2230.
  • [30] Z. S. Karnin, T. Koren, and O. Somekh, “Almost optimal exploration in multi-armed bandits.” International Conference on Machine Learning (ICML-3), vol. 28, pp. 1238–1246, 2013.
  • [31] A. Pelc, “Searching games with errors – fifty years of coping with liars,” Theoretical Computer Science, vol. 270, no. 1, pp. 71–109, 2002.
  • [32] O. Atan, C. Tekin, and M. van der Schaar, “Global multi-armed bandits with hölder continuity,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015, pp. 28–36.
  • [33] M. Horstein, “Sequential transmission using noiseless feedback,” Information Theory, IEEE Transactions on, vol. 9, no. 3, pp. 136–143, 1963.
  • [34] R. Nowak, “Generalized binary search,” in Communication, Control, and Computing, 2008 46th Annual Allerton Conference on.   IEEE, 2008, pp. 568–574.
  • [35] M. V. Burnashev and K. Zigangirov, “An interval estimation problem for controlled observations,” Problemy Peredachi Informatsii, vol. 10, no. 3, pp. 51–61, 1974.
  • [36] R. Waeber, P. I. Frazier, and S. G. Henderson, “Bisection search with noisy responses,” SIAM Journal on Control and Optimization, vol. 51, no. 3, pp. 2261–2279, 2013.
  • [37] R. D. Nowak, “The geometry of generalized binary search,” Information Theory, IEEE Transactions on, vol. 57, no. 12, pp. 7893–7906, 2011.
  • [38] L. Hyafil and R. L. Rivest, “Constructing optimal binary decision trees is np-complete,” Information Processing Letters, vol. 5, no. 1, pp. 15–17, 1976.
  • [39] A. Singh, R. Nowak, and P. Ramanathan, “Active learning for adaptive mobile sensing networks,” in Proceedings of the 5th international conference on Information processing in sensor networks.   ACM, 2006, pp. 60–68.
  • [40] R. M. Castro and R. D. Nowak, “Upper and lower error bounds for active learning,” in The 44th Annual Allerton Conference on Communication, Control and Computing, vol. 2, no. 2.1, 2006, p. 1.
  • [41] T. Lattimore, “Optimally confident ucb: Improved regret for finite-armed bandits,” arXiv preprint arXiv:1507.07880, 2015.
  • [42] R. Waeber, P. Frazier, and S. G. Henderson, “A bayesian approach to stochastic root finding,” in Simulation Conference (WSC), Proceedings of the 2011 Winter.   IEEE, 2011, pp. 4033–4045.
  • [43] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
  • [44] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952.
  • [45] R. Pasupathy and B. W. Schmeiser, “Root finding via darts – dynamic adaptive random target shooting,” in Simulation Conference (WSC), Proceedings of the 2010 Winter.   IEEE, 2010, pp. 1255–1262.
  • [46] T. L. Lai, “Stochastic approximation,” Annals of Statistics, pp. 391–406, 2003.
  • [47] S. Asmussen and P. W. Glynn, Stochastic simulation: Algorithms and Analysis.   Springer Science & Business Media, 2007, vol. 57.