1 Introduction
Research on the Stochastic Point Location (SPL) problem [1] has delivered increasingly efficient schemes for locating the optimal point on a line. In all brevity, the optimal point must be found based on iteratively proposing candidate points, with each candidate revealing whether the optimal point lies to the candidate’s left or to its right. The provided directions can be erroneous, and the goal is to locate the optimal point with as few nonoptimal candidate proposals as possible. The SPL problem can also be cast as an agent that moves on a line, attempting to locate a particular location . The agent communicates with a teacher that notifies the agent whether its current location is greater or lower than . However, the teacher is of a stochastic nature and with probability feeds the agent erroneous feedback.
Despite the simplicity of the SPL problem, SPL schemes have provided novel solutions for a wide range of problems. Intriguing applications include estimation of nonstationary binomial distributions
[2], communication network routing [3], and metaoptimization [4]. Furthermore, recent research that addresses the related Stochastic RootFinding (SRF) problem provides promising solutions for parameter estimation, transportation system optimization, as well as supply chain optimization [5, 6].Stateoftheart. Adaptive Step Searching (ASS) [7] is currently the leading approach to solving SPL problems, although it is outperformed by Hierarchical Stochastic Searching on the Line (HSSL) [8] in highly volatile nonstationary environments [7]. Optimal Computing Budget Allocation (OCBA) has also been applied to SPL [9] and provides stable solutions while converging slightly slower than ASS. Unfortunately, these stateoftheart schemes fail when noise increases beyond a certain degree, which happens when the majority of obtained directions mislead rather than guide. Indeed, by naively following the directions provided under such circumstances, one is systematically led away from the optimal point. We refer to this kind of problem environments as deceptive environments, as opposed to informative ones, to be further clarified below.
To the best of the authors’ knowledge, the pioneering CPLAdS [10]
scheme was the first known approach handling deceptive SPL environments. CPLAdS relies on two consecutive phases. In the first phase, a sequence of intelligently selected questions is used to classify the environment as either informative or deceptive. By spending a sufficient amount of time in this phase, the classification can be made arbitrarily accurate. In the second phase, a regular SPL scheme is applied, except that the directions obtained are reversed if the problem environment was classified as deceptive in the first phase. This means that the scheme may have to remain in the first phase for an extensive amount of time to ensure that the problem environment is correctly classified, otherwise, one risks being systematically mislead in the second phase. These properties largely render CPLAdS inappropriate for online or anytime problem solving.
Recently, HSSL has been extended by Zhang et al. to cover both informative and deceptive environments, using a Symmetric HSSL (SHSSL) [11]. This scheme essentially runs two HSSL schemes in conjunction: one regular that handles informative environments and one where all feedback from the environment is inverted to handle deceptive environments. The hierarchy navigation capabilities of HSSL are then exploited to allow SHSSL to switch between the two HSSLs, depending on the nature of the environment. However, a significant limitation of HSSL, namely, that must be larger than the conjugate of the golden ratio, carries over to SHSSL. Indeed, SHSSL fails to converge for , which amounts to approximately of the feasible values for . This is in contrast to the approach we propose in this paper, as well as to CPLAdS [10], since both of these schemes operate along the whole range of (apart from ).
To cast further light on the challenges lined out above, we here introduce the NDoor Puzzle as a framework for modeling deception. We further propose an accompanying novel solution scheme — Thompson Sampling guided Stochastic Point Location (TSSPL). The TSSPL scheme handles both SPL and SRF problems, and is capable of simultaneously solving the problem as well as determining whether we are dealing with an informative or a deceptive environment. As we shall see, not only does this scheme handle an arbitrary level of noise, but it also outperforms current stateoftheart techniques in both informative and deceptive environments.
The NDoor Puzzle. In the book ”To Mock a Mockingbird” [12] the following puzzle is formulated: ”Someone was sentenced to death, but since the king loves riddles, he threw this guy into a room with two doors. One leading to death, one leading to freedom. There are two guards, each one guarding one door. One of the guards is a perfect liar, the other one will always tell the truth. The man is allowed to ask one guard a single yesno question and then has to decide, which door to take. What single question can he ask to guarantee his freedom?” To avoid spoiling the puzzle for the reader, we omit the solution here and note that asking a double negative question will often be the correct course of action for these types of puzzles.
The above puzzle can be generalized by increasing the number of doors. Instead of deciding between merely two doors, the prisoner now faces doors, with a guard posted between each pair of doors. Only a single door leads to freedom, the remaining doors lead to death. At sunrise each day the prisoner is allowed to ask one of the guards whether the door leading to freedom is to the guard’s left or to the guard’s right. However, only a fixed proportion of the guards answers truthfully, the rest are compulsive liars. Further, the guards are randomly assigned a position each sunrise, and thus, knowing who lies and who tells the truth is impossible. As an additional complication, depending on the mood of the king, the prisoner may be ordered to walk through one door of his choosing at an arbitrary day. Therefore, to save his life, it is imperative that the prisoner as quickly as possible determines which door leads to freedom.
Specifically, let be the fraction of truthful guards. Since the guards are randomly assigned a position each day, the probability of obtaining a truthful answer is governed by . If then the majority of the guards are compulsive liars, and the guards as an entity can be characterized as being deceptive. Conversely, if then the majority of the guards are truthful and the guards can be seen as informative. For completeness, we mention that the puzzle is unsolvable for the case where is exactly equal to , since it then becomes impossible to gain information on neither the nature of the doors or the guards.
Thompson Sampling. The Thompson Sampling (TS) principle was introduced by Thompson already in 1933 [13] and now forms the basis for several stateoftheart approaches to the MultiArmed Bandit (MAB) problem — a fundamental sequential resource allocation problem that has challenged researchers for decades. At each time step in MAB, one is offered to pull one out of bandit arms, which in turn triggers a stochastic reward. Each arm has an underlying probability of providing a reward, however, these probabilities are unknown to the decision maker. The challenge is thus to decide which of the arms to pull at every time step, so as to maximize the expected total number of rewards obtained [14].
In all brevity, TS seeks to achieve the above goal by quickly shifting from exploring reward probabilities to maximizing the number of rewards obtained. This is achieved by recursively estimating the underlying reward probability of each arm, using Bayesian filtering of the rewards obtained thus far. TS then simply selects the next arm to pull based on the Bayesian estimates of the reward probabilities (one reward probability density function per arm).
The arm selection strategy of TS is rather straightforward, yet surprisingly efficient. To determine which arm to pull, a single candidate reward probability is sampled from the probability density function of each arm. The arm whose sampled value is the highest is the one pulled next
. The outcome of pulling that arm is in turn used to perform the next Bayesian update of the arm’s reward probability estimate. It is this simple scheme that makes TS select arms with frequency proportional to the posterior probability of being optimal, leading to quick convergence towards always selecting the optimal arm.
TS has turned out to be among the top performers for traditional MAB problems [15, 16], supported by theoretical regret bounds [17, 18]. It has also been been successfully applied to contextual MAB problems [19], Constrained Gaussian Process optimization [20], Distributed Quality of Service Control in Wireless Networks [21], Cognitive Radio Optimization [22]
, as well as a foundation for solving the Maximum a Posteriori Estimation problem
[23].Pure Exploration Bandits. Throughout this paper we assume that each SPL problem potentially takes part in a larger system consisting of multiple SPL problems, and not necessarily operating in isolation. From existing applications in the literature, such as web crawler balancing [24], it is clear that the value of an SPL scheme does hinge upon its ability to cooperate and interact with other decision makers. Such cooperation demands predictable behaviour from the individual decision makers, as well as coordinated balancing of exploring new solution candidates against maintaining good solution candidates. Without such an ability, the system as a whole will not be able to systematically move towards the more promising areas of the search space, gradually focusing in on an optimal configuration. Therefore, in this paper we omit a direct comparison with schemes that rely on a ”fixed sampling then decide” approach, such as Unimodal Bandits [25]. For the same reason, we will not investigate purely exploitative bandits [26, 27, 28, 29, 30], that is, bandits that have a predefined finite time horizon and whose performance is only measured at the end of that horizon. Consequentially, such algorithms are free to explore without any negative impact. These types of algorithms are shown to outperform traditional exploitationexploration bandits such as TS and UCB for scenarios where exploitation is not required.^{1}^{1}1There also exists a wide spectrum of techniques and schemes in the literature on the topic of searching with noise. See for instance [31] for a comprehensive survey. These are unable to handle unknown and deceptive environments, with stochastic directional feedback, and are therefore not directly comparable to SPL solution schemes. We have therefore not included this class of techniques in the present paper.
Paper Contributions. The contributions of this paper can be summarized as follows. First of all, we introduce a novel scheme for solving the SPL problem, namely, Thompson Sampling guided Stochastic Point Localization (TSSPL). First of all, we formulate a compact and scalable Bayesian representation of the solution space. This Bayesian representation simultaneously captures both the location of the optimal point (bandit arm) as well as the probability of receiving correct feedback. We further introduce the accompanying Thompson Sampling guided Stochastic Point Location (TSSPL) scheme for balancing exploration against exploitation. By learning , TSSPL also supports deceptive environments that are lying about the direction of the optimal arm. This, in turn, allows us to solve the fundamental Stochastic Root Finding (SRF) Problem. More specifically, the contributions of the paper can be summarized as follows:

We introduce the novel TSSPL scheme that represents the solution space of NDoor Puzzles, and thus SPL problems, in terms of a Bayesian model. As opposed to competing solutions that merely maintain and refine a single candidate solution, our Bayesian model encompasses the complete space of candidate solutions at every time instant. This Bayesian representation of the problem opens up for efficient exploration and exploitation of the solution space with Thompson Sampling.

We formulate a compact and scalable Bayesian representation of the solution space that simultaneously captures both the location of the optimal point (arm), as well as the probability of receiving correct feedback.

We link TSSPL to socalled Stochastic Bisection Search; and unify accompanying methods under the umbrella of Thompson Sampling.

Similarly, we enhance Soft Generalized Binary Search (SGBS), Probabilistic Bisection Search (PBS) and BurnashevZigangirov Algorithm (BZ) by introducing novel parameter free solutions that take advantage of our Bayesian model of the NDoor Puzzle/SPL problem. This approach eliminates previous reliance on prior knowledge of the degree of noise affecting the system to be optimized.

We finally demonstrate the empirical performance of TSSPL for both SPL and SRF problems. TSSPL outperforms stateoftheart algorithms in both informative and deceptive environments, except that it is beaten by the SGBS and BZ schemes with correctly specified observation noise.
Paper Outline. The paper is organized as follows. In Section 2, we present our scheme for Thompson Sampling guided Stochastic Point Location (TSSPL). We first introduce the Bayesian model of the NDoor Puzzle. Based on the Bayesian model, we then formulate a TS based scheme that balances solution space exploration against reward maximization. We further extend selected stateoftheart solution schemes with the Bayesian model that TSSPL employ. This extension removes the need for prior information on observation noise. Then, in Section 3, we provide extensive empirical results comparing TSSPL with stateoftheart schemes for both SPL and SRF. We conclude in Section 4 and point to promising venues for further work.
2 Thompson Sampling guided Stochastic Point Location (TSSPL)
In this section, we introduce the Thompson Sampling guided Stochastic Point Location (TSSPL) scheme. At the core of TSSPL we find a Bayesian model of the NDoor Puzzle (introduced in Section 1).
Formally, we represent an NDoor Puzzle instance as a tuple , where is the set of doors and is the truthfulness of the guards. Let be the particular NDoor Puzzle faced. A novel aspect of TSSPL is that instead of maintaining a single or a limited set of candidate solutions, we instead maintain a posterior distribution over the whole solution space, . This distribution is conditioned on the feedback already obtained up to time step , allowing us to single in on as the number of time steps increases, ultimately converging to .
Assuming no prior information, we assign a uniform distribution over
, i.e., all puzzle instances are equally probable. By gradually refining the posterior distribution over , we can select guards to question in a goal directed manner. In all brevity, we sample a solution candidate from , selecting the guard to the left or to right of . The answer of the selected guard is then used to update our posterior distribution. By repeating this procedure, the expected probability of the underlying NDoor Puzzle instance, , increases monotonically, reducing the probability of other puzzle instances. In effect, given enough iterations, TSSPL will correctly identify the door leading to freedom as the posterior probability of approaches unity.2.1 Bayesian Model of the NDoor Puzzle
The main purpose of the Bayesian model is to facilitate efficient calculation of a posterior distribution over the possible NDoor Puzzle instances, . Since the prisoner does not initially know which problem instance he is facing, and since the observations are stochastic, we cast and
as two random variables. We further assume that
and are independent of each other. Furthermore, the information we obtain from questioning the guards is represented as a set of random variables , with each random variable representing the answer from question . Finally, we assume that the outcomes of the individual questions, , are independent when conditioned on and . For each question , we can then compute the probability of the answer (”left” or ”right”) that we received from the guard, as summarized in Table 1.
Guard to the left of door to freedom: 
left guard, door, right guard, door, 
Guard to the right of door to freedom: 
left guard, door, right guard, door, 
As an example, assume that the truthfulness of the guards is . Let us further for instance solicit the guard to the left of door , with the guard replying that the door leading to freedom lies to his left. We can then infer that all doors to the left has a probability of of leading to freedom, and all the doors to the right has the probability of leading to freedom.
Applying Bayes Theorem to
, defined in Table 1, we are able to derive closedform expressions for the posterior distributions of both and . The derivation of follows (the derivation of is analogous, and are left out here for the sake of brevity):(1)  
(2)  
(3)  
(4)  
(5) 
Above, and and (2) follows directly from Bayes Theorem. We obtain (3) by marginalizing out . Eq. (4) is a result of the independence of and , and (5) from the independence between the questions in . This leads us to the following two equations for updating our knowledge surrounding both the door probabilities (Eq. 6) and the truthfulness of the guards (Eq. 7).
(6) 
(7) 
2.2 Guard Selection
We have now formally established how we can turn information from the guards into a probability distribution over which door leads to freedom. However, as mentioned previously, we here face a tradeoff between exploring different doors and zeroing in on the best door found so far. To handle this tradeoff we model the door selection as a socalled Global Information MAB (GIMAB)
[32].To decide on what door to select at each iteration, we solve the GIMAB by utilizing the principle of TS. Here, the selection process is simply to select a random door proportional to the probability that that door is the one leading to freedom. Once the door has been selected, we need to decide which of the guards to query: the guard to the left or to the right of the door selected. We do this by, again, selecting one of the guards randomly, proportionally to the sum of the probabilities of the doors next to each guard. Assume for instance that we have three doors with probability of leading to freedom: . Then, according to the TS principle, these are also the probabilities we use to sample a particular door. Note that since the answer obtained from each guard queried affects the complete probability distribution over (the probability associated with every door is updated), we have a GIMAB as opposed to a traditional MAB.
2.3 Improving StateoftheArt Schemes with Bayesian Model
A main advantage of TSSPL compared to similar schemes is the utilization of the Bayesian model that enables TSSPL to operate without prior problem parameters. Due to TSSPL’s close connection to the Probabilistic Bisection Search (PBS) [33], Noisy Generalized Binary Search (NGBS) [34] and the BZ algorithm [35], we will here utilize our Bayesian TSSPL model to also make these other schemes parameter free.
Probabilistic Bisection Search
The goal of Probabilistic Bisection Search (PBS)^{2}^{2}2In this context this scheme also covers the Stochastic Binary Search [36, 34] is to locate an unknown point . To acquire intelligence on the location of one queries an Oracle of the relation between a point and . The oracle responds by informing whether is on the left or the right side of . If we assume that the Oracle always tells the truth, then the well known deterministic Bisection Search that halves the search space with each query can be employed to efficiently find . However, in PBS we assume that the Oracle provides correct answers with probability and erroneous ones with probability .
The origin of PBS can be traced to Horstein [33]. In PBS a probability distribution is mapped over the search space and is gradually updated using a Bayesian methodology under the assumption that the environment noise is known apriori. The search space is then continuously explored using the median of the posterior distribution as the point of interest. It has been shown that PBS has a geometric rate of convergence under the latter assumptions [36].
As the noise is assumed given, one can simply invoke Eq.. 8 to calculate the posterior distribution.
(8) 
Here is the conditional probability of obtaining answer . That is, for every location to the left of , the probability that the Oracle directs the decision maker to the right is , . And conversely, for to the right of .
Generalized Binary Search
The Generalized Binary Search (GBS) problem can be formulated as follows [37, 34]. Consider a collection of unique binaryvalued functions defined on a domain . Each is defined as a mapping from to . Assume that there exists an optimal function that produces the correct binary labeling for each . For each query the value of is observed, possibly corrupted by independent binary noise. The objective is then to determine the function using as few queries as possible. In this paper we restrict to the class of threshold binary functions with the effect of turning the GBS into the informative NDoor Puzzle.
If the feedback is noiseless then the problem boils down to the combinatorial problem of finding an optimal decision tree in the
space, a problem that Hyafil and Rivest showed to be NPcomplete [38, 37].The SoftDecision Generalized Binary Search SDGBSearch [34, 37] is the stateofart algorithm for finding when probability of binary noise is less than , i.e., for informative environments.
Similarly to TSSPL, SDBGSearch employs a probabilistic model that for time step assigns a probability to each . However, for each timestep, it decides which
to query next based on a deterministic heuristic:
(9) 
SDGB uses the following equation to determine and update at each time step:
(10) 
Where and is the response from . Simplifying Eq. 10 we observe that represents an AND operator that takes on the value 1 if is equal to and 1 otherwise. Furthermore, we note that since , then one of and will have to take the value , while the other takes the value .
This update scheme is identical to the one found in PBS and thus suffers from the same limitation (noise probability is assumed known a priori). In the same manner as we enhanced PBS to utilize a prior over the noise, we can enhance SDGB (using Eq. 6,7) to become a parameter free scheme, again employing our Bayesian TSSPL scheme. In the following, we will denote this improved version of SDGB as SDGBM.
BurnashevZigangirov Algorithm
The BurnashevZigangirov (BZ) Algorithm [35]
is one of the most widely used algorithms for solving the discrete PBS problem and has in particular been employed in the context of Active Learning
[39, 40]. In BZ, we search for a point that is located on a line. This line is discretized into bins and we are only allowed to query the borders of the bins for the direction of . The BZ algorithm suffers from the same practical limitation as PBS and SDGB, namely a dependency on knowing the noise level beforehand.We will now show how BZ can be improved in a similar fashion as PBS and SDGB, leveraging our Bayesian model. Let denote the probability of residing in bin at timestep . The probability mass function (pmf) of all the bins is therefore with its cumulative density function (cdf) denoted as .
To decide which point to investigate next, that is, deciding a value for , BZ selects one of the two closest points to the median of . We denote this point
. The binary response variable
is observed with probability , whereas with probability .To update the probability distribution over we define as the probability for noise and let and .
For we have
and for
To change BZ into a parameter free scheme we first notice that for any given noise : , , and . After some simple algebraic manipulations, it turns out that the updating scheme of the BZ algorithm is identical to PBS expect that:

BZ calculates the normalizing factor as a part of the updating rule instead of using the likelihood value, and then later normalizes as PBS does.

BZ samples on the interval edges while PBS samples the midpoints of each interval.
3 Empirical Results
In this section we evaluate the performance of TSSPL empirically, compared to competing schemes. We investigate both the effect the various parameter settings has on behavior, as well as the capability of TSSPL to handle different applications, including Stochastic Point Location and Stochastic Root Finding problems. Unless otherwise noted, the empirical results report the average of 10 000 independent trials.
For some of the applications we investigate here, we do not find any existing scheme that handles deceptive environments. Instead, the schemes we have identified assume that feedback is informative on average. To render comparison fair, we thus introduce TSSPLINF, configured with a prior that the feedback is informative. This also serves to exemplify the power of our Bayesian approach, because we can leverage from a prior tailored for the task at hand. Note that this informed prior is equivalent to the priors used for the other probability theory based schemes, PBSM and SDGBM.
Further note that we apply a fixed set of parameter values across the whole suite of experiments, set to optimize overall performance. For SHSSL [11] and HSSL [8] we used a tree branching factor of , and for ASS [7] we set and . For OCBA [11] we set and . The priors used for TSSPL is uniform over the unit interval and is discretized as , and . For the informative schemes TSSPLINF, PGAM, SGDBM, BZM, we utilize the same prior for the doors as for TSSPL, however, we use an uniform prior over the interval for truthfulness, with .
We will in the following subsections investigate (1) the effect of different priors on TSSPL; (2) TSSPL’s ability to identify the nature of the underlying stochastic environment; (3) the ability to solve the Stochastic Point Location Problem; and (4) performance on Stochastic rootfinding problems  a particularly intriguing class of deceptive environments that arises naturally as a result of the properties of stochastic root finding.
3.1 Sensitivity to Discretization and Distribution of Prior
Although TSSPL is a parameter free scheme it depends on defining , the set of all possible NDoor Puzzles, and then formulating a prior distribution over this space. Since TSSPL is a discrete scheme, an important question is how does TSSPL fare under various level of discretization, that is, how is the TSSPL performance affected by the cardinality of .
We define convergence for TSSPL to an interval when 95% of the probability mass is contained in the interval, i.e. . The measure of interest is then the number of timesteps passed before convergence.
From Table 2 we identify that the cardinality of in fact, does affect the performance of TSSPL. As increases so does the time it takes before TSSPL converge. However, from Table 2 it is evident that this relationship between convergence time and is not linear, indeed the increase in convergence time is insignificant even when doubling from 3200 to 6400 possible doors, suggesting a logarithmic relation between and convergence time.
100  200  400  800  1600  3200  6400  

Convergence Steps:  31.4  36.0  38.9  39.4  39.3  40.2  40.9 
To see how the cardinality of affects performance we gradually increase the discretization of the interval . We fix the cardinality of to 100. Observing Table 3 it is clear that an increase in the discretization of does not significantly affect performance.
50  100  200  400  800  1600  3200  

Convergence Steps:  51.6  50.8  48.4  52.1  51.0  52.4  52.1 
Another advantage of our Bayesian scheme is the ability to incorporate prior information to guide the algorithm. On the other hand, specifying an incorrect prior can deteriorate performance instead of enhancing it. In Table 4 we give the results for an informed prior over and . With the correct underlying values , we specify three types of priors: Correct , Incorrect and Flat (all solutions equally probable), denoted C, I and F respectively. From Table 4 we can see the effect of different priors. In brief, having a correct prior over the doors contributes more to convergence time than having a correct prior over the truthfulness of the guards. The disadvantage of setting an incorrectly biased prior is also evident, as the flat prior performs better than any combination involving a biased prior.
Door  Truthfulness  Convergence  Door  Truthfulness  Convergence  Door  Truthfulness  Convergence 
F  F  36.4  C  F  30.2  I  F  46.4 
F  C  35.7  C  C  30.0  I  C  45.2 
F  I  41.2  C  I  40.5  I  I  113.1 
3.2 Tracking the Truthfulness of the Environment
A interesting property of TSSPL is its ability to provide a distribution over the truthfulness for that problem instance. This is a significant advantage as it present the enduser with a better view into the underlying environment when it comes to practical applications. This can in particular be leveraged in the case of repeated trials, where the information from previous trials can be us as a prior on subsequent trials, hence greatly increasing the speed of convergence as seen in Section 3.1. Figure 1 shows the probability of each level of noise as the TSSPL progresses with noise probability (a highly deceptive environment). As seen, TSSPL is capable of quickly estimating the correct value of .
3.3 Stochastic Point Location
The NDoor Puzzle, as outlined in the introduction, is dependent on two variables and , with specifying the door leading to freedom and the truthfulness of the guards. Since the NDoor Puzzle does not pose any spatial requirements on the placements of the doors we can generate a mapping from the NDoor Puzzle to the SPL problem by uniformly placing the doors over the unit interval.
As not all of the schemes evaluated in this section are Bayesian, we introduce the notion of regret, as typical for the multiarmed bandit scenario, as a metric for measuring the performance of the different schemes. Regret can be stated as the cumulative penalty from selecting suboptimal actions. In the case of SPL we define regret as the (unsigned) distance between the selected point and the optimal point .
3.3.1 Informative SPL
We evaluate the performance of TSSPL and TSSPLINF in an informative SPL problem against algorithms designed to handle informative environments. To the best of our knowledge this is the first time both the family of PBS based schemes and the family of SPL based schemes are compared.
The performance of the different schemes is summarized in Table 5. One significant observation is the performance difference between the Learning Automata (LA) based schemes (HSSL and SHSSL) and the Bayesian schemes. It is clear that performance wise, Bayesian schemes significantly outperform the LA based schemes, however it should be noted that the LA based schemes require less memory and run faster than the Bayesian ones due to their simplicity.
As can be deduced from Table 5, the distance is an important metric for how hard a particular SPL problem is to solve. This can be explained by the fact that most schemes start exploring from the center. Thus, if is far from the center then such a scheme has to obtain more evidence to explore in the peripheral regions of the search space. This is particularly apparent for PBSM as its performance peaks in the case where , even when faced with significant noise ().
Since PBSM pursues the median of the probability distribution, we can say that PBSM is conservative in its exploration. This is because it takes significant more evidence to move the point of exploration compared to TSSPL. TSSPL on the other hand has a tendency to explore too much, and as noted by Lattimore [41]
using TS for exploration can lead to overexploration when facing high variance distributions. In the low noise scenarios, on the other hand, NGBSM is the most efficient scheme, exploring deterministically.
Moreover, from Table 6
we observe that TSSPLINF exhibits the lowest standard deviation overall, and is consequentially the scheme that consistently perform closest to its expected regret for every trial. This is in sharp contrast to PBSM who outperform TSSPL when it comes to average regret, but is unable to do so consistently. NGBSM also displays significant variance in high noise scenarios.
Avg Regret  Avg Regret  Avg Regret  
TSSPL  29.2 / 9.8 / 5.1  36.4 / 12.9 / 6.2  57.3 / 20.3 / 10.1 
TSSPLINF  22.2 / 7.3 / 3.7  22.5 / 7.7 / 3.8  23.9 / 8.7 / 4.3 
PBSM  9.8 / 4.0 / 2.6  32.7 / 14.2 / 8.5  52.1 / 29.6 / 16.9 
BZM  23.5 / 5.9 / 2.2  27.5 / 6.3 / 2.5  35.1 / 9.6 / 3.4 
NGBSM  36.9 / 3.5 / 1.0  48.9 / 4.5 / 1.5  68.5 / 7.1 / 2.3 
ASS  45.8 / 17.0 / 6.7  30.4 / 8.9 / 3.6  38.8 / 11.7 / 3.9 
OCBA  70.8 / 47.4 / 35.2  89.9 / 55.8 / 37.1  112.1 / 78.4 / 48.8 
HSSL  117.3 / 23.1 / 8.2  111.7 / 16.7 / 4.8  131.5 / 19.1 / 5.3 
SHSSL  152.2 / 32.6 / 11.8  151.8 / 23.5 / 6.5  175.1 / 26.1 / 7.3 
Std. dev.  Std. dev.  Std. dev.  

TSSPL  16.8 / 5.9 / 2.6  20.5 / 6.5 / 3.1  30.9 / 10.3 / 4.6 
TSSPLINF  13.8 / 4.2 / 2.0  14.2 / 4.4 / 2.5  15.7 / 5.7 / 2.4 
PBSM  15.2 / 10.3 / 10.2  69.1 / 40.9 / 31.5  94.2 / 71.1 / 56.6 
BZM  30.4 / 8.9 / 3.1  40.8 / 9.8 / 4.9  48.8 / 15.3 / 5.3 
NGBSM  68.5 / 8.9 / 0.9  83.7 / 13.6 / 1.4  108.7 / 19.4 / 1.6 
ASS  51.6 / 22.4 / 10.1  47.8 / 15.7 / 5.4  62.3 / 23.1 / 4.6 
OCBA  46.2 / 27.6 / 19.4  63.9 / 43.9 / 25.6  76.1 / 64.6 / 41.9 
HSSL  71.7 / 16.1 / 4.6  83.6 / 16.1 / 4.2  94.8 / 19.4 / 4.5 
SHSSL  89.7 / 23.5 / 6.2  108.5 / 23.4 / 5.8  126.5 / 27.7 / 6.4 
3.4 Stochastic Point Location in Deceptive Environments
With the underlying taking on values in the interval we test TSSPL, CPLAdS[10] and SHSSL[11] for speed of convergence and how much regret on average one accumulates before converging. However, since CPLAdS operates in a twophase mannerm direct comparison with TSSPL and SHSSL is inappropriate because the latter schemes operate online. Oommen et al. states in [10] that this decision phase needs approximately 200 time steps, and by this time TSSPL is already close to converging to the actual solution. To further explore this point, see Table 7. Here it is clear that TSSPL is superior to CPLAdS by several orders of magnitude, as well as outperforming SHSSL.
Another interesting observation is that the performance of TSSPL is symmetrical around . Further note that as stated earlier, SHSSL fails to converge for , so SHSSL is effectively operating with a smaller search space for than both TSSPL and CPLAdS.
After modifying PBS, NGBS and BZ to support a Bayesian model of truthfulness, we can use the same prior that we apply in TSSPL also for these schemes, leading to PBSM, NGBSM and BZM. The effect of this enhancement to existing schemes is summarized in Table 7. As clearly seen, the query selection method for these schemes is not suited to handle deceptive environments.
TSSPL ()  6.2  6.2 
CPLAdS ()  501.6 / 354.9  842.8/502.3 
PBSM ()  31.5  77.5 
BZM ()  4.9  352.5 
NGBSM ()  1.4  191.2 
SHSSL ()  6.5  6.5 
3.5 Stochastic RootFinding Problem
The deterministic root finding problem is the procedure of locating a root such that for a function defined over an interval . We assume that is unknown, however, an oracle returns when queried at point . Then the problem becomes, how can we using as few queries as possible determine the root ? If the response from the oracle is noisy, then we obtain the Stochastic Root Finding Problem (SRFP)[6].
One approach to solving the deterministic root finding problem is the Bisection Method. In the Bisection Method we halves the search space each iteration by continually querying the oracle on the mid point of the remaining search space. However, for the SRFP the Bisection Method is unable to discard half of the search space since the oracle may provide false information regarding value of .
The objective is therefore to select a sequence of queries to gather information about such that the final query is close to i.e., . [42]
Formally, let be a function such that given then is equal for all , and for all .
For any the oracle generate a sample where is stochastic noise. Furthermore, define as the sign of i.e. . As we shall see, the TSSPL as well as PBS and its variants operate solely using . While disregarding the scalar information of might seem wasteful, it open up for the highly efficient Bayesian framework deployed by TSSPL. In the scenario that we will investigate here, noise reverse the sign of the function value, i.e., returns with probability and with probability .
The traditional way of solving SRFP is to apply a variant of Stochastic Approximation (SA) [43, 44]. Implementation wise SA methods^{3}^{3}3 The form of SA shown here is also referred to as Classical Stochastic Approximation (CSA) as it closely resemble the original form proposed by Robbins and Monro [45]. extend or modify the iterative NewtonRaphson algorithm to handle noise:
where is a sequence of step lengths that decreasing as increase. Applying SA to SRFP has been extensively studied in the literature and it is outside the scope of this article to give a full literature review, interested readers are referred to [46, 47, 6] and the references therein. As there exists a myriad of different SA algorithms we have selected one of the more fundamental approaches to form a basis for comparing the different types of schemes. Note that a limitation of this SA scheme is that is required to be monotone.
The main difference between SRFP and SPL is that unlike SPL, SRFP does not provide feedback concerning the direction of the root from the query location . To map the feedback into a direction it is necessary to know whether is increasing or decreasing. If it is increasing and then the root is to the left of (and to the right if ). Conversely, if is decreasing and then is to the right of (and to the left if ).
Learning the direction of can be done by repeatedly querying a single point on the edge of the interval . To gain an insight into how many repeated samples are sufficient we employ the two sided Hoeffding’s inequality where is the average of queries at , is a value such that . Setting the rhs. equal to and solving for , we obtain . Plugging in for and we obtain . Thus we are 99% sure of our estimate of , given that .
However, it turns out that TSSPL, being able to handle a deceptive environment does not require this sampling phase. It merely require an arbitrary, yet consistent mapping of each sign to a direction. For instance, left, and right. The reason for this is that if the initial mapping is wrong, then TSSPL will recognize that the feedback is deceptive and thus still be able to solve the problem with no additional effort. An informative scheme will on the other hand be unable to recognizing this and will therefore be unable to find the root without an additional sampling phase.
We remark that the above sampling procedure only enables the other methods to handle SRFP for an unknown functions in an informative environment, it does not provide a definite answer to whether the environment is informative or deceptive. See [10] for a way to implement this as an additional sampling phase. The functions that we use to measure performance and compare schemes are illustrated in Figure 2.
Func A  Avg Regret.  Avg Regret.  Avg Regret 

TSSPL  46.0  17.1  8.6 
TSSPLINF  55.2 (24.3)  39.1 ( 8.1 )  35.0 ( 4.1 ) 
PBSM  55.1 (24.2)  40.3 ( 9.3 )  35.2 ( 4.2 ) 
NGBSM  53.4 (22.4)  35.3 ( 4.3 )  32.9 ( 1.9 ) 
BZM  63.8 (32.8)  39.9 ( 8.9 )  33.8 ( 2.8 ) 
SA  32.4  14.8  5.4 
ASS  80.4 (49.4)  38.7 ( 7.7 )  33.5 ( 2.5 ) 
HSSL  60.4 (29.4)  45.5 ( 14.6 )  35.8 ( 4.8 ) 
SHSSL  62.8  20.2  6.7 
CPLAdS  162.1 (107.9)  146.3 (97.4)  135.3 (90.1) 
. The results are given in the format ”average residuals (average residuals after sampling)” for each scheme. For CPLAdS the sampling period is the estimation period (epoch 0) as defined by the scheme. The number of iterations per trial is 250 with 10000 independent trials per data point.
Func B  Avg Regret.  Avg Regret.  Avg Regret 

TSSPL  47.1  17.8  8.5 
TSSPLINF  53.8 ( 22.9 )  39.5 ( 8.5 )  35.1 ( 4.1 ) 
PBSM  41.1 ( 10.2 )  35.3 ( 4.31 )  33.3 ( 2.3 ) 
SGBSM  50.6 ( 19.7 )  35.7 ( 4.7 )  33.0 ( 2.0 ) 
BZM  60.3 ( 29.4 )  39.6 ( 8.7 )  33.6 ( 2.6 ) 
SA  175.1  204.5  223.3 
ASS  81.6 ( 50.6 )  39.5 ( 8.7 )  40.2 ( 9.0 ) 
HSSL  85.3 ( 54.4 )  50.6 ( 19.6 )  39.0 ( 8.0 ) 
SHSSL  75.4  30.8  12.7 
CPLAdS  117.9 (109.3)  116.7 (107.1)  144.9 (96.5) 
Func C  Avg Regret.  Avg Regret.  Avg Regret 

TSSPL  36.9  13.7  6.4 
TSSPLINF  52.8 ( 21.9 )  38.8 ( 7.8 )  34.9 ( 3.9 ) 
PBSM  49.6 ( 18.7 )  39.2 ( 8.3 )  34.3 ( 3.3 ) 
SGBSM  47.2 ( 16.2 )  34.6 ( 3.6 )  32.5 ( 1.6 ) 
BZM  58.8 ( 27.9 )  38.4 ( 7.4 )  33.7 ( 2.7 ) 
SA  149.0  178.0  185.0 
ASS  54.3 ( 23.4 )  39 ( 8.0 )  33.5 ( 2.5 ) 
HSSL  75.2 ( 44.3 )  44.3 ( 13.4 )  35.6 ( 4.6 ) 
SHSSL  56.5  18.4  6.4 
CPLAdS  153.0 (101.6)  156.2 (103.7)  165.0 (109.6) 
From Table 8, 9 and 10 it is clear that TSSPL is the most efficient root solver among stateoftheart schemes. This largely comes from the fact that it simultaneously learns whether is decreasing or increasing with , as well as trying to locate the root . In addition there is the risk that the sampling procedure that the other schemes apply to determine the direction is increasing may conclude with the wrong answer. If this happens then none of the schemes depending on the sampling will converge towards the root . Furthermore, an advantage of using TSSPL is that it is can be applied to a wide range of functions without regards to any local extrema residing in the function. This is unlike SA that shows excellent performance only for monotonic functions as exemplified in Table 8.
4 Conclusions and Further Work
In this paper, we investigated a novel reinforcement learning problem derived from the socalled ”NDoor Puzzle”. This puzzle has the fascinating property that it involves stochastic
compulsive liars. Feedback is erroneous on average, systematically misleading the decision maker. This renders traditional reinforcement learning (RL) based approaches ineffective due to their dependency on ”on average” correct feedback.To solve the problem of deceptive feedback, we recast the problem as a particularly intriguing variant of the multiarmed bandit problem, referred to as the Stochastic Point Location (SPL) Problem. The decision maker is here only told whether the optimal point on a line lies to the “left” or to the “right” of a current guess, with the feedback being erroneous with probability . Solving this problem opens up for optimization in continuous action spaces with both informative and deceptive feedback.
Our solution to the above problem, introduced in the present paper, is based on a novel compact and scalable Bayesian representation of the solution space. This model simultaneously captures both the location of the optimal point, as well as the probability of receiving correct feedback, . We further introduced an accompanying Thompson Sampling guided Stochastic Point Location (TSSPL) scheme for balancing exploration against exploitation. By learning , TSSPL supports deceptive environments that are lying about the direction of the optimal point.
The resulting scheme was applied to the Stochastic Point Location (SPL) problem and outperformed all of the Learning Automata driven methods. However, by enhancing the Soft Generalized Binary Search (SGBS) scheme with our Bayesian representation of the solution space, SGBS was able to outperform TSSPL under informative feedback. For deceptive SPL problems, TSSPL outperformed all of the existing stateofart schemes by several orders of magnitude, even when the latter schemes were supported by our Bayesian model.
We also applied TSSPL to the Stochastic Root Finding Problem (SRFP). We demonstrated that SRFP can be seen as a deceptive problem, allowing TSSPL to outperform existing dedicated stateofart SRFP schemes by an order of magnitude. Thus, TSSPL can be considered stateoftheart for both deceptive SPL and for the SRFP, while yielding comparable results to the top performing schemes in the case of informative SPLs.
Despite the above performance gains, TSSPL is based on Thompson Sampling, which is known to have a tendency to overexplore high variance reward distributions [41]. In future work, it is therefore interesting to investigate mechanisms that eliminate or reduce this tendency, to further increase convergence speed.
References
 [1] B. J. Oommen, “Stochastic Searching on the Line and its Applications to Parameter Learning in Nonlinear Optimization,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 27, no. 4, pp. 733–739, 1997.
 [2] A. Yazidi, B. J. Oommen, and O.C. Granmo, “A novel stochastic discretized weak estimator operating in nonstationary environments,” in Computing, Networking and Communications (ICNC), 2012 International Conference on. IEEE, 2012, pp. 364–370.
 [3] B. J. Oommen, S. Misra, and O.C. Granmo, “Routing bandwidthguaranteed paths in mpls traffic engineering: A multiple race track learning approach,” Computers, IEEE Transactions on, vol. 56, no. 7, pp. 959–976, 2007.
 [4] B. J. Oommen, O.C. Granmo, and Z. Liang, “A novel multidimensional scaling technique for mapping wordofmouth discussions,” in Opportunities and Challenges for NextGeneration Applied Intelligence. Springer, 2009, pp. 317–322.
 [5] H. Chen and B. W. Schmeiser, “Stochastic root finding via retrospective approximation,” IIE Transactions, vol. 33, no. 3, pp. 259–275, 2001.
 [6] R. Pasupathy and S. Kim, “The stochastic rootfinding problem: Overview, solutions, and open questions,” ACM Transactions on Modeling and Computer Simulation (TOMACS), vol. 21, no. 3, p. 19, 2011.
 [7] T. Tao, H. Ge, G. Cai, and S. Li, “Adaptive step searching for solving stochastic point location problem,” in Intelligent Computing Theories. Springer, 2013, pp. 192–198.
 [8] A. Yazidi, O.C. Granmo, B. J. Oommen, and M. Goodwin, “A hierarchical learning scheme for solving the stochastic point location problem,” in Advanced Research in Applied Artificial Intelligence. Springer, 2012, pp. 774–783.
 [9] J. Zhang, L. Zhang, and M. Zhou, “Solving stationary and stochastic point location problem with optimal computing budget allocation,” in Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on, Oct 2015, pp. 145–150.
 [10] B. J. Oommen, G. Raghunath, and B. Kuipers, “On how to learn from a stochastic teacher or a stochastic compulsive liar of unknown identity,” in AI 2003: Advances in Artificial Intelligence. Springer, 2003, pp. 24–40.
 [11] J. Zhang, Y. Wang, C. Wang, and M. Zhou, “Symmetrical hierarchical stochastic searching on the line in informative and deceptive environments,” IEEE Transactions on Cybernetics, vol. PP, no. 99, pp. 1–10, 2016.
 [12] R. Smullyan, To Mock a Mockingbird and Other Logic Puzzles: Including an Amazing Adventure in Combinatory Logic. Knopf, 1988.
 [13] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, no. 3/4, pp. 285–294, 1933.
 [14] S. Bubeck and N. CesaBianchi, “Regret analysis of stochastic and nonstochastic multiarmed bandit problems,” Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
 [15] O.C. Granmo, “Solving twoarmed bernoulli bandit problems using a bayesian learning automaton,” International Journal of Intelligent Computing and Cybernetics, vol. 3, no. 2, pp. 207–234, 2010.
 [16] O. Chapelle and L. Li, “An empirical evaluation of thompson sampling,” in Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2249–2257.
 [17] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multiarmed bandit problem,” in Conference on Learning Theory, COLT, 2012.
 [18] ——, “Further optimal regret bounds for thompson sampling,” in Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, 2013, pp. 99–107.
 [19] ——, “Thompson sampling for contextual bandits with linear payoffs,” in Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 127–135.
 [20] S. Glimsdal and O.C. Granmo, “Gaussian process based optimistic knapsack sampling with applications to stochastic resource allocation,” in Proceedings of the 24th Midwest Artificial Intelligence and Cognitive Science Conference 2013. CEUR Workshop Proceedings, 2013, pp. 43–50.
 [21] O.C. Granmo and S. Glimsdal, “Accelerated bayesian learning for decentralized twoarmed bandit based decision making with applications to the goore game,” Applied intelligence, vol. 38, no. 4, pp. 479–488, 2013.
 [22] L. Jiao, X. Zhang, B. J. Oommen, and O.C. Granmo, “Optimizing channel selection for cognitive radio networks using a distributed bayesian learning automatabased approach,” Applied Intelligence, vol. 44, no. 2, pp. 307–321, 2016.
 [23] D. Tolpin and F. Wood, “Maximum a posteriori estimation by search in probabilistic programs,” in Eighth Annual Symposium on Combinatorial Search, 2015.
 [24] O.C. Granmo, B. J. Oommen, S. A. Myrer, and M. G. Olsen, “Learning Automatabased Solutions to the Nonlinear Fractional Knapsack Problem with Applications to Optimal Resource Allocation,” IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 37, no. 1, pp. 166–175, 2007.
 [25] Y. Y. Jia and S. Mannor, “Unimodal bandits,” in International Conference on Machine Learning (ICML11), 2011, pp. 41–48.
 [26] E. EvenDar, S. Mannor, and Y. Mansour, “Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems,” Journal of Machine Learning Research, vol. 7, no. Jun, pp. 1079–1105, 2006.
 [27] K. G. Jamieson, M. Malloy, R. D. Nowak, and S. Bubeck, “lil’ucb: An optimal exploration algorithm for multiarmed bandits.” in COLT, vol. 35, 2014, pp. 423–439.
 [28] J.Y. Audibert and S. Bubeck, “Best arm identification in multiarmed bandits,” in COLT23th Conference on Learning Theory2010, 2010, pp. 13–p.
 [29] V. Gabillon, M. Ghavamzadeh, A. Lazaric, and S. Bubeck, “Multibandit best arm identification,” in Advances in Neural Information Processing Systems, 2011, pp. 2222–2230.
 [30] Z. S. Karnin, T. Koren, and O. Somekh, “Almost optimal exploration in multiarmed bandits.” International Conference on Machine Learning (ICML3), vol. 28, pp. 1238–1246, 2013.
 [31] A. Pelc, “Searching games with errors – fifty years of coping with liars,” Theoretical Computer Science, vol. 270, no. 1, pp. 71–109, 2002.
 [32] O. Atan, C. Tekin, and M. van der Schaar, “Global multiarmed bandits with hölder continuity,” in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, 2015, pp. 28–36.
 [33] M. Horstein, “Sequential transmission using noiseless feedback,” Information Theory, IEEE Transactions on, vol. 9, no. 3, pp. 136–143, 1963.
 [34] R. Nowak, “Generalized binary search,” in Communication, Control, and Computing, 2008 46th Annual Allerton Conference on. IEEE, 2008, pp. 568–574.
 [35] M. V. Burnashev and K. Zigangirov, “An interval estimation problem for controlled observations,” Problemy Peredachi Informatsii, vol. 10, no. 3, pp. 51–61, 1974.
 [36] R. Waeber, P. I. Frazier, and S. G. Henderson, “Bisection search with noisy responses,” SIAM Journal on Control and Optimization, vol. 51, no. 3, pp. 2261–2279, 2013.
 [37] R. D. Nowak, “The geometry of generalized binary search,” Information Theory, IEEE Transactions on, vol. 57, no. 12, pp. 7893–7906, 2011.
 [38] L. Hyafil and R. L. Rivest, “Constructing optimal binary decision trees is npcomplete,” Information Processing Letters, vol. 5, no. 1, pp. 15–17, 1976.
 [39] A. Singh, R. Nowak, and P. Ramanathan, “Active learning for adaptive mobile sensing networks,” in Proceedings of the 5th international conference on Information processing in sensor networks. ACM, 2006, pp. 60–68.
 [40] R. M. Castro and R. D. Nowak, “Upper and lower error bounds for active learning,” in The 44th Annual Allerton Conference on Communication, Control and Computing, vol. 2, no. 2.1, 2006, p. 1.
 [41] T. Lattimore, “Optimally confident ucb: Improved regret for finitearmed bandits,” arXiv preprint arXiv:1507.07880, 2015.
 [42] R. Waeber, P. Frazier, and S. G. Henderson, “A bayesian approach to stochastic root finding,” in Simulation Conference (WSC), Proceedings of the 2011 Winter. IEEE, 2011, pp. 4033–4045.
 [43] H. Robbins and S. Monro, “A stochastic approximation method,” The annals of mathematical statistics, pp. 400–407, 1951.
 [44] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952.
 [45] R. Pasupathy and B. W. Schmeiser, “Root finding via darts – dynamic adaptive random target shooting,” in Simulation Conference (WSC), Proceedings of the 2010 Winter. IEEE, 2010, pp. 1255–1262.
 [46] T. L. Lai, “Stochastic approximation,” Annals of Statistics, pp. 391–406, 2003.
 [47] S. Asmussen and P. W. Glynn, Stochastic simulation: Algorithms and Analysis. Springer Science & Business Media, 2007, vol. 57.
Comments
There are no comments yet.