## 1 Introduction

We consider the following set-guessing problem, which is similar to classical group testing [1], but differs in the form of the observations. Let be the real line and

be a vector containing the unknown locations of

objects, where is known. One can sequentially choose subsets of , query the number of objects in each set, and obtain a series of noiseless answers . Our goal is to devise a method for choosing the questions that allows us to find as accurately as possible, given a finite budget of questions. We work in a Bayesian setting, and use the entropy of the posterior distribution on to measure accuracy.We consider both adaptive policies, i.e., policies that choose the next question based on previous answers, and non-adaptive policies, i.e., policies that choose all questions in advance. Adaptive policies promise to better localize the objects within the given query budget, by adapting later questions to provide more useful information, but non-adaptive policies offer easy parallelization because all questions may be asked simultaneously.

In this paper, we present two policies: a new non-adaptive policy, called the dyadic policy, which splits the search domain into successively finer partitions; and an adaptive policy, called the greedy policy, which chooses questions to maximize the one-step expected reduction in entropy. We make the following contributions:

We show that the dyadic policy achieves an information-theoretic lower bound on the expected entropy reduction achievable by a non-adaptive policy, showing it is optimal among non-adaptive policies. We also show that the dyadic policy’s performance is within a factor of two of a lower bound on the entropy reduction under any policy, adaptive or non-adaptive. Moreover, this non-adaptive policy is easy to compute and provides a posterior distribution that supports fast computation. Specifically, after questions and answers, the dyadic policy allows for explicitly computing the expected number of objects within each element of a partition of into

bins which can be used in a second stage of querying. We also further characterize the entropy of the posterior under this policy providing an explicit expression for its expected value and its asymptotic variance, and by showing that it is asymptotically Normally distributed.

We also consider the greedy policy, and show its performance is at least as good as that of the dyadic policy, and in some cases is strictly better. Thus, this policy offers improved query efficiency, though it does not support parallelization and requires substantially more computation than the dyadic policy, making it the more appropriate choice for applications that do not allow asking questions in parallel, and for which questions are substantially more expensive than computation.

We also compare these policies against benchmarks and show that they offer substantial performance benefits over the previous state-of-the-art.

### 1.1 Literature Review

The previous literature on similar problems can be classified into two groups: those that consider a single object (

); and those that consider multiple objects ().Among single-object versions of this problem, the earliest is the Rényi-Ulam game [2, 3, 4]. In this game, one person (the responder) thinks of a number between one and one million and another person (the questioner) chooses a sequence of subsets to query in order to find this number. The responder can answer either YES or NO and is allowed to lie a given number of times.

Variations of the Renyi-Ulam game have been considered in [5]. Among these variations, the following continuous probabilistic version, first studied in [6], is similar to the problem we consider: The responser thinks of a number and the questioner aims to find a set with measure less than such that

with probability at least

. In addition, the responser lies with probability no more than . Whether the questioner can win this game based on the error probability is analyzed and searching algorithms using queries are provided.Among previous work on the single-object problem, perhaps the closest to the current work is [7], which considered a Bayesian setting and used the entropy of the posterior distribution to measure of accuracy, as we do here. It considered two policies, a greedy policy called probabilistic bisection, which was originally proposed in [8] and further studied in [9, 10], and the dyadic policy. [11] generalized the probabilistic bisection policy to multiple questioners. Here, we generalize both policies to multiple objects.

Our work contrasts with this previous work on the single-object problem by considering multiple objects.

The previous literature includes work on three multiple-object problems:
the Group Testing problem [1, 12, 13, 14, 15, 16];
the subset-guessing game associated with the Random Chemistry algorithm [17, 18];
and the Guessing Secret game [19]. We denote the collection of objects by .
In the Group Testing problem, questions are of the form: *Is ?*
In the subset-guessing game associated with the Random Chemistry algorithm,
questions are of the form *Is ?*.
In the Guessing Secret game, when queried with a set , the responder chooses an element from according to any rule that he likes, and tells the questioner whether this chosen element is in . The chosen element itself is not revealed, and may change after each question. Thus, the answer is when , when , and can be either or otherwise.

Our work contrasts with this previous work by considering a problem where the answer provided by the responser is not binary but instead counts the number of objects in the queried set.

Our use of the (differential) entropy as a measure of quality in localizing objects follows a similar use of entropy in other sequential interrogation problems, including optimization of continuous functions [20], online learning [21], and adaptive compressed sensing [22]

. In this literature and here, the differential entropy is of direct interest as a measure of concentration of the posterior probability. Indeed, it is the logarithm of the volume of the smallest set containing “most of the probability”, see

[23]p.246. In our setting, when the prior distribution over each object’s location is of Uniform distribution, the posterior distribution is also Uniform and the differential entropy is exactly the logarithm of the volume of the support set of the posterior density. When the querying process discussed here is followed by a second stage involving a different querying process with different kinds of question and answers (as it is in each of the examples discussed below) the differential entropy may be considered as a surrogate for the time complexity required in this second stage.

### 1.2 Applications

The problem we consider, or slight variants of it, appear in three applications discussed below: heavy hitter detection in network traffic stream, screening for important input factors in complex simulators, and fast object localization in Computer Vision. They also appear in searching for auto-catalytic sets of molecules [17], and searching for collections of multiple contingencies leading to cascading power failures in models of electrical networks [24].

In each of the three applications discussed below, objects’ locations are discrete rather than continuous. The policies we present, which result from an analysis considering differential entropy and a continuous prior, may still be used profitably even when objects’ locations are known to lie on a finite subset of , as long as the granularity of the questions asked does not become finer than .

In heavy hitter detection [25], we operate a router within a computer network, and wish to detect a (presumably small) number of source IP addresses that are generating traffic through our network exceeding a given limit on packet rate. These source IP addresses are called “heavy hitters”.
Although we could, in theory, keep an ever-expanding list of all source IP addresses with associated packet counts, this would require a prohibitive amount of memory.
Instead, one can choose a set of IP addresses , and count how many packets fall into that set over a short time period^{1}^{1}1Our framework allows general , while in practice, the set should be of a form that allows easily checking whether a packet resides within it, for example, by having the set consist of all source IP addresses simultaneously satisfying a collection of conditions on individual bits within the address. The dyadic policy that we construct has this form when the prior is uniform, and the number of allowed queries is below a threshold.. By comparing this number to the limit on packet rate, one can obtain information about the number of heavy hitters (which are our objects ) with source addresses within .
By sequentially, or simultaneously, querying several sets , one can obtain a low-entropy posterior distribution on the locations of all heavy hitters. One can then follow this first stage of queries on sets by a second confirmatory stage of queries on individual IP addresses that the first stage revealed were likely to be being heavy hitters.

In screening for important factors in complex simulators [26, 27], we wish to determine which of a large number of input parameters have a significant effect on the output of a computer model. A factor model models each input to the simulator as a factor taking one of two values, “on” or “off”, and models the output of the simulator as approximately linear in the factors, with unknown coefficients that multiply each of the “on” factors to produce the output. The “important” factors are those with nonzero coefficients, and these are the objects we seek to identify. To identify them, we may choose a set of factors to turn on, and observe the output of the simulator, which gives us information about the number of important factors in the queried set. By sequentially, or simultaneously, querying several sets , we may obtain a low-entropy posterior distribution on the identity of the important factors. We can then individually query those factors believed to be important in a second confirmatory stage.

In computer vision applications, we may wish to localize object instances in images and video streams. Examples include detecting faces in images [28], finding quasars in astronomical data [29]

, and counting synapses in electron microscopy volumes

[30]. To support this, high-performing but computationally expensive classifiers exist that can localize object instances accurately. While one way to localize each instance would be to run such a classifier at each and every pixel to assess whether an instance was centered at that pixel, this would be computationally intractable for large images or video sequences. Instead, one can divide the image into various sub-regions , and use a computation to count how many instances fall in that region. Critically, counting the number of instances in a region is substantially faster than running the classifier at every pixel in that region, see [31, 32, 33]. Using a low-entropy posterior distribution obtained from these queries, one can compute the expected number of objects, among , at each pixel. We can then run our expensive classifier in a second confirmatory stage at those pixels where an object instance has been identified as being likely to reside. We illustrate our policies on a substantially simplified version of this problem in Section 7.Now, in Section 2, we state the problem more formally, and summarize our main results.

## 2 Problem Formulation and Summary of Main Results

Let be a random vector taking values in . represents the location of the th object of interest, . We assume that are i.i.d. with density , and joint density . We assume is absolutely continuous with respect to the Lebesgue measure and has finite differential entropy, which is defined in (2). We refer to

as the Bayesian prior probability distribution on

. We will ask a series of questions to locate , where each question takes the form of a subset of , and the answer to this question is the number of objects in this subset. More precisely, for each , the question is and its answer is(1) |

where is the indicator function of the set . Unless otherwise stated, our choice of the set

may depend upon the answers to all previous questions, and upon some initial randomization through a uniform random variable

on chosen independently of . Thus, the set is random, through its dependence on , and the answers to previous questions.We call a rule for choosing the questions a policy. Formally, we define a policy to be a sequence , where is a Borel-measurable subset of . We denote the collection of all such policies by . With a policy specified, the choice of is then , so that specifying implicitly specifies a rule for choosing based on the random seed and the history . Here, we have used the notation for any natural numbers and to indicate the sequence if , and the empty sequence if . We define and similarly. The distribution of thus implicitly depends on . When we wish to highlight this dependence, we will use the notation and to indicate probability and expectation respectively. However, when the policy being studied is clear, we will simply use and .

This definition of allows the choice of question to depend upon previous answers, and when we wish to emphasize this fact we will refer to as the set of adaptive policies. We also define the set of non-adaptive policies to be those under which each depends only on the random seed , i.e., for a fixed , the questions are deterministic. From a formal point of view, note that the set of adaptive policies includes the set of non-adaptive policies as a special case. Figure 1

illustrates, as a Bayesian network, the dependence structure of the random variables in our problem under an adaptive policy, and under a non-adaptive policy.

We refer to the posterior probability distribution on after questions as , so is the conditional distribution of given the past history . The dependence on arises because may depend on , in addition to . Equivalently, under any fixed policy , is the conditional distribution of given . This posterior can be computed using Bayes rule: is proportional to over the set , and outside.

After we exhaust our budget of questions, we will measure the quality of what we have learned via the differential entropy of the final posterior distribution on ,

(2) |

Throughout this paper, we use to denote the logarithm to base 2. We let , and we assume .
The posterior distribution , as well as its entropy , are random for , as they depend on and . Thus, we measure the quality of a policy when given questions using the *rate of reduction in expected entropy*

(3) |

This rate is the average number of bits learned per question.

Our goal in this paper is to characterize the solution to the optimization problem

(4) |

with or . Any policy that attains this supremum is called optimal. According to this definition, an optimal policy may not exist.

While (4

) can be formulated as a partially observable Markov decision process

[34], and can be solved, in principle, via dynamic programming, the state space of this dynamic program is the space of posterior distributions over , and the extreme size of this space prevents solving this dynamic program through brute-force computation. Thus, we must characterize optimal policies using other means.We define two policies, the dyadic policy, which is non-adaptive, and the greedy policy, which is adaptive. (More precisely, the greedy is a class of policies, as its definitions allow certain decisions to be made arbitrarily.) We will see below that the dyadic policy attains the supremum in (4) for , and thus is optimal among non-adaptive policies. We will also see that its performance comes within a factor of two of the supremum for , showing that it is a two-approximation among adaptive policies. We will also see below that the greedy policy performs at least as well as the dyadic policy, and so is also a two-approximation among adaptive policies.

To define the dyadic policy

, let us recall that the quantile function of

is(5) |

where

is the cumulative distribution function of

, corresponding to its density . The dyadic policy consists in choosing at step the set(6) |

where is the support of , i.e., the set of values for which . For example, when is uniform over , the dyadic policy is the one in which the first question is , the second question is , and each subsequent question is obtained by subdividing into equally sized subsets, and including every second subset. A further illustration of the dyadic question sets is provided in Figure 4 in Section 5. This definition of the dyadic policy generalizes a definition provided in [7] for single objects.

We define a greedy policy to be any policy that chooses each of its questions to minimize the expected entropy of the posterior distribution one step forward in time,

(7) |

where the argmin is taken over all Borel-measurable subsets of . We show in Section 6 that this argmin exists.

We are now ready to present our main results:

(8) |

where is a greedy policy, is a dyadic policy, and

(9) |

is the entropy of a Binomial distribution

.The first inequality in (8) is an information theoretic inequality (easily proved in Section 3). The second inequality is trivial since an optimal adaptive policy is at least as good as any other policy. The third inequality comes from a detailed computation of the posterior distribution of after observing answers for any possible sequence of questions (see Section 6.2). Additionally, we show that this inequality cannot be reversed, by presenting a special case in which there is a greedy policy whose performance is strictly better than that of the dyadic policy (see Section 6.3). The first equality comes from the characterization of the posterior distribution in the special case of the dyadic policy (see Section 5.2). The last equality is an information theoretic inequality which exploits the conditional independence structure of non-adaptive policies. It is proven in Section 3. The last inequality, proven in Section 5.2, shows that the rate of an optimal non adaptive policy is no less than half the rate of an optimal adaptive policy.

The power of these results is illustrated by Figure 2, which shows, as a function of the number of objects , the number of questions required to reduce the expected entropy of the posterior on their locations by 20 bits per object. The figure shows the number of questions needed under the dyadic policy (solid line); under two benchmark policies described below, Benchmark 1 and Benchmark 2 (dotted, and dash-dotted lines); and a lower bound on the number needed under the optimal adaptive policy (dashed line, and left-most expression in (8)). By (8), we know that the number of extra questions required by using either the dyadic or the greedy, instead of the adaptive optimal policy, is bounded above by the distance between the solid and dashed lines.

Benchmark 1 identifies each object individually, using an optimal single-object strategy. It first asks questions to localize the first object , reducing the entropy of our posterior distribution on that object’s location by 20 bits. This requires 20 questions, and can be achieved, for example, by a bisection policy, [8]. It then uses the same strategy to localize each subsequent second object, requiring 20 questions per object. Implementing such a policy would require the ability to ask questions about whether or not a single specified object (e.g., object ) resides in a queried set, rather than the number of objects in that set. While this ability is not included in our formal model, Benchmark 1 nevertheless provides a useful comparison. The total number of questions required under this policy to achieve 20 bits of entropy reduction per object is .

Benchmark 2 is adapted from the sequential bifurcation policy of [27]. While [27] considered an application setting somewhat different from the problem that we consider here (screening for discrete event simulation), we were able to modify their policy to allow it to be used in our setting. A detailed description of the modified policy is provided in Appendix B. It makes full use of the ability to ask questions about multiple objects simultaneously, and improves slightly over Benchmark 1. We view this policy as the best previously proposed policy from the literature for solving the problem that we consider.

The figure shows that a substantial saving over both benchmarks is possible through the dyadic or greedy policy. For example, for objects, Benchmark 1 and Benchmark 2 require 320 and 304 questions respectively. In contrast, the dyadic policy requires 106 questions, which is nearly 3 times smaller than required by the benchmarks. Furthermore, (8) shows that the greedy policy performs at least as well as the dyadic policy. Thus, localizing objects’ locations jointly can be much more efficient than localizing them one-at-a-time, and the dyadic and greedy policies are implementable policies that can achieve much of the potential efficiency gains.

The figure also shows, again at objects, that the optimal policy requires at least 80 questions, while the dyadic and greedy require no more than 106 questions, and so are within a factor of 1.325 of optimal. This is remarkable, when we compare how little is lost when going from the hard-to-compute optimal policy to the easily computed dyadic policy, with how much is gained by going to the dyadic from one of the two benchmark policies considered. Our results also show that this multiplicative factor is never worse than 2.

The dyadic policy can be computed extremely quickly, and can even be pre-computed, as the questions asked do not depend on the answers to previous questions. This makes it convenient in settings where multiple questions can be asked simultaneously, e.g., in a parallel or distributed computing environment. The greedy policy requires more computational effort than the dyadic policy, but is still substantially easier to compute than the optimal policy, and provides performance at least as good as that of the dyadic policy, as shown by (8), and sometimes strictly better, as will be shown in Section 6.3.

We see in the figure that the dyadic policy’s rate and the rate of the optimal policy come together at . This can also be seen directly from our theoretical results. When , the left-hand and right-hand sides of (8) are equal, since becomes a random variable, whose entropy is . This shows, when , that the rate of expected entropy reduction under the dyadic is the same as the upper bound on this rate under the optimal policy, which in turn shows that both dyadic and greedy policies are optimal, and the upper bound is tight. When , the well-known bisection policy is a greedy policy, and the dyadic is also greedy, i.e., satisfies (7).

We begin our analysis in Section 3, by justifying the left-most inequality in (8). We then provide an explicit expression for the posterior distribution in Section 4, which is used in later analysis. We analyze the dyadic policy in Section 5, and the greedy policy in Section 6. We illustrate the use of our policies on a stylized problem inspired by computer vision applications in Section 7. Finally, we offer concluding remarks in Section 8.

## 3 Upper Bounds on the Rate of Reduction in Expected Entropy

In this section, below in Theorem 1, we prove the first inequality in (8), which is an easy upper bound on the reduction in expected entropy for a fixed number of questions and answers under an adaptive policy. This bound is obtained from the fact that the answer to each question is a number in , and so cannot provide more than bits. We also prove in Theorem 1 that the upper bound cannot be achieved for .

Then, we provide a complementary upper bound for non-adaptive policies in Theorem 2, which we later show in Section 5.2 is matched by the dyadic policy, showing that it is optimal among non-adaptive policies.

###### Theorem 1.

(10) |

Moreover, when , this inequality is strict.

###### Proof.

According to the definition of rate of reduction in expected entropy in (3), in order to prove (10), we need to prove that under any valid policy,

(11) |

Recall that is the entropy of the posterior distribution of , which is random through its dependence on the past history . Thus, . Furthermore, using information theoretic arguments, we have

(12) |

Moreover,

(13) |

where because the information contained in the random seed and the answers completely determines the questions . Recall that for all ,

is a discrete random variable with

possible outcomes, namely . The maximum possible value for the entropy is , obtained when each outcome of has the same probability , i.e. .On the other hand,

(14) |

where for the same reason as above, and because the information contained in completely determines . Also, because the random seed is assumed to be independent of the objects .

We now prove that the inequality (10) is strict when , i.e. when there is more than one object. Consider any fixed , which specifies the questions set . Recall from (1) that and that are independent. As a consequence, , where . Therefore, when , implying . Thus, , so that there is no policy that can achieve the upper bound. ∎

Now, we provide an upper bound on the rate of reduction in expected entropy for all non-adaptive policies.

###### Theorem 2.

Under any non-adaptive policy , we have

(15) |

###### Proof.

To prove the claim (15), it suffices to prove that under any non-adaptive policy,

(16) |

First of all, the relation between mutual information and entropy gives

(17) |

For the first term, we have

(18) |

For the second term, we have

(19) |

Furthermore, since the information contained in and completely determines . Also, , since we can see from Figure 1 that are conditional independent given , and each is independent of conditional on as is the only parent of in the directed acyclic graph. In addition, since the random seed is assumed to be independent of the object . Hence, we have

(20) |

Combining (17), (18) and (20) yields

(21) |

Recall that by definition, , which is a sum of i.i.d. Bernoulli random variables. Hence, for each fixed and , we have . Therefore,

(22) |

The claim of the theorem follows. ∎

## 4 Explicit Characterization of the Posterior Distribution

In this section, we first introduce in Section 4.1 some notation to characterize the joint location of objects and provide an example to illustrate these notations. We then derive an explicit formula for the posterior distribution on the locations of the objects. In Section 4.2, we compute the conditional distribution of the next answer given previous answers , which we will use later to analyze the rate of a policy.

### 4.1 The Posterior Distribution of the Objects

Consider a fixed , where . For each binary sequence of length , , let

(23) |

The collection is a partition of the support of . A history of questions provides information on which sets contain which objects among .

We will think of a sequence of binary sequences as a sequence of codewords indicating the sets in which each of the objects reside, i.e, indicating that is in , is in , etc. We may consider each binary sequence to be a column vector, and place them into an binary matrix, . This binary matrix then codes the location of all objects, and is a codeword for their joint location.

Moreover, to characterize the location of the random vector in terms of its codeword , define to be the Cartesian product

(24) |

To be consistent with an answer , we must have exactly objects located in the question set for each . This can be described in terms of a constraint on the matrix as , i.e., that the sum of the row in the matrix is . Thus, after observing the answers to the questions , the set of all possible joint codewords describing is

(25) |

To illustrate the previous construction, and also to provide the foundation for a later analysis in Section 6.3 showing the greedy policy is strictly better than the dyadic policy in some settings, we provide two examples of the posterior distribution, arising from two different responses to the same sequence of questions.

Suppose are two objects located in (0,1] with a uniform prior distribution . Let and be the first two questions of the dyadic policy, so and . Then consider two possibilities for the answers to these questions:

#### Example 1:

Suppose and . According to (25), there is only one matrix in the collection , which has . Thus where

(26) |

We can observe that when is in , and otherwise.

#### Example 2:

Suppose and . According to (25), there are four matrices in the collection ,

(27) |

We can observe that the posterior distribution has density when is in or or or , and is otherwise.

All possible joint locations of in the two examples above are shown in Figure 3.

Given this notation, we observe the following lemma:

###### Lemma 1.

Let a policy and the random seed be fixed. Then, for each , the event can be rewritten

(28) |

where we recall that depends on . Moreover for any with , the two sets and are disjoint.

###### Proof.

Clearly, according to the definition of in (25), when , the answers that we observe must satisfy . On the other hand, suppose . Then belongs to some nonempty set where . Hence, there exists , , such that , which implies that the answer to the question is . This proves (28).

Now, for any , there exists with such that . This implies that and are disjoint and the last assertion follows. ∎

At this point, the explicit characterization of the posterior distribution is immediate and we have the following lemma.

###### Lemma 2.

(29) |

and for . Here, for any measurable set , denotes the integral . Moreover,

(30) |

where denotes the integral .

### 4.2 The Posterior Predictive Distribution of

We now provide an explicit form for the posterior predictive distribution of

, i.e., its conditional distribution given the history and the external source of randomness in the policy . This is useful because Lemma 4 in the appendix shows that the expected entropy can be computed using the conditional entropy of given . We use this in Sections 5.2 and 6.2 to compute the expected entropy for the dyadic and greedy policies respectively.For , we have demonstrated in the proof of Theorem 1 that follows the binomial distribution given .

Now, consider any , and any fixed history . Using the equality (28) presented in Lemma 1 we have,

(31) |

Now, since for any , according to Lemma 1, we can simplify:

(32) |

Also, using Lemma 2, we obtain

(33) |

Finally, according to (1), is the sum of Bernoulli random variables . Given the event , these Bernouili r.v’s are conditionally independent with respective parameters . This conditional independence can be verified as follows. Consider any fixed binary vector . For each , let be equal to if and its complement if . Then,

(34) |

Using the fact that is the sum of conditionally independent Bernoulli random variables given and , we may provide an explicit probability mass function. When , is conditionally given and . In general, let be independent discrete random variables with , where are any real numbers in [0,1]. The distribution of is called *Poisson Binomial* distribution, which was first studied by S. D. Poisson in [35]. We denote the distribution of by and its probability mass function is given by

(35) |

and has mean and variance given by

(36) |

Using this definition of the Poisson Binomial distribution, the conditional distribution of given and is .

Finally, putting together equations (31), (33), and the fact that is conditionally given and provides the following characterization of the conditional probability mass function of given .

###### Theorem 3.

For , given , . For , given , is a mixture of Poisson Binomial distributions with probability mass function: