 # The Sample Complexity of Search over Multiple Populations

This paper studies the sample complexity of searching over multiple populations. We consider a large number of populations, each corresponding to either distribution P0 or P1. The goal of the search problem studied here is to find one population corresponding to distribution P1 with as few samples as possible. The main contribution is to quantify the number of samples needed to correctly find one such population. We consider two general approaches: non-adaptive sampling methods, which sample each population a predetermined number of times until a population following P1 is found, and adaptive sampling methods, which employ sequential sampling schemes for each population. We first derive a lower bound on the number of samples required by any sampling scheme. We then consider an adaptive procedure consisting of a series of sequential probability ratio tests, and show it comes within a constant factor of the lower bound. We give explicit expressions for this constant when samples of the populations follow Gaussian and Bernoulli distributions. An alternative adaptive scheme is discussed which does not require full knowledge of P1, and comes within a constant factor of the optimal scheme. For comparison, a lower bound on the sampling requirements of any non-adaptive scheme is presented.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

This paper studies the sample complexity of finding a population corresponding to some distribution among a large number of populations corresponding to either distribution or . More specifically, let index the populations. Samples of each population follow one of two distributions, indicated by a binary label : if , then samples of population follow distribution , if , then samples follow distribution . We assume that

are independently and identically distributed (i.i.d.) Bernoulli random variables with

and . Distribution is termed the atypical distribution, which corresponds to atypical populations, and the probability quantifies the occurrence of such populations. The goal of the search problem studied here is to find an atypical population with as few samples as possible.

In this search problem, populations are sampled a (deterministic or random) number of times, in sequence, until an atypical population is found. The total number of samples needed is a function of the sampling strategy, the distributions, the required reliability, and . To build intuition, consider the following. As the occurrence of the atypical populations becomes infrequent, (i.e. as ), the number of samples required to find one such population must, of course, increase. If and are extremely different (e.g., non-overlapping supports), then a search procedure could simply proceed by taking one sample of each population until an atypical population was found. The procedure would identify an atypical population with, on average, samples. More generally, when the two distributions are more difficult to distinguish, as is the concern of this paper, we must take multiple samples of some populations. As the required reliability of the search increases, a procedure must also take more samples to confirm, with increasing certainty, that an atypical population has been found.

The main contribution of this work is to quantify the number of samples needed to correctly find one atypical population. Specifically, we provide matching upper and lower bounds (to within a constant factor) on the expected number of samples required to find a population corresponding to with a specified level of certainty. We pay additional attention to this sample complexity as becomes small (and the occurrence of the atypical populations becomes rare). We consider two general approaches to find an atypical population, both of which sample populations in sequence. Non-adaptive

procedures sample each population a predetermined number of times, make a decision, and if the null hypothesis is accepted then move on to the next population.

Adaptive methods, in contrast, enjoy the flexibility to sample each population sequentially, and thus, the decision to continue sampling a particular population can be based on prior samples.

The developments in this paper proceed as follows. First, using techniques from sequential analysis, we derive a lower bound on the expected number of samples needed to reliably identify an atypical population. To preview the results, the lower bound implies that any procedure (adaptive or non-adaptive) is unreliable if it uses fewer than samples on average, where

is the Kullback-Leibler divergence. We then prove this is tight by showing that a series of sequential probability ratio tests (which we abbreviate as an S-SPRT) succeeds with high probability if the total number of samples is within a constant factor of the lower bound, provided a minor constraint on the log-likelihood statistic is satisfied (which holds for bounded distributions, Gaussian, exponential, among others). We give explicit expressions for this constant in the Gaussian and Bernoulli cases. In the Bernoulli case, the bound derived by instantiating our general results produces the tightest known bound. In many real world problems, insufficient knowledge of the distributions

and makes implementing an S-SPRT impractical. To address this shortcoming, we propose a more practical adaptive procedure known as sequential thresholding, which does not require precise knowledge of , and is particularly well suited for problems in which occurrence of an atypical population is rare. We show sequential thresholding is within a constant factor of optimal in terms of dependence on the problem parameters as . Both the S-SPRT procedure and sequential thresholding are shown to be robust to imperfect knowledge of . Lastly, we show that non-adaptive procedures require at least samples to reliably find an atypical population, a factor of more samples when compared to adaptive methods.

### I-a Motivating Applications

Finding an atypical population arises in many relevant problems in science and engineering. One of the main motivations for our work is the problem of spectrum sensing in cognitive radio. In cognitive radio applications, one is interested in finding a vacant radio channel among a potentially large number of occupied channels. Only once a vacant channel is identified can the cognitive device transmit, and thus, identifying a vacant channel as quickly as possible is of great interest. A number of works have looked at various adaptive methods for spectrum sensing in similar contexts, including [2, 3, 4].

Another captivating example is the Search for Extraterrestrial Intelligence

(SETI) project. Researchers at the SETI institute use large antenna arrays to sense for narrowband electromagnetic energy from distant star systems, with the hopes of finding extraterrestrial intelligence with technology similar to ours. The search space consists of a virtually unlimited number of stars, over 100 billion in the Milky Way alone, each with 9 million potential “frequencies” in which to sense for narrow band energy. The prior probability of extraterrestrial transmission is indeed very small (SETI has yet to make a contact), and thus occurrence of atypical populations is rare. Roughly speaking, SETI employs a variable sample size search procedure that repeatedly tests energy levels against a threshold up to five times

[5, 6]. If any of the measurements are below the threshold, the procedure immediately passes to the next frequency/star. This procedure is closely related to sequential thresholding . Sequential thresholding results in substantial gains over fixed sample size procedures and, unlike the SPRT, it can be implemented without perfect knowledge of .

### I-B Related Work

The prior work most closely related to the problem investigated here is that by Lai, Poor, Xin, and Georgiadis , in which the authors also examine the problem of quickest search across multiple populations, but do not focus on quantifying the sample complexity. The authors show that the S-SPRT (also termed a CUSUM test) minimizes a linear combination of the expected number of samples and the error probability. Complementary to this, our contributions include providing tight lower bounds on the expected number of samples required to achieve a desired probability of error, and then showing the sample complexity of the S-SPRT comes within a constant of this bound. This quantifies how the number of samples required to find an atypical population depends on the distributions and and the probability , which was not explicitly investigated in . As a by-product, this proves the optimality of the S-SPRT.

An instance of the quickest search problem was also studied recently in , where the authors investigate the problem of finding a biased coin with the fewest flips. Our more general results are derived using different techniques, and cover this case with and as Bernoulli distributions. In , the authors present a bound on the expected number of flips needed to find a biased coin. The bound derived from instantiating our more general theory (see example 2 and Corollary 4) is a minimum of 32 times tighter than the bound in .

Also closely related is the problem of sparse signal support recovery from point-wise observations [7, 10, 11, 12], classical work in optimal scanning theory [13, 14], and work on pure exploration in multi-armed bandit problems [15, 16]. The sparse signal recovery problems differ in that the total number of populations is finite, and the objective is to recover all (or most) populations following , as opposed to finding a single population and terminating the procedure. Traditional multi-armed bandit problems differ in that no knowledge of the distributions of the arms is assumed.

## Ii Problem Setup

Consider an infinite number of populations indexed by . For each population , samples of that population are distributed either

 Yi,j iid∼ P0ifXi=0or Yi,j iid∼ P1ifXi=1

where and are probability measures supported on , indexes multiple i.i.d. samples of a particular population, and is a binary label. The goal is to find a population such that as quickly and reliably as possible. The prior probability of a particular population following or is i.i.d., and denoted

 P(Xi=1) = π P(Xi=0) = 1−π

where we assume without loss of generality.

A testing procedure samples a subset of the populations and returns a single index, denoted . The performance of any testing procedure is characterized by two metrics: 1) the expected number of samples required for the procedure to terminate, denoted , and 2) the probability of error, defined as

 Pe:=P(I∈{i:Xi=0}).

In words, is the probability a procedure returns an index that does not correspond to a population following .

In order to simplify analysis, we make two assumptions on the form of the procedures under consideration.

Assumption 1. Search procedures follow the framework of Fig. 1. Specifically, a procedure starts at population . Until termination, a procedure then 1) takes a sample of , or 2) moves to index , or 3), terminates, declaring population as following distribution (deciding ).

Assumption 1 implies procedures do not revisit populations. It can be argued that this restricted setting has no loss of optimality when , , and

are known; in this Bayesian setting, the posterior probability of population

depends only on samples of that index. This posterior reflects the probability of error if the procedure were to terminate and estimate

. Since this probability is not affected by samples of other indices, for any procedure that returns to re-measure a population, there is a procedure requiring fewer samples with the same probability of error that either did not return to index , or did not move away from index in the first place. Note that  makes the same assumption.

The second assumption we make is on the invariance of the procedure across indices. To be more specific, imagine that a procedure is currently sampling index . For a given sampling procedure, if , the probability the procedure passes to index without terminating is denoted , and the probability the procedure correctly declares is . Likewise, for any such that , the procedure falsely declares with probability , and continues to index with probability .

Assumption 2. and are invariant as the procedure progresses; i.e., they are not functions of the index under consideration.

Under Assumption 2, provided the procedure arrives at population , we can write

 β = P(^Xi=0|Xi=1) α = P(^Xi=1|Xi=0).

Note that this restriction has no loss of optimality as the known optimal procedure  has this form. Restricted to the above framework, a procedure consists of a number of simple binary hypothesis tests, each with false positive probability and false negative probability . While any pair and do parameterize the procedure, our goal is to develop universal bounds in terms of the underlying problem parameters, and .

Assumptions 1 and 2 allow for the following recursive relationships, which will be central to our performance analysis. Let be the (random) number of samples taken of population , and be the total number of samples taken by the procedure. We can write the expected number of samples as

 E[N] = E[N1]+

where is the probability the procedure arrives at the second index. The expected number of samples used from the second index onwards, given that the procedure arrives at the second index (without declaring ), is simply equal to the total number of samples: . Rearranging terms in (II) gives the following relationship

 E[N] = E[N1]α(1−π)+π(1−β). (2)

In the same manner we arrive at the following expression for the probability of error:

 Pe=α(1−π)α(1−π)+π(1−β)=11+π(1−β)α(1−π). (3)

From this expression we see that if

 α(1−π)π(1−β)≥δ

for some , then , and is greater than or equal to some positive constant.

Lastly, the bounds derived throughout often depend on explicit constants, in particular the Kullback-Leibler divergence between the distributions, defined in the usual manner:

 D(P1||P0) := E1[L(Y)] D(P0||P1) := E0[−L(Y)]

where

 L(Y):=logP1(Y)P0(Y)

is the log-likelihood ratio. Other constants are denoted by , , etc., and represent distinct numerical constants which may depend on and .

## Iii Lower bound for any procedure

We begin with a lower bound on the number of samples required by any procedure to find a population following distribution . The main theorem of the section is presented first, followed by two corollaries aimed at highlighting the relationship between the problem parameters.

###### Theorem 1.

Any procedure with

 Pe≤δ1+δ

also has

 E[N] ≥ 1−ππ(1−δ)2(1+δ)max(1,1D(P0||P1))+ log(12πδ)D(P1||P0)⎛⎜ ⎜⎝1−δD(P1||P0)D(P0||P1)1+δ⎞⎟ ⎟⎠−1D(P1||P0)

for any .

###### Proof.

See Appendix A. ∎

Theorem 1 lower bounds the expected number of samples required by any procedure to achieve a desired performance, and is comprised of two terms with dependence on and the error probability, and a constant offset. To help emphasize this dependence on the problem parameters, we present the following Corollary.

###### Corollary 1.

Any procedure with

 Pe≤δ1+δ

also has

 E[N] ≥ 1D(P0||P1)(112π+13log(12πδ)−1) (5)

for any . Here, we assume for simplicity of presentation.

Proof of Corollary 1 follows immediately from Theorem 1, as and .

Corollary 1 provides a particularly intuitive way to quantify the number of samples required for the quickest search problem. The first term in (5), which has a dependence, can be interpreted as the minimum number of samples required to find a population following distribution . The second term, which has a dependence, is best interpreted as the minimum number of samples required to confirm that a population following has been found.

When the populations following distribution become rare (when tends to zero), the second and third terms in (5) become small compared to the first term. This suggests the vast majority of samples are used to find a rare population, and a vanishing proportion are needed for confirmation. The corollary below captures this effect. The leading constants are of particular importance, as we relate them to upper bounds in Sec. IV. In the following, consider and as functions , , , and some sampling procedure .

###### Corollary 2.

Rare population. Fix . Then any procedure that satisfies

 limsupπ→0Pe≤δ1+δ

also has

 liminfπ→0πE[N] ≥ (1−δ)2(1+δ)max(1D(P0||P1),1).

The proof of Corollary 2 follows from Theorem 1 by noting both the second and third terms of (1) are overwhelmed as becomes small.

Corollary 2 is best interpreted in two regimes: (1) the high SNR regime, when , and (2), the low SNR regime, when . In the high SNR regime, when , any procedure with also has . This simply implies that any procedure requiring fewer samples in expectation than also has probability of error bound away from zero. The bound becomes tight when the SNR becomes high – when

is sufficiently large, we expect to classify each population with one sample. In the lower SNR regime, where

, any procedure with also has . In the low SNR regime the sampling requirements are at best an additional factor of higher than when we can classify each distribution with one sample.

## Iv S-SPRT Procedure

The Sequential Probability Ratio Test (SPRT), optimal for simple binary hypothesis tests in terms of minimizing the expected number of samples for tests of a given power and significance , can be applied to the problem studied here by implementing a series of SPRTs on the individual populations. For notational convenience, we refer to this procedure as the S-SPRT. This is equivalent in form to the CUSUM test studied in , which is traditionally applied to change point detection problems.

The S-SPRT operates as follows. Imagine the procedure has currently taken samples of population . The procedure continues to sample population provided

 γL < Λi,j < γU. (6)

where is the likelihood ratio statistic, and and are scalar upper and lower thresholds. In words, the procedure continues to sample population provided the likelihood ratio comprised of samples of that population is between two scalar thresholds. The S-SPRT stops sampling population after samples, which is a random integer representing the smallest number of samples such that (6) no longer holds:

 Ni:=min{j:Λi,j≤γL⋃Λi,j≥γU}.

When the likelihood ratio exceeds (or equals) , then , and the S-SPRT terminates returning . Conversely, if the likelihood ratio falls below (or equals) , then , and the procedure moves to index . The procedure is detailed in Algorithm 1.

The S-SPRT procedure studied in  fixes the lower threshold in each individual SPRT at (and hence terms the procedure a CUSUM test). This has a very intuitive interpretation; since there are an infinite number of populations, anytime a sample suggests that a particular population does not follow , moving to another population is best. While this approach is optimal , we use a strictly smaller threshold, as it results in a simpler derivation of the upper bound.

In the following theorem and corollary we assume a minor restriction on the tail distribution of the log-likelihood ratio test statistic, a notion studied in depth in

. Specifically, recall is the log-likelihood statistic. We require that

 maxr≥0E[L−r|L≥r]<∞ (7)

and

 minr≥0E[L+r|L≤−r]>−∞. (8)

This condition is satisfied when

follows any bounded distribution, Gaussian distributions, exponential distributions, among others. It is not satisfied by distributions with infinite variance or polynomial tails. A more thorough discussion of this restriction is studied in

.

###### Theorem 2.

The S-SPRT with and , satisfies

 Pe≤δ1+δ

and

 E[N]≤C1π+log1πδD(P1||P0)+C2 (9)

for some constants and independent of and .

###### Proof.

The full proof is given in Appendix B. ∎

The main argument of the proof of Theorem 2 follows from standard techniques in sequential analysis. The constant is a function of the underlying distributions and is given by

 C1=C′1+logγ−1L(1−γL)D(P0||P1) (10)

where is a bound on the overshoot in the log-likelihood ratio when it falls outside or . , and thus , can be explicitly calculated depending on the underlying distributions in a number of cases (see Examples 1 and 2, and [19, page 145], and ).

###### Corollary 3.

Rare population. Fix . The S-SPRT with any and satisfies and

 limπ→0πE[N]≤C1

for some constant independent of and .

The proof of Corollary 3 is an immediate consequence of Theorem 2. Note that , since we assume . As the atypical populations become rare, sampling is dominated by finding an atypical population, which is order . The constant factor of is the multiplicative increase in the number of samples required when the problem becomes noisy.

Remark 1. The S-SPRT procedure is fairly insensitive to our knowledge of the true prior probability . On one hand, if we overestimate by using a larger to specify the upper threshold , then according to (25) the probability of error increases and is approximately , while the order of remains the same. On the other hand, if our underestimates , then the probability of error is reduced by a factor of , and the order of also remains the same, provided , i.e., is not exponentially smaller than . As a consequence, it is sensible to underestimate , rather than overestimate as the latter would increase the probability of error.

Remark 2. Implementing a sequential probability ratio test on each population can be challenging for many practical problems. While the S-SPRT is optimal when both and are known and testing a single population amounts to a simple binary hypothesis test, scenarios often arise where some parameter of distribution is unknown. Since the SPRT is based on exact knowledge of , it cannot be implemented in this case. A simple example where and , for some unknown , illustrates this issue. Many alternatives to the SPRT have been proposed for composite hypothesis (see [20, 21], etc.). In the next section we propose an alternative that is near optimal and also simple to implement.

## V Sequential Thresholding

Sequential thresholding, first proposed for sparse recovery problems in , can be applied to the search for an atypical population, and admits a number of appealing properties. It is particularly well suited for problems in which the atypical distributions are rare. Sequential thresholding does not require full knowledge of the distributions, specifically , as required by the S-SPRT (see Remarks 2 and 4). Moreover, the procedure admits a general error analysis, and perhaps most importantly is very simple to implement (a similar procedure is used in the SETI project [5, 6]). The procedure can substantially outperform non-adaptive procedures as becomes small. Roughly speaking, for small values of , the procedure reliably recovers an atypical population with

 E[N]≲Cπ

for some constant independent of .

Sequential thresholding requires one input: , an integer representing the maximum number of rounds for any particular index. Let represent a sufficient statistic for the likelihood ratio that does not depend on the parameters of or (for example, when and are Gaussian with different mean, ).

The procedure searches for an atypical population as follows. Starting on population , the procedure takes one sample. If the sufficient statistic comprised of that sample is greater than the threshold, i.e. , the procedures takes two additional samples of index and forms (which is only a function of the second and third samples). If , three more samples are taken, and is tested against a threshold. The procedure continues in this manner, taking samples on round , and testing the statistic up to a maximum of times. If the statistic is below the threshold, i.e. , on any round, the procedure immediately moves to the next population, setting , and resetting . Should any population survive all rounds, the procedure estimates , and terminates. The procedure is detailed in Algorithm 2.

Control of the probability of error depends on the series of thresholds and the number of rounds . For our analysis the thresholds are set as to satisfy

 P0(Tk>γk)=12.

In practice, the thresholds can be set in any way such that test statistic under falls below the threshold with fixed non-zero probability.

Intuitively, the procedure controls the probability of error as follows. First, can be made small by increasing ; as each round is independent, . Of course, as is increased, also increases. Fortunately, as grows, it can be shown that is strictly less than one (provided the Kullback-Leibler divergence between the distributions is non-zero). The following theorem quantifies the number of samples required to guarantee recovery of an index following as grows small.

###### Theorem 3.

Sequential Thresholding. Sequential thresholding with satisfies

 limπ→0Pe=0

and

 limπ→0πE[N]≤C

for some constant independent of .

###### Proof.

See Appendix C. ∎

Remark 3. Similar to the behavior of the SPRT discussed in Remark 1, sequential thresholding is also fairly insensitive to our prior knowledge of , especially when we underestimate . More specifically, overestimating increases the probability of error almost proportionally and has nearly no affect on , while underestimating decreases the probability of error and the order of is the same as long as .

Remark 4. For many distributions in the exponential family, the log-likelihood ratio, , is a monotonic function of a test statistic that does not depend on parameters of . As a consequence of the sufficiency of , the thresholds , , depend only on , making sequential thresholding suitable when knowledge about is not available.

Perhaps most notably, in contrast to the SPRT based procedure, sequential thresholding does not aggregate statistics. Roughly speaking, this results in increased robustness to modeling errors in at the cost of a sub-optimal procedure. Analysis of sequential thresholding in related sparse recovery problems can be found in [7, 11].

## Vi Limitations of Non-Adaptive Procedures

For our purposes a non-adaptive procedure tests each individual population with a pre-determined number of samples, denoted . In this case, the conditional number of samples for each individual test is simply , giving

 E[N]=N0α(1−π)+π(1−β). (11)

To compare the sampling requirements of non-adaptive procedures to adaptive procedures, we present a necessary condition for reliable recovery. The theorem implies that non-adaptive procedures require a factor of more samples than the best adaptive procedures.

###### Theorem 4.

 Pe≤δ1+δ

also has

 E[N]≥log(12δπ)−1π(1+δ)D(P1||P0).

for .

###### Proof.

See Appendix D. ∎

Remark 5. The lower bound presented in Theorem 4 implies that non-adaptive procedures require at best a multiplicative factor of more samples than adaptive procedures (as adaptive procedures are able to come within a constant factor of the lower bound in Theorem 1). For problems with even modestly small values of , this can result in non-adaptive sampling requirements many times larger than those required by adaptive sampling procedures.

## Vii Examples and Numerical Results

Example 1. Searching for a Gaussian with positive mean. Consider searching for a population following amongst a number of populations following for some . The Kullback-Leibler divergence between the Gaussian distributions is .

Focusing on the S-SPRT (Alg. 1) and Theorem 2, we have an explicit expression for (defined in (10)) based on the overshoot of the likelihood ratio [19, page 145]:

 C′1(μ)=2μ⎛⎝μ+e−μ2/2∫∞−μe−t2/2dt⎞⎠. (12)

In order to make our bound on as tight as possible, we would like to minimize from (10) with respect to . Since the minimizer has no closed form expression, we use the sub-optimal value for , and for . For this choice of , the constant in Theorem 2 and (10) is

 C1(μ)=⎧⎪ ⎪⎨⎪ ⎪⎩C′1(μ)+log(μ)(1−1/μ)D(P0||P1)if μ>1C′1(μ)+log((1−√μ)−1)√μD(P0||P1)if μ<1.

Consider the following two limits. First, as

 limμ→∞C1(μ)=1.

As a consequence (from Corollary 3)

 limμ→∞limπ→0πE[N]≤limμ→∞C1(μ)=1.

This implies Corollary 2 is tight in this regime. As tends to infinity we approach the noise-free case, and the procedure is able to make perfect decisions with one sample per population. As expected, the required number of samples grows as .

Second, as ,

 limμ→0C1(μ)D(P0||P1)=1

which implies (again from Corollary 3)

 limμ→0limπ→0πD(P0||P1)E[N]≤limμ→0C1(μ)D(P0||P1)=1.

Comparison to Corollary 2 shows the bound is tight. For small , the S-SPRT requires samples as the distributions grow similar; no procedure can do better.

Fig. 2 plots the expected number of samples scaled by as a function of . Specifically, the figure displays four plots. First, vs. obtained from simulation of the S-SPRT procedure is plotted: , , and . Second, the lower bound from Theorem 1 is shown. For small , from (1), any reliable procedure has

 E[N]≳1πmax(1,1D(P0||P1)).

The upper bound from Theorem 2 is also plotted. From (9), for small values of , the S-SPRT achieves

 E[N]≲C1π.

where is calculated by minimizing (10) over for each value of . is within a small factor of the lower bound for all values of .

Lastly, the performance of sequential thresholding (Alg. 2) is plotted. The maximum number of round is specified as in Theorem 3. Fig. 2: Expected number of samples scaled by π as a function of the mean of the atypical population, μ, corresponding to example 1. Simulation of the S-SPRT (Alg. 1) is plotted with the upper bound from Corollary 3 and lower bound from Corollary 2. Sequential thresholding (Alg. 2). π=10−3, Pe≤10−2, 103 trials for each value of μ.

Example 2. Searching for a biased coin. Consider the problem of searching for a coin with bias towards heads of amongst coins with bias towards heads of , for . This problem was studied recently in .

###### Corollary 4.

Biased Coin. The S-SPRT procedure (Alg. 2) with and satisfies and

 E[N]≤12b2(1π+log(1πδ)+1).
###### Proof.

The proof follows from evaluation of the constants in Theorem 2. The log-likelihood ratio corresponding to each sample (each coin flip) takes one of two values: if a coin reveals heads, , and if a coin reveals tails, . When each individual SPRT terminates, it can exceed the threshold by no more than this value, giving,

 C′1(b)=log1+2b1−2bC′2(b)=log1+2b1−2b

where is defined in (10), and in (29). With , we can directly calculate the constants in Theorem 2. From (10),

 C1(b)=1+2b4b2≤12b2

as the Kullback-Leibler divergence is . Also note . Lastly, from (29),

 C2(b)=C′2D(P1||P0)=12b≤12b2.

Combining these with Theorem 2 completes the proof. ∎

Comparison of Corollary 4 to [9, Theorem 2] shows the leading constant is a factor of 32 smaller in the bound presented here.

Moreover, closer inspection reveals that the constant can be further tightened. Specifically, note that when an individual SPRT estimates it must hit the lower threshold exactly (since ). If we choose only values of such that the upper threshold is an integer multiple of the likelihood ratio (i.e., set for some integer ) the overshoot here is also zero. and , which then give

 C1(b)=1+2b8b2. (13)

From Corollary 3,

 limπ→0πE[N]≤1+2b8b2. (14)

For small , the number of samples required by any procedure to reliably identify an atypical population is

 E[N]≲1π(1+2b8b2).

If (each coin flip is deterministic), , and the expected number of samples grows as as expected. The upper bound in Corollary 3 and lower bound in Corollary 2 converge.

Likewise, as the bias of the coin becomes small, , and the expected number of samples needed to reliably identify an atypical population grows as . Again the upper and lower bounds converge.

Note that the S-SPRT procedure for testing the coins in this particular example is equivalent to a simple, intuitive procedure, which can be implemented as follows: beginning with coin , and a scalar static , if heads appears, add to the statistic. Likewise, if tails appears, subtract from the test statistic. Continue to flip the coin until either 1) falls below 0, or 2) exceeds some upper threshold (which controls the error rate). If the statistic falls below 0, move to a new coin, and reset the count, i.e., set ; conversely if the statistic exceeds the upper threshold, terminate the procedure. Note that any time the coin shows tails on the first flip, the procedure immediately moves to a new coin. Fig. 3: Expected number of samples scaled by π as a function of the bias of the coin corresponding to Example 2. Upper and lower bounds from Corollaries 3 and 2. Simulation of the S-SPRT (Alg. 1). Simulation of sequential thresholding (Alg. 2). π=10−3, Pe≤10−2, 103 trials for each value of b.

Fig. 3 plots the expected number of samples scaled by as a function of the bias of the atypical coins, . The S-SPRT was simulated with the lower threshold set at for and . The upper and lower bounds from Corollaries 3 and 2 are also plotted. The upper bound, , is given by the expression in (13).

Notice that the simulated S-SPRT procedure appears to achieve the upper bound. Closer inspection of the derivation of Theorem 2 with (as the overshoot in (27) is zero), shows the bound on the number of samples required by the S-SPRT is indeed tight for the search for the biased coin. Performance of sequential thresholding (Alg. 2) is included for comparison.

## Viii Conclusion

This paper explored the problem of finding an atypical population amongst a number of typical populations, a problem arising in many aspects of science and engineering.

More specifically, this paper quantified the number of samples required to recover an atypical population with high probability. We paid particular attention to problems in which the atypical populations themselves become increasingly rare. After establishing a lower bound based on the Kullback Leibler divergence between the underlying distributions, the number of samples required by the optimal S-SPRT procedure was studied; the number of samples is within a constant factor of the lower bound, which can be explicitly derived in a number of cases. Two common examples, where the distributions are Gaussian and Bernoulli, were studied.

Sequential thresholding, a more robust procedure that can often be implemented with less prior knowledge about the distributions was presented and analyzed in the context of the quickest search problem. Sequential thresholding requires a constant factor more samples than the S-SPRT. Both sequential thresholding and the SPRT procedure were shown to be fairly robust to modeling errors in the prior probability. Lastly, for comparison, a lower bound for non-adaptive procedures was presented.

## Appendix A

Proof of Theorem 1: Assume that and from (3) we have

 α(1−π)π(1−β)≤δ. (15)

For ease of notation, define

 E1=E[N1|X1=1]E0=E[N1|X1=0]. (16)

From (2),

 E[N] = πE1+(1−π)E0α(1−π)+π(1−β)≥πE1+(1−π)E0(1+δ)π(1−β) (17) = E1(1+δ)(1−β)+(1−π)E0(1+δ)π(1−β).

From standard sequential analysis techniques (see [22, Theorem 2.29]) we have the following identities relating the expected number of measurements to and , which hold for any binary hypothesis testing procedure:

 E1 ≥ βlog(β1−α)+(1−β)log(1−βα)D(P1||P0) (18) E0 ≥ αlog(α1−β)+(1−α)log(1−αβ)D(P0||P1). (19)

Rearranging (17),

 E[N] ≥ βlog(β1−α)(1+δ)(1−β)D(P1||P0)T1+log(1−βα)(1+δ)D(P1||P0)T2 + (1−π)(αlog(α1−β)+(1−α)log(1−αβ))π(1+δ)(1−β)D(P0||P1)T3.

We first bound as

 T1≥−1(1+δ)D(P1||P0)≥−1D(P1||P0) (20)

since for all ,

 βlogβ1−α1−β≥βlogβ1−β≥−1.

From (15),

 T2≥log(1−ππδ)(1+δ)D(P1||P0).

Next, differentiating with respect to gives

 d(T3)dα=(1−π)logαβ(1−α)(1−β)(1−δ)π(1−β)D(P0||P1)

showing that the expression is non-increasing in over the set of satisfying From (15), we are restricted to and thus, if , then (19) is non-increasing in . To show this, note that

 δπ1−π≤δ≤1−δ≤1−α(1−π)π(1−β)≤1−α≤1−αβ

since both and . We can replace in (19) with . This gives

 T3 ≥ δlog(δπ1−π)(1+δ)D(P0||P1)+ (1−π)(1−δπ(1−β)1−π)log(1β−δπ(1−β)β(1−π))πD(P0||P1)(1+δ)(1−β) ≥ δlog(δπ1−π)(1+δ)D(P0||P1)+ (1−π)(1−δ)log(1β−δ(1−β)β)πD(P0||P1)(1+δ)(1−β) ≥ δlog(δπ1−π)(1+δ)D(P0||P1)+(1−π)(1−δ)2πD(P0||P1)(1+δ)

where the first inequality follows from making the substitution for and from (15), and the second inequality follows since and , and the last inequality follows as

 log(1β−δ(1−β)β)(1−β)≥1−δ (21)

for all . To see the validity of (21), we note

 log(1β−δ(1−β)β)(1−β)(1−δ) = log(1+(1−δ)(1−β)β)(1−β)(1−δ) (⋆)≥ log(1