Optimal Stopping for Interval Estimation in Bernoulli Trials

11/18/2017 ∙ by Tony Yaacoub, et al. ∙ Rutgers University 0

We propose an optimal sequential methodology for obtaining confidence intervals for a binomial proportion θ. Assuming that an i.i.d. random sequence of Benoulli(θ) trials is observed sequentially, we are interested in designing a) a stopping time T that will decide when is the best time to stop sampling the process, and b) an optimum estimator θ̂_T that will provide the optimum center of the interval estimate of θ. We follow a semi-Bayesian approach, where we assume that there exists a prior distribution for θ, and our goal is to minimize the average number of samples while we guarantee a minimal coverage probability level. The solution is obtained by applying standard optimal stopping theory and computing the optimum pair (T,θ̂_T) numerically. Regarding the optimum stopping time component T, we demonstrate that it enjoys certain very uncommon characteristics not encountered in solutions of other classical optimal stopping problems. Finally, we compare our method with the optimum fixed-sample-size procedure but also with existing alternative sequential schemes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Interval estimation of a binomial proportion is one of the most basic problems in statistics with many important real-world applications. Some classical applications include interval estimation of the prevalence of a rare disease [1]; interval estimation of the overall response rate in clinical trials [2]; and accuracy assessment in remote sensing [3]. In these applications, the sample size is fixed in advance, and a confidence interval for is obtained. There exists extensive bibliography regarding derivations of confidence intervals for when the sample size is fixed. Perhaps, the most widely known in this category is Wald’s interval, which takes the form , where T is the fixed sample size, expresses the desired coverage probability, is the sample mean of and satisfies with denoting the complementary cdf of a standard

Gaussian random variable. This confidence interval is derived based on the asymptotic normality of

and, therefore, exhibits poor behavior when is small [4, 5, 6, 7]. Several efforts to improve Wald’s classical method are reported in [8, 9, 10, 11, 12, 4]. There are also Baysian-based techniques [5, 13, 14] while in [4, 5, 6, 15, 7] there exists interesting surveys that evaluate the relative performance of the above methods. Finally we must mention that [16] provides explicit formulas for the required sample size that can guarantee a prescribed coverage probability.

In many modern applications, sampling observations is costly and time consuming. Therefore, there is a desire to limit the sampling size without, however, compromising the quality of the interval estimate. For instance, in automatic fraud detection in finance, one needs to manually go through the “suspect” financial transactions that are automatically detected as fraudulent by some machine learning or other computer algorithm. Since the manual process is expensive in labor and cost, it is desirable to quickly estimate, with high confidence, what percentage of the suspect transactions are truly fraudulent. A different motivating application is in Statistical Model Checking, where with an approximate verification method, one overcomes the state space explosion problem for probabilistic systems by Monte Carlo simulations. Given an executable stochastic system, we verify a system’s property with simulation and we desire to estimate the probability

by which the system satisfies the property in question. The goal is to estimate within acceptable margins of error and confidence (see [17] and references therein). Because Monte Carlo simulations very often tend to require extensive time and computing power, it is advantageous to reduce their number assuring, at the same time, satisfactory quality levels for the corresponding estimate. The sequential version of the interval estimation aims exactly at reducing the sample size by selecting it to be random and, in particular, a stopping time controlled by the observations themselves. The literature focusing on the sequential setup of the problem is limited compared to its fixed sample-size counterpart (see [18, 19, 20]). However, none of these articles is able to claim optimality of their corresponding schemes in any sense.

The objective of our current work is to offer optimum sequential methods for interval estimation of , with the quality of the estimate expressed through the coverage probability. In addition to deriving the optimum scheme, we will also demonstrate some very uncommon but highly interesting properties of the optimum solution. These properties are not encountered in optimum sequential schemes derived for other well known sequential problems (i.e. sequential hypothesis testing). We must also add that our methodology exhibits similarities with the work developed in [21]. However, the focus in [21] is on the actual estimate of with the adopted criterion being a variation of the classical mean square error. In our work, as we pointed out, we focus on confidence intervals and coverage probabilities; and, as it turns out, this difference makes our derivations and proofs far more complicated, requiring original analytical methodology. This becomes particularly apparent when we attempt to establish the validity of the unique properties, mentioned before, that characterize our optimum solution.

The remainder of this article is organized as follows. In Section II we discuss our proposed framework for interval estimation for and propose a well-defined optimization problem and discuss its general solution. In Section III we focus on the computational aspects of the optimum scheme and the unique properties that they characterize it. In Section IV we compare the proposed scheme against the fixed-sample-size and two existing sequential methods in the literature. Finally, Section V contains our conclusions.

Ii Proposed Framework

We observe sequentially an i.i.d. process of Bernoulli random variables with and . The goal is to provide a confidence interval for . We are interested in confidence intervals of fixed width equal to for some pre-specified . We would also like our scheme to be able to guarantee a coverage probability equal to , where is given. Our scheme consists of a pair , that is, a stopping time and a mid-point estimator111The estimate does not have the meaning of a classical parameter estimator. It is the mid-point of the confidence interval and does not necessarily constitute an efficient estimate of . , where is adapted to the observation history (filtration generated by the observations) and is a function of the observations accumulated up to the time of stopping . We would like to solve the following constrained optimization problem for the optimum pair

(1)

where the desired interval estimate is (with the two ends cropped at 0 and 1, respectively, whenever they exceed the two limits) and where and denote probability and expectation for given .

Although (1) seems as the ideal formulation, it unfortunately targets an infeasible goal. We note that we are asking for the pair to minimize the average number of samples for every value of the parameter . In other words, we want our scheme to enjoy a uniform optimality property over all , a requirement which is impossible to satisfy. In order to be able to find a solution that has a well-defined form of optimality, we adopt a semi-Bayesian approach222The term “semi-Bayesian” is used because our setup involves two different components where one is optimized while the other is constrained, unlike full-Bayesian approaches that combine all terms into a single performance measure. and assume that a prior for is available. This allows for the following modification of the previous constrained optimization

(2)

where and denote probability and expectation including averaging over with the help of the prior.

Remark 1.

We must emphasize that the constraint in (2) does not guarantee that the desired coverage probability will also hold for each individual , namely , a property which is particularly desirable in practice. Perhaps, a more meaningful problem to consider in place of (2) would have been

(3)

that assures a coverage probability of at least for every . Unfortunately, it is unclear how to derive the optimal solution to this alternative formulation. Consequently, we focus on (2) as the optimum scheme we are going to develop, but in our numerical examples, we will evaluate it in terms of (3) as well.

Let denote a Lagrange multiplier that we use to combine the two terms in (2) into a single cost function and consider the unconstrained optimization problem

(4)

We will first identify the solution to (4) and then demonstrate that a proper selection of can also solve the constrained problem in (2).

Ii-a The Unconstrained Problem

We start by considering the classical Bayes estimation problem for fixed sample size

(5)

If we observe then, given that is i.i.d. Bernoulli(), the probability to obtain a specific combination of samples given is equal to where is the number of “successes” up to time

. This implies that the posterior probability density of

given the observations can be written as

(6)

From Bayesian estimation theory [22, Page 142], we have that the optimization in (5) is achieved by the following Bayes estimator

(7)

yielding the corresponding optimum conditional complementary coverage probability

(8)

From (7) and (8) we observe that both quantities are -measurable and, more precisely, functions of . For known prior , we can, at least numerically, compute the Bayes estimate and the corresponding optimum conditional complementary coverage probability for each combination of integer pair .

Remark 2.

By focusing on (7), we can make a small but interesting observation: Regarding the Bayes estimate it is easy to verify that

(9)

Indeed, this is clear, because if in (7) we select or , this will yield an inferior cost compared to the selection or , respectively. The implication of this observation is that will be biased and inconsistent when considered as an estimate of the true parameter , at least for values of outside the interval . As we mentioned, the correct meaning of this quantity is that it constitutes the mid-point of the confidence interval with the latter enjoying, for each fixed , the largest possible coverage probability.

Consider now the optimization in (4) which will be performed in two steps: First we fix the stopping time and minimize with respect to ; the resulting expression is then minimized, during the second step, over in order to obtain the optimum pair. We have the following lemma that addresses the first problem.

Lemma 1.

Assume stopping time is fixed and satisfies , where is some deterministic integer. Then,

(10)

with equality when we apply the corresponding Bayesian estimator at the time of stopping.

Proof.

The proof is straightforward and presented in the Appendix. ∎

A side-product of Lemma 1, as it can be verified from the corresponding proof in the Appendix, is the fact that the Bayesian estimator is not only optimum for fixed sample size, but it retains its optimality property when the sample size is controlled by any stopping time adapted to the observations.

Using (10) from Lemma 1, we are now left with the optimization of the stopping time . Assuming that is an integer which is sufficiently large, we consider the following optimization over stopping times that are bounded by

(11)

This is a classical finite horizon optimal stopping problem with cost per sample equal to and cost for stopping at equal to . Of course, it is only natural to wonder why we limited our analysis to finite horizons instead of considering the more classical infinite horizon version. As we will see in the sequel, for the most common prior we will be able to demonstrate that the infinite horizon assumption is completely unnecessary. Indeed, the optimum stopping time will turn out to be bounded by a deterministic quantity, suggesting that by limiting ourselves to a (sufficiently large) finite horizon, we do not suffer any performance loss.

In order to solve the optimization problem defined in (11), we follow the classical optimal stopping theory [23]. For define the sequence of optimal average residual costs

(12)

then we have

(13)

with the backward recursion initialized with . Regarding this last selection, it produces since the latter is a probability. In fact, this is exactly what the optimum residual cost at must be, because if we have not stopped before , then we necessarily stop at and this produces cost (simply the cost of stopping at ). The total optimum cost is expressed through , namely . The next lemma specifies in more detail the recursion in (13).

Lemma 2.

Consider the recursion in (13) then, the optimal residual cost is a function of and therefore -measurable. Furthermore, (13) can be written as

(14)

where expresses the optimum average residual cost to continue, satisfying

(15)
(16)

Finally, if the prior is symmetric around then the functions are symmetric with respect to around the value .

Proof.

The validity of this lemma is straightforward and can be easily established using induction. We therefore give no further details. ∎

Once the sequence of optimal residual costs has been obtained through the solution of (14), it is then immediate to define the optimum stopping time that solves the minimization problem in (11). Again, optimal stopping theory [23] suggests that

(17)

In other words, when the optimum residual cost matches, for the first time, the cost for stopping or, equivalently, the cost of stopping is smaller than the residual cost of continuing, this is when we stop. Since the functions involved depend on

, this quantity can serve as our test statistic and we can express the stopping rule in (

17) in terms of . Specifically, for each time , we can find the sampling region with , and we can equivalently define the stopping time as .

Ii-B The Constrained Problem

Let us now turn to the constrained problem in (2) which we can solve with the results we have so far. We will show that (2) can be recovered as an instance of the unconstrained version (4) corresponding to a special selection of the Lagrange multiplier . Our result is summarized in the following theorem.

Theorem 1.

For the solution of (2) we distinguish two cases:

i) If , with , then the optimum is to stop without taking any samples, i.e. and use as mid-point of the optimum confidence interval the value which is based only on the prior .

ii) If , then for any horizon where satisfies , there exists Lagrange multiplier such that the solution of (4) is also the solution to (2) that can involve a possible randomization before taking any samples.

Proof.

The proof of this theorem is presented in the Appendix. ∎

Iii Properties of the Optimum Solution

If we fix the value of the horizon and the cost per sample , we can then compute the mid-points of the confidence intervals from (7). Assuming that is continuous, candidates for can be obtained from the solution of the following equation which we obtain by differentiating (7) with respect to

(18)

The previous equation has clearly a solution in the interval when with the corresponding value providing a (local) extremum for the coverage probability. To these candidate mid-points we must include the two end points since the global maximum can occur at the two ends as well. Therefore, we need to examine which of these cases provides the best coverage probability and select the corresponding value as our optimum mid-point . When it is possible (18) not to have any solution in . In this case, and are equal to one of the two end values or . Having identified the optimum mid-points , we apply (8) to compute the corresponding optimum complementary conditional coverage probabilities .

The next step consists in computing for and with numerical integration. Once we have available and , we can then use them in the backward recursion (14) to find the sequence and the optimum residual cost sequence . To identify the stopping rule, according to (17) we must compare the two sequences , element-by-element. At coordinates where the sequences differ, we decide to continue sampling; whereas if they are equal, we decide to stop. This generates the sequence of sampling regions . Equivalently, we can compare with , and wherever the first is no larger than the second, we stop, while we continue sampling in the opposite case.

We now present a conjecture that contains two significant claims for the optimum stopping time for the problem in (4) which we believe are valid for any prior . We were able to provide a proof for the first claim (Lemma 3) for a rich class of priors, and prove both claims (Theorem 2) providing also quantitative information when the prior is the Beta density. Regarding the latter case we should note that the Beta density is among the most popular priors for the problem we are considering in this work.

Conjecture.

For any prior and sufficiently large horizon the optimum stopping time of the unconstrained problem in (4) enjoys the following two properties:
i). There exists constant depending only on and not on such that .
ii). For sufficiently small there exists constant depending only on and not on such that .

Below we present a general proof of property i) of the Conjecture under the following additional assumption: Define the maximal conditional variance

(19)

where is the posterior pdf defined in (6) and assume that as . This forces the conditional variance to converge to 0 uniformly in . It also implies that the posterior distribution converges, uniformly, to a degenerate measure at a single point (often the true ) as . This is clearly related to the consistency

concept of posterior distributions in Bayesian statistics and is often considered a valid assumption (see

[24]).

Lemma 3.

Let be defined as in (19) with . Then for sufficiently large horizon there exists constant depending only on such that , i.e. property i) in the Conjecture is true.

Proof.

The proof is a simple application of the Chebyshev inequality in combination with (19). Indeed we observe that

(20)

Since as there exists such that and, therefore, from (14) we conclude that , which suggests that we will necessarily stop at for any value of . Quantity is the smallest for which this is true. ∎

Remark 3.

The assumption does not hold for all prior distribution. A counterexample where it fails is when the prior is a two-point probability mass function, say . However, even for this case the Conjecture might still be valid since the requirement used in our proof, is only sufficient for the validity of our claim.

An interesting example where the assumption holds is when the prior is the Beta density , where

(21)

To see this, we note that the posterior pdf is of the same type, namely , and thus the maximal conditional variance in (19) becomes

(22)

where the equality is attainable when is an integer. Clearly, for fixed we have as , and thus the assumption of Lemma 3 holds. Moreover, by the proof of Lemma 3, the optimum stopping time satisfies for all This bound is of the order of . In Theorem 2, Section III-B, by applying a more advanced analysis, we will be able to improve it and provide an alternative estimate which is of the order of for the case of the symmetric prior .

Remark 4.

Property i) of the Conjecture suggests that the number of samples, under the optimum scheme, will never exceed the value even if we allow the horizon to grow without limit. This interesting and uncommon characteristic was also observed in [21] but with cost function a variance of the classical mean square error. However, what is more intriguing in our conjecture is property ii), namely that we need first to accumulate a sufficient volume of information before we start asking ourselves whether we should stop sampling or not. This is an extremely uncommon feature and, to our knowledge, has never been reported before in Sequential Analysis as a property of optimum schemes. As we claim in our conjecture, we believe that both properties are valid for any prior . Fortunately, as we mentioned before, this double claim is not without solid evidence. Indeed with Theorem 2, we demonstrate its validity when the prior is the symmetric Beta density.

Iii-a Performance Evaluation

What we presented so far allows for the determination of the stopping rule of the proposed scheme. We would like now to compute its performance but also the performance of any stopping time which uses as its test statistic and is defined in terms of a sequence of sampling regions in terms of . In particular, we are interested in computing and . Of course, we could obtain these quantities using Monte-Carlo simulations, but it is also possible to determine them numerically. The following lemma provides the necessary formulas.

Lemma 4.

Let the stopping time be bounded by having as test statistic the process . Assume for each that denotes the sampling region. Suppose also that for the combination the scheme provides the mid-point estimate and the corresponding conditional complementary coverage probability . For , we then define the following backward recursions that must be applied for

(23)
(24)
(25)
(26)

where is defined in (16) and the four recursions are initialized with . Then, and .

Proof.

The validity of these expressions is established in the Appendix. ∎

The applicability of Lemma 4 is clearly not limited to the proposed scheme but can be used to compute the performance of the fixed-sample-size and of other sequential alternatives that we intend to compare against the method we have developed.

Iii-B Beta Density as Prior

Let us now find the particular form of our scheme when we adopt as our prior the Beta density , where is defined in (21). We observe that the selection in the prior corresponds to the uniform density in . It is now straightforward to verify that the posterior pdf accepts a similar form, namely

(27)

while the conditional complementary coverage probability at time becomes

(28)

where is the incomplete Beta function (see [25, Page 944]) which is the cdf of .

The Bayes estimator, according to (18), can be found as the solution of the equation

corresponding to the root in the interval . Such root always exists except when and or . For these cases, is equal to or , depending on which value provides a larger conditional coverage probability. The resulting optimum conditional complementary coverage probability becomes

(29)

Finally, as indicated in (15) and (16), we need to find the probability , for which we have the following simple formula

(30)

We can now compute the sequences as explained in (14) and compare, element-by-element, with or with to identify the sampling and stopping regions.

For the particular prior adopted in (21), as we mentioned before, the resulting optimum stopping time enjoys the unique properties claimed in the Conjecture. The next theorem provides the necessary evidence.

Theorem 2.

The Conjecture is true when the prior is the Beta density with the optimum stopping time satisfying for constants that depend only on and .

Proof.

The proof is very technical and detailed in the Appendix. Unfortunately, the analytical techniques developed for the specific prior are not directly extendable to the general case. ∎

Perhaps, it is worth mentioning the fact that from the proof of Theorem 2, we conclude that the two estimates for and in (43),(45) grow linearly in having drastically different multiplicative coefficients ( of the order of versus of the order of ) and different offsets.

Fig. 1: Sampling (green) and stopping (red) regions for and . Upper and lower bounds for optimum stopping time: and . No possibility of stopping (light green).

As an illustration for these properties we consider , and . Fig. 1 depicts the sampling (green) and the stopping (red) region in terms of the test statistic . Both regions are clearly limited between the lines and . Even though we have marked a whole region in red, only the points that are next to the green region are actually accessible because can increase at most by one unit as we go from to . We can also see the two bounds and for . For the light green region covers all points , thus identifying the time instances we can never stop. Also, we note that once we pass we are in the stopping region suggesting that we must necessarily stop at . For each the stopping region has an upper and a lower threshold and, as long as is between these two limits, we need to sample. Since the prior distribution is symmetric with respect to , then, according to Theorem 1, the sampling region is symmetric around , implying that .

Fig. 2: Average sample size (red), lower (blue) and upper limit (green), as functions of for optimum stopping time when and .

In Fig. 2, after using (24), we plot the average sample size and the two limits of as functions of for and . We can see that the lower limit is significantly smaller than the resulting average, suggesting that the optimum scheme very quickly regards the accumulated information as capable of providing reliable interval estimates and therefore starts the process of questioning whether to stop or continue sampling.

Iv Comparisons

Let us now compare our scheme with the optimal fixed-sample-size (FSS) and two sequential methods: The first was proposed by Frey in [20] and the second, the Conditional Method, was proposed in our earlier work in [26]. Frey’s method uses a modified Wald-type sequential confidence interval based on the stopping time

(31)

where , is a pre-specified constant and is chosen so that the confidence interval , with , has a confidence level of at least . Table I provides the values of and recommended in [20] for best results.

TABLE I: Choices of and for , , and confidence intervals of fixed half-width in [20].

From (31) and using the fact that we conclude that the corresponding stopping time satisfies . Regarding the finite-sample-size method, it uses the optimum Bayes estimator , obtained in (7) and the number of samples is selected to meet the desired coverage probability. Finally, for the conditional method in [26], we should point out that it is a general sequential parameter estimation technique based on conditional costs which is not limited to binomial proportions. For the problem of interest, we have and , where are the Bayes estimator and the corresponding optimum conditional complementary coverage probability defined in (7),(8). Threshold is selected to guarantee that the resulting coverage probability is . For we have from the proof of Theorem 1, eq. (32), that , consequently . In other words, all four schemes satisfy the assumption of Lemma 4 of bounded stopping time, therefore the corresponding performance can be computed numerically by applying the recursions of the lemma without the need to perform Monte-Carlo simulations.

Fig. 3: Average samples size versus coverage probability for proposed (red), Frey (black ), fixed-sample-size (blue) and conditional (green), for and .

For the competing methods using (24),(26), we plot in Fig. 3 the average number of samples versus the coverage probability when and . Note that we have three points for Frey’s scheme because of the tuning parameters and which are provided in Table I only for three confidence levels. As we can see, the proposed method outperforms the fixed-sample-size and both alternative sequential techniques. It is only at very high coverage probability levels that the difference between the three sequential schemes becomes less pronounced.

As we pointed out in (3), Section II, there is practical interest in evaluating the performance for each individual . Clearly in this case, the requirement is to be able to guarantee a minimal coverage probability for all . Again, we resort to Lemma 4 and use (23),(25) to evaluate the performance of the competing methods for each . In Fig. (a)a, we plot the coverage probability for each test versus and in Fig. (b)b, the corresponding average sample size required to obtain this performance. Parameters were selected so as all competing schemes provide the same worst-case coverage probability assuring a coverage of at least for all . By observing the two figures, we can draw the following conclusions: The fixed-sample-size scheme can require up to almost eight times more samples compared to the proposed. Of course, one may argue that it produces higher coverage probability levels. Indeed this is true, but, unfortunately, this increased performance cannot be traded for a reduced sample size without compromising the worst-case level. Consequently, what we observe is in fact the best the fixed-sample-size method can offer. The conditional scheme, around , requires up to 30% more samples which, as in the case of fixed-sample-size, produce higher coverage probabilities. Again, it is impossible to sacrifice part of this increased performance to improve the corresponding sample size without degrading the worst-case coverage probability. Finally, we can see that the proposed and Frey’s scheme require similar samples over most . However, we observe that the proposed method has a coverage probability profile which is better than Frey’s, since for most the corresponding probability is larger. Frey’s scheme is slightly better only for close to 0 and 1. But even for these values of the proposed scheme requires almost 50% less samples.

(a)
(b)
Fig. 4: Coverage probability (a) and Average sample size (b) as a function of proportion for proposed (red), Frey (black), fixed-sample-size (blue) and conditional (green) when , and worst-case coverage probability .

V Conclusions

We proposed an optimal sequential scheme for obtaining confidence intervals for a binomial proportion under a well defined formulation. We proved that, for a particular prior (Beta density), our optimum stopping time enjoys certain uncommon properties not encountered in solutions of other classical optimal stopping problems. We also conjectured that these properties are present with any prior. Specifically, our claim is that our stopping time is always bounded from above and below, suggesting that we need to first accumulate a sufficient amount of information before we start applying our stopping rule, and that our stopping time will always terminate at a specific deterministic time even if we allow the time horizon to be infinite. Finally, our scheme was compared against the optimum fixed-sample-size procedure and against existing sequential alternatives. Numerical performance evaluations showed that the proposed method exhibits an overall improved performance profile compared to its rivals.

Vi Acknowledgments

This work was supported by the US National Science Foundation under Grant CIF 1513373 through Rutgers University and under Grant CMMI 1362876 through Georgia Institute of Technology.

Proof of Lemma 1: From (10) we can write