A fundamental problem in applied statistics is the construction of a confidence interval (CI) for a binomial proportion, . The simplest and most commonly used proportion CI is the Wald interval (Brown et al., 2001). In this work we investigate the performance of the Wald interval in the context of rare-event probabilities, along with three other common intervals: Clopper-Pearson (exact), Wilson (score) and Agresti-Coull (adjusted Wald) (Clopper and Pearson (1934); Wilson (1927); Agresti and Coull (1998)).
A rare event can be defined as a random experiment whereby the probability of success is quite small, and in this work, we consider as a rare-event probability. Such rare-event probabilities can occur in clinical statistics where could represent the proportion of patients exhibiting treatment side effects. Vehicular collision, aircraft engine failure, train derailment etc. provide examples of rare events that may be encountered in the field of transport engineering. In quality control/reliability engineering the occurrence of a defect in the manufacture of reliable components is often deemed as a rare event. For example, the aim of the continuous improvement philosophy Six-Sigma, is to reduce the number of defective components to just per one million components produced. The advances in automation and equipment technology since the inception of Six-Sigma has seen many high-volume manufacturing facilities meet and exceed this production target (Woodall and Montgomery (2014); Evans and Lindsay (2015)).
Despite its widespread use, the Wald interval is known to produce inadequate coverage when is near or , and/or the sample size, , is small. It has also been well documented that this interval suffers from erratic coverage, even when is moderate (Blyth and Still (1983); Vollset (1993); Böhning (1994); Agresti and Coull (1998)). Brown et al. (2001) show that this coverage fluctuation occurs for large and recommend against using the Wald interval in practice. Newcombe (1998) also discourages the use of the Wald interval and suggests that its use be restricted to sample size planning.
In determining sample sizes we adopt the common approach of setting the CI margin of error (in the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals) equal to a specified value, , and solving for . An inherent issue with determining sample sizes in this manner is that the value of must be defined in advance. When is moderately large, say , this does not pose a major problem as the analyst has sufficient flexibility in defining the estimate precision (albeit it is reasonable to assume that should be less than ). For example, might be considered as reasonable precision for , but could equally be considered as viable precision for . However, when dealing with small success probabilities the definition of is critical to the validity of the resulting interval. For example, could be considered as a valid margin of error for , but is far too large for a success probability of the order .
In order to maintain consistency between and , we propose a margin of error scheme that is considered relative to the magnitude of . Whilst defining the margin of error in this manner provides more flexibility over prescribing a fixed value, an initial estimate of the magnitude of is then required. We briefly discuss the practicality of obtaining such an initial estimate, but primarily focus on assessment of interval performance after the estimate has been obtained. We note that it is standard practice to include an estimate of in sample size planning (unless a ‘default’ value of is used), and, in any case, it is important to ensure that one does not end up with a mismatch between the value of and the magnitude of as discussed in the previous paragraph.
We demonstrate the importance of a suitable definition of in the small probability regime of , and recommend practical tolerances for both the proposed relative margin of error scheme and coverage probability. We provide a comparison in terms of sample size requirements and show that when CI performance is assessed in terms of both coverage probability and relative margin of error, the four CI estimators perform similarly in many cases.
The remainder of this article is organised as follows. In Section 2 we briefly highlight some of the proposed proportion interval estimators and provide formulae for the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals. Section 3 provides details of the CI evaluation criteria used in the work. In Section 4 we discuss estimating for the purpose of sample size planning, examine the proposed relative margin of error scheme and suggest suitable tolerances for assessing CI performance. Sections 5 and 6 illustrate the importance of employing this relative margin of error scheme in CI assessment. Finally, the article is concluded in Section 7 with a summary and discussion.
2 Binomial Proportion Interval Estimators
Several techniques have been devised to estimate a binomial proportion, , including the Wald, Clopper-Pearson, Wilson, Agresti-Coull, Jeffreys, arcsine transformation, Jeffreys’ Prior and the likelihood ratio interval. A range of ensemble/model-averaged approaches have also been considered, e.g., Turek and Fletcher (2012), Kabaila et al. (2016) and Park and Leemis (2019). In this work we assess the performance of the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals in standard form, i.e., without application of a modification or continuity correction.
The Wald interval is included in this study as it is the most widely known and used interval, both in theory and in practice. We assess the Clopper and Pearson (1934) interval as it is an exact method, and often regarded as the ‘gold standard’ in binomial proportion estimation (Agresti and Coull (1998); Newcombe (1998); Gonçalves et al. (2012)). Where the Wald interval is known to produce inadequate coverage for small or , the Clopper-Pearson interval is generally regarded as being overly conservative, unless is quite large (Newcombe (1998); Agresti and Coull (1998); Brown et al. (2001); Thulin (2014)). The intervals proposed by Wilson (1927) and Agresti and Coull (1998) both offer a compromise between the (liberal) Wald and (conservative) Clopper-Pearson intervals.
We assess the performance of the Wilson and Agresti-Coull intervals in this work given the popularity of these estimators among authors in the literature. For example, Brown et al. (2001) recommend the Wilson or Jeffreys interval for small . For larger they recommend the Wilson, Jeffreys or Agresti-Coull intervals, preferring the Agresti-Coull method for its simpler presentation. Agresti and Coull (1998) recommend the Wilson interval, and add that the Wilson interval has similar performance to their method. Newcombe (1998) remarks on the mid-p method (an exact method closely related to the Clopper-Pearson method), and the Wilson method, noting the Wilson’s advantage of having a simple closed form. Vollset (1993) too recommends the mid-p and Wilson (uncorrected and continuity corrected) intervals, along with the Clopper-Pearson interval, stating that those four intervals can be safely used at all times. Pires and Amado (2008) recommend the continuity corrected arcsine transformation or the Agresti-Coull method.
3 Evaluation Criteria
The most commonly used CI evaluation criteria are coverage probability and expected width (Gonçalves et al., 2012) — and these are the criteria that we consider in this work. However, other performance metrics have been proposed. For example, Vos and Hudson (2005) interpret an interval as the non-rejected parameter values in a hypothesis test and discuss two additional criteria: -confidence and -bias. Newcombe (1998) presents an evaluation criterion using non-coverage as an indicator of location. In a recent work, Park and Leemis (2019) adopt an ensemble approach and use root mean squared error and mean absolute deviation to measure CI performance.
3.1 Coverage Probability
The coverage probability can be interpreted as the computed interval’s long-run percentage inclusion of the unknown parameter. Denoting and as the lower and upper CI bounds obtained with successes (suppressing the dependence on and the significance level, ), the expected coverage probability, which we denote , for a fixed parameter , is given by
where is an indicator function that takes the value when its argument is true, and otherwise.
3.2 Expected Width
The expected width, which we denote , is given by
and the expected margin of error, , is then given by
4 Calculating Sample Size
The first problem in CI estimation is determining the sample size required to achieve a desired estimation accuracy/precision. There have been a range of sample size determination methods discussed in the literature, e.g., Korn (1986), Liu and Bailey (2002) and Gonçalves et al. (2012), but here we adopt the common approach of deriving the sample size from the CI formula with fixed . Equation (3) displays the sample size obtained from the Wald confidence interval.
where denotes the ceiling function mapping to the least integer , and denotes the anticipated value of , i.e., an initial estimate. (Sample size formulae for the Clopper-Pearson, Wilson and Agresti-Coull intervals are given in appendix A.)
4.1 Initial Estimate of
Selecting a value for is required to make equation (3) operational, and this is an inherent practical challenge in any such sample size calculation. In some situations it might be possible to overcome this problem by utilising subject matter knowledge or results from a previous study. If no previous information is available, a common approach is to consider the conservative value of , for which is a maximum, but we do not adopt this approach here. Given that the focus of this work is on small/rare-event probabilities, we consider as a function of the magnitude of . For example, fixing at might be reasonable for , but is impractically large for . Since is unknown it may not always be possible to obtain an accurate , it is nonetheless an important aspect of the sample size planning to avoid (as much as possible), obtaining a margin of error that is incompatible with . I.e., we aim to avoid a margin of error that is too wide to be practically useful, or indeed a margin of error that is too narrow, in the sense that a reasonable estimate could have been obtained using fewer resources.
In some situations one might be able to gain a reasonable estimate of . Consider a manufacturing environment where, for example, past experience or consultation with subject matter experts could provide an analyst with information on the order of magnitude of . One could deduce that the true proportion is more likely to be of the order in rather than in . Such insight could be sufficient in setting a reasonable threshold for and subsequently determining an appropriate sample size. Such initial estimation of an unknown parameter is a topic worthy of further discussion but is beyond the scope of this work. Here the focus is on the performance of the intervals after an initial estimate has been obtained.
4.2 Margin of Error Relative to
Once the analyst is equipped with an initial estimate of the magnitude of , the margin of error can be defined accordingly. The required estimation precision is largely analysis dependent; what could be considered reasonable accuracy in one setting might be completely inappropriate in another. Although a margin of error scheme cannot be rigidly prescribed, we suggest a general scheme such that the margin of error is compatible with the order of magnitude of . To maintain compatibility between and , we consider a relative margin of error, , given by
To ensure that the margin of error is not larger than the order of magnitude of , we impose . Thus, one could consider as a plausible margin of error scheme. However, considering values too close to the bound of results in very wide intervals. We therefore suggest as a more reasonable scheme. Whilst smaller values are desirable, considering values too close to the bound of results in very narrow intervals for which an excessively large sample size is required — and such high precision is not likely to be required for many studies.
Table 1 provides a comparison of calculated sample sizes corresponding to . (Sample size values in this and subsequent tables are rounded to significant digits.)
Table 1 shows that the Wald, Wilson and Agresti-Coull sample sizes are similar across the range, but the Clopper-Pearson sample sizes are approximately larger which is indicative of this method’s conservatism.
Given that is defined in relation to , the above calculated sample sizes are only applicable to each specific . Even if the true proportion equals (i.e., the initial estimate is perfect), will of course vary from sample to sample and may not equal ; hence, the realised margin of error will typically differ from .
Next we illustrate the importance of defining the margin of error in relation to the magnitude of the proportion. Consider the following fixed margin of error schemes:
Table 2 displays the calculated Wald sample sizes and coverage probabilities corresponding to the above margin of error schemes. (A comparison of Wald, Clopper-Pearson, Wilson and Agresti-Coull coverage probabilities for the above schemes is given in appendix B.)
|Margin of Error Scheme|
|Wald CI coverage shown in parentheses|
Table 2 shows that for Scheme 1, fixing and considering , creates sample sizes that reduce dramatically as a function of . This results in coverage probabilities that are completely inadequate for . In Scheme 2, both and are fixed and this creates a constant sample size of . This sample size is reasonable for , but is insufficient for the remaining values, which is reflected in the poor coverage performance.
Scheme 3 is similar to Scheme 1, but here, is reduced to . This combination produces sufficient coverage for , but deteriorates for the smaller values. In Scheme 4, is fixed at and , this results in a constant sample size of , which produces excellent coverage throughout the range. Whilst the coverage is satisfactory in this scheme, the magnitude of is not compatible with all values, particularly and . For , the resulting interval is . This interval is far too narrow in the sense that a reasonable interval could be obtained with a significantly reduced sample size. For , the interval is truncated at . Here, even though the coverage is very close to , the interval is too wide to be practically useful since its width is an order of magnitude larger than .
Table 3 provides the Wald sample sizes and coverage probabilities associated with the relative margin of error schemes: .
|Wald coverage shown in parentheses|
|Coverage computed with|
Table 3 shows that by considering in relation to the magnitude of , the coverage probabilities are reasonable across the range, but now, the analyst must choose a scheme such that the resulting interval’s width is appropriate. For example, consider , at the resulting interval is . This interval is very narrow and the large sample size of reflects this quite stringent margin of error. Considering has the advantage of significantly reducing the sample size, however, in this scheme the interval is significantly wider at . To obtain intervals that are neither too liberal nor too conservative, and to avoid excessively large sample sizes, we recommend as a reasonable scheme. The coverage of the Clopper-Pearson, Wilson and Agresti-Coull intervals are similar to that shown in Table 3 for . A comparison of coverage probabilities for is given in appendix B.
4.4 Suitability of Scheme
A range of qualifications/criteria are often used to check the validity of using approximate CI estimators, particularly the Wald interval. Fleiss et al. (2003) state that the normal distribution provides excellent approximations to exact binomial procedures when and . Leemis and Trivedi (1996)
and Brown et al. (2001) also discuss the and qualification.
We examine our proposed scheme to assess its compatibility with the qualification and , where , in relation to the Wald sample size equation. Replacing in (3) with gives
For a given , when and, therefore, equation (4) is sufficient for our purposes to ensure both and are greater than .
4.5 Tolerances for Assessing CI Performance
In this section we suggest suitable tolerances for assessing interval performance in terms of coverage probability and relative margin of error.
In terms of achieving a desired coverage probability, one usually considers , where denotes a predefined coverage tolerance. The definition of such a tolerance is dependent on the individual researcher and particular study, and is thus difficult to quantify. In one study might be acceptable, whilst in another, one might require . We suggest that would be reasonable tolerances for most analyses, and as such, consider acceptable expected coverage probabilities as , where is described in equation (1).
A tolerance is also necessary with regard to the relative margin of error. As with the coverage, the desired margin of error is dependent on the particular research question and hence can not be rigidly prescribed. However, as previously discussed, it is vital that the magnitude of the margin of error reflect the magnitude of the estimated proportion.
Similar to the definition of , we define the expected relative margin of error, , as
where is defined in equation (2). As discussed in Section 4.2, a relative margin of error exceeding is not acceptable from a practical perspective, and, as per Section 4.4, we suggest that it should not exceed .
Table 5 provides suggested tolerances for the assessment of and for confidence intervals which could be considered reasonable in most settings.
|is considered acceptable as it offers a compromise between (desired) and (limit)|
5 Relative Margin of Error Central to Performance
Next we illustrate how the relative margin of error is fundamental to CI performance evaluation. We show that when a valid confidence interval is defined as achieving a desired coverage probability whilst simultaneously satisfying a minimum relative margin of error, the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals perform similarly for a given combination.
Figure 1 provides a CI performance comparison for . Referring to Figure 1, the Wilson, Agresti-Coull and Clopper-Pearson intervals all achieve satisfactory coverage across the entire sample size range, whereas for , the coverage of the Wald interval oscillates around the lower limit of . For example, satisfactory coverage is achieved at , but the coverage falls below at
. This phenomenon of coverage oscillation relates to the discreteness of the binomial distribution and has been previously discussed in the literature, e.g.,Blyth and Still (1983), Vollset (1993), Agresti and Coull (1998) and Brown et al. (2001). Whilst the coverage performance of the Wald interval is inferior to the other three intervals for , none of the intervals satisfy the requirement at these lower sample sizes. Thus, by stipulating a minimum requirement for , the poor coverage at small is rendered irrelevant and the performance of the Wald interval is more comparable to the other three intervals when .
Figure 1 shows that the performance of the four interval estimators is reasonably similar when is considered in conjunction with . When is small (), all four intervals fail to satisfy both coverage and relative margin of error requirements. However, when is large (), each interval produces valid estimates.
Figure 2 compares the performance of a CI for and shows that the Wald interval achieves for . For the Wald interval encounters a total of five sample sizes where the coverage drops below the lower limit of . The Clopper-Pearson interval requires a sample size of to satisfy both and , with the coverage exceeding the upper limit of on five occasions for . The Agresti-Coull interval performs very well for with just one value (), failing to satisfy both and thereafter. The Wilson interval provides the best performance, satisfying both and requirements .
Figure 3 provides a comparison of the interval performance for with . At this significance level the coverage probabilities are more erratic with each interval producing values that oscillate outside of the limits. The Clopper-Pearson interval performs particularly poorly, requiring to produce just five valid intervals. The Wald interval is similar to the Wilson and Agresti-Coull intervals in terms of the proportion of values falling outside the desired range of . None of the methods satisfy at , although one could argue that the Wald interval is very close to satisfying both and criteria at this sample size.
6 CI Performance Tables
The table cells are colour coded according to the tolerances discussed in Table 5: target (green), acceptable (yellow), minimally acceptable (orange) and unacceptable (red).
Table 6 shows that for and , none of the intervals satisfy the desired and requirements simultaneously. The importance of considering the relative margin of error in CI evaluation is clearly evident. In several cases the coverage probability lies within the desired tolerance but the excessive relative margin of error renders the estimate impractical. For example, with reference to the Wilson interval, , however, which is not acceptable.
Table 7 shows that the Wald, Wilson and Agresti-Coull intervals satisfy at , but the Clopper-Pearson interval requires . Each interval encounters sample sizes where exceeds the desired bounds of , but overall, is satisfactory.
Table 8 shows that none of the intervals satisfy and for . The Wilson method performs best in this scheme, and if the tolerances of and were considered, it would produce a valid interval .
Table 9 shows that the Wald interval has the worst coverage with three values falling outside the limit, and one value outside the limit. The Wilson and Argesti-Coull intervals perform the best, but overall, all four intervals perform well in this sample size scheme, particularly if the coverage tolerance was considered as .
|: First where at least one interval satisfies and|
|: First where all intervals satisfies and|
|Coverage computed for|
Table 10 shows that the Wald, Wilson and Agresti-Coull methods perform similarly for a given combination. The and values of the Clopper-Pearson slightly exceed the desired limits, but overall, the performance is quite reasonable.
Table 11 illustrates the sample sizes required to maintain a desired level of and performance. For each performance scheme investigated: , and , the Wilson interval requires the smallest sample sizes providing further evidence of its overall superiority.
|: where and|
|: where and|
|: where and|
|Smallest sample size for each scheme shown in bold|
7 Summary and Conclusion
When constructing confidence intervals for small success probabilities it is important that the margin or error, , be considered relative to the magnitude of the proportion, . Incompatibilities between and can lead to completely unsatisfactory coverage or unnecessarily narrow intervals that require extremely large sample sizes. When dealing with moderate success probabilities, say , this is less important, but in the context of small or rare-event success probabilities, the consideration of relative to is crucial to reduce the possibility of substantial mismatching between and . For example, might be considered as valid precision for , but such a margin of error is not compatible with a proportion of the order .
To ensure is compatible with the order of magnitude of , we have suggested the use of a relative margin of error, . We suggest restricting the range of values to as higher values lead to imprecision and poor CI coverage, whereas lower values lead to sample sizes that are likely to be impractically large for many studies. When is considered as a performance criterion in conjunction with the empirical coverage probability, , the Wald, Clopper-Pearson, Wilson and Agresti-Coull intervals perform similarly in many cases. In general, all four intervals fail to satisfy both criteria when the sample size is small, with improved performance at larger sample sizes as expected. For example, for a confidence interval when , none of the methods produce a satisfactory interval for . Each interval achieves the nominal coverage of at some (albeit not all) sample sizes in this range, but in each case the desired limit of is exceeded. Once the sample size is increased (), and the requirement is satisfied, all four intervals perform well in terms of coverage.
The coverage probabilities of the Wald and Clopper-Pearson intervals for small are generally poor, particularly in comparison to the Wilson and Agresti-Coull intervals. However, the considerable difference in coverage in such situations is rendered immaterial once the (we believe reasonable) requirement that is considered. When satisfactory performance is defined as achieving a desired and , the performance across these commonly-used intervals is much more comparable, particularly if one considers empirical coverage in the range as reasonable. In this relative margin of error framework the criticisms of inadequate coverage for the Wald interval, and excessive conservatism for the Clopper-Pearson interval, are somewhat alleviated, and all four intervals perform quite similarly. Although there are performance similarities, the Wilson and Agresti-Coull intervals are generally superior to the intervals of Wald and Clopper-Pearson. The Wilson and Agresti-Coull intervals achieve similar and values for given combinations, however the Wilson interval is narrower and achieves favourable performance at lower sample sizes.
When the success probability is small, failure to consider the margin of error relative to the order of magnitude of the estimated proportion can result in poor coverage, and/or intervals which are unnecessarily narrow or excessively wide. The proposed relative margin of error scheme ensures that the interval precision is compatible with the order of magnitude of . As such, the relative margin of error is an important feature of the study planning and sample size calculations and serves as a useful performance evaluation criterion in the estimation of rare-event binomial proportions.
- Agresti and Coull (1998) Agresti, A. and Coull, B. A. (1998). Approximate is better than ‘exact’ for interval estimation of binomial proportions. The American Statistician, 52(2):119–126.
- Blyth and Still (1983) Blyth, C. R. and Still, H. A. (1983). Binomial confidence intervals. Journal of the American Statistical Association, 78(381):108–116.
- Böhning (1994) Böhning, D. (1994). Better approximate confidence intervals for a binomial parameter. The Canadian Journal of Statistics, 22(2):207–218.
- Brown et al. (2001) Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 16(2):101–133.
- Clopper and Pearson (1934) Clopper, C. J. and Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(1):404–413.
- Evans and Lindsay (2015) Evans, J. R. and Lindsay, W. M. (2015). An introduction to Six Sigma & process improvement. CENGAGE Learning, USA, 2 edition.
- Fleiss et al. (2003) Fleiss, J. L., Levin, B., and Cho Paik, M. (2003). Statistical methods for rates and proportions. John Wiley & Sons, New Jersey, 3 edition.
- Gonçalves et al. (2012) Gonçalves, L., De Oliveira, M. R., Pascoal, C., and Pires, A. (2012). Sample size for estimating a binomial proportion: comparison of different methods. Journal of Applied Statistics, 39(11):2453–2473.
- Kabaila et al. (2016) Kabaila, P., Welsh, A. H., and Abeysekera, W. (2016). Model-averaged confidence intervals. Scandinavian Journal of Statistics, 43(1):35–48.
- Korn (1986) Korn, E. L. (1986). Sample size tables for bounding small proportions. Biometrics, 42(1):213–216.
- Leemis and Trivedi (1996) Leemis, L. M. and Trivedi, K. S. (1996). A comparison of approximate interval estimators for the bernoulli parameter. The American Statistician, 50(1):63–68.
- Liu and Bailey (2002) Liu, W. and Bailey, B. J. R. (2002). Sample size determination for constructing a constant width confidence interval for a binomial success probability. Statistics & Probability Letters, 56(1):1–5.
- Newcombe (1998) Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17(1):857–872.
- Park and Leemis (2019) Park, H. and Leemis, L. M. (2019). Ensemble confidence intervals for binomial proportions. Statistics in Medicine, 38(1):3460–3475.
- Pires and Amado (2008) Pires, A. M. and Amado, C. (2008). Interval estimators for a binomial proportion: comparison of twenty methods. REVSTAT - Statistical Journal, 6(2):165–197.
- Thulin (2014) Thulin, M. (2014). Coverage-adjusted confidence intervals for a binomial proportion. Scandinavian Journal of Statistics, 41(1):291–300.
- Turek and Fletcher (2012) Turek, D. and Fletcher, D. (2012). Model-averaged Wald confidence intervals. Computational Statistics and Data Analysis, 56(1):2809–2815.
- Vollset (1993) Vollset, S. E. (1993). Confidence intervals for a binomial proportion. Statistics in Medicine, 12(1):809–824.
- Vos and Hudson (2005) Vos, P. and Hudson, S. (2005). Evaluation criteria for discrete confidence intervals: beyond coverage and length. The American Statistician, 59(2):137–142.
- Wilson (1927) Wilson, E. (1927). Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22(1):209–212.
- Woodall and Montgomery (2014) Woodall, W. H. and Montgomery, D. C. (2014). Some current directions in the theory and application of statistical process monitoring. Journal of Quality Technology, 46(1):78–94.
Appendix A Sample Size Formulae
Letting be the minimum satisfying: and letting be the minimum satisfying: , the sample size is given by: