1 Introduction
Consider an empirical distribution generated from
i.i.d. samples of a discrete random variable
that takes one of values according to an unknown distribution . A confidence region for is a subset of the simplex that depends on , and includes the unknown true distribution with a specified confidence. More precisely, is a confidence region at confidence level if(1) 
holds for all , where denotes the simplex, and is the multinomial probability measure under .
Construction of tight confidence regions for categorical distributions is a long standing problem dating back nearly a hundred years [1]. The goal is to construct regions that are as small as possible, but still satisfy (1
). Broadly speaking, approaches for constructing confidence regions can be classified into
i) approximate methods that fail to guarantee coverage (i.e, (1) fails to hold for all ) and ii) methods that succeed in guaranteeing coverage, but have excessive volume – for example, approaches based on Sanov or HoeffdingBernstein type inequalities. Recent approaches based on combinations of methods [2] have shown improvement through numerical experiment, but do not provide theoretical guarantees on the volume of the confidence regions. To the best of our knowledge, construction of confidence regions for the multinomial parameter that have minimal volume and guarantee coverage is an open problem.One construction that has shown promise empirically is the levelset approach of [3]. The level set confidence regions are similar to ‘exact’ and ClopperPearson^{1}^{1}1We note that ‘exact’ and ClopperPearson are synonymous in many texts [4]. regions [1] as they involve inverting tail probabilities, but are applicable beyond the binomial case, i.e., they are defined for . ClopperPearson, exact, and levelset confidence regions are closely related to statistical significance testing; the confidence region defined by these approaches is synonymous with the range of parameters over which the outcome is not statistically significant at a pvalue of . For a thorough discussion of these relationships in the binomial case, see [5, 4] and reference therein.
This paper proves that the levelset confidence regions of [3], which are extensions of exact tests, are optimal in that they minimize the average volume among any confidence region construction. More precisely, when averaged across either i) the possible empirical outcomes, or ii) a uniform prior on the unknown parameter , the level set confidence regions have minimal volume among any confidence region construction that satisfies the coverage guarantee. The proof first involves showing that arbitrary confidence regions can be expressed as the inversion of a set mapping. The level set confidence regions are minimal in this setting by design, and the minimal average volume property follows. As the authors of [3] observe through numerical experiment, the levelset confidence regions have small volume when compared with a variety of other approaches. Indeed this observation is correct; the regions minimize average volume among any construction of confidence regions.
While motivation for tight confidence regions can be found across science and engineering, one motivation comes from the need for tighter confidence intervals for the mean. Indeed, confidence intervals for functionals such as the mean, variance, or median, can be derived from confidence regions for the multinomial parameter by simply finding the range values assumed by the functional in the confidence region. This paper also shows that the level set confidence regions can be used to generate confidence intervals for functionals (such as the mean, variance and median) that are tighter than all known bounds. When compared to standard confidence bounds for the mean, for example based on Bernstein or Hoeffding’s inequalities, the constructions can require several times fewer samples to achieve a desired interval width.
These improvements translate directly to reductions in sample complexity and regret for bandit and reinforcement learning. Confidence bounds for the mean are fundamental to the operation and analysis of many sequential learning algorithms
[6, 7], guiding both data collection and providing namesake (for example, the Upper Confidence Bounds algorithm of [8] and the lil’UCB of [9]). The performance of these methods hinges on the quality of sequential actions, which in turn depends critically on the confidence bounds. If the bounds are too loose, then such sequential algorithms may perform little better than nonadaptive or random action selection. If they are too aggressive (i.e., invalid confidence bounds), then guarantees are null and algorithm performance is suboptimal. This is particularly true in the small sample regime, where sequential learning algorithms have the most to gain over nonadaptive counterparts.Direct computation of level set regions involves enumerating all empirical outcomes and computing partial sums. In the small sample regime (for example, and ) computation of the level set regions is straightforward. As computation scales as , this becomes prohibitive for modest . To aid in computation, we show an outer bound based on the Kullback Leibeler divergence that can be used to accelerate computation of the regions. The large sample regime, which is not the focus of the computational work here, is served well by traditional confidence regions based on asymptotic statistics.
2 Preliminaries
Let be a i.i.d. sample of a categorical random variable where takes one of possible values from a set of categories . The empirical distribution of is the relative proportion of occurrences of each element of in . More precisely, let and define for . Then , where is the discrete simplex from samples over categories:
We write as shorthand for where denotes the probability measure under and is the dimensional probability simplex:
We also write for as shorthand for . is fully characterized by the multinomial distribution with parameter :
The parameter specifies the unknown distribution over .
The focus of this paper is construction of confidence regions for from a sample . Since is a sufficient statistic for , we focus on construction of confidence regions that are functions of with no loss of generality.
Definition 1.
Confidence region. Denote the power set of . Let be a set valued function that maps an observed empirical distribution to a subset of the simplex. is a confidence region at confidence level if (1) holds for all .
Observation 1.
Equivalent Characterization via Covering Collections. Denote the power set of . Let be the preimage of :
(2) 
Then
(3) 
and
(4) 
We refer to as a covering collection [3], and observe that any confidence region construction can be equivalently expressed in terms of its covering collection according to (4). We also note that for any valid confidence region, holds for all , since by (3).
Next we define the minimum volume confidence region constructions, which are termed the levelset region in [3]. The sets are defined in terms of their covering collection. We note that construction is different than the definition in [3] to facilitate the main theorem of this paper. We discuss this difference in Section 4.
Definition 2.
Minimal volume confidence region. Denote the power set of . Let be any set valued function that satisfies
(5) 
for all . Then the minimal volume confidence region is given as
(6) 
is a set valued function, mapping to a subset of empirical distributions with minimal number of elements among subsets whose probability under equals or exceeds . is the subset of the simplex for which the set valued function includes the observation .
Note that is in general not unique, and many subsets of can have minimal cardinality and sufficient probability. As we develop in what follows, any subset of that satisfies (5) must have minimal average volume, and thus, equal average volume. We discuss this in section 4. Before proceeding, we note that the construction creates confidence regions with sufficient coverage, by definition.
Observation 2.
is a confidence region at level since .
3 Results
3.1 Main Result
We next proceed to the main result of the paper, which shows that the confidence set of Definition 2, , are on average minimal volume among confidence regions at level .
Theorem 1.
Let be a confidence region given by Definition 2 and define as the Lebesgue measure. Then
for any confidence region .
Proof.
The crux of the proof follows from a straightforward observation. Employing the relationship in (4), note that for any confidence region,
(7) 
which, in words, states a simple fact: the sum of the Lebesgue measure of the confidence regions for all possible empirical outcomes is equal to the integral of the cardinality of over . By definition, for all . This implies
(8) 
Given that any confidence region construction can be defined in terms of its covering collection according to Observation 2, together with (7), this implies the result. ∎
Theorem 1 shows that, averaged over equally probable empirical distributions, the confidence regions defined in (2) have minimal volume. We next show that if the multinomial parameter is chosen with uniform probability over the simplex, then the optimal properties of the region still applies.
Proposition 1.
Let be drawn uniformly at random from and denote expectation with respect to the multinomial parameter . Let be a confidence region construction given by Def. (2), then
for any confidence region .
Proof.
The proof follows from the observation that a multinomial parameter drawn uniformly at random in
induces a uniform distribution over the set of empirical distributions. To see this, note that the resulting distribution on
is the DirichletMultinomial distribution, or a compound Dirichlet distribution [10] with a uniform Dirichlet. The result then follows directly from Theorem 1. ∎3.2 Optimal Construction
As noted in Sec. 2, the minimum volume confidence construction is underspecified. In general there are many covering collections , each of which results in equal and minimal volume confidence regions.
A simple way to full specify the confidence regions is to order the empirical distributions based on their probability under (with ties broken randomly), and construct by including the most probable empirical distributions until a mass of is obtained. This results in covering collections that satisfy (2) and also have an additional guarantee on their coverage probability. We capture this notion in the following corollary.
Proposition 2.
For any , let be an ordering of the empirical distributions such that , and let be the smallest integer that satisfies
(9) 
Define and . Then
holds for all .
Proof.
Since by the relationship in (3), and since by the ordering above, the proof follows immediately. ∎
Corollary 2 shows that a particular choice for construction of the covering collection also satisfies a secondary optimality property – among all confidence regions that have minimal (and equal) average volume, has maximal coverage probability for all .
Proposition 2 highlights the observation that several confidence region constructions have equal average minimal volume. This occurs because the average is taken over the set of possible empirical distributions. Provided the minimal cardinality requirement is employed in the construction, the average volume is constant, but the coverage probability may vary.
Proposition 2 also highlights the difference between the definition of the minimal volume confidence regions defined here, and the level set construction in [3]. In the level set construction, equiprobable outcomes are either all included or excluded in the covering collections, which precludes the construction from having minimal average volume in this corner case.
4 Discussion and Extensions
4.1 Confidence Intervals for Functionals
Confidence regions for the multinomial parameter provide a direct path to derive any functionals of interest (such as the mean, variance and median) by simply finding the maximum and minimum range of the function over the confidence region.
The next corollary shows that the minimum average volume confidence sets yield minimum average volume confidence sets for any functional of the distribution (e.g., mean, variance, etc).
Proposition 3.
Consider a functional of the form and the sets . The following equality holds:
(10) 
Proof.
This follows in a similar way as the counting argument used to obtain (7), except that the Lebesgue measure is now over the range space of the function . In particular, for a given , we can find all that map to it, and the covering collections for each of those . Thus for each that maps to , there are number of confidence regions that assign positive mass to it in the LHS of (10). Integrating over and gives the required result. ∎
To illustrate the power of the level set construction, let us compare confidence intervals for the mean. The classical Chernoff bound and Hoeffding’s inequality are standard textbook examples that bound deviations of the empirical mean from the true mean. These are sometimes useful in algorithm analysis, but often too loose in practice [11], since they essentially assume the worst case variance. Refinements such as the KLBernoulli bound [11, 12] can be significantly better, especially in cases where the true mean is close to the extremes, e.g., or in the case of random variable in . These bounds have shown theoretical and empirical improvement in multiarmed bandit algorithms [12, 13]. Bernstein’s inequality offers potential for improvement, by taking the underlying scale/variance into account. The empirical Bernstein bound [14, 15, 16, 17, 18]
uses an estimate of the variance to tighten confidence intervals on the mean. For sufficiently large sample sizes, this bound can be significantly better than those mentioned above, showing that additional information about the shape of the distribution can be helpful in improving bounds. The empirical Bernstein bound is quite loose in small sample regimes, which significantly reduces its practicality.
The level set construction proposed in this paper can require several times fewer samples to achieve a specific confidence interval width when compared with the approaches descried above. This implies that the sample complexity or regret of bandit and reinforcement learning algorithms can be reduced by a similar factor [13]. We demonstrate this by plotting the widths of these methods with increasing sample size in Figure 1 .
4.2 Relationship to significance testing
The confidence regions presented in this paper and in [3] are closely related to values in statistical significance testing. Often, the phrase value is used to describe an approximate value based on a normal approximation. A more precise interpretation of a value can be related to the construction of .
Definition 3.
value. The value of an outcome (under the hypothesis ) is:
A value has the following interpretation in statistical significance testing: is the probability that the observed outcome or something less probable occurred under the hypothesis . A small
value corresponds to a strange outcome under the null, and thus corresponds to rejection of the null hypothesis. The level set confidence regions described in this paper and in
[3] can be stated in terms of covering collection based on values: .We note that the level set confidence regions and their expressions herein are closely related to ‘exact’ confidence regions defined in [19] for the specific case when . The confidence region defined by an exact test is the range of parameters over which the outcome is not statistically significant at a pvalue of . Extending this to the multinomial setting is the essence of the level set confidence regions.
4.3 Relationship to Sanov Confidence Regions
Sanov’s theorem (Theorem 11.4.1 in [20]) allows us to bound the probability of observing a set of empirical distributions using its Kullback Leibler distance to the datagenerating distribution. Since the statement of the theorem involves an infimum over Kullback Leibler distances, we can use it to obtain the following inequality:
where
is the Kullback Leibler divergence. One can view the previous inequality as a concentration result for the Kullback Leibler divergence between the observed empirical distribution and the true distribution. The work done in
[21] has sharpened these types of results in several parameter ranges. For example, when , [21] shows that(11) 
Thus using Sanov’s theorem gives us a choice for a confidence region of level . Another approach used by [2] to obtain a confidence region is to obtain bounds on the marginal probabilities . This can be done as corresponds to i.i.d. realizations of a Bernoulli random variable having mean as . By allocating error probability in bounding each of the marginal parameters, we get using the BernoulliKL inequality [12] that for each
(12) 
Both (11) and (12) give us valid confidence regions for the multinomial parameter. We plot these regions along with our proposed region in Figure 2.
4.4 Computation
Computation of requires enumerating all empirical outcomes and computing partial sums. In our experiments, enumerating and ordering the empirical distributions for and and checking membership in is feasible in around two seconds on a modern laptop. Regardless, as computation scales as , computation of membership in becomes prohibitive for a modest number of categories. To aid in computation, we show an outer bound based on the Kullback Leibler divergence that can be used to accelerate computation of the regions.
In the following theorem, we provide an outer bound that can be used to reduce computation. The bound provides a way to confirm if a particular is outside .
Theorem 2.
Outer bound. The following inequality holds:
Proof.
From [20] (Theorem 11.1.4), we can bound the probability of any empirical distribution under :
(13) 
Thus, for any ,
which implies the following. Let be a set of empirical distributions that satisfies for all . Then,
(14) 
Next, we require Sanov’s Theorem, [20] (Theorem 11.4.1), which states the following. Let be a set of empirical distributions. Then
(15) 
Choosing and combining (14) and (15), we conclude
∎
Note that the above bound has and additional factor of two in the second term, beyond what arises from directly inverting Sanov’s Theorem [20]. This arises from the fact that is not necessarily the minimal empirical distribution in KL divergence, i.e, it is not necessary true that equals
(16) 
We illustrate our proposed confidence region for a toy experiment on samples of a categorical random variable. Figure 3 shows the confidence regions at level for all possible empirical distributions in the discrete simplex overlayed on top of each other. We also show the uniform parameter and indicate the regions that include it at the chosen confidence level, i.e., its covering collection. From the figure, we can see that in this case.
The computation of the confidence regions rely on a hybrid bisection method (called Brent’s method [22] in Scipy) to numerically compute the root of a continuous univariate function.
Proposition 4.
Consider the pmfs on a ray originating at and going towards an edge or face of the simplex. This set of pmfs can be characterized using a pmf which has at least one component as zero.
Denoting the elements of using their associated scalar as , we have that if .
Proof.
This is because the increases for larger values of . Letting be the entropy function,
Only the last term in the RHS above depends on . Since is convex in its arguments (and consequently in ), and the minimum value is zero when , we get the result. ∎
For a given subset of pmfs on the ray in Corollary 4, we numerically find the zerocrossing point of the following function.
The function is positive if and negative otherwise.
5 Summary
Construction of tight confidence regions is a challenging problem with a long history. The problem has seen increased interest, as confidence bounds are central to the analysis and operation of many learning algorithms, especially sequential methods such as active learning, bandits, and reinforcement learning. In this paper, we have obtained optimal confidence regions for the multinomial parameter. In particular, we show that the
levelset confidence regions proposed in [3] are tight in the sense that they have minimal average volume, answering a long standing question.Confidence sets in the simplex provide a direct approach to deriving confidence intervals for any functional of interest (for example, the mean, variance, or median). The resulting intervals inherit the properties of the confidence set; i.e., have minimum average width. To achieve a desired interval width, the new bounds require several times fewer samples than standard bounds in many cases. This implies that the sample complexity or regret of bandit and reinforcement learning algorithms can be reduced by a corresponding factor.
While computation of the regions is possible for modest and , it can become prohibitive for problems with a large number of categories and samples. To aid in computation, we relate the regions to values, and derive a bound based on Kullback Leibler divergence that can be used to accelerate computation.
References
 Clopper and Pearson [1934] Charles J Clopper and Egon S Pearson. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934.
 Nowak and Tànczos [2019] Robert Nowak and Ervin Tànczos. Tighter confidence intervals for rating systems, 2019.
 Chafai and Concordet [2009] Djalil Chafai and Didier Concordet. Confidence regions for the multinomial parameter with small sample size. Journal of the American Statistical Association, 104(487):1071–1079, 2009.
 Agresti and Coull [1998] Alan Agresti and Brent A Coull. Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52(2):119–126, 1998.

Agresti [2003]
Alan Agresti.
Dealing with discreteness: making exact confidence intervals for proportions, differences of proportions, and odds ratios more exact.
Statistical Methods in Medical Research, 12(1):3–21, 2003.  Jamieson et al. [2013] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. On finding the largest mean among many. arXiv preprint arXiv:1306.3917, 2013.
 Malloy and Nowak [2014] Matthew L Malloy and Robert D Nowak. Sequential testing for sparse recovery. IEEE Transactions on Information Theory, 60(12):7862–7873, 2014.
 Auer et al. [2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Jamieson et al. [2014] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory, pages 423–439, 2014.
 Frigyik et al. [2010] Bela A Frigyik, Amol Kapila, and Maya R Gupta. Introduction to the dirichlet distribution and related processes. Department of Electrical Engineering, University of Washington, UWEETR20100006, (0006):1–27, 2010.
 Langford [2005] John Langford. Tutorial on practical prediction theory for classification. Journal of machine learning research, 6(Mar):273–306, 2005.
 Garivier and Cappé [2011] Aurélien Garivier and Olivier Cappé. The klucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual Conference On Learning Theory, pages 359–376, 2011.
 Tanczos et al. [2017] Ervin Tanczos, Robert Nowak, and Bob Mankoff. A kllucb algorithm for largescale crowdsourcing. In Advances in Neural Information Processing Systems, pages 5894–5903, 2017.
 Mnih et al. [2008] Volodymyr Mnih, Csaba Szepesvári, and JeanYves Audibert. Empirical bernstein stopping. In Proceedings of the 25th international conference on Machine learning, pages 672–679. ACM, 2008.
 Maurer and Pontil [2009] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In COLT 2009The 22nd Conference on Learning Theory, 2009.
 Peel et al. [2010] Thomas Peel, Sandrine Anthoine, and Liva Ralaivola. Empirical bernstein inequalities for ustatistics. In Advances in Neural Information Processing Systems, pages 1903–1911, 2010.
 Audibert et al. [2009] JeanYves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multiarmed bandits. Theoretical Computer Science, 410(19):1876–1902, 2009.

Balsubramani and Ramdas [2016]
Akshay Balsubramani and Aaditya Ramdas.
Sequential nonparametric testing with the law of the iterated
logarithm.
In
Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence
, pages 42–51. AUAI Press, 2016.  Blyth and Still [1983] Colin R Blyth and Harold A Still. Binomial confidence intervals. Journal of the American Statistical Association, 78(381):108–116, 1983.
 Cover and Thomas [2012] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
 Mardia et al. [2018] Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D Nowak, and Tsachy Weissman. Concentration inequalities for the empirical distribution. arXiv preprint arXiv:1809.06522, 2018.
 Brent [2013] Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013.