Estimating the most likely outcome of a probability distribution is a useful primitive in many computing applications such as counting, natural language processing, clustering, etc. We study the probably approximately correct (PAC) sequential version of this problem in which the learner faces a stream of elements sampled independently and identically distributed (i.i.d.) from an unknown probability distribution, wishing to learn an element with the highest probability mass on it (a mode) with confidence. At any time, the learner can issue queries to obtain information about the identities of samples in the stream, and aims to use as few queries as possible to learn the distribution’s mode with high confidence. Specifically, we consider two natural models for sample identity queries – (a) each query, for a single sample of the stream so far, unambiguously reveals the identity (label) of the sample, (b) each query, for a pair of samples in the stream, reveals whether they are the same element or not.
A concrete application of mode estimation (and one of the main reasons that led to this formulation) is the problem of adaptive, partial clustering, where the objective is to find the largest cluster (i.e., equivalence class) of elements as opposed to learning the entire cluster grouping [12, 13, 11, 10]. We are given a set of elements with an unknown clustering or partition, and would like to find the elements comprising the largest cluster or partition. Suppose a stream of elements is sampled uniformly and independently from the set, and at each time one can ask a comparison oracle questions of the form: “Do two sampled elements and belong to the same cluster or not?” Under this uniformly sampled distribution for the element stream, the probability of an element belonging to a certain cluster is simply proportional to the cluster’s size, so learning the heaviest cluster is akin to identifying the mode of the distribution of a sampled element’s cluster label.
We make the following contributions towards understanding the sequential query complexity for estimating the mode of a distribution using a stream of samples. (a) For both the individual-label and pairwise similarity query models, we give sequential PAC query algorithms which provably output a mode of the sample-generating distribution with large probability, together with guarantees on the number of queries they issue. These query complexity upper bounds explicitly depend on parameters of the unknown discrete probability distribution, in that they scale inversely with the gap between the probability masses at the mode and at the other elements in the distribution’s support. The proposed algorithms exploit the probabilistic i.i.d. structure of the data stream to resolve uncertainty about the mode in a query-efficient fashion, and are based on the upper and lower confidence bounds (UCB, LCB) principle from online learning to guide adaptive exploration across time; in fact, we employ more refined empirical Bernstein bounds  to take better advantage of the exact structure of the unknown sample distribution. (b) We derive fundamental limits on the query complexity of any sequential mode-finding algorithm for both query models, whose constituents resemble those of the query complexity upper bounds for our specific query algorithms above. This indicates that the algorithms proposed make good use of their queries and the associated information in converging upon a mode estimate. (c) We report numerical simulation results that support our theoretical query complexity performance bounds.
1.1 Related Work
The mode estimation problem has been studied classically in the batch or non-sequential setting since many decades back, dating to the work of Parzen  and Chernoff , among others. This line of work, however, focuses on the objective of consistent mode estimation (and the asymptotic distribution of the estimate) for continuous distributions, instead of finite-time PAC guarantees for large-support discrete distributions as considered here. Our problem is essentially a version of sequential composite hypothesis testing with adaptive “actions” or queries, and with an explicit high-confidence requirement on the testing algorithm upon stopping.
There has been a significant amount of work in the streaming algorithms community, within computer science, on the ”heavy hitter” problem – detecting the most frequent symbol in an arbitrary (non-stochastic) sequence – and generalizations thereof pertaining to estimation of the empirical moments, see e.g.,, , . However the focus here is on understanding resource limits, such as memory and computational effort, on computing on arbitrary (non stochastic / unstructured) streams that arise in highly dynamic network applications. We are instead interested in quantifying the statistical efficiency of mode estimation algorithms in terms of the structure of the generating probability distribution.
Adaptive decision making and resolution of the explore-exploit trade off is the subject of work on the well-known multi-armed bandit model, e.g., . At an abstract level, our problem of PAC-mode estimation is like a multi-armed bandit “best arm” identification problem  but with a different information structure – queries are not directly related to any utility structure for rewards as in bandits.
, where the aim is to learn the entire structure of an unknown clustering by making information queries. In this regard, studying the mode estimation problem helps to shed light on the simpler, and often more natural, objective of merely identifying the largest cluster in many machine learning applications, which has not been addressed by previous work.
2 Problem formulation
In this section we develop the required notation and describe the query models.
Consider an underlying unknown discrete probability distribution with the support set . For each
and a random variable, let .
We would like to estimate the mode of the unknown distribution , defined as any member of the set111 is used to denote the set of all maximisers of the function on . . Towards this, we assume query access to an oracle containing a sequence of independently and identically distributed (i.i.d.) samples from , denoted We study the mode estimation problem under the following query models to access the values of these i.i.d. samples:
Query Model 1 (QM1) : For each query, we specify an index following which the oracle reveals the value of the sample to us. Since the samples are i.i.d., without loss of generality, we will assume that the successive query reveals ,
Query Model 2 (QM2) : In this model, the oracle answers pairwise similarity queries. For each query, we specify two indices , following which the oracle reveals if the two samples and are equal or not. Formally, the response of the oracle to a query is
Note that to know the value of a sample in this query model, multiple pair-wise queries to the oracle might be required.
For each of the query models above, our goal is to design a statistically efficient sequential mode-estimation algorithm which, at each time , either makes a query to the oracle based on past observations or decides to stop and output an estimate for the distribution’s mode. Mathematically, a sequential algorithm with a stopping rule decides an action at each time depending only on past observations. For QM1, can be one of the following:
(continue,): Query the index ,
(stop,), : Stop querying and return as the mode estimate.
For QM2, can be one of the following:
(continue,): Continue with the next round, with possibly multiple sequential pairwise queries of the form for some . That is, we compare the sample with some or all of the previous samples.
(stop,), : Stop querying and return as the mode estimate.
The stopping time of the algorithm is defined as
The cost of the algorithm is measured by its query complexity – the number of queries made by it before stopping.
For , a sequential mode-estimation algorithm is defined to be a -true mode estimator if it correctly identifies the mode for every distribution on the support set with probability at least , i.e., . The goal is to obtain -true mode estimators for each query model (QM1 and QM2) that require as few queries as possible. For a -true mode estimator and a distribution , let denote the number of queries made by a -true mode estimator when the underlying unknown distribution is . We are interested in studying the optimal query complexity of -true mode estimators. Note that is itself a random quantity, and our results either hold in expectation or with high probability.
For the purpose of this paper, we assume that i.e. the mode of the underlying distribution is , and hence a -true mode estimator returns with probability at least . In Sections 3 and 4, we discuss -true mode estimators and analyze their query complexity for the QM1 and QM2 query models respectively. We provide some experimental results in Section 5 and further explore a few variations of the problem in Section 6. Several proofs have been relegated to the Appendix.
3 Mode estimation with QM1
We will begin by presenting an algorithm for mode estimation under QM1 and analyzing its query complexity.
Recall that under the QM1 query model, querying the index to the oracle reveals the value of the corresponding sample, , generated i.i.d. according to the underlying unknown distribution . During the course of the algorithm, we form bins for each element in the support of the underlying distribution. Bin is created when the first sample with value is revealed and any further samples with that value are ‘placed’ in the same bin. For each query and revealed sample value , define for as follows.
Note that for each given , is a Bernoulli random variable with . Also for any given , are i.i.d. over time.
Our mode estimation scheme is presented in Algorithm 1. At each stage of the algorithm, we maintain an empirical estimate of the probability of bin , , for each . Let denote the estimate at time , given by
where recall that for are the i.i.d. samples. Also, at each time instant, we maintain confidence bounds for the estimate of
. The confidence interval for thebin probability at the iteration is denoted by , and it captures the deviation of the empirical value from its true value . In particular, the confidence interval value is chosen so that the true value lies in the interval with a significantly large probability. The lower and upper boundaries of this interval are referred to as the lower confidence bound (LCB) and the upper confidence bound (UCB) respectively. The particular choice for the value of that we use for our algorithm is presented in Section 3.2.
Finally, our stopping rule is as follows : when there exists a bin whose LCB is greater than the UCB of all the other bins, upon which the index is output as the mode estimate.
Given the way our confidence intervals are defined, this ensures that the output of the estimator is the mode of the underlying distribution with large probability.
This proof is based on confidence bound arguments. To construct the confidence intervals for the probability values ’s, we use the empirical version of the Bernstein bound given in . The result used is stated as Theorem 9 in Appendix A. Using the result in our context, we get the following for any given pair with probability at least :
where is the sample variance, i.e., .
To establish confidence bounds on the sample variance, we use the result given in , which is stated as Theorem 10 in Appendix A. Using the result in our context, and noting that the expected value of would be , we get the following for any given pair with probability at least :
Let denote the error event that for some pair the confidence bound (3) around is violated. Also, let denote the error event that for some pair the confidence bound (4) around the sample variance is violated. Choosing , and taking the union bound over all , from (3) and (4), we get and . Hence we get that
This means that with probability at least the confidence bounds corresponding to both equations (3) and (4) hold true for all pairs .
We now show that if the event is true, then Algorithm 1 returns as the mode. To see this, assume the contrary that the algorithm returns . Under the confidence intervals hold true for all pairs and hence the stopping condition defined in Line 10, Algorithm 1 will imply
which is false. Thus Algorithm 1 returns as the mode if the event is true. Hence, the probability of returning as the mode is at least , which by (5) implies that Algorithm 1 is a -true mode estimator. ∎
3.3 Query Complexity Upper bound
For a -true mode estimator , corresponding to Algorithm 1, we have the following with probability at least .
If the event is true, it implies that all confidence intervals hold true. We find a value , which ensures that by , the confidence intervals of the bins have separated enough such that the algorithm stops. Since at each time one query is made, this value of would also be an upper bound on the query complexity . The derivation of has been relegated to Appendix B, which gives the upper bound as stated above. ∎
3.4 Query Complexity Lower bound
For any -true mode estimator , we have,
Let denote the mode of the distribution . By assumption, we have . Consider any such that . Let be a -true mode estimator and let be its output. Then, by definition,
Let be the stopping time associated with estimator . Using Wald’s lemma  we have,
where refers to the Kullback-Leibler (KL) divergence between the distributions and .
denotes the Bernoulli distribution with parameterand the last step is obtained by the data processing inequality . Furthermore we have,
We choose as follows, for some small :
where follows from [16, Theorem 1.4] which gives an upper bound on the KL divergence and follows because can be arbitrarily small. Note that an upper bound on the stopping time would also give an upper bound on , since we have one query at each time. Hence, using (8) and (9), we get the following lower bound:
4 Mode estimation with QM2
In this section, we present an algorithm for mode estimation under QM2 and analyse its query complexity.
Recall that under the QM2 query model, a query to the oracle with indices and reveals whether the samples and have the same value or not. Our mode estimation scheme is presented in Algorithm 2. During the course of the algorithm we form bins, and the bin formed is referred to as bin . Here, each bin is a collection of indices having the same value as revealed by the oracle. Let denote the element from the support that that bin represents. The algorithm proceeds in rounds. In round , let denote the number of bins present. Based on the observations made so far, we form a subset of bins . We go over the bins in one by one, and in each iteration within the round we query the oracle with index and a sample index from some bin , i.e., for . The round ends when the oracle gives a positive answer to one of the queries or we exhaust all bins in . So, the number of queries in round is at most . If we get a positive answer from the oracle, we add to the corresponding bin. A new bin will be created if the oracle gives a negative answer to each of these queries.
We will now describe how we construct the subset of bins we compare against in round . To do so, for each bin , corresponding to , and for each round , we maintain an empirical estimate of each during each round, denoted by . We also maintain confidence intervals for each , denoted by . The choice for is the same as that in QM1, given in (2). is formed in each round by considering only those bins whose UCB is greater than the LCB of all other bins. Mathematically,
The rest of the algorithm is similar to Algorithm 1, including the stopping rule, i.e. we stop when there is a bin whose LCB is greater than the UCB of all other bins in . As before, the choice of confidence intervals and the stopping rule ensures that when , the corresponding element with probability at least .
The error analysis for Algorithm 2 is similar as that for Algorithm 1 as given in Section 3.2. We consider the same two events and . Recall that denotes the error event that for some pair the confidence bound (3) around is violated; denotes the error event that for some pair the confidence bound (4) around the sample variance is violated. Again choosing similar values for and , we get . We now need to show that if the event is true, then Algorithm 2 returns as the mode. This then implies that Algorithm 2 is a -true mode estimator.
The analysis remains same as discussed for QM1. The only additional factor we need to consider here, is the event that the bin corresponding to the mode of the underlying distribution is discarded in one of the rounds of the algorithm, i.e., it is not a part of for some round . A bin is discarded when its UCB becomes less than the LCB of any other bin. Under event , all confidence intervals are true, and since the UCB of the corresponding bin can never be less than the LCB of any other bin. This implies that under , the bin corresponding to the mode is never discarded.
Hence, the probability of returning as the mode is at least , which by (5) implies that Algorithm 2 is a -true mode estimator. ∎
4.3 Query Complexity Upper bound
From the analysis of the sample complexity for the QM1 model derived in Section 3.3, we get one upper bound for the QM2 case. Algorithm 1 continues for at most rounds with probability at least , where is given by Theorem 2. A sample accessed in each of these rounds can be compared to at most other samples. Thus, a natural upper bound on the query complexity of Algorithm 2 is . The following result provides a tighter upper bound.
For a -true mode estimator , corresponding to Algorithm 2, we have the following with probability at least .
The detailed calculations are provided in Appendix B, here we give a sketch. Under the event , where all confidence bounds hold true, for any bin represented during the run of the algorithm and corresponding to some element from the support, we have that it will definitely be excluded from when its confidence bound and . We find a which satisfies the stopping condition for each of the bins represented, and summing over them gives the total query complexity. Following the calculations in Appendix B, we get the following value of for the bin corresponding to element , such that by the bin will definitely be excluded from .
Also, for the first bin we get as follows.
Also, the total number of bins created will be at most
A sample from the bin corresponding to element will be involved in at most queries. Hence the total number of queries, , is bounded as follows
4.4 Query Complexity Lower bound
The following theorem gives a lower bound on the expected query complexity for the QM2 model.
For any -true mode estimator , we have,
Consider any -true mode estimator under query model QM2 and let denote its average (pairwise) query complexity when the underlying distribution is . Next, consider the QM1 query model and let estimator simply simulate the estimator under QM2, by querying the values of any sample indices involved in the pairwise queries. It is easy to see that since is a -true mode estimator under QM2, the same will be true for as well under QM1. Furthermore, the expected query complexity of under QM1 will be at most since each pairwise query involves two sample indices. Thus, if the query complexity of under QM2 is less than , then we have a -true mode estimator under query model QM1 with query complexity less than , which contradicts the lower bound in Theorem 3. The result then follows. ∎
The lower bound on the query complexity as given by the above theorem matches the first term of the upper bound given in Theorem 5. So, the lower bound will be close to the upper bound when the first term dominates, in particular when .
While the above lower bound matches the upper bound in a certain restricted regime, we would like a sharper lower bound which works more generally. Towards this goal, we consider a slightly altered (and relaxed) problem setting which relates to the problem of best arm identification in a multi-armed bandit (MAB) setting studied in , . The altered setting and the corresponding result are discussed in Appendix C.
5 Experimental Results
For both the QM1 and QM2 models, we simulate Algorithm 1 and Algorithm 2 for various synthetic distributions. We take and keep the difference constant for each distribution. For the other ’s we follow two different models:
Uniform distribution : The other ’s for are chosen such that each .
Geometric distribution : The other ’s are chosen such that form a decreasing geometric distribution which sums upto .
For each distribution we run the experiment 50 times and take an average to plot the query complexity. In Fig. 1 we plot the number of queries taken by Algorithm 1 for QM1, for both the geometric and uniform distributions. As suggested by our theoretical results, the query complexity increases (almost) linearly with for a fixed . In Fig. 2 we plot the number of queries taken by Algorithm 2 for QM2 and compare it to the number of queries taken by a naive algorithm which queries a sample with all the bins formed, for both the uniform and geometric distributions.
To further see how Algorithm 2 performs better than the naive algorithm which queries a sample against all bins formed, we plot the number of queries in each round for a particular geometric distribution, in Fig. 3. We observe that over the rounds, for Algorithm 2 bins start getting discarded and hence the number of queries per round decreases, while for the naive algorithm, as more and more bins are formed, the number of queries per round keeps increasing.
Real world dataset: As mentioned in the introduction, one of the applications of mode estimation is partial clustering. Via experiments on a real-world purchase data set , we were able to benchmark the performance of our proposed Algorithm 2 for pairwise queries, a naive variant of it with no UCB-based bin elimination and the algorithm of  which performs a full clustering of the nodes. Using , we create a dataset where each entry represents an item available on Amazon.com along with a label corresponding to a category that the item belongs to. We take entries with a common label to represent a cluster and we consider the task of using pairwise oracle queries to identify a cluster of items within the top-10 clusters, from among the top-100 clusters in the entire dataset. Note that while our algorithm is tailored towards finding a heavy cluster, the algorithm in  proceeds by first learning the entire clustering and then identifying a large cluster. Some statistics of the dataset over which we ran our algorithm are: number of clusters - 100; number of nodes - 291925; size of largest cluster - 53551; and size of 11th largest cluster - 19390.
With a target confidence of 99%, our proposed algorithm terminates in pairwise queries while the naive algorithm takes queries ( more). The algorithm of  clusters all nodes first, and is thus expected to take around queries ( more) to terminate; this was larger than our computational budget and hence could not be run successfully. With a target confidence of 95%, our algorithm takes queries instead of the naive version which takes queries ( more) to terminate.
There are a few variations of the standard mode estimation problem which have important practical applications and we discuss them in the following subsections.
6.1 Top- estimation
An extension of the mode estimation problem discussed could be estimating the top- values. i.e. for , an algorithm should return the set . A possible application is in clustering, where we are interested in finding the largest clusters. The algorithms 1 and 2 for QM1 and QM2 would change only in the stopping rule. The new stopping rule would be such that (stop,.) when there exist bins such that their LCB is greater than the UCB of the other remaining bins. In this setting, we define a -true top- estimator as an algorithm which returns the set with probability at least . In the following results we give bounds on the query complexity, for a -true top- estimator , for the QM1 query model.
For a -true top- estimator , corresponding to Algorithm 1 for the top- case, we have the following with probability at least .
The proof follows along the same lines as Theorem 2. Here, for a bin and a bin , their confidence bounds would be separated when and . The calculations then follow similarly as before to give the above upper bound. ∎
For any -true top- estimator , we have,
The proof follows along the same lines as Theorem 3. Here, for , the alternate distribution that we choose would have and for some and some . We take the maximum value over all such to give the lower bound. ∎
6.2 Noisy oracle
Here, we consider a setting where the oracle answers queries noisily and we analyze the impact of errors on the query complexity for mode estimation.
For the noisy QM1 model, we assume that when we query the oracle with some index , the value revealed is the true sample value with probability and any of the values in the support with probability each. The problem of mode estimation in this noisy setting is equivalent to that in a noiseless setting where the underlying distribution is given by
Since the mode of this altered distribution is the same as the true distribution, we can use Algorithm 1 for mode estimation under the noisy QM1 model, and the corresponding query complexity bounds in Section 3 hold true.
For the noisy QM2 model, we assume that for any pair of indices and sent to the oracle, it returns the correct answer ( if , otherwise) with probability . This setting is technically more involved and is in fact similar to clustering using pairwise noisy queries . The mode estimation problem in this setting corresponds to identifying the largest cluster, which has been studied in . Deriving tight bounds for this case is still open.
-  (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5 (1), pp. 1–122. Cited by: §1.1.
-  (1964) Estimation of the mode. Annals of the Institute of Statistical Mathematics 16 (1), pp. 31–41. Cited by: §1.1.
-  (2019-02) Top-m clustering with a noisy oracle. In 2019 National Conference on Communications (NCC) (NCC 2019), Bangalore, India. Cited by: §6.2.
-  (2012) Elements of information theory. John Wiley & Sons. Cited by: §3.4.
-  (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS) 28 (1), pp. 51–55. Cited by: §1.1.
-  (2016) On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research 17 (1), pp. 1–42. Cited by: Appendix C, Appendix C, §1.1, §3.4, §4.4.
-  (2014-06) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data/com-Amazon.html Cited by: §5.
-  (2002) Approximate frequency counts over data streams. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, pp. 346–357. Cited by: §1.1.
-  (2009) Empirical Bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740. Cited by: Appendix A, Appendix A, §1, §3.2.
-  (2016) Clustering via crowdsourcing. arXiv preprint arXiv:1604.01839. Cited by: §1.1, §1.
A theoretical analysis of first heuristics of crowdsourced entity resolution. In
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.1, §1.
-  (2017) Clustering with noisy queries. In Advances in Neural Information Processing Systems, pp. 5788–5799. Cited by: §1.1, §1, §5, §5, §6.2.
-  (2017) Query complexity of clustering with side information. In Advances in Neural Information Processing Systems, pp. 4682–4693. Cited by: §1.1, §1.
-  (1982) Finding repeated elements. Science of computer programming 2 (2), pp. 143–152. Cited by: §1.1.
On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33 (3), pp. 1065–1076. Cited by: §1.1.
Bounds for kullback-leibler divergence. Electronic Journal of Differential Equations 2016. Cited by: Appendix C, §3.4.
-  (2017) The simulator: understanding adaptive sampling in the moderate-confidence regime. In Proceedings of the 2017 Conference on Learning Theory, Cited by: Appendix C.
-  (2014) Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pp. 828–836. Cited by: §4.4.
-  (1944) On cumulative sums of random variables. The Annals of Mathematical Statistics 15 (3), pp. 283–296. Cited by: §3.4.
Appendix A Empirical Bernstein result
We use the following result given in  to get the confidence bounds. Since we do not know the underlying distribution and it’s variance, we cannot use the standard Bernstein bound. The following result gives us a handle on the confidence bounds, using empirical variance.
Let be i.i.d. random variables with values in and let . Then with probability at least in the i.i.d. vector
in the i.i.d. vectorwe have
where is the sample variance
Also, the following result also given in  is used to establish bounds on the sample variance.
Let and be a vector of independent random variables with values in . Then for and we have,
Appendix B Details in proof of Theorem 2
Under the confidence intervals around and hold true for all pairs , and we derive the value of which would ensure that the confidence intervals of the bins are separated enough to ensure that is returned as the mode.
For each bin created during the run of the algorithm, its UCB will definitely be less than the LCB of the first bin when and . This happens because
We find a time which satisfies the stopping condition for each bin created, and taking the maximum over these gives . For the bin we have,
|Using the bound around the empirical variance and|
|again choosing we get,|