1. Introduction
The CountMin sketch has proven to be one of the most effective sketches for obtaining approximate counts for pointwise queries and for computing approximate inner products. It is especially effective in the common data scenario where the count distribution is highly skewed.
However, there are notable cases where the sketch performs suboptimally or poorly. For example, when there are few heavy hitters and a large number of items relative to the size of the sketch, the CountMin sketch is highly biased and performs poorly compared to the Count sketch (charikar2002countsketch, ). This has led to a number of attempts (jin2003dynamically, ), (lee2005improving, ), (deng2007new, ), (lu2008counter, ), (chen2017bias, ) to improve estimation from the CountMin sketch in these regimes. In all cases but one (lu2008counter, ), these methods can be shown to perform worse than the basic CountMin estimator in some regimes or for some sketch parameter settings. The one case with guaranteed better accuracy, however, can only be applied in the highly restrictive and computationally expensive setting where all possible items are known and their counts jointly estimated. As a result, it is unclear to a practitioner which method to choose. Although several empirical studies (rusu2007statistical, ), (cormode2008finding, ) have attempted to address this issue, choosing the best method has required a priori knowledge of the properties of the unseen data.
A second issue with the CountMin sketch is that although it has a probabilistic error guarantee, this guarantee is extremely loose and of no practical use when reporting the error of any query. Again, the only proposed method for obtaining errors with practical magnitudes is given by (lu2008counter, ) where all counts must be decoded.
This paper introduces methods that provides better accuracy under all regimes and takes the guesswork out of count estimation. The resulting estimator also has a tight, practical error bound. Furthermore, it can utilize joint estimation of multiple counts to yield more accurate results without needing to know the entire universe of items.
Our approach treats count estimation from the CountMin sketch as a statistical estimation problem where the irrelevant counts are modeled as error terms. The key idea is that the distribution of these error terms can be estimated from the sketch itself. Equipped with an error distribution, we consider two classes of estimators: ones which use the full likelihood information and adhoc estimators with some good properties. All existing estimators are shown to be from the latter class. For these estimators, we show that bootstrap methods can be used to debias a wide class of estimators and obtain tight confidence intervals that bound the error.
We propose two likelihood based estimators: the standard maximum likelihood estimator and a Bayesian estimator. The Bayesian estimator, while more computationally expensive, is proved to be optimal even when the sketch is of fixed depth. The more practical maximum likelihood estimator is empirically shown to outperform all other methods in all scenarios.
Key to the likelihood based methods is a nonparametric estimate of the error distribution. We show this can be accomplished with logconcave density estimation. This estimator has attractive properties as it requires no tuning parameters and yields a concave loglikelihood function that ensures maximum likelihood estimation is fast and easy. We further show that it generates robust count estimators even when the assumption of logconcavity is false.
In addition to the practical improvements motivated by theory, our work also advances our understanding of the CountMin and related sketches. We serve as a brief survey of existing estimation algorithms and summarize the techniques used. We show that unlike existing methods which exploit only one or two techniques, our method is able to exploit all of them to obtain better results. Furthermore, we use asymptotic theory to explain under which regimes different count estimators and sketches perform well.
This understanding also has practical consequences in sketch construction. In particular, we find given a fixed space constraint, it is generally preferable to reduce the number of hash functions and increase the width of the sketch, as it increases the likelihood of belonging in the ”superefficient” regime where the Min estimator achieves the optimal rate. When additional information about the error distribution is known a priori, we show how to optimally select these sketch parameters.
Our methods may also be applied to other sketches such as the Count sketch (charikar2002countsketch, ), also known as the FastAMS sketch when applied to innerproduct estimation (alon1999space, ).
The paper is structured as follows. First, we review the CountMin sketch and define the empirical error distribution relative to a pointwise query. Next, we give a brief survey of existing work on improving estimation for the sketch, provide insights into how they work, and show they can be generalized in natural ways. Section 5
then introduces the bootstrap and shows how simple statistics can be converted into unbiased estimators for the count and gives procedures to construct tight error bounds. As simple statistics may not make full use of the information in the data, section
6 shows that the true likelihood can estimated from the data and proposes estimators based on it. We also show that the resulting estimators have robust estimation properties and that they can be used to estimate multiple counts jointly through regression. Section 9 provides empirical results on real and synthetic data to show that our estimators are indeed the most accurate in a variety of settings and that the error bounds are tight We then discuss asymptotics that aid our understanding of the sketch, applications to parameter tuning, and the use of our techniques to other sketches, in streaming settings, and for inner product estimation.Throughout the paper we rely heavily on statistical estimation theory and concepts that we unfortunately do not have sufficient space to cover in detail. These concepts are the full distribution based counterparts to the tail probability and concentration inequality driven theory common in the sketching literature.
2. CountMin
The CountMin sketch compresses and aggregates a large and possibly unknown number of tuples into a finite sketch of numeric counters. It allows for two basic types of queries: 1) pointwise queries which provide an estimate of the aggregated count for any item or set of items, and 2) inner product queries which provide for an estimate for
for count vectors
and indexed by distinct items. We write the vector of counts indexed by item by . These two basic queries can be used to formulate more complex queries. For example, aggregated counts for range queries can be constructed out of pointwise queries that expand numeric valued items into membership in a set of dyadic ranges (cormode2005countmin, ). We focus on pointwise queries in this paper and briefly discuss the application of our techniques to the inner product case.The CountMin summarization technique can be decomposed into two parts: the construction of the sketch and the estimation procedure for count queries. In this paper, we focus on improvements to estimation and not on sketch construction. For clarity, we will refer to the construction as the Count+ summarization and the estimator as the Min estimator. Here, the plus sign represents the onesided errors for the sketch.
An Count+ summarization consists of two parts: a hash based projection and replication. The first hashes each item to one of counters. The vector of observed counters is obtained by summing the counts in each bin. The second part simply replicates this process times with independent hashes. and are often referred to as the depth and width of the sketch.
More precisely, given a hash function , the item, count pair updates the counter vector by the update rule
(1) 
This process is repeated times to obtain independent identically distributed (i.i.d.) vectors using independent hashes for .
Estimation from this sketch is simple and relies on the fact that counts are nonnegative. For any of the vectors , the counter is an upper bound on the total count for item . The original Minestimator for the CountMin sketch takes the minimum over the replicates
(2) 
Several simple observations can be made from this construction and estimator. Only the counters that an item is hashed to contain any information about its count. Removing an item and its count from the Count+ summarization yields vectors of exchangeable error terms where the error terms are all nonnegative. The Min estimator is biased as it cannot underestimate the count. More formally, for any replicate ,
(3) 
where the are identically distributed and exchangeable.
These observations motivate our basic strategy. Take counters which only contain error terms. Use them to empirically estimate a noncentered, nonnegative error distribution. An item’s counters plus the error distribution for those counters provides all the available information to estimate the item’s count. Apply statistical estimation techniques to estimate the count and obtain an error estimate. When the error distribution is correct, the resulting estimator is optimal.
Symbol  Definition 

Vector of all counts indexed by item  
Estimated count for item  
Number of distinct items  
Set of indices that or are hashed to  
Number of replicates in CountMin sketch  
Number of counters in one replicate  
Hash function for replicate  
CountMin counters  
,  counter in replicate 
Vector of errors (relative to some item )  
,  True and empirical distribution of errors 
Projection matrix for the sketch and for replicate  
Expected number of items per counter . 
2.1. Linear algebra of the CountMin sketch
The Count+ summarization is an example of a linear sketch. In other words, each replicate is a random projection of the counts where the construction of does not depend on . This may be expressed as
(4) 
where is a random binary matrix with precisely 1 nonzero value per column. More explicitly, if and otherwise. For succinctness in notation we denote the concatenation of the as simply and likewise for . We also write by and similarly for .
Whenever only a subset of items are of interest, the sketch has the form,
(5)  
(6) 
The equation representing the counters
has the same form as a linear regression problem where
are the known covariates and are the unknown regression coefficients. The error terms are defined relative to the queried items . It differs slightly from typical linear regression problems in that the errors are not centered to have mean zero, and the distribution of the errors is not known or assumed. For notational convenience, we will simply write for the error term as is always clear from the context.2.2. Empirical distributions
Given an item and Count+ summary, only the counters that hashes to provide information about the count . The remaining counters are draws from an error distribution. This large sample allows the error distribution to be accurately estimated and reduces the count estimation problem to a familiar problem of parameter estimation with a known error distribution.
Denote the unknown true error distribution’s cumulative distribution function (c.d.f.) as
and its density or mass function as . When is drawn from a distribution with c.d.f. , we write . In the case of a pointwise query for a single item, the distribution of a counter . Estimating the count is a parametric estimation problem from the oneparameter location family of distributions.3. Existing work
Several existing improvements to the Min estimator have been proposed. The estimation techniques for the Count+ summary can be categorized into four basic ideas:

Bias reduction

Linear Regression

Support constraints

Robust objective choice
Each existing estimator exploits only one or two of these ideas. For example, the Min estimator exploits only the nonnegative support of the error distribution. The Median estimator exploits only a robust objective choice.
3.1. Debiasing
Most prior work, (deng2007new, ), (jin2003dynamically, ), (chen2017bias, )
, focuses on debiasing the estimator under different choices of objectives. We describe this debiasing operation with a more general procedure and list the choices made by each procedure. This allows us to extend debiasing to a large class of base estimators, such as any quantile.
Let be the set of (replicate number, index) pairs that item is hashed to. Let be some function on a set of counters so that
(7) 
We refer to this as the translation property in this paper. Obvious examples of include the mean, minimum, median, and any quantile. These are also all special cases of maximizers of the form . For the mean, , and for the median,
and is a robust loss function.
For any satisfying this property, is an unbiased estimate for when . This yields a general method for constructing a debiased estimator. 1) Choose a function with the translation property, and 2) find an empirical estimate of the bias .
For the hCount* estimator (jin2003dynamically, ), remains the minimum. To estimate the bias, they explicitly query for a small set of items that are known to have count and take the average of the corresponding estimates.
For the CMM estimators (deng2007new, ), is taken to be the median. Rather than explicitly querying to find noise counters, they use counters that do not to contain the query key to estimate the bias. Since regardless of the sizes of and , the resulting estimate is nearly unbiased.
Bias Aware estimation (chen2017bias, ) proposes other debiased Median and Mean estimators for . They differ from other debiasing methods since they use information not contained in the sketch itself. Rather than directly applying the mean or median to the set of relevant counters, they compute ”debiased counters” where is the number of items hashed to counter and is a per item bias estimate. The statistic has the translation property and does not need further debiasing. However, computing this requires knowing and being able to iterate over the universe of distinct items.
3.2. Regression and Support Constraints
When multiple items counts are estimated together, estimation can be improved. One item’s estimate can reduce the error for another item when there is a hash collision. More formally, equation 5 shows that adding elements to the set of desired item counts reduces the number of items mapping to the error term. When the added items are heavy hitters, this can substantially reduce the magnitude of the error. The choice of regression model is thus dictated by what one knows about the universe of items and assumptions about the unknown error distribution
Under the assumption that the error distribution is normal and only a subset of items are known, one recovers the linear least squares method of (lee2005improving, ). This is equivalent to the solution of the maximization problem
(8) 
In the case where all item counts are jointly estimated and the linear system is overdetermined, the leastsquares estimator finds the exact counts.
If the entire universe of items is known, the Counter Braids estimation algorithm (lu2008counter, ) is guaranteed to be no worse than the Min estimator and can often recover the exact counts. The Counter Braids estimator does so via a message passing algorithm that provides deterministic upper and lower bounds on the estimated counts. We show in appendix A.1 that this algorithm can be formulated in as a standard optimization problem. It is a cutting plane algorithm (kelley1960cutting, ) for finding the feasible set for an optimization problem, and that the feasible set exploits the nonnegative support of error distributions.
Exploiting ideas from both methods yields the general class of regression based procedures that solve the constrained optimization problem
(9) 
where is some loss function. Section 6 will show that an estimated loglikelihood function yields a good loss function.
4. Our methods
When the problem is fulled modeled by a statistical model, the four techniques listed in the previous section can be simplified into two: linear regression and modeling the error distribution. The error distribution encodes the bias, support, and optimal objective function to use for count estimation. In addition, knowledge of the error distribution yields the exact sampling distribution of an estimator and corresponding tight confidence intervals (CIs).
We propose two methods based on nonparametric modeling of the error distribution. First, we propose a class of bootstrap estimators. This class of estimators can be based off statistics that are fast and easy to compute and implement. It covers all existing debiased estimators and allows for the easy generation of others such as estimators based on other quantiles or trimmed means. Second, we propose full likelihood based estimators based on an empirical estimate of the error density or mass function. These methods can incorporate regression techniques to exploit information about the universe of items.
5. Bootstrap Estimators
Debiasing and computing tight error bounds bounds requires knowing the distribution of the statistic . The bootstrap (efron1979bootstrap, ) estimates this distribution by resampling observations and examining the distribution of the results on the simulated samples. The näive bootstrap will not work since there are only a small number of relevant counters to resample. However, when a statistic has the translation property, one can instead sample from the error counters.
Theorem 5.1 shows that when this is done, then any with the translation property can be turned into an unbiased estimator of an item’s count. Existing debiased estimators can be seen as instances of this bootstrapping procedure. While our analysis suggests easier ways to compute the bias and yields new estimators, our primary contribution is applying the bootstrap to yield tight confidence intervals and in its application to new base statistics . We also address computational issues that arise with the bootstrap and show biases and confidence intervals for estimators based on the minimum value or any quantile can be recovered without resorting to an expensive Monte Carlo simulation.
Theorem 5.1 ().
Let be any function that satisfies the translation property. Consider an item and the collection of indices that is hashed to. Consider the empirical distribution of the counters excluding those in , and denote expectation under this distribution by . Let be i.i.d. draws from this distribution. Then,
(10) 
is an unbiased estimator for the count .
Proof.
Let be a randomly chosen item with count . Denote by the vector of error terms for item . By symmetry, . Since whenever , . Hence, . ∎
While this theorem constructs an unbiased estimator out of any base statistic that satisfies the translation property, it is possible for the resulting estimate to be negative. When counts are always nonnegative, it is sensible to truncate the estimate at 0 to ensure all estimates are nonnegative as well. This results in a slightly biased estimator. We apply this truncation to all estimators, and hence refer to them as debiased and not unbiased estimators. We also note that the base statistic cannot be a truncated statistic. Otherwise, it cannot have the translation property.
5.1. Tight error estimation
Theorem 5.2 shows the bootstrap can be used to construct confidence intervals that have the correct finite sample coverage in all situations. A trivial corollary shows the resulting confidence intervals are tight. A confidence interval for at level is a probabilistic error bound which guarantees that . If expressible as , the error guarantee is of the form .
Theorem 5.2 ().
Let be the quantile of the empirical distribution of . The interval is a confidence interval for the count . The coverage of the interval is where denotes the probability a draw from the empirical distribution is strictly less than rather than less than or equal to .
Proof.
By symmetry of the errors, . Substituting and rearranging gives the desired result. ∎
Corollary 5.3 ().
Any shorter interval with has coverage strictly less than .
Proof.
∎
Past theoretical work effectively derives an extremely loose power calculation. It finds a sample size that guarantees an error less than a desired ”effect size” with probability at least for some constants . The precise constants needed to compute the needed sketch size depend on the unknown true counts . A simple search procedure allows one to convert power calculations which map , to confidence intervals which map and vice versa. In section 8, we examine the power calculations for our tight confidence intervals to find optimal settings of the sketch parameters.
This confidence bound corresponding to the existing theory is obtained using Markov’s inequality. Setting this equal to gives a confidence interval with width where is the total count summed over all items. Given a fixed memory budget , this can be expressed as the interval . If one chooses to optimize this interval, the interval width is .
This interval does not account for the shape of the count distribution and depends only on the total count . As a result, a single heavy hitter can result in an arbitrarily wide confidence interval, even though a vast majority of count items are not affected. While more refined analyses that account for the top heavy hitters have been proposed (cormode2005summarizing, ), (cormode2012synopses, ), these require knowing the heavy hitters or a strong assumption of Zipf distributed counts. Neither is easily estimated or verified from the sketch alone. In contrast, our bootstrap confidence intervals automatically account for the entire shape of the count distribution, including the heavy hitters, and does so with only knowledge that is readily available from the sketch.
The improvement offered by tight error bounds are significant as the practical performance of the CountMin and Count sketches often greatly differ from the theoretical bounds (rusu2007statistical, ), (minton2014improved, ). In the context of innerproduct estimation, these bounds yielded error estimates that were over times larger than the true errors, though we see more modest differences in our pointwise queries. Figure 1 shows the disparity between our empirically driven confidence intervals and the existing Markov inequality based confidence intervals. Especially for heavy tailed distributions, the Markov inequality based intervals are often an order of magnitude larger than our intervals.
It also shows the actual coverage of the estimators matches or exceeds the desired coverage. The coverage exceeds the desired coverage primarily when the intervals are narrow. In these cases, we verified that excess coverage is due to the discrete jumps in probability in a discrete distribution. Attempts to shorten the intervals yielded insufficient coverage. For example, reducing the intervals by on each side, and effectively turning the interval from a closed to open interval for discrete counts, reduced the empirical coverage for a CI for the MLE estimator from to a less than advertised coverage of . Thus, the empirical results verify the theory which states they are tight as possible.
5.2. Computation
Bootstrap quantities can pose some computational difficulty as they are typically calculated via Monte Carlo simulation. However, in some cases, the quantities can often be computed directly from the empirical distribution (efron1994introduction, ). In particular, the mean and distribution of an order statistic can be easily approximated. The order statistic of a set of items is the smallest value in that set. For example the Min estimator is an order statistic as it is the smallest value in a set of values.
This can be done by relating the distribution of the order statistics from
distributed random variables to those of
random variables. Recall that the inverse c.d.f. transform generates a distributed random variable from a random variable via for . Since is monotone, the order statistic . The distribution of is wellknown and is .When applied to debiasing operations, this gives . In particular, the Min estimator can be debiased using the estimated bias where is the empirical distribution of the errors. More importantly, an exact confidence interval can be computed directly from by using an outer confidence interval (meyer1987outer, ).
For the Min estimator, a ”onesided” confidence interval for the error is where is the quantile of a distribution. This leads to algorithm 2 which debiases the Minestimator and provides a confidence interval. We refer to this as a ”onesided” confidence interval since the upper bound cannot be violated. A twosided interval for the Min or any quantile estimator can be similarly estimated. For the order statistic, compute a confidence interval for . Theorem 5.2 gives that is a confidence interval for the estimate. For implementation purposes, note that is the base estimator prior to debiasing.
Even when the bootstrap quantities cannot be directly computed from the distribution of error counters, they can be computed just once and applied to all count estimates. Since quantiles are always robust and most estimators that we consider are also robust to large errors, there is little difference in estimating the bias and interval using all counters rather than only the counters that do not contain a given item. This yields algorithm 1 which debiases an estimator and returns a confidence interval.
6. Likelihood based estimation
For the bootstrapped estimators, the procedures directly resample from the error distribution without estimating the distribution itself. With likelihood based methods, it is necessary to estimate this distribution. By doing so, one is able to apply the statistical machinery for efficient estimation and inference.
We derive the error distribution and show how to estimate it nonparametrically and without any additional tuning parameters. This allows the easy application of maximum likelihood estimation as well as Bayes optimal estimation. Furthermore, the likelihood based approaches provide a framework for performing joint estimation of counts via regression to obtain even more accurate estimates.
6.1. Logconcave density estimation
To ensure good performance under all possible count distributions, we use a nonparametric estimate of the error distribution. We do this under the assumption that the distribution of the logerrors are logconcave. The concavity has the added benefit that the continuous relaxation of the maximum likelihood objective is easily maximized by standard concave maximization algorithms. Furthermore, unlike other nonparametric methods such as kernel density estimation, a logconcave density has a consistent maximum likelihood estimator
(dumbgen2009maximum, ) that requires no tuning of parameters such as the bandwidth.Logconcave densities cover many common distributions. These include the Poisson, Binomial, Exponential, Normal, NegativeBinomial, among others. We remark that heavy tailed distributions with probability for large have a log density or log mass function that is logconvex in the tails rather than concave. In this case, we compute a logconcave projection of the trimmed density which results in linearly decaying tails. As shown in section 6.4, the resulting objective function is a robust objective which can perform well even when the assumptions are not met. It is similar to Huber’s estimator which combines the quadratic loss associated with the mean estimator with the linear loss of the median or other quantiles.
We further note that in many commonly used distributions where the logconcavity assumption is invalid, the density or mass function is monotone decreasing. Though nonparametric density estimators for decreasing densities exist, it is unnecessary for the purposes of this paper. For a decreasing density with unbounded support, the Min estimator is the MLE. We make this precise in Theorem 6.1 and in Theorem 6.2 which states that the logconcave projection of a decreasing density is decreasing.
We are not aware of precise statements on the computational complexity of the logconcave density estimation algorithms. However, the final estimate of the log density is always a linear spline. Estimating the density with a spline is an optimization problem with constraints equal to the number of knots. We find that our final solutions typically have a small number of knots, 10 to 40, so that fitting the density is inexpensive.
Theorem 6.1 ().
Let be i.i.d. nonnegative random variables from some decreasing density or mass function with support or the nonnegative integers . The maximum likelihood estimator for given is .
Proof.
This trivially follows from comparing the likelihood at to any other point. ∎
Theorem 6.2 ().
Let be a probability mass function with finite entropy and be its logconcave projection. It follows that is decreasing.
Proof.
Given in appendix. ∎
6.2. Maximum likelihood estimation
When the error density is known, the maximum likelihood estimate (MLE) for the count is given by
(11) 
where is the set of counters that item hashes to.
Although the likelihood accounts for shifts in the error distribution, the maximum likelihood estimator is still often biased. However, the estimator is of the form given in section 3.1, and hence, it can be fully debiased by the bootstrap procedure in section 5. Empirical results show this additional debiasing step is important for obtaining the best performing estimator as shown in figure 5. Computation in this case can be moderately expensive, however, as there is no analytic form for the sampling distribution of the estimator, unlike for the Minestimator.
6.3. Regression algorithm
The same maximum likelihood approach can be applied for joint estimation of counts by applying linear regression with the estimated error distribution. The maximum likelihood estimate for a set of items that are hashed to indices is given by a maximizer of the objective
(12) 
When many counts are jointly estimated, may be close to the size of the sketch. In this case, there are few counters containing purely error terms and an estimate of must also utilize information in as well. This turns equation 12 into a joint maximization problem over both and logconcave densities and requires extending the sum over previously irrelevant counters.
(13) 
The maximizer for this objective is known to exist and be consistent (dumbgen2010approximation, ). However, the optimization problem is only necessarily biconvex. To estimate the maximizer, we alternate between maximizing and .
6.4. Robust statistics
When the data is heavytailed, the estimated logconcave objective mirrors those used in robust statistics. In this case, the trimmed logconvex tail of a heavy tailed distribution is projected to a logconcave density. This results in the linear tails in the estimated error log density. These linear tails are extended so that the estimated log density has unbounded support. Objective functions with such linear tails are robust. They are insensitive to the actual value that an outlier takes. For example, consider a continuously differentiable objective
where the log density on . The maximizer satisfies . For an outlier , the derivative is constant for all reasonable values of . Thus, the value of has no effect on the solution beyond the fact that it is . For nondifferentiable objectives a similar argument applies to subgradients.Figure 2 illustrates this by showing the true log mass function for a sketch with an average of items hashed to each counter with item counts drawn from a distribution. This is compared to the corresponding logconcave estimate for on a right trimmed sample for three different levels of trimming. The estimated and true distributions match well except at the logconvex right tail. In that region, the estimated distribution linearizes the tail to ensure concavity. The trimming changes the sensitivity of the resulting objective to large counts. It is included since otherwise the the logprojection is not welldefined when there is a logconvex tail with unbounded support. We trim the largest of error values in our experiments.
6.5. Counter Distribution
We derive the exact asymptotic counter distribution given some unknown parameters. The significance of this derivation is that 1) it allows one to understand when the estimation assumptions are reasonable, 2) it allows one to easily compute how the error distribution changes as sketch parameters are changed, and 3) it allows us to make precise the conditions under which our Bayesian estimator is optimal.
Under the assumption that each hash generates a completely random mapping, items are assigned to a counter with very small probability . It follows from the Poisson limit theorem that the number of items in each counter is asymptotically whenever with .
Suppose the true counts have probability mass function , and excluding item , the number of items assigned to counter in replicate is denoted . This leads to the following asymptotic observational model for a single replicate in the sketch. When ,
(14)  
(15)  
(16) 
where denotes the convolutional power and is the distribution of the sum of i.i.d. distributed random variables. The error distribution for is thus a . We denote this error distribution by and its corresponding density or probability mass function by .
In general, neither nor are known. Rather than estimating them, we directly estimate the error distribution nonparametrically under an assumption of logconcavity. Sufficient conditions for logconcavity of the error distribution are provided by Theorem 5.5 in (johnson2013log, ) which we restate here.
Theorem 6.3 (Sufficient conditions for logconcavity).
Let be a mass function supported on the positive integers, and be the corresponding sizebiased measure. A distribution is logconcave if is logconcave and .
Note that so logconcavity of implies logconcavity of , and if the underlying count distribution is logconcave, then so is the error distribution for sufficiently large
. Of particular note is the NegativeBinomial distribution which can be expressed as a compound Poisson distribution.
In this paper, the most useful property of the compound Poisson distribution is given in lemma 6.4 which states that the distribution resulting from increasing the rate can be expressed using convolution. The resulting distribution on the interval can be quickly computed in
time using a FastFourier Transform. We demonstrate how this can be used to choose appropriate tuning parameters in section
8.Lemma 6.4 ().
Let be the mass function or density of a distribution. Then, is the mass function or density of a distribution.
Proof.
This follows trivially from the superposition theorem for Poisson processes (kingman1993poisson, ). ∎
6.6. Bayesian estimation
Since our procedure produces a likelihood function, it is natural to consider the resulting Bayesian estimator given a prior. In this case, it is possible to make precise statements about the optimality of the estimator.
Given a prior distribution for the unknown count and error density , the posterior distribution for is given by
(17) 
where is the set of indices hashes to. By simply replacing with the estimated , one obtains an estimated posterior. Given a loss , the optimal Bayesian estimator is the minimizer
(18) 
This leads to the optimality result in theorem 6.5. In simple terms, it states that if the number of replicates and average number of distinct items per counter stays the same but the number of error counters goes to infinity, then the Bayes optimal estimator using the approximate posterior converges to the true optimal estimator in probability.
Theorem 6.5 ().
Let be a sequence of infinitely exchangeable counts with bounded marginal mass function . Consider a sequence of Count+ summaries on the first counts where the sketch parameters is fixed and as such that . Let be the mass function of a and be its c.d.f.. Let be the optimal Bayes estimator given in equation 18 using a bounded loss function and be the estimator using the approximate posterior obtained by estimating using the maximum likelihood logconcave density estimator and an atomic mass at . Assume is logconcave and has finite entropy. Further assume that the objective has a well separated maximum with probability 1. That is, given the maximizer , if then . Then,
(19) 
Proof.
Given in the appendix. ∎
We note that this optimality result is a strong finite sample result, as only counters contain an item’s count, rather than an asymptotic optimality result or an even weaker rate result that is typical in the literature. Only finitely many replicates are observed for each item of interest.
7. Asymptotics
There is a rich set of work on asymptotics that aids understanding what makes a count estimator statistically efficient. The more general form of the estimation problem is to find the true count from a set of observations drawn from a density where the error density has support in . Such a problem is sometimes referred to as endpoint estimation. A number of works including (hall1982endpoint, ), (woodroofe1974, ), (cooke1979statistical, ) focus on the difficult case where the density vanishes at 0. More specifically, they consider the case where the density is of the form . In this case, one finds 3 regimes. When , so that the density drops off sharply near 0, the Minestimator is nearly optimal. When so that the density decays slowly near 0, the best possible rate using only points close to the minimum is and worse than the rate achieved by the mean. In this case the support constraint provides little value. When , one can achieve the improved rate using maximum likelihood estimation on items closest to the minimum.
These three regimes are illustrated in figure 3
which show the behavior of various estimators under different truncations of a normal distribution. Truncations which are far into the tail of the normal distribution have that the Mean estimator is optimal. Truncations near the mode find that a debiased Minestimator is near optimal. Truncations in the middle find both the Mean and Minestimators deviate from the optimal estimator.
8. Tuning sketch parameters
Although our methods take the guesswork out of what estimation procedure to choose, the sketch creator must still choose the number of replicates and the number of counters per replicate , or width. The original CountMin paper (cormode2005countmin, ) suggests choosing these to minimize the space required to achieve a desired error guarantee. For the guarantee, , their error bound yields the suggestion and . It has been suggested (cormode2012synopses, ) that typically in practice but can be as low as (cormode2008finding, ) without obvious illeffects. Several industry implementations such as the RedisLabs module (rediscountmin, ) choose a default of .
The previous suggestion finds the smallest sketch that will guarantee a certain confidence level and interval width based on a loose confidence bound. The same can be applied to our tight confidence intervals. We demonstrate how this can be done efficiently without trial and error by using the counter distribution from section 6.5.
We first consider the natural case where there is a fixed memory budget , and one desires the smallest interval width. As the asymptotic theory suggests the region where the Min estimator is optimal or near optimal is the best regime, it is sensible to minimize the width of the Min estimator’s interval. Let be the distribution function of a distribution where is the distribution of item counts, and be the distribution function of a random variable. Given a desired confidence level for the onesided confidence interval, the choice of is
(20) 
This is easily computed from a single Count+ summary. The summary provides the error distribution where the rate and a corresponding density estimate of . Lemma 6.4 gives that the error distribution for any choice of parameters can be computed as the convolutional power , which can be efficiently computed using a FastFourier transform. Figure 4 illustrates how the interval width changes with for a range of confidence levels and fixed memory budget.
Furthermore, the underlying data can be downsampled using coordinated or bottomk sampling (cohen2013coordinated, ) to estimate error distributions with even smaller rates. This allows one to explore the confidence interval widths for a range of sketch sizes as well.
As an illustration of how this can be applied in a database system, consider the Google Ngram viewer which deals with the canonical natural language processing task of computing counts of ngrams. An ngram is of a sequence of
words. For example, ”An ngram consists” is a 3gram. The number of ngrams and possible pointwise queries is very large. One study (yang2007ngram, ) found there were on the order of unique 5grams in 100 million English web pages out of which appeared at least 5 times. Naively tuning parameters is costly. It requires computing a large number of exact counts as well as repeatedly computing a sketch and estimated counts for a large number of parameter settings. Our method shows that no true counts need to be computed, the error is obtained by a single quantile calculation, and only one sketch needs to be computed for all parameter settings.Even when prior information about the error or count distribution is unavailable, the asymptotic theory provides guidance on how to choose the sketch parameters as wider sketches tend to be closer to the ”superefficient” regime where the Min estimator is nearly optimal.
9. Empirical Results
We test our MLE estimator in a variety of real and synthetic situations. It is shown to match or best other estimators in all situations. We also empirically show that our confidence intervals provide the correct coverage. A comparison of these tight bounds with prior bounds shows that they are orders of magnitude better.
For synthetic simulations, we use the family of ZipfMandelbrot, or discrete power law, distributions. These distributions have probability mass function given by on the positive integers. Here is some offset that adjusts the mass near with smaller values having a larger mass at 1, and controls the tail behavior with smaller values having heavier tails. For
, the distribution has infinite variance. We always consider a universe with
items.For real world datasets, we used a network and a natural language processing dataset. For network data, we used the CAIDA Anonymized OC48 Internet Traces dataset (oc48, ). In 15 minutes of network traffic there were 21.8 million packets from 1.6 million distinct source addresses and ports. We use a Count+ summary to estimate the number of packets for each source. For natural language processing data, we used the Google Ngrams dataset (michel2011quantitative, ) for all 2grams starting with the letters ’ta’. There are 1.4 million distinct 2grams out of a total of 713 million.
We used the R package logcondens (dumbgen2010logcondens, ) to perform logconcave density estimation though we note there is a corresponding package logcondiscr (balabdaoui2013asymptotics, ) for discrete distribution. Although our data is discrete, we chose the continuous valued density estimation package so that resulting objective function is continuous and can be easily solved by a standard realvalued optimizer.
Although we do not consider timings for our simulation to be representative for practical implementation as R is slow, we report that count estimation for 2000 counts for a sketch of size took roughly 4 ms per count on a 2.4Ghz CPU when running on a single thread. On average, each count estimate used roughly 16 evaluations of the objective function when using the function optimize which does not make use of known gradient or Hessian information.
To compare the sketches, we use the root mean squared error and the relative efficiency. The relative efficiency of estimator to on random data is
(21) 
where are the true values being estimated. For unbiased estimators of real valued this computes the ratio of the variances, and under regular assumptions where the variance scales inversely to sample size, the relative efficiency of represents needing times more data for estimator to achieve the same error as
For single count estimation, we compare the following estimators: the Min, Debiased Min, Debiased Mean, Debiased Median, MLE, and Debiased MLE estimators. Of these, the MLE estimators are the only completely new estimators. Other estimators benefit from our computational simplification when applicable. For all these estimators, the tight confidence intervals are from our new bootstrap procedure. For each sketch, we estimate the counts for the top 2000 heavy hitters. In simulations, the sketch sizes range in depth from 2 to 16 replicates and width from to counters per replicate. Figure 5 shows the empirical error and efficiency under the real and synthetic scenarios. The debiased MLE estimator is clearly the best estimator under all scenarios.
Figure 1 shows the coverage of the corresponding confidence intervals for each of the estimators. They match the desired confidence levels at all levels in a multitude of settings. The resulting error bounds are orders of magnitude better than those available from theoretical analysis.
9.1. Regression results
For our regression simulation, we use a small sketch with width counters like (lee2005improving, ) and depth 4. The count distribution is . We note that this choice of small sketch is to highlight the limited regime in which regression will significantly improve the count estimator, namely the regime in which items of interest have hash collisions with the known heavy hitters. As shown in (lee2005improving, ) and figure 6, regression yields almost no effect when the sketch is wide relative to the number of items of interest. When collisions with other heavy hitters are likely, then the error can be substantially reduced.
10. Discussion
We discuss the applicability of our techniques to other counting sketches and inner product estimation and address computational issues that arise with using empirical error distributions and likelihood based estimators.
10.1. Application to other counting sketches
The same idea of empirically estimating an error distribution to improve count estimation can be applied to other counting sketches and modifications of the Count+ summary. It is straightforward to apply to linear sketches such as the Count sketch (charikar2002countsketch, ) and modifications to the Count+ summary that preserve linearity, for example the time adaptive Adasketch (shrivastava2016time, ).
We note, however, that there is little reason to prefer the Count summary over the Count+ summary for pointwise queries when using our likelihood based estimators. Prior results (deng2007new, ) show that the accuracy of the Debiased Median estimates closely matches the Count sketch estimates. The Count summary is the same as the Count+ summary except item ’s counter is randomly incremented by either or rather than . Thus, the error terms are necessarily more noisy than those in a Count+ summary, and estimation should not be expected to be better when exploiting the full likelihood.
Several other modifications can be described as random nonlinear transformations of an underlying linear Count+ summary. These include the CountMinLog sketch
(talbot2009countminlog, ), (pitel2015count, ) which uses approximate counter to save space and a proposed sketch to replace the simple additive counters with approximate distinct counters (cormode2009forward, ) . The same idea of empirically estimating an error distribution and applying statistical estimation techniques can be applied to improve estimation. The resulting estimators are more complex as the observed counter values cannot be used directly. Computing the likelihood requires integrating over the error distribution.Nonlinear sketches such as the Conservative Update CountMin sketch result in summaries where the error terms are no longer exchangeable. The irrelevant counters for an item are not necessarily informative of the error distribution in the relevant counters. The conservative update modification updates only the smallest of the counters that an item hashes. This substantially reduces the raw magnitude of the error vector and potentially improves performance in the regime where the biased Minestimator is nearly optimal. However, in other regimes, the error will still grow linearly as
since there is no debiasing operation. In contrast, the error in the Count+MLE estimator will grow with the standard deviation
. Furthermore, there is no procedure for generating tight confidence intervals when using conservative updates.10.2. Computational complexity and Application to streaming settings
Thus far, estimation of the empirical error distribution has been assumed to have manageable computational cost. This is aided by the fact that if a sketch does not change, then the error distribution only needs to be estimated once. This may not be the case in streaming settings. Furthermore, in extremely high throughput situations, the maximum likelihood estimator may also be relatively expensive to compute in comparison to simple estimators like the Min, Mean, and Median.
These problems may be alleviated in two ways. First, the estimated error distribution can be updated infrequently. If the empirical distribution is updated only when it can differ by so that , then the number of times the estimated error distribution is updated is logarithmic in the stream size. The amortized cost of adding a count to the sketch goes to 0. Second, rather than using the MLE estimator, the tight error bounds can be used to periodically select the best simple estimator. Thus, the estimator can smoothly transition from the regime where the Min estimator is optimal to ones where the Mean or some quantile estimator is optimal.
10.3. Inner products
Inner product estimation is somewhat more challenging than item count estimation. All counters contain relevant item counts so the error distribution cannot be simply gleaned from unused counters. We provide a means to generate an approximate error distribution but do not evaluate this procedure in this paper as we regard it as a substantial separate topic.
Given two sketches for true count vectors , the naive CountMin inner product estimate for is . It has been empirically shown to perform well when the data is highly skewed (rusu2007statistical, ), but in cases where there is a heavy tail, the bias in the estimate is large, and it can perform an order of magnitude worse than other methods. An unbiased estimator is given by (thorup2004tabulation, ) which empirically performs similarly to the AMS sketch (alon2005estimating, ). The AMS sketch is the same as the Count sketch but with an innerproduct estimator instead of a item count estimator.
To find an error distribution, expand the product of two counters to identify the form of the error. Let be some index and be the set of all items that hash to that index. This error in the product is given by
(22)  
(23) 
The error is the sum of pairs of counts where and for each item pair , and are drawn independently, though there is dependence between pairs. An imperfect surrogate of the error distribution can be obtained by multiplying random counters in the sketch. For indices ,
(24) 
This ensure the number of pairs is approximately correct when is large and preserves part of the dependence structure between pairs. Rather than explicitly constructing a sample, this error distribution can be computed by estimating a distribution for log counter values and taking the convolution.
11. Conclusion
This paper addresses a number of practical problems for counting sketches and advances our understanding of the mechanisms by which they work. We provide two distinct primary contributions. 1) We give the first method that produces practical and tight error estimates for a pointwise query, and 2) we derive improved and optimal estimators that make full use of the information contained in the sketch. Besides their immediate contributions to counting sketches, we show they help solve other problems facing a practitioner including which sketch and which count estimator to use and how to select optimal sketch tuning parameters.
References
 [1] The caida ucsd anonymized passive oc48 internet traces dataset 20030424. http://www.caida.org/data/passive/passive_oc48_dataset.xml.
 [2] N. Alon, N. Duffield, C. Lund, and M. Thorup. Estimating arbitrary subset sums with few probes. In PODS, 2005.

[3]
N. Alon, Y. Matias, and M. Szegedy.
The space complexity of approximating the frequency moments.
Journal of Computer and System Sciences, 58(1):137–147, 1999.  [4] F. Balabdaoui, H. Jankowski, K. Rufibach, and M. Pavlides. Asymptotics of the discrete logconcave maximum likelihood estimator and related applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(4):769–790, 2013.
 [5] M. Charikar, K. Chen, and M. FarachColton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pages 693–703. Springer, 2002.
 [6] J. Chen and Q. Zhang. Biasaware sketches. VLDB, 2017.
 [7] E. Cohen and H. Kaplan. What you can do with coordinated samples. In RANDOM, 2013.
 [8] P. Cooke. Statistical inference for bounds of random variables. Biometrika, 66(2):367–374, 1979.
 [9] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1–3):1–294, 2012.
 [10] G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. VLDB, 1(2):1530–1541, 2008.
 [11] G. Cormode and S. Muthukrishnan. An improved data stream summary: the countmin sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
 [12] G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SIAM International Conference on Data Mining. SIAM, 2005.
 [13] G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward decay: A practical time decay model for streaming systems. In ICDE, pages 138–149. IEEE, 2009.
 [14] F. Deng and D. Rafiei. New estimation algorithms for streaming data: Countmin can do more, 2007.
 [15] L. Dümbgen, K. Rufibach, et al. Maximum likelihood estimation of a logconcave density and its distribution function: Basic properties and uniform consistency. Bernoulli, 15(1):40–68, 2009.
 [16] L. Dümbgen, K. Rufibach, et al. logcondens: Computations related to univariate logconcave density estimation. Journal of Statistical Software, 2010.
 [17] L. Dümbgen, R. Samworth, and D. Schuhmacher. Approximation by logconcave distributions, with applications to regression. Technical report, University of Bern, 2010.
 [18] B. Efron et al. Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1):1–26, 1979.
 [19] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
 [20] P. Hall. On estimating the endpoint of a distribution. The Annals of Statistics, pages 556–568, 1982.
 [21] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou. Dynamically maintaining frequent items over a data stream. In Proceedings of the twelfth international conference on Information and knowledge management, pages 287–294. ACM, 2003.
 [22] O. Johnson, I. Kontoyiannis, and M. Madiman. Logconcavity, ultralogconcavity, and a maximum entropy property of discrete compound poisson measures. Discrete Applied Mathematics, 161(9):1232–1250, 2013.
 [23] J. Kelley, Jr. The cuttingplane method for solving convex programs. Journal of the Society for Industrial and Applied Mathematics, 8(4):703–712, 1960.
 [24] J. F. C. Kingman. Poisson processes. Wiley Online Library, 1993.
 [25] G. M. Lee, H. Liu, Y. Yoon, and Y. Zhang. Improving sketch reconstruction accuracy using linear least squares method. In Internet Measurement Conference, 2005.
 [26] Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for perflow measurement. SIGMETRICS, 2008.
 [27] J. S. Meyer. Outer and inner confidence intervals for finite population quantile intervals. Journal of the American Statistical Association, 82(397):201–204, 1987.
 [28] J.B. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, et al. Quantitative analysis of culture using millions of digitized books. science, 331(6014):176–182, 2011.
 [29] G. T. Minton and E. Price. Improved concentration bounds for countsketch. SODA, 2014.
 [30] G. Pitel and G. Fouquier. Countminlog sketch: Approximately counting with approximate counters. arXiv preprint arXiv:1502.04885, 2015.
 [31] RedisLabs. Countmin sketch. https://github.com/RedisLabsModules/countminsketch, 2017.
 [32] F. Rusu and A. Dobra. Statistical analysis of sketch estimators. In SIGMOD, 2007.
 [33] A. Shrivastava, A. C. König, and M. Bilenko. Time adaptive sketches (adasketches) for summarizing data streams. SIGMOD, 2016.
 [34] D. Talbot. Succinct approximate counting of skewed data. In IJCAI09 Proceedings, pages 1243–1248, 2009.
 [35] M. Thorup and Y. Zhang. Tabulation based 4universal hashing with applications to second moment estimation. In SODA, volume 4, pages 615–624, 2004.
 [36] A. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2000.
 [37] M. Woodroofe. Maximum likelihood estimation of translation parameter of truncated distribution ii. Ann. Statist., 2(3):474–488, 05 1974.
 [38] S. Yang, H. Zhu, A. Apostoli, and P. Cao. Ngram statistics in english and chinese: similarities and differences. In Semantic Computing, 2007. ICSC 2007. International Conference on, pages 454–460. IEEE, 2007.
Appendix A Proofs
Theorem 6.5
Proof.
Since , it follows that the error distribution . By Theorem 3.4 in [4], . For any , choose such that , and take . Hence, for any , the approximate likelihood with probability eventually. Thus, the approximate posterior uniformly converges to the true posterior in probability, . Boundedness of the loss function ensures uniform convergence of the objectives where and is similarly defined on the approximate posterior. The wellseparation gives the desired convergence in probability of the maximizers by the Mestimation consistency theorem [36]. ∎
Theorem 6.2
Proof.
The logconcave projection is the maximizer of over logconcave mass functions. Assume is not decreasing. Without loss of generality assume, the left endpoint of the support of is 0. Let be the smallest value such that . Such a value must exist since otherwise
is linearly increasing with bounded support. Since the uniform distribution is logconcave and attains a higher objective value,
cannot be linearly increasing.For , define if , if , and if where . It is easy to verify that is a probability mass function, and that it is logconcave on and . We will verify that it satisfies the condition for logconcavity at for sufficiently small . . It follows that for small enough , the condition holds. We can now show that also attains a higher objective value for small enough . . Since is decreasing and is stricty increasing on , and . Taking , it follows that which is for sufficiently small . Thus any nondecreasing can not be a maximizer. ∎
a.1. Counter Braids
When the entire universe of items is known, the true counts must be in the feasible set . This set is expensive to compute when the number of counters is large. The counter braids estimator instead keeps track of upper and lower bounds for the feasible set so that
(25) 
It follows that
(26)  
(27) 
which has the same form as a Count+ estimation problem. Using the Min estimator, which only exploits the nonnegative support constraint, yields the updates
(28)  
(29) 
Initializing with , repeating these iterations until convergence, and returning either or as the estimate yields the counter braids estimator.