This paper focuses on counting, which is among the most fundamental operations in almost every field of science and engineering. Computing the sum is the simplest counting ( denotes time). Counting the th moment is more general. When , counts the total number of non-zeros in . When , counts the “energy” or “power” of the signal . If actually outputs the power of an underlying signal , counting the sum is equivalent to computing .
for massive data streams is practically important, among many challenging issues in data stream computations. In fact, the general theme of “scaling up for high dimensional data and high speed data streams” is among the “ten challenging problems in data mining research.”
Because the elements, , are time-varying, a naíve counting mechanism requires a system of counters to compute exactly. This is not always realistic when is large and we only need an approximate answer. For example, may be if records the arrivals of IP addresses. Or, can be the total number of checking/savings accounts.
Compressed Counting (CC) is a new scheme for approximating the th frequency moments of data streams (where ) using low memory. The underlying technique is based on what we call skewed stable random projections.
1.1 The Data Models
We consider the popular Turnstile data stream model . The input stream , arriving sequentially describes the underlying signal , meaning . The increment can be either positive (insertion) or negative (deletion). Restricting results in the cash register model. Restricting at all (but can still be either positive or negative) results in the strict Turnstile model, which suffices for describing most (although not all) natural phenomena. For example, in a database, a record can only be deleted if it was previously inserted. Another example is the checking/savings account, which allows deposits/withdrawals but generally does not allow overdraft.
Compressed Counting (CC) is applicable when, at the time for the evaluation, for all . This is more flexible than the strict Turnstile model, which requires at all . In other words, CC is applicable when data streams are (a) insertion only (i.e., the cash register model), or (b) always non-negative (i.e., the strict Turnstile model), or (c) non-negative at check points. We believe our model suffices for describing most natural data streams in practice.
With the realistic restriction that at , the definition of the th frequency moment becomes
and the case becomes trivial, because
In other words, for , we need only a simple counter to accumulate all values of increment/decrement .
For , however, counting (2) is still a non-trivial problem. Intuitively, there should exist an intelligent counting system that performs almost like a simple counter when with small . The parameter may bear a clear physical meaning. For example, may be the “decay rate” or “interest rate,” which is usually small.
The proposed Compressed Counting (CC) provides such an intelligent counting systems. Because its underlying technique is based on skewed stable random projections, we provide a brief introduction to skewed stable distributions.
1.2 Skewed Stable Distributions
-stable distribution if the Fourier transform of its density is
where and is the scale parameter. We denote . Here . When , the inverse Fourier transform is unbounded; and when
, the inverse Fourier transform is not a probability density. This is whyCompressed Counting is limited to .
Consider two independent variables, . For any non-negative constants and , the “-stability” follows from properties of Fourier transforms:
However, if and do not have the same signs, the above “stability” does not hold (unless or , ). To see this, we consider , with and . Then, because ,
which does not represent a stable law, unless or , . This is the fundamental reason why Compressed Counting needs the restriction that at the time of evaluation, elements in the data streams should have the same signs.
1.3 Skewed Stable Random Projections
Given with each element i.i.d., then
meaning represents one sample of the stable distribution whose scale parameter is what we are after.
Of course, we need more than one sample to estimate . We can generate a matrix with each entry
. The resultant vectorcontains i.i.d. samples: , to .
Note that this is a linear projection; and recall that the Turnstile model is also linear. Thus, skewed stable random projections can be applicable to dynamic data streams. For every incoming , we update for to . This way, at any time , we maintain i.i.d. stable samples. The remaining task is to recover , which is a statistical estimation problem.
1.4 Counting in Statistical/Learning Applications
The method of moments
is often convenient and popular in statistical parameter estimation. Consider, for example, the three-parameter generalized gamma distribution, which is highly flexible for modeling positive data, e.g., . If , then the first three moments are , , . Thus, one can estimate , and from i.i.d. samples by counting the first three empirical moments from the data. However, some moments may be (much) easier to compute than others if the data ’s are collected from data streams. Instead of using integer moments, the parameters can also be estimated from any three fractional moments, i.e., , for three different values of . Because is very large, any consistent estimator is likely to provide a good estimate. Thus, it might be reasonable to choose mainly based on the computational cost. See Appendix A for comments on the situation in which one may also care about the relative accuracy caused by different choices of .
The logarithmic norm arises in statistical estimation, for example, the maximum likelihood estimators for the Pareto and gamma distributions. Since it is closely connected to the moment problem, Section 4 provides an algorithm for approximating the logarithmic norm, as well as for the logarithmic distance; the latter can be quite useful in machine learning practice with massive heavy-tailed data (either dynamic or static) in lieu of the usual distance.
Entropy is also an important summary statistic. Recently  proposed to approximate the entropy moment using the th moments with and very small .
1.5 Comparisons with Previous Studies
Pioneered by, there have been many studies on approximating the th frequency moment .  considered integer moments, , 1, 2, as well as . Soon after, [5, 9] provided improved algorithms for . [18, 3] proved the sample complexity lower bounds for .  proved the optimal lower bounds for all frequency moments, except for , because for non-negative data, can be computed essentially error-free with a counter[16, 6, 1].  provided algorithms for to (essentially) achieve the lower bounds proved in [18, 3].
Note that an algorithm, which “achieves the optimal bound,” is not necessarily practical because the constant may be very large. In a sense, the method based on symmetric stable random projections is one of the few successful algorithms that are simple and free of large constants.  described the procedure for approximating in data streams and proved the bound for (although not explicitly). For ,  provided a conceptual algorithm.  proposed various estimators for symmetric stable random projections and provided the constants explicitly for all .
None of the previous studies, however, captures of the intuition that, when , a simple counter suffices for computing (essentially) error-free, and when with small , the sample complexity (number of projections, ) should be low and vary continuously as a function of .
Compressed Counting (CC) is proposed for and it works particularly well when with small . This can be practically very useful. For example, may be the “decay rate” or the “interest rate,” which is usually small; thus CC can count the total value in the future taking into account the effect of decaying or interest accruement. In parameter estimations using the method of moments, one may choose the th moments with close 1. Also, one can approximate the entropy moment using the th moments with and very small .
1.6 Two Statistical Estimators
Recall that Compressed Counting (CC) boils down to a statistical estimation problem. That is, given i.i.d. samples , estimate the scale parameter . Section 2 will explain why we fix .
estimators, whose asymptotic variances are illustrated in Figure1.
The geometric mean estimator,
is unbiased. We prove the sample complexity explicitly and show suffices for around 1.
The harmonic mean estimator, , for
It is considerably more accurate than and its sample complexity bound is also provided in an explicit form. Here is the usual gamma function.
1.7 Paper Organization
Section 2 begins with analyzing the moments of skewed stable distributions, from which the geometric mean and harmonic mean estimators are derived. Section 2 is then devoted to the detailed analysis of the geometric mean estimator.
2 The Geometric Mean Estimator
We first prove a fundamental result about the moments of skewed stable distributions. If , then for any ,
which can be simplified when , to be
For , and ,
See Appendix B.
Recall that Compressed Counting boils down to estimating from these i.i.d. samples . Setting in Lemma 2
yields an unbiased estimator:
The following Lemma shows that the variance of decreases with increasing . The variance of
is a decreasing function of .
Proof: The result follows from the fact that
is a deceasing function of .
Therefore, for attaining the smallest variance, we take . For brevity, we simply use instead of . In fact, the rest of the paper will always consider only.
We rewrite (i.e., ) as
Here, , if , and if .
Lemma 2 concerns the asymptotic moments of . As
monotonically with increasing (), where is Euler’s constant. For any fixed , as ,
Proof: See Appendix C.
In (4), the denominator depends on for small . For convenience in analyzing tail bounds, we consider an asymptotically equivalent geometric mean estimator:
The right tail bound:
and the left tail bound:
and are solutions to
Here is the “Psi” function.
Proof: See Appendix D.
For fixed , as (i.e., ),
Proof: See Appendix E.
Figure 3 plots the constants for small values of , along with the approximations suggested in Lemma 2. Since we usually consider should not be too large, we can write, as , and if ; both at the rate . However, if , , which is extremely fast.
The sample complexity bound is then straightforward. Using the geometric mean estimator, it suffices to let so that the error will be within a factor with probability , where . In the neighborhood of , only.
3 The Harmonic Mean Estimator
For , the harmonic mean estimator can considerably improve . Unlike the harmonic mean estimator in , which is useful only for small and has no exponential tail bounds except for , the harmonic mean estimator in this study has very nice tail properties for all .
The harmonic mean estimator takes advantage of the fact that if , then exists for all .
Assume i.i.d. samples , define the harmonic mean estimator ,
and the bias-corrected harmonic mean estimator ,
The bias and variance of are
The right tail bound of is, for ,
where is the solution to
The left tail bound of is, for ,
where is the solution to
Proof: See Appendix F. .
4 The Logarithmic Norm and Distance
The logarithmic norm and distance can be important in practice. Consider estimating the parameters from i.i.d. samples . The density function is , and the likelihood equation is
If instead, , to , then the density is , , and the likelihood equation is
Therefore, the logarithmic norm occurs at least in the content of maximum likelihood estimations of common distributions. Now, consider the data ’s are actually the elements of data streams ’s. Estimating becomes an interesting and practically meaningful problem.
Our solution is based on the fact that, as ,
which can be shown by L’Hópital’s rule. More precisely,
which can be shown by Taylor expansions.
Therefore, we obtain one solution to approximating the logarithmic norm using very small . Of course, we have assumed that strictly. In fact, this also suggests an approach for approximating the logarithmic distance between two streams , provided we use symmetric stable random projections.
The logarithmic distance can be useful in machine learning practice with massive heavy-tailed data (either static or dynamic) such as image and text data. For those data, the usual distance would not be useful without “term-weighting” the data; and taking logarithm is one simple weighting scheme. Thus, our method provides a direct way to compute pairwise distances, taking into account data weighting automatically.
One may be also interested in the tail bounds, which, however, can not be expressed in terms of the logarithmic norm (or distance). Nevertheless, we can obtain, e.g.,
If is used, we just replace the corresponding constants in the above expressions. If we are interested in the logarithmic distance, we simply apply symmetric stable random projections and use an appropriate estimator of the distance; the corresponding tail bounds will have same format.
Counting is a fundamental operation. In data streams , , counting the th frequency moments has been extensively studied. Our proposed Compressed Counting (CC) takes advantage of the fact that most data streams encountered in practice are non-negative, although they are subject to deletion and insertion. In fact, CC only requires that at the time for the evaluation, ; at other times, the data streams can actually go below zero.
Compressed Counting successfully captures the intuition that, when , a simple counter suffices, and when with small , an intelligent counting system should require low space (continuously as a function of ). The case with small can be practically important. For example, may be the “decay rate” or “interest rate,” which is usually small. CC can also be very useful for statistical parameter estimation based on the method of moments. Also, one can approximate the entropy moment using the th moments with and very small .
Compared with previous studies, e.g., [10, 14], Compressed Counting achieves, in a sense, an “infinite improvement” in terms of the asymptotic variances when . Two estimators based on the geometric mean and the harmonic mean are provided in this study, including their variances, tail bounds, and sample complexity bounds.
We analyze our sample complexity bound