 # Compressed Counting

Counting is among the most fundamental operations in computing. For example, counting the pth frequency moment has been a very active area of research, in theoretical computer science, databases, and data mining. When p=1, the task (i.e., counting the sum) can be accomplished using a simple counter. Compressed Counting (CC) is proposed for efficiently computing the pth frequency moment of a data stream signal A_t, where 0<p<=2. CC is applicable if the streaming data follow the Turnstile model, with the restriction that at the time t for the evaluation, A_t[i]>= 0, which includes the strict Turnstile model as a special case. For natural data streams encountered in practice, this restriction is minor. The underly technique for CC is what we call skewed stable random projections, which captures the intuition that, when p=1 a simple counter suffices, and when p = 1+/Δ with small Δ, the sample complexity of a counter system should be low (continuously as a function of Δ). We show at small Δ the sample complexity (number of projections) k = O(1/ϵ) instead of O(1/ϵ^2). Compressed Counting can serve a basic building block for other tasks in statistics and computing, for example, estimation entropies of data streams, parameter estimations using the method of moments and maximum likelihood. Finally, another contribution is an algorithm for approximating the logarithmic norm, ∑_i=1^D A_t[i], and logarithmic distance. The logarithmic distance is useful in machine learning practice with heavy-tailed data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper focuses on counting, which is among the most fundamental operations in almost every field of science and engineering. Computing the sum is the simplest counting ( denotes time). Counting the th moment is more general. When , counts the total number of non-zeros in . When , counts the “energy” or “power” of the signal . If actually outputs the power of an underlying signal , counting the sum is equivalent to computing .

Here, denotes a time-varying signal, for example, data streams[8, 5, 10, 2, 4, 17]. In the literature, the th frequency moment of a data stream is defined as

 F(α)=D∑i=1|At[i]|α. (1)

Counting

for massive data streams is practically important, among many challenging issues in data stream computations. In fact, the general theme of “scaling up for high dimensional data and high speed data streams” is among the “ten challenging problems in data mining research.”

Because the elements, , are time-varying, a naíve counting mechanism requires a system of counters to compute exactly. This is not always realistic when is large and we only need an approximate answer. For example, may be if records the arrivals of IP addresses. Or, can be the total number of checking/savings accounts.

Compressed Counting (CC) is a new scheme for approximating the th frequency moments of data streams (where ) using low memory. The underlying technique is based on what we call skewed stable random projections.

### 1.1 The Data Models

We consider the popular Turnstile data stream model . The input stream , arriving sequentially describes the underlying signal , meaning . The increment can be either positive (insertion) or negative (deletion). Restricting results in the cash register model. Restricting at all (but can still be either positive or negative) results in the strict Turnstile model, which suffices for describing most (although not all) natural phenomena. For example, in a database, a record can only be deleted if it was previously inserted. Another example is the checking/savings account, which allows deposits/withdrawals but generally does not allow overdraft.

Compressed Counting (CC) is applicable when, at the time for the evaluation, for all . This is more flexible than the strict Turnstile model, which requires at all . In other words, CC is applicable when data streams are (a) insertion only (i.e., the cash register model), or (b) always non-negative (i.e., the strict Turnstile model), or (c) non-negative at check points. We believe our model suffices for describing most natural data streams in practice.

With the realistic restriction that at , the definition of the th frequency moment becomes

 F(α)=D∑i=1At[i]α; (2)

and the case becomes trivial, because

 F(1)=D∑i=1At[i]=t∑s=1Is (3)

In other words, for , we need only a simple counter to accumulate all values of increment/decrement .

For , however, counting (2) is still a non-trivial problem. Intuitively, there should exist an intelligent counting system that performs almost like a simple counter when with small . The parameter may bear a clear physical meaning. For example, may be the “decay rate” or “interest rate,” which is usually small.

The proposed Compressed Counting (CC) provides such an intelligent counting systems. Because its underlying technique is based on skewed stable random projections, we provide a brief introduction to skewed stable distributions.

### 1.2 Skewed Stable Distributions

follows a -skewed

-stable distribution if the Fourier transform of its density is



 FZ(t) =Eexp(√−1Zt)α≠1, =exp(−F|t|α(1−√−1βsign(t)tan(πα2))),

where and is the scale parameter. We denote . Here . When , the inverse Fourier transform is unbounded; and when

, the inverse Fourier transform is not a probability density. This is why

Compressed Counting is limited to .

Consider two independent variables, . For any non-negative constants and , the “-stability” follows from properties of Fourier transforms:

 Z=C1Z1+C2Z2∼S(α,β,Cα1+Cα2).

However, if and do not have the same signs, the above “stability” does not hold (unless or , ). To see this, we consider , with and . Then, because ,

 FZ= exp(−|C1t|α(1−√−1βsign(t)tan(πα2))) × exp(−|C2t|α(1+√−1βsign(t)tan(πα2))),

which does not represent a stable law, unless or , . This is the fundamental reason why Compressed Counting needs the restriction that at the time of evaluation, elements in the data streams should have the same signs.

### 1.3 Skewed Stable Random Projections

Given with each element i.i.d., then

 RTAt=D∑i=1riAt[i]∼S(α,β,F(α)=D∑i=1At[i]α),

meaning represents one sample of the stable distribution whose scale parameter is what we are after.

Of course, we need more than one sample to estimate . We can generate a matrix with each entry

. The resultant vector

contains i.i.d. samples: , to .

Note that this is a linear projection; and recall that the Turnstile model is also linear. Thus, skewed stable random projections can be applicable to dynamic data streams. For every incoming , we update for to . This way, at any time , we maintain i.i.d. stable samples. The remaining task is to recover , which is a statistical estimation problem.

### 1.4 Counting in Statistical/Learning Applications

The method of moments

is often convenient and popular in statistical parameter estimation. Consider, for example, the three-parameter generalized gamma distribution

, which is highly flexible for modeling positive data, e.g., . If , then the first three moments are , , . Thus, one can estimate , and from i.i.d. samples by counting the first three empirical moments from the data. However, some moments may be (much) easier to compute than others if the data ’s are collected from data streams. Instead of using integer moments, the parameters can also be estimated from any three fractional moments, i.e., , for three different values of . Because is very large, any consistent estimator is likely to provide a good estimate. Thus, it might be reasonable to choose mainly based on the computational cost. See Appendix A for comments on the situation in which one may also care about the relative accuracy caused by different choices of .

The logarithmic norm arises in statistical estimation, for example, the maximum likelihood estimators for the Pareto and gamma distributions. Since it is closely connected to the moment problem, Section 4 provides an algorithm for approximating the logarithmic norm, as well as for the logarithmic distance; the latter can be quite useful in machine learning practice with massive heavy-tailed data (either dynamic or static) in lieu of the usual distance.

Entropy is also an important summary statistic. Recently  proposed to approximate the entropy moment using the th moments with and very small .

### 1.5 Comparisons with Previous Studies

Pioneered by, there have been many studies on approximating the th frequency moment .  considered integer moments, , 1, 2, as well as . Soon after, [5, 9] provided improved algorithms for . [18, 3] proved the sample complexity lower bounds for .  proved the optimal lower bounds for all frequency moments, except for , because for non-negative data, can be computed essentially error-free with a counter[16, 6, 1].  provided algorithms for to (essentially) achieve the lower bounds proved in [18, 3].

Note that an algorithm, which “achieves the optimal bound,” is not necessarily practical because the constant may be very large. In a sense, the method based on symmetric stable random projections is one of the few successful algorithms that are simple and free of large constants.  described the procedure for approximating in data streams and proved the bound for (although not explicitly). For ,  provided a conceptual algorithm.  proposed various estimators for symmetric stable random projections and provided the constants explicitly for all .

None of the previous studies, however, captures of the intuition that, when , a simple counter suffices for computing (essentially) error-free, and when with small , the sample complexity (number of projections, ) should be low and vary continuously as a function of .

Compressed Counting (CC) is proposed for and it works particularly well when with small . This can be practically very useful. For example, may be the “decay rate” or the “interest rate,” which is usually small; thus CC can count the total value in the future taking into account the effect of decaying or interest accruement. In parameter estimations using the method of moments, one may choose the th moments with close 1. Also, one can approximate the entropy moment using the th moments with and very small .

Our study has connections to the Johnson-Lindenstrauss Lemma, which proved at . An analogous bound holds for [10, 14]. The dependency on may raise concerns if, say, . We will show that CC achieves in the neighborhood of .

### 1.6 Two Statistical Estimators

Recall that Compressed Counting (CC) boils down to a statistical estimation problem. That is, given i.i.d. samples , estimate the scale parameter . Section 2 will explain why we fix .

Part of this paper is to provide estimators which are convenient for theoretical analysis, e.g., tail bounds. We provide the geometric mean and the harmonic mean

estimators, whose asymptotic variances are illustrated in Figure

1. Figure 1: Let ^F be an estimator of F with asymptotic variance Var(^F)=VF2k+O(1k2). We plot the V values for the geometric mean and the harmonic mean estimators, along with the V values for the geometric mean estimator in  (symmetric GM). When α→1, our method achieves an “infinite improvement” in terms of the asymptotic variances.
• The geometric mean estimator,

 ^F(α),gm=∏kj=1|xj|α/kDgm,(Dgm depends on α and k). Var(^F(α),gm)=F2(α)kπ212(α2+2−3κ2(α))+O(1k2), κ(α)=α,   if  α<1,κ(α)=2−α,   if  α>1.

is unbiased. We prove the sample complexity explicitly and show suffices for around 1.

• The harmonic mean estimator, , for

 ^F(α),hm,c=kcos(απ2)Γ(1+α)∑kj=1|xj|−α(1−1k(2Γ2(1+α)Γ(1+2α)−1)), Var(^F(α),hm,c)=F2(α)k(2Γ2(1+α)Γ(1+2α)−1)+O(1k2).

It is considerably more accurate than and its sample complexity bound is also provided in an explicit form. Here is the usual gamma function.

### 1.7 Paper Organization

Section 2 begins with analyzing the moments of skewed stable distributions, from which the geometric mean and harmonic mean estimators are derived. Section 2 is then devoted to the detailed analysis of the geometric mean estimator.

Section 3 analyzes the harmonic mean estimator. Section 4 addresses the application of CC in statistical parameter estimation and an algorithm for approximating the logarithmic norm and distance. The proofs are presented as appendices.

## 2 The Geometric Mean Estimator

We first prove a fundamental result about the moments of skewed stable distributions. If , then for any ,

 E(|Z|λ)=Fλ/α(α)cos(λαtan−1(βtan(απ2))) ×(1+β2tan2(απ2))λ2α(2πsin(π2λ)Γ(1−λα)Γ(λ)),

which can be simplified when , to be

 E(|Z|λ)= Fλ/α(α)cos(κ(α)αλπ2)cosλ/α(κ(α)π2)(2πsin(π2λ)Γ(1−λα)Γ(λ)), κ(α)=α   if   α<1,   % and  κ(α)=2−α   if  α>1.

For , and ,

 E(|Z|λ)=E(Zλ)=Fλ/α(α)Γ(1−λα)cosλ/α(απ2)Γ(1−λ).

Proof:       See Appendix B.

Recall that Compressed Counting boils down to estimating from these i.i.d. samples . Setting in Lemma 2

yields an unbiased estimator:

 ^F(α),gm,β=∏kj=1|xj|α/kDgm,β, Dgm,β=cosk(1ktan−1(βtan(απ2)))× (1+β2tan2(απ2))12[2πsin(πα2k)Γ(1−1k)Γ(αk)]k.

The following Lemma shows that the variance of decreases with increasing . The variance of

 Var(^F(α),gm,β)=F2(α)Vgm,β Vgm,β=cosk(2ktan−1(βtan(απ2)))cos2k(1ktan−1(βtan(απ2)))× [2πsin(παk)Γ(1−2k)Γ(2αk)]k[2πsin(πα2k)Γ(1−1k)Γ(αk)]2k−1,

is a decreasing function of .

Proof:  The result follows from the fact that

 cos(2ktan−1(βtan(απ2)))cos2(1ktan−1(βtan(απ2))) = 2−sec2(1ktan−1(βtan(απ2))),

is a deceasing function of .

Therefore, for attaining the smallest variance, we take . For brevity, we simply use instead of . In fact, the rest of the paper will always consider only.

We rewrite (i.e., ) as

 ^F(α),gm=∏kj=1|xj|α/kDgm,(k≥2), (4) Dgm=(cosk(κ(α)π2k)/cos(κ(α)π2)) ×[2πsin(πα2k)Γ(1−1k)Γ(αk)]k.

Here, , if , and if .

Lemma 2 concerns the asymptotic moments of . As

 [cos(κ(α)π2k)2πΓ(αk)Γ(1−1k)sin(π2αk)]k → exp(−γe(α−1)), (5)

monotonically with increasing (), where is Euler’s constant.      For any fixed , as ,

 E((^F(α),gm)t) = Ft(α)cosk(κ(α)π2kt)[2πsin(πα2kt)Γ(1−tk)Γ(αkt)]kcoskt(κ(α)π2k)[2πsin(πα2k)Γ(1−1k)Γ(αk)]kt = Ft(α)exp(1kπ2(t2−t)24(α2+2−3κ2(α))+O(1k2)).
 Var(^F(α),gm)=F2(α)kπ212(α2+2−3κ2(α))+O(1k2).

Proof: See Appendix C.

In (4), the denominator depends on for small . For convenience in analyzing tail bounds, we consider an asymptotically equivalent geometric mean estimator:

 ^F(α),gm,b=exp(γe(α−1))cos(κ(α)π2)k∏j=1|xj|α/k.

Lemma 2 provides the tail bounds for and Figure 2 plots the tail bound constants. One can infer the tail bounds for from the monotonicity result (5).

The right tail bound:

 Pr(^F(α),gm,b−F(α)≥ϵF(α))≤exp(−kϵ2GR,gm),  ϵ>0,

and the left tail bound:

 Pr(^F(α),gm,b−F(α)≤−ϵF(α))≤exp(−kϵ2GL,gm), 0<ϵ<1,
 ϵ2GR,gm=CRlog(1+ϵ)−CRγe(α−1) −log(cos(κ(α)πCR2)2πΓ(αCR)Γ(1−CR)sin(παCR2)), ϵ2GL,gm=−CLlog(1−ϵ)+CLγe(α−1)+logα −log(cos(κ(α)π2CL)Γ(CL))+log(Γ(αCL)cos(παCL2)).

and are solutions to

 −γe(α−1)+log(1+ϵ)+κ(α)π2tan(κ(α)π2CR) −απ/2tan(απ2CR)−ψ(αCR)α+ψ(1−CR)=0, log(1−ϵ)−γe(α−1)−κ(α)π2tan(κ(α)π2CL) +απ2tan(απ2CL)−ψ(αCL)α+ψ(CL)=0.

Here is the “Psi” function.

Proof:      See Appendix D.

It is important to understand the behavior of the tail bounds as . ( if ; and if .) See more comments in Appendix A. Lemma 2 describes the precise rates of convergence.

For fixed , as (i.e., ),

 GR,gm=ϵ2log(1+ϵ)−2√Δlog(1+ϵ)+o(√Δ), If α>1, then GL,gm=ϵ2−log(1−ϵ)−2√−2Δlog(1−ϵ)+o(√Δ), If α<1, then GL,gm=ϵ2Δ(exp(−log(1−ϵ)Δ−1−γe))+o(Δexp(1Δ)).

Proof:   See Appendix E.

Figure 3 plots the constants for small values of , along with the approximations suggested in Lemma 2. Since we usually consider should not be too large, we can write, as , and if ; both at the rate . However, if , , which is extremely fast.

The sample complexity bound is then straightforward. Using the geometric mean estimator, it suffices to let so that the error will be within a factor with probability , where . In the neighborhood of , only.

## 3 The Harmonic Mean Estimator

For , the harmonic mean estimator can considerably improve . Unlike the harmonic mean estimator in , which is useful only for small and has no exponential tail bounds except for , the harmonic mean estimator in this study has very nice tail properties for all .

The harmonic mean estimator takes advantage of the fact that if , then exists for all .

Assume i.i.d. samples , define the harmonic mean estimator ,

 ^F(α),hm=kcos(απ2)Γ(1+α)∑kj=1|xj|−α,

and the bias-corrected harmonic mean estimator ,

 ^F(α),hm,c=kcos(απ2)Γ(1+α)∑kj=1|xj|−α(1−1k(2Γ2(1+α)Γ(1+2α)−1)).

The bias and variance of are

 E(^F(α),hm,c)=F(α)+O(1k2), Var(^F(α),hm,c)=F2(α)k(2Γ2(1+α)Γ(1+2α)−1)+O(1k2).

The right tail bound of is, for ,

 Pr(^F(α),hm−F(α)≥ϵF(α))≤exp(−k(ϵ2GR,hm)), ϵ2GR,hm=−log(∞∑m=0Γm(1+α)Γ(1+mα)(−t∗1)m)−t∗11+ϵ,

where is the solution to

 ∑∞m=1(−1)mm(t∗1)m−1Γm(1+α)Γ(1+mα)∑∞m=0(−1)m(t∗1)mΓm(1+α)Γ(1+mα)+11+ϵ=0.

The left tail bound of is, for ,

 Pr(^F(α),hm−F(α)≤−ϵF(α))≤exp(−k(ϵ2GL,hm)), ϵ2GL,hm=−log(∞∑m=0Γm(1+α)Γ(1+mα)(t∗2)m)+t∗21−ϵ

where is the solution to

 −∑∞m=1m(t∗2)m−1Γm(1+α)Γ(1+mα)∑∞m=0(t∗2)mΓm(1+α)Γ(1+mα)+11−ϵ=0

Proof:      See Appendix F. .

## 4 The Logarithmic Norm and Distance

The logarithmic norm and distance can be important in practice. Consider estimating the parameters from i.i.d. samples . The density function is , and the likelihood equation is

 (θ−1)D∑i=1logxi−D∑i=1xi/γ−Dθlog(γ)−DlogΓ(θ).

If instead, , to , then the density is , , and the likelihood equation is

 Dlogθ−(θ+1)D∑i=1logxi.

Therefore, the logarithmic norm occurs at least in the content of maximum likelihood estimations of common distributions. Now, consider the data ’s are actually the elements of data streams ’s. Estimating becomes an interesting and practically meaningful problem.

Our solution is based on the fact that, as ,

 Dαlog(1DD∑i=1At[i]α)→D∑i=1logAt[i],

which can be shown by L’Hópital’s rule. More precisely,

 ∣∣ ∣∣Dαlog(1DD∑i=1At[i]α)−D∑i=1logAt[i]∣∣ ∣∣ = O⎛⎝αD(D∑i=1logAt[i])2⎞⎠+O(αD∑i=1log2At[i]),

which can be shown by Taylor expansions.

Therefore, we obtain one solution to approximating the logarithmic norm using very small . Of course, we have assumed that strictly. In fact, this also suggests an approach for approximating the logarithmic distance between two streams , provided we use symmetric stable random projections.

The logarithmic distance can be useful in machine learning practice with massive heavy-tailed data (either static or dynamic) such as image and text data. For those data, the usual distance would not be useful without “term-weighting” the data; and taking logarithm is one simple weighting scheme. Thus, our method provides a direct way to compute pairwise distances, taking into account data weighting automatically.

One may be also interested in the tail bounds, which, however, can not be expressed in terms of the logarithmic norm (or distance). Nevertheless, we can obtain, e.g.,

 Pr([Dαlog(1D^F(α),hm)]≥(1+ϵ)[Dαlog(1DF(α))]) ≤ exp⎛⎜ ⎜⎝−k((F(α)/D)ϵ−1)2GR,hm⎞⎟ ⎟⎠,ϵ>0, Pr([Dαlog(1D^F(α),hm)]≤(1−ϵ)[Dαlog(1DF(α))]) ≤ exp⎛⎜ ⎜⎝−k(1−(D/F(α))ϵ)2GL,hm⎞⎟ ⎟⎠,0<ϵ<1

If is used, we just replace the corresponding constants in the above expressions. If we are interested in the logarithmic distance, we simply apply symmetric stable random projections and use an appropriate estimator of the distance; the corresponding tail bounds will have same format.

## 5 Conclusion

Counting is a fundamental operation. In data streams , , counting the th frequency moments has been extensively studied. Our proposed Compressed Counting (CC) takes advantage of the fact that most data streams encountered in practice are non-negative, although they are subject to deletion and insertion. In fact, CC only requires that at the time for the evaluation, ; at other times, the data streams can actually go below zero.

Compressed Counting successfully captures the intuition that, when , a simple counter suffices, and when with small , an intelligent counting system should require low space (continuously as a function of ). The case with small can be practically important. For example, may be the “decay rate” or “interest rate,” which is usually small. CC can also be very useful for statistical parameter estimation based on the method of moments. Also, one can approximate the entropy moment using the th moments with and very small .

Compared with previous studies, e.g., [10, 14], Compressed Counting achieves, in a sense, an “infinite improvement” in terms of the asymptotic variances when . Two estimators based on the geometric mean and the harmonic mean are provided in this study, including their variances, tail bounds, and sample complexity bounds.

We analyze our sample complexity bound