1 Introduction
When analyzing massive data sets, even simple operations such as computing a sum or mean are costly and time consuming. These simple operations are frequently performed both by people investigating the data interactively and asking a series of questions about it as well as in automated systems which must monitor or collect a multitude of statistics.
Data sketching algorithms enable the information in these massive datasets to be efficiently processed, stored, and queried. This allows them to be applied, for example, in real-time systems, both for ingesting massive data streams or for interactive analysis.
In order to achieve this efficiency, sketches are designed to only answer a specific class of question, and there is typically error in the answer. In other words, it is a form of lossy compression on the original data where one must choose what to lose in the original data. A good sketch makes the most efficient use of the data so that the errors are minimized while having the flexibility to answer a broad range of questions of interest. Some sketches, such as HyperLogLog, are constrained to answer very specific questions with extremely little memory. On the other end of the spectrum, sampling based methods such as coordinated sampling Brewer et al. (1972), Cohen and Kaplan (2013) are able to answer almost any question on the original data but at the cost of far more space to achieve the same approximation error.
We introduce a sketch, Unbiased Space Saving, that simultaneously addresses two common data analysis problems: the disaggregated subset sum problem and the frequent item problem. This makes the sketch more flexible than previous sketches that address one problem or the other. Furthermore, it is efficient as it provides state of the art performance on the disaggregated subset sum problem. On i.i.d. streams it has a stronger provable consistency guarantee for frequent item count estimation than previous results, and on non-i.i.d. streams it performs well both theoretically and empirically. In addition, we derive an error estimator with good coverage properties that allows a user to assess the quality of a disaggregated subset sum result.
The disaggregated subset sum estimation is a more challenging variant of the subset sum estimation problem Duffield et al. (2007), the extremely common problem of computing a sum or mean over a dataset with arbitrary filtering conditions. In the disaggregated subset sum problem Cohen et al. (2007), Gibbons and Matias (1998) the data is ”disaggregated” so that a per item metric of interest is split across multiple rows. For example in an ad click stream, the data may arrive as a stream of single clicks that are identified with each ad while the metric of interest is the total number of clicks per ad. The frequent item problem is the problem of identifying the heavy hitters or most frequent items in a dataset. Several sketches exist for both these individual problems. In particular, the Sample and Hold methods of Cohen et al. (2007), Estan and Varghese (2003), Gibbons and Matias (1998) address the disaggregated subset sum estimation problem. Frequent item sketches include the Space Saving sketch Metwally et al. (2005), Misra-Gries sketch Misra and Gries (1982), and Lossy Counting sketch Manku and Motwani (2002).
Our sketch is an extension of the Space Saving frequent item sketch, and as such, has stronger frequent item estimation properties than Sample and Hold. In particular, unlike Sample and Hold, theorem 3 gives both that a frequent item will eventually be included in the sketch with probability 1, and that the proportion of times it appears will be consistently estimated for i.i.d. streams. In contrast to frequent item sketches which are biased, our Unbiased Space Saving sketch gives unbiased estimates for any subset sum, including subsets containing no frequent items.
Our contributions are in three parts: 1) the development of the Unbiased Space Saving sketch, 2) the generalizations obtained from understanding the properties of the sketch and the mechanisms by which it works, and 3) the theoretical and empirical results establishing the correctness and efficiency of the sketch for answering the problems of interest. In particular, the generalizations allow multiple sketches to be merged so that information from multiple data sets may be combined as well as allowing it to be applied in distributed system. Other generalizations include the ability to handle signed and real-valued updates as well as time-decayed aggregation. We empirically test the sketch on both synthetic and real ad prediction data. Surprisingly, we find that it even outperforms priority sampling, a method that requires pre-aggregated data.
This paper is structured as follows. First, we describe the disaggregated subset sum problem, some of its applications, and related sketching problems. We then introduce our sketch, Unbiased Space Saving, as a small but significant modification of the Space Saving sketch. We examine its relation to other frequent item sketches, and show that they differ only in a ”reduction” operation. This is used to show that any unbiased reduction operation yields an unbiased sketch for the disaggregated subset sum estimation problem. The theoretical properties of the sketch are then examined. We prove its consistency for the frequent item problem and for drawing a probability proportional to size sample. We derive a variance estimator and show that it can be used to generate good confidence intervals for estimates. Finally, we present experiments using real and synthetic data.
2 Two Sketching Problems
3 Disaggregated subset sum problem
Many data analysis problems consist of a simple aggregation over some filtering and group by conditions.
SELECT sum(metric), dimensions FROM table WHERE filters GROUP BY dimensions
This problem has several variations that depend on what is known about the possible queries and about the data before the sketch is constructed. For problems in which there is no group by clause and the set of possible filter conditions are known before the sketch is constructed, counting sketches such as the CountMin sketch Cormode and Muthukrishnan (2005) and AMS sketch Alon et al. (1999) are appropriate. When the filters and group by dimensions are not known and arbitrary, the problem is the subset sum estimation problem. Sampling methods such as priority sampling Duffield et al. (2007) can be used to solve it. These work by exploiting a measure of importance for each row and sampling important rows with high probability. For example, when computing a sum, the rows containing large values contribute more to the sum and should be retained in the sample.
The disaggregated subset sum estimation problem is a more difficult variant where there is little to no information about row importance and only a small amount of information about the queries. For example, many user metrics, such as number of clicks, are computed as aggregations over some event stream where each event has the same weight 1 and hence, the same importance. Filters and group by conditions can be arbitrary except for a small restriction that one cannot query at a granularity finer than a specified unit of analysis. In the click example, the finest granularity may be at the user level. One is allowed to query over arbitrary subsets of users but cannot query a subset of a single user’s clicks. The data is ”disaggregated” since the relevant per unit metric is split across multiple rows. We will refer to something at the smallest unit of analysis as an item to distinguish it from one row in the data.
Since pre-aggregating to compute per unit metrics does not reduce the amount of relevant information, it follows that the best accuracy one can achieve is to first pre-aggregate and then apply a sketch for subset sum estimation. This operation, however, is extremely expensive, especially as the number of units is often large. Examples of units include users and ad id pairs for ad click prediction, source and destination IP pairs for IP flow metrics, and distinct search queries or terms. Each of these have trillions or more possible units.
Several sketches based on sampling have been proposed that address the disaggregated subset sum problem. These include the bottom-k sketch Cohen and Kaplan (2007) which samples items uniformly at random, the class of ”NetFlow” sketches Estan et al. (2004), and the Sample and Hold sketches Cohen et al. (2007), Estan and Varghese (2003), Gibbons and Matias (1998). Of these, the Sample-and-Hold sketches are clearly the best as they use strictly more information than the other methods to construct samples and maintain aggregate statistics. We describe them in more depth in section 5.4.
The Unbiased Space Saving sketch we propose throws away even less information than previous sketches. Surprisingly, this allows it to match the accuracy of priority sampling, a nearly optimal subset sum estimation algorithm Szegedy (2006), which uses pre-aggregated data. In some cases, our sketch achieves better accuracy despite being computed on disaggregated data.
3.1 Applications
The disaggregated subset sum problem has many applications. These include machine learning and ad prediction
Shrivastava et al. (2016), analyzing network data Estan et al. (2004), Cohen et al. (2007), detecting distributed denial of service attacks Sekar et al. (2006), database query optimization and join size estimation Vengerov et al. (2015), as well as analyzing web users’ activity logs or other business intelligence applications.For example, in ad prediction the historical click-through rate and other historical data are among the most powerful features for future ad clicks He et al. (2014). Since there is no historical data for newly created ads, one may use historical click or impression data for previous ads with similar attributes such as the same advertiser or product category Richardson et al. (2007). In join size estimation, it allows the sketch to estimate the size under the arbitrary filtering conditions that a user might impose.
It also can be naturally applied to hierarchical aggregation problems. For network traffic data, IP addresses are arranged hierarchically. A network administrator may both be interested in individual nodes that receive or generate an excess of traffic or aggregated traffic statistics on a subnet. Several sketches have been developed to exploit hierarchical aggregations including Cormode and Hadjieleftheriou (2008), Mitzenmacher et al. (2012), and Zhang et al. (2004). Since a disaggregated subset sum sketch can handle arbitrary group by conditions, it can compute the next level in a hierarchy.
3.2 Frequent item problem
The frequent item or heavy hitter problem is related to the disaggregated subset sum problem. Our sketch is an extension of Space Saving, Metwally et al. (2005), a frequent item sketch. Like the disaggregated subset sum problem, frequent item sketches are computed with respect to a unit of analysis that requires a partial aggregation of the data. However, only functions of the most frequent items are of interest. Most frequent item sketches are deterministic and have deterministic guarantees on both the identification of frequent items and the error in the counts of individual items. However, since counts in frequent item sketches are biased, further aggregation on the sketch can lead to large errors when bias accumulates as shown in section 6.3.
4 Unbiased Space-saving
Our sketch is based on the Space Saving sketch Metwally et al. (2005) used in frequent item estimation. We will refer to it as Deterministic Space Saving to differentiate it from our randomized sketch. For simplicity, we consider the case where the metric of interest is the count for each item. The Deterministic Space Saving sketch works by maintaining a list of bins labeled by distinct items. A new row with item increments ’s counter if it is in the sketch. Otherwise, the smallest bin is incremented, and its label is changed to . Our sketch introduces one small modification. If is the count for the smallest bin, then only change the label with probability . This change provably yields unbiased counts as shown in theorem 1. Algorithm 1 describes these Space Saving sketches more formally.
Notation | Definition |
---|---|
Number of rows encountered or time | |
Estimate for item at time | |
Count in the smallest bin at time | |
True count for item and total over all items | |
Estimated and true total count of items in | |
Vector of estimated and true counts | |
Relative frequency of item | |
Number of bins in sketch | |
Binary indicator if item is a label in the sketch | |
Probability of inclusion | |
Number of items from set in the sketch |
Theorem 1.
For any item , the randomized Space-Saving algorithm in figure 1 gives an unbiased estimate of the count of .
Proof.
Let denote the estimate for the count of at time and be the count in the smallest bin. We show that the expected increment to is 1 if is the next item and 0 otherwise. Suppose is the next item. If it is in the list of counters, then it is incremented by exactly 1. Otherwise, it incremented by with probability for an expected increment of . Now suppose is not the next item. The estimated count can only be modified if is the label for the smallest count. It is incremented with probability . Otherwise is updated to . This gives the update an expected increment of when the new item is not . ∎
We note that although given any fixed item , the estimate of its count is unbiased, each stored pair often contains an overestimate of the item’s count. This occurs since any item with a positive count will receive a downward biased estimate of 0 conditional on it not being in the sketch. Thus, conditional on an item appearing in the list, the count must be biased upwards.
5 Related sketches and further generalizations
Although our primary goal is to demonstrate the usefulness of the Unbiased Space-Saving sketch, we also try to understand the mechanisms by which it works and use this understanding to find extensions and generalizations. Readers only interested in the properties of Unbiased Space Saving may skip to the next section.
In particular, we examine the relationships between Unbiased Space Saving and existing deterministic frequent items sketches as well as its relationship with probability proportional to size sampling. We show that existing frequent item sketches all share the same structure as an exact increment of the count followed by a size reduction. This size reduction is implemented as an adaptive sequential thresholding operation which biases the counts. Our modification replaces the thresholding operation with a subsampling operation. This observation allows us to extend the sketch. This includes endowing it with an unbiased merge operation that can be used to combine datasets or in distributed computing environments.
The sampling design in the reduction step may also be chosen to give the sketch different properties. For example, time-decayed sampling methods may be used to weight recently occurring items more heavily. If multiple metrics are being tracked, multi-objective sampling Cohen (2015) may be used.
5.1 Probability proportional to size sampling
Our key observation in generalizing Unbiased Space Saving is that the choice of label is a sampling operation. In particular, this sampling operation chooses the item with probability proportional to its size. We briefly review probability proportional to size sampling and priority sampling as well as the Horvitz-Thompson estimator which allows one to unbias the sum estimate from any biased sampling scheme. Probability proportional to size sampling (PPS) is of special importance for sampling for subset sum estimation as it is essentially optimal. Any good sampling procedure mimics PPS sampling.
For unequal probability samples, an unbiased estimator for the sum over the true population is given by the Horvitz-Thomson estimator where denotes whether is in the sample and is the inclusion probability. When only linear statistics of the sampled items are computed, the item values may be updated .
When drawing a sample of fixed size, it is trivial to see that an optimal set of inclusion probabilities is given by when this is possible. In other words, it generates a probability proportional to size (PPS) sample. In this case, each term in the sum is constant, so that the estimator is exact and has zero variance. When the data is skewed, drawing a truly probability proportional size sample may be impossible for sample sizes greater than 1. For example, given values and , any scheme to draw 2 items with probabilities exactly proportional to size has inclusion probabilities bounded by and . The expected sample size is at most . In this case, one often chooses inclusion probabilities for some constant . The inclusion probabilities are proportional to the size if the size is not too large and 1 otherwise.
Many algorithms exist for generating PPS samples. We briefly describe two as they are necessary for the merge operation given in section 5.5. The splitting procedure of Deville and Tillé (1998) provides a class of methods to generate a fixed size PPS sample with the desired inclusion probabilities. Another method which approximately generates a PPS sample is priority sampling. Instead of exact inclusion probabilities which are typically intractable to compute, priority sampling generates a set of pseudo-inclusion probabilities.
The splitting procedure is based on a simple recursion. At each step, the target distribution is split into a mixture of two simpler distributions. One flips a coin and based on the result, chooses to sample from one of the two simpler distribution. More formally, given a target vector of inclusion probabilities and two vectors of probabilities and with , then drawing and then drawing a sample with marginal inclusion probabilities gives a sample with inclusion probabilities matching the target . There is great flexibility in choosing how to split, and when the split yields inclusion probabilities equal to 0 or 1, the subsequent sampling becomes easier.
Priority sampling is a method that approximately draws a PPS sample. It generates a random priority for an item with value . The values corresponding to the smallest priorities form the sample. Surprisingly, by defining the threshold be the smallest priority, it can be shown that for almost any function of just the samples, the expected value under this sampling scheme is the same as the expected value under independent sampling.
5.2 Misra-Gries and frequent item sketches
The Misra-Gries sketch Misra and Gries (1982), Demaine et al. (2002), Karp et al. (2003) is a frequent item sketch and is isomorphic to the Deterministic Space Saving sketch Agarwal et al. (2013). The only difference is that it decrements all counters rather than incrementing the smallest bin when processing an item that is not in the sketch. Thus, the count in the smallest bin for the Deterministic Space Saving sketch is equal to the total number of decrements in the Misra-Gries sketch. Given estimates from a Deterministic Space Saving sketch, the corresponding estimated item counts for the Misra-Gries sketch are where is the count for the smallest bin and the operation truncates negative values to be 0. In other words, the Misra-Gries estimate is the same as the Deterministic Space Saving estimate soft thresholded by . Equivalently, the Deterministic Space Saving estimates are obtained by adding back the total number of decrements to any nonzero counter in the Misra-Gries sketch.
The sketch has a deterministic error guarantee. When the total number of items is then the error for any item is at most .
Other frequent item sketches include the deterministic lossy counting and randomized sticky sampling sketches Manku and Motwani (2002). We describe only lossy counting as sticky sampling has both worse practical performance and weaker guarantees than other sketches.
A simplified version of Lossy counting applies the same decrement reduction as the Misra-Gries sketch but decrements occur at a fixed schedule rather than one which depends on the data itself. To count items with frequency , all counters are decremented after every rows. Lossy counting does not provide a guarantee that the number of counters can be bounded by . In the worst case, the size can grow to counters. Similar to the isomorphism between the Misra-Gries and Space-saving sketches, the original Lossy counting algorithm is recovered by adding the number of decrements back to any nonzero counter.
5.3 Reduction operations
Existing deterministic frequent item sketches differ in only the operation to reduce the number of nonzero counters. They all have the form described in algorithm 2 and have reduction operations that can be expressed as a thresholding operation. Although it is isomorphic to the Misra-Gries sketch, Deterministic Space Saving’s reduction operation can also be described as collapsing the two smallest bins by adding the larger bin’s count to the smaller one’s.
Modifying the reduction operation provides the sketch with different properties. We highlight several uses for alternative reduction operations.
The reduction operation for Unbiased Space Saving can be seen as a PPS sample on the two smallest bins. A natural generalization is to consider a PPS sample on all the bins. We highlight three benefits of such a scheme. First, items can be added with arbitrary counts or weights. Second, the sketch size can be reduced by multiple bins in one step. Third, there is less quadratic variation added by one sampling step, so error can be reduced. The first two benefits are obvious consequences of the generalization. To see the third, consider when a new row contains an item not in the sketch, and let be the set of bins equal to . When using the thresholded PPS inclusion probabilities from section 5.1, the resulting PPS sample has inclusion probability for the new row’s item and for bins in . Other bins have inclusion probability . After sampling, the Horvitz-Thompson adjusted counts are . Unbiased Space Saving is thus a further randomization to convert the real valued increment over bins to an integer update on a single bin. Since Unbiased Space Saving adds an additional randomization step, the PPS sample has smaller variance. The downside of this procedure, however, is that it requires real valued counters that require more space per bin. The update cost when using the stream summary data structure Metwally et al. (2005) remains .
Changing the sampling procedure can also provide other desirable behaviors. Applying forward decay sampling Cormode et al. (2009) allows one to obtain estimates that weight recent items more heavily. Other possible operations include adaptively varying the sketch size in order to only remove items with small estimated frequency.
Furthermore, the reduction step does not need to be limited strictly to subsampling. Theorem 2 gives that any unbiased reduction operation yields unbiased estimates. This generalization allows us to analyze Sample-and-Hold sketches.
Theorem 2.
Any reduction operation where the expected post-reduction estimates are equal to the pre-reduction estimates yields an unbiased sketch for the disaggregated subset estimation problem. More formally, if where are the sketch and estimated counts before reduction at time step and is the post reduction estimate, then is an unbiased estimator.
Proof.
Since , it follows that is a martingale with respect to the filtration adapted to . Thus, , and the sketch gives unbiased estimates for the disaggregated subset sum problem. ∎
We also note that reduction operations can be biased. The merge operation on the Misra-Gries sketch given by Agarwal et al. (2013) performs a soft-thresholding by the size of the counter rather than by 1. This also allows it to reduce the size of the sketch by more than 1 bin at a time. It can be modified to handle deletions and arbitrary numeric aggregations by making the thresholding operation two-sided so that negative values are shrunk toward 0 as well. In this case, we do not provide a theoretical analysis of the properties.
5.4 Sample and Hold
To our best knowledge, the current state of the art sketches designed to answer disaggregated subset sum estimation problems are the family of sample and hold sketches Gibbons and Matias (1998), Estan and Varghese (2003), Cohen et al. (2007). These methods can also be described with a randomized reduction operation.
For adaptive sample and hold Cohen et al. (2007), the sketch maintains an auxiliary variable which represents the sampling rate. Each point in the stream is assigned a random variable, and the items in the sketch are those with . If an item remains in the sketch starting from time , then the counter stores the number of times it appears in the stream after the initial time. Every time the sketch becomes too large, the sampling rate is decreased so that under the new rate , one item is no longer in the sketch.
It can be shown that unbiased estimates can be obtained by keeping a counter value the same with probability and decrementing the counter by a random random variable otherwise. If a counter becomes negative, then it is set to 0 and dropped. Adding back the mean of the random variable to the nonzero counters gives an unbiased estimator. Effectively, the sketch replaces the first time an item enters the sketch with the expected number of tries before it successfully enters the sketch and it adds the actual count after the item enters the sketch. Using the memoryless property of random variables, it is easy to show that the sketch satisfies the conditions of theorem 2. It is also clear that one update step adds more error than Unbiased Space Saving as it potentially adds noise with variance to every bin. Furthermore, the eliminated bin may not even be the smallest bin. Since is the sampling rate, it is expected to be close to 0. By contrast, Unbiased Space Saving has bounded increments of for bins other than the smallest bin, and the only bin that can be removed is the current smallest bin.
The discrepancy is especially prominent for frequent items. A frequent item in an i.i.d. stream for Unbiased Space Saving enters the sketch almost immediately, and the count for the item is nearly exact as shown in theorem 3. For adaptive sample and hold, the first occurrences of item are expected to be discarded and replaced with a high variance random variable. Since is typically small in order to keep the number of counters low, most of the information about the count is discarded.
Another sketch, step sample-and-hold, avoids the problem by maintaining counts for each ”step” when the sampling rate changes. However, this is more costly both from storage perspective as well as a computational one. For each item in the sketch, computing the expected count takes time quadratic in the number of steps in which the step’s counter for the item is nonzero, and storage is linear in .
5.5 Merging and Distributed counting
The more generalized reduction operations allow for merge operations on the sketches. Merge operations and mergeable sketches Agarwal et al. (2013) are important since they allow a collection of sketches, each answering questions about the subset of data it was constructed on, to be combined to answer a question over all the data. For example, a set of frequent item sketches that give trending news for each country can be combined to give trending news for Europe as well as a multitude of other possible combinations. Another common scenario arises when sketches are aggregated across time. Sketches for clicks may be computed per day, but the final machine learning feature may combine the last 7 days.
Furthermore, merges make sketches more practical to use in real world systems. In particular, they allow for simple distributed computation. In a map-reduce framework, each mapper can quickly compute a sketch, and only a set of small sketches needs to be sent over the network to perform an aggregation at the reducer.
As noted in the previous section, the Misra-Gries sketch has a simple merge operation which preserves its deterministic error guarantee. It simply soft thresholds by the largest counter so that at most nonzero counters are left. Mathematically, this is expressed as where is the estimated count from sketch and is the smallest nonzero value obtained by summing the estimated counts from the two sketches. Previously, no merge operation existed for Deterministic Space Saving except to first convert it to a Misra-Gries sketch. Theorem 2 shows that by replacing the pairwise randomization with priority sampling or some other sampling procedure still allows one to obtain an Unbiased Space Saving merge that can preserve the expected count in the sketch rather than biasing it downward.
The trade-off required for such an unbiased merge operation is that the sketch may detect fewer of the top items by frequency than the biased Misra-Gries merge. Rather than truncating and preserving more of the ”head” of the distribution, it must move mass from the tail closer to the head. This is illustrated in figure 1.
6 Sketch Properties
We study the properties of the space saving sketches here. These include provable asymptotic properties, variance estimates, heuristically and empirically derived properties, behavior on pathological and adversarial sequences, and costs in time and space. In particular, we prove that when the data is i.i.d., the sketch eventually includes all frequent items with probability 1 and that the estimated proportions for these frequent items is consistent. We prove there is a sharp transition between frequent items which are sampled with probability 1 eventually and infrequent items which are sampled with probability proportional to their sizes. This is also borne out in the experimental results where the observed inclusion probabilities match the theoretical ones and in estimation error where Unbiased Space Saving matches or even exceeds the accuracy of priority sampling. In pathological cases, we demonstrate that Deterministic Space Saving completely fails at the subset estimation problem. Furthermore, these pathological sequences arise naturally. Any sequence where items’ arrival rates change significantly over time forms a pathological sequence. We show that we can derive a variance estimator as well. Since it works under pathological scenarios, the estimator is upward biased. However, we heuristically show that it is close to the variance for a PPS sample. This is confirmed in experiments as well. For both i.i.d. and pathological cases, we examine the resulting empirical inclusion probabilities. Likewise, they behave similarly to a probability proportional to size or priority sample.
6.1 Asymptotic consistency
Our main theoretical result for frequent item estimation states that the sketch contains all frequent items eventually on i.i.d. streams. Thus it does no worse than Deterministic Space Saving asymptotically. We also derive a finite sample bound in section 6.3. Furthermore, the guarantee states that the estimated proportion of times the item appears is strongly consistent and goes to 0. This is better than deterministic guarantees which only ensure that the error is within some constant.
Assume that items are drawn from a possibly infinite, discrete distribution with probabilities and, without loss of generality, assume they are labeled by their index into this sequence of probabilities. Let be the number of bins and be the number of items processed by the sketch. We will also refer to as time. Let be the set of items that are in the sketch at time and . To simplify the analysis, we will give a small further randomization by randomly choosing the smallest bin to increment when multiple bins share the same smallest count. Define an absolutely frequent item to be an item drawn with probability where is the number of bins in the sketch. By removing absolutely frequent items and decrementing the sketch size by 1 each time, the set of frequent item can be defined by the condition in corollary 4 which depends only on the tail probability. We first state the theorem and a corollary that immediately follows by induction.
Theorem 3.
If , then as the number of items , eventually.
Corollary 4.
If for all and for some , then for all eventually.
Corollary 5.
Given the conditions of corollary 4, the estimate is strongly consistent for all as .
Proof.
Suppose item becomes sticky after items are processed. After , the number of times appears is counted exactly correctly. As , the number of times appears after will dominate the number of times it appears before
. By the strong law of large numbers, the estimate is strongly consistent. ∎
Lemma 6.
Let . For any , eventually as .
Proof.
Note that any item not in the sketch is added to the smallest bin. The probability of encountering an item not in the sketch is lower bounded by . Furthermore, by the strong law of large numbers, the actual number of items encountered that are not in the sketch must be eventually. If there are items added to the smallest bin, then with bins, . ∎
We now give the proof of theorem 3. The outline of the proof is as follows. We first show that item will always reappear in the sketch if it is replaced. When it reappears, its bin will accumulate increments faster than the average bin, and as long as it is not replaced during this processes, it will escape and never return to being the smallest bin. Since the number of items that can be added before the label on the minimum bin is changed is linear in the size of the minimum bin, there is enough time for item to ”escape” from the minimum bin with some constant probability. Even if it fails to escape on a given try, it will have infinitely many tries, so eventually it will escape.
Proof.
Trivially, since there are bins, and the minimum is less than the average number of items in each bin. If item is not in the sketch, then the smallest bin will take on as its label with probability . Since conditional on item 1 not being in the sketch, these are independent events, the second Borel-Cantelli lemma gives that item is in the sketch infinitely often. Whenever item is in the sketch, is a submartingale with bounded increments. Furthermore, it can be lower bounded by an asymmetric random walk where the expected increment is . Let . Let be the time item 1 flips the label of the smallest bin. Lemma 6 gives that the difference for any If item 1 is not displaced, then after additional rows, Azuma’s inequality gives after rearrangement, . The probability that item 1 is instead displaced during this time is which can be simplified to some positive constant that does not depend on . In other words, there is some constant probability such that item 1 will go from being in the smallest bin to a value greater than the mean. From there, there is a constant probability that the bounded sub-martingle never crosses back to zero or below. Since item 1 appears infinitely often, it must either become sticky or there are infinitely many 0 upcrossing for . In the latter case, there is a constant probability that lower bounds the probability the item becomes sticky. Thus a geometric random variable lower bounds the number of tries before item ”sticks,” and it must eventually be sticky. ∎
6.2 Approximate PPS Sample
We prove that for i.i.d. streams, Unbiased Space Saving approximates a PPS sample and does so without the expensive pre-aggregation step. This is born out by simulations as, surprisingly, it often empirically outperforms priority sampling on computationally expensive, pre-aggregated data. Since frequent items are included with probability 1, we consider only the remaining bins and the items in the tail.
Lemma 7.
Let denote the count in the bin. If then eventually.
Proof.
If then is not the smallest bin. In this case, the expected difference after 1 time step is bounded above by . Consider a random walk with an increment of with probability and otherwise. By Azuma’s inequality, if it is started at time at value then the probability it exceeds is bounded by . Since for to be , it must upcross 0 at some time , maximizing over gives an upper bound on the probability . It is easy to derive that is the maximizer and the probability is bounded by . When , , and the conclusion holds by the Borel-Cantelli lemma. ∎
Lemma 8.
If then and eventually
Proof.
Since there are finitely many bins, by the lemma 7, eventually. The other inequality holds since ∎
Theorem 9.
If , then the items in the sketch converge in distribution to a PPS sample.
Proof.
The label in each bin is obtained by reservoir sampling. Thus it is a uniform sample on the rows that go into that bin. Since all bins have almost exactly the same size , it follows that item is a label with probability . ∎
The asymptotic analysis of Unbiased Space Saving splits items into two regimes. Frequent items are in the sketch with probability 1 and the associated counts are nearly exact. The threshold at which frequent and infrequent items are divided is given by corollary
4 and is the same as the threshold in the merge operation shown in figure 1. The counts for infrequent items in the tail are all . The actual probability for the item in the bin is irrelevant since items not in the sketch will force the bin’s rate to catch up to the rate for other bins in the tail. Since an item changes the label of a bin with probability where is the size of the bin, the bin label is a reservoir sample of size 1 for the items added to the bin. Thus, the labels for bins in the tail are approximately proportional to their frequency. Figure 2 illustrates that the empirical inclusion probabilities match the theoretical ones for a PPS sample. The item counts are chosen to approximate a roundeddistribution. This is a skewed distribution where the standard deviation is roughly
times the mean.We note, however, that the resulting PPS sample has limitations not present in PPS samples on pre-aggregated data. For pre-aggregated data, one has both the original value and the Horvitz-Thompson adjusted value where
is the inclusion probability. This allows the sample to compute non-linear statistics such as the population variance which uses the second moment estimator
. With the PPS samples from disaggregated subset sum sketching, only the adjusted values are observed.6.3 Pathological sequences
Deterministic Space Saving has remarkably low error when estimating the counts of frequent items Cormode and Hadjieleftheriou (2008). However, we will show that it fails badly when estimating subset sums when the data stream is not i.i.d.. Unbiased Space Saving performs well on both i.i.d. and on pathological sequences.
Pathological cases arise when an item’s arrival rate changes over time rather than staying constant. Consider a sketch with 2 bins. For a sequence of 1’s, 2’s, a single 3 and 4, the Deterministic Space Saving algorithm will always return 3 and 4, each with count . By contrast, Unbiased Space Saving will return 1 and 2 with probability when is large. Note that in this case, the count for each frequent item is slightly below the threshold that guarantees inclusion in the sketch, . This example illustrates the behavior for the deterministic algorithm. When an item is not in the ”frequent item head” of the distribution then the bins that represent the tail pick the labels of the most recent items without regard to the frequency of older items.
We note that natural pathological sequences can easily occur. For instance, partially sorted data can naturally lead to such pathological sequences. This can occur from sketching the output of some join. Data partitioned by some key where the partitions are processed in order is another case. We explore this case empirically in section 7. Periodic bursts of an item followed by periods in which its frequency drops below the threshold of guaranteed inclusion are another example. The most obvious ”pathological” sequence is the case where every row is unique. The Deterministic Space Saving sketch always consists of the last items rather than a random sample, and no meaningful subset sum estimate can be derived.
For Unbiased Space Saving, we show that even for non-i.i.d. sequences, it essentially never has an inclusion probability worse than simple random sampling which has inclusion probability where denotes the falling factorial.
Theorem 10.
An item occurring times has worst case inclusion probability . An item with asymptotic frequency has an inclusion probability as .
Proof.
Whether an item is in the sketch depends only on the sequence of additions to the minimum sized bin. Let be last time an item is added to bin while it is the minimum bin. Let be the number of times item is added to bin by time and be the count of bin at time . Item is not the label of bin with probability , and it is not in the sketch with probability . Note that for item to not be in the sketch, the last occurrence of must have been added to the minimum sized bin. Thus, maximizing this probability under the constraints that and gives an upper bound on and yields the stated result. ∎
We note that the bound is often quite loose. It assumes a pathological sequence where the minimum sized bin is as large as possible, namely . If , the asymptotic bound would be .
At the same time, we note that the bound is tight in the sense that one can construct a pathological sequence that achieves the upper bound. Consider the sequence consisting of distinct items followed by item for times with and both being multiples of . It is easy to see that the only way that item is not in the sketch is for it no bin to ever take on label and for the bins to all be equal in size to the minimum sized bin. The probability of this event is equal to the given upper bound.
Although Deterministic Space Saving is poor on pathological sequences, we note that if data arrives in uniformly random order or if the data stream consists of i.i.d. data, one expects the Deterministic Space Saving algorithm to share similar unbiasedness properties as the randomized version as in both cases the label for a bin can be treated roughly as a uniform random choice out of the items in that bin.
6.4 Variance
In addition to the count estimate, one may also want an estimate of the variance. In the case of i.i.d. streams, this is simple since it forms an approximate PPS sample. Since the inclusion of items is negatively correlated, a fixed size PPS sample of size has variance upper bounded by
(1) |
When the marginal sampling probabilities are small, this upper bound is nearly exact. For the non-i.i.d. case, we provide a coarse upper bound. Since is a martingale as shown in theorem 2, the quadratic variation process taking the squares of the increments yields an unbiased estimate of the variance. There are only two cases where the martingale increment is non-zero: the new item is and is not in the sketch or the new item is not and item is in the smallest bin. In each case the expected squared increment is since the updated value is where . Let be the time when item becomes ”sticky.” That is the time at which a bin acquires label and never changes afterwards. If item does not become sticky, then . Define . It is the number of times item is added up to when it becomes sticky. This leads to the following upper bound on the variance
(2) | ||||
(3) |
We note that the same variance argument holds when computing a further aggregation to estimate for a set of items . In this case is the total number of times items in are added to the sketch excluding the deterministic additions to the final set of ”sticky” bins.
To obtain a variance estimate for a count, we plug in an estimate for into equation 3. We use the following estimate
(4) | ||||
(5) |
where is the greater of and the number of times an item in appears in the sketch.
The estimate is an upward biased estimate for . For items with count , one has no information about their relative frequency compared to other infrequent items. Thus, we choose the worst case as our estimate . For items with count , we also take a worst case approach for estimating . Consider a bin with size . The probability that an additional item will cause a change in the label is . Since is the largest possible ”non-sticky” bin, it follows where . Taking the expectation given gives the upward biased estimate . In this case, we drop the for simplicity and because it is an overestimate.
We compare this variance estimate with the variance of a Poisson PPS sample and show that they are similar for infrequent items but adds an additional term for each frequent item in the worst case for Unbiased Space-Saving. Note that in the i.i.d. scenario for Unbiased Space-Saving, and converges to . Plugging these into equation 5 gives a variance estimate of which differs only by a factor of from the variance of a Poisson PPS sample given in equation 1. For infrequent items, is typically small. For frequent items, a Poisson PPS sample has inclusion probability and zero variance. In this case, the worst case behavior for Unbiased Space Saving contributes the same variance as an infrequent item.
The similar behavior to PPS samples is also borne out by experimental results. Figure 8 shows that the variance estimate is often very accurate and close to the variance of a true PPS sample.
6.5 Confidence Intervals
As the inclusion of a specific item is a binary outcome, confidence intervals for individual counts are meaningless. However, the variance estimate allows one to compute Normal confidence intervals when computing sufficiently large subset sums. Thus, a system employing the sketch can provide estimates for the error along with the count estimate itself. These estimates are valid even when the input stream is a worst case non-i.i.d. stream. Experiments in section 7
shows that these Normal confidence intervals have close to or better than advertised coverage whenever the central limit theorem applies, even for pathological streams.
6.6 Robustness
For the same reasons it has much better behavior under pathological sequences, Unbiased Space Saving is also more robust to adversarial sequences than Deterministic Space Saving. Theorem 11 shows that by inserting an additional items, one can force all estimated counts to 0, including estimates for frequent items, as long as they are not too frequent. This complete loss of useful information is a strong contrast to the theoretical and empirical results for Unbiased Space Saving which suggest that polluting a dataset with noise items will simply halve the sample size, since it will still return a sample that approximates a PPS sample.
Theorem 11.
Let be a vector of counts with and for all . There is a sequence of rows such that item appears exactly times, but the Deterministic Space Saving sketch returns an estimate of for all items .
Proof.
Sort the items from most frequent to least frequent. Add additional distinct items. The resulting deterministic sketch will consist only of the additional distinct items and each bin will have count . ∎
6.7 Running time and space complexity
The update operation is identical to the Deterministic Space Saving update except that it changes the label of a bin less frequently. Thus, each update can be performed in time Metwally et al. (2005) when the stream summary data structure is used. In this case the space usage is where is the number of bins.
7 Experiments
We perform experiments with both simulations and real ad prediction data. For synthetic data, we consider three cases: randomly permuted sequences, realistic pathological sequences for Deterministic Space Saving, and ”pathological” sequences for Unbiased Space Saving. For each we draw the count for each item using a Weibull distribution that is discretized to integer values. That is for item
. The discretized Weibull distribution is a generalization of the geometric distribution that allows us to adjust the tail of the distribution to be more heavy tailed. We choose it over the Zipfian or other truly heavy tailed distributions as few real data distributions have infinite variance. Furthermore, we expect our methods to perform better under heavy tailed distributions with greater data skew as shown in figure
3. For more easily reproducible behavior we applied the inverse cdf method where the are on a regular grid of values rather than independent random variables. Randomly permuting the order in which individual rows arrive yields an exchangeable sequence which we note is identical to an i.i.d. sequence in the limit by de Finetti’s theorem. In each case, we draw at least samples to estimate the error.For real data, we use a Criteo ad click prediction dataset ^{1}^{1}1http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/. This dataset provides a sample of 45 million ad impressions. Each sample includes the outcome of whether or not the ad was clicked as well as multiple integer valued and categorical features. We do not randomize the order in which data arrives in this case. We pick a subset of 9 of these features. There are over 500 million possible tuples on these features and many more possible filtering conditions.
The Criteo dataset provides a natural application of the disaggregated subset sum problem. Historical clicks are a powerful feature in click prediction Richardson et al. (2007), Hillard et al. (2010). While the smallest unit of analysis is the or the pair, the data is in a disaggregated form with one row per impression. Furthermore, since there may not be enough data for a particular ad, the relevant click prediction feature may be the historical click through rate for the advertiser or some other higher level aggregation. Past work using sketches to estimate these historical counts Shrivastava et al. (2016) include the CountMin counting sketch as well as the Lossy Counting frequent item sketch.
To simulate a variety of possible filtering conditions, we draw random subsets of 100 items to evaluate the randomly permuted case. As expected, subsets which mostly pick items in the tail of the distribution and have small counts also have estimates with higher relative root mean squared error. The relative root mean squared error (RRMSE) is defined as where is the true subset sum. For unbiased estimators this is equivalent to where is the standard deviation of the estimator. Note that an algorithm with times the root mean squared error of a baseline algorithm typically requires times the space as the variance, not the standard deviation, scales linearly with size.
We compare out method to uniform sampling of items using the bottom-k sketch, priority sampling, and Deterministic Space Saving. Although we do not directly compare with sample and hold methods, we note that figure 2 in Cohen et al. (2007) shows that sample and hold performs significantly worse than priority sampling.
Surprisingly, figure 5 shows that Unbiased Space Saving performs better than priority sampling even though priority sampling is applied on pre-aggregated data. We are unsure as to the exact reason for this. However, we note that, unlike Unbiased Space Saving, priority sampling does not ensure the total count is exactly correct. A priority sample of size when all items have the same count will have relative error of when estimating the total count.
This added variability in the threshold and the relatively small sketch sizes for the simulations on i.i.d. streams may explain why Unbiased Space Saving performs even better than what could be considered close to a ”gold standard” on pre-aggregated data.
7.1 Pathological cases and variance
For pathological sequences we find that Unbiased Space Saving performs well in all cases while Deterministic Space Saving gives unacceptably large errors even for reasonable non-i.i.d. sequences. First we consider a pathological sequence for Deterministic Space Saving. This sequence is generated by splitting the sequence into two halves. Each half is an independent i.i.d. stream from a discretized Weibull frequency distribution. This is a natural scenario as the data may be randomly partitioned into blocks, for example, by hashed user id, and each block is fed into the sketch for summarization. As shown in figure 7, Deterministic Space Saving completely ignores infrequent items in the first half of the stream, resulting in large bias and error. In this case, the sketches used are small with only bins, and the disparity would only increase with larger sketches and streams where the bias of Deterministic Space Saving remains the same but the variance decreases for Unbiased Space Saving.
The types of streams that induce worst case behavior for Deterministic and Unbiased Space Saving are different. For Unbiased Space Saving, we consider a sorted stream arranged in ascending order by frequency. Note that the reverse order where the largest items occur first gives an optimally favorable stream for Unbiased Space Saving. Every frequent item is deterministically included in the sketch, and the count is exact. The sequence consists of distinct items and rows where the item counts are from a discretized Weibull distribution. We use
bins in these experiments. To evaluate our method, we partition the sequence into 10 epochs containing an equal number of distinct items and estimate the counts of items from each epoch. We find in this case our variance estimate given in equation
5 yields an upward biased estimate of the variance as expected. Furthermore, it is accurate except for very small counts and the last items in a stream. Figure 9 shows the true counts and the corresponding 95% confidence intervals computed as . In epochs 4 and 5, there are on average roughly 3 and 13 items in the sample, and the asymptotic properties from the central limit theorem needed for accurate normal confidence intervals have not or are not fully manifested. For epochs 1 and 2, the upward bias of the variance estimate gives 100% coverage despite the central limit theorem not being applicable. The coverage of a confidence interval is defined to the the probability the interval includes the true value. A 95% confidence interval should have almost exactly 95% coverage. Lower coverage represents an underestimation of variability or risk. Less harmful is higher coverage, which represents an overly conservative estimation of variability.We note that the behavior of Deterministic Space Saving is easy to derive in this case. The first 9 epochs have estimated count of 0 and the last epoch has estimated count . Figure 10 shows that except for small counts, Unbiased Space Saving performs an order of magnitude better than Deterministic Space Saving.
8 Conclusion
We have introduced a novel sketch, Unbiased Space Saving, that answers both the disaggregated subset sum and frequent item problems and gives state of the art performance under all scenarios. Surprisingly, for the disaggregated subset sum problem, the sketch can outperform even methods that run on pre-aggregated data.
We prove that asymptotically, it can answer the frequent item problem for i.i.d. sequences with probability 1 eventually. Furthermore, it gives stronger probabilistic consistency guarantees on the accuracy of the count than previous results for Deterministic Space Saving. For non-i.i.d. streams, we show that Unbiased Space Saving still has attractive frequent item estimation properties and exponential concentration of inclusion probabilities to 1.
For the disaggregated subset sum problem, we prove that the sketch provides unbiased results. For i.i.d. stream, we show that items selected for the sketch are sampled approximately according to an optimal PPS sample. For non-i.i.d. streams we show that it empirically performs well and is close to a PPS sample even if given a pathological stream for which Deterministic Space Saving fails badly on. We derive a variance estimator for subset sum estimation and show that it is nearly equivalent to the estimator for a PPS sample. It is shown to be accurate on pathological sequences and yields confidence intervals with good coverage.
We study Unbiased Space Saving’s behavior and connections to other data sketches. In particular, we identify the primary difference between many of the frequent item sketches is a slightly different operation to reduce the number of bins. We use that understanding to provide multiple generalizations to the sketch which allow it to be applied in distributed settings, handle weight decay over time, and adaptively change its size over time. This also allows us to compare Unbiased Space Saving to the family of sample and hold sketches that are also designed to answer the disaggregated subset sum problem. This allows us to also mathematically show that Unbiased Space Saving is superior.
References
- Agarwal et al. [2013] P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and K. Yi. Mergeable summaries. ACM Transactions on Database Systems, 38(4):26, 2013.
- Alon et al. [1999] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137–147, 1999.
- Brewer et al. [1972] K. R. Brewer, L. Early, and S. Joyce. Selecting several samples from a single population. Australian & New Zealand Journal of Statistics, 14(3):231–239, 1972.
- Cohen [2015] E. Cohen. Multi-objective weighted sampling. In Hot Topics in Web Systems and Technologies (HotWeb), 2015 Third IEEE Workshop on, pages 13–18. IEEE, 2015.
- Cohen and Kaplan [2007] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In PODC, 2007.
- Cohen and Kaplan [2013] E. Cohen and H. Kaplan. What you can do with coordinated samples. In RANDOM, 2013.
- Cohen et al. [2007] E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. Sketching unaggregated data streams for subpopulation-size queries. In PODS. ACM, 2007.
- Cormode and Hadjieleftheriou [2008] G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. VLDB, 1(2):1530–1541, 2008.
- Cormode and Muthukrishnan [2005] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
- Cormode et al. [2009] G. Cormode, V. Shkapenyuk, D. Srivastava, and B. Xu. Forward decay: A practical time decay model for streaming systems. In ICDE, pages 138–149. IEEE, 2009.
- Demaine et al. [2002] E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms, pages 348–360, 2002.
- Deville and Tillé [1998] J. Deville and Y. Tillé. Unequal probability sampling without replacement through a splitting method. Biometrika, 85(1):89–101, 1998.
- Duffield et al. [2007] N. Duffield, C. Lund, and M. Thorup. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM), 54(6):32, 2007.
- Estan and Varghese [2003] C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Transactions on Computer Systems (TOCS), 21(3):270–313, 2003.
- Estan et al. [2004] C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. ACM SIGCOMM Computer Communication Review, 34(4):245–256, 2004.
- Ghashami et al. [2016] M. Ghashami, E. Liberty, and J. M. Phillips. Efficient frequent directions algorithm for sparse matrices. KDD, 2016.
- Gibbons and Matias [1998] P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. ACM SIGMOD Record, 27(2):331–342, 1998.
- He et al. [2014] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the International Workshop on Data Mining for Online Advertising, pages 1–9. ACM, 2014.
- Hillard et al. [2010] D. Hillard, S. Schroedl, E. Manavoglu, H. Raghavan, and C. Leggetter. Improving ad relevance in sponsored search. In WSDM, pages 361–370. ACM, 2010.
- Karp et al. [2003] R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51–55, 2003.
- Liberty [2013] E. Liberty. Simple and deterministic matrix sketching. In KDD. ACM, 2013.
- Manku and Motwani [2002] G. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.
- Metwally et al. [2005] A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In ICDT, 2005.
- Misra and Gries [1982] J. Misra and D. Gries. Finding repeated elements. Science of computer programming, 2(2):143–152, 1982.
- Mitzenmacher et al. [2012] M. Mitzenmacher, T. Steinke, and J. Thaler. Hierarchical heavy hitters with the space saving algorithm. In Meeting on Algorithm Engineering & Expermiments, pages 160–174, 2012.
- Richardson et al. [2007] M. Richardson, E. Dominowska, and R. Ragno. Predicting clicks: estimating the click-through rate for new ads. In WWW, pages 521–530. ACM, 2007.
- Sekar et al. [2006] V. Sekar, N. Duffield, O. Spatscheck, J. van der Merwe, and H. Zhang. Lads: large-scale automated ddos detection system. In USENIX, 2006.
- Shrivastava et al. [2016] A. Shrivastava, A. C. König, and M. Bilenko. Time adaptive sketches (ada-sketches) for summarizing data streams. SIGMOD, 2016.
- Szegedy [2006] M. Szegedy. The dlt priority sampling is essentially optimal. In STOC, pages 150–158. ACM, 2006.
- Vengerov et al. [2015] D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen. Join size estimation subject to filter conditions. Proceedings of the VLDB Endowment, 8(12):1530–1541, 2015.
- Zhang et al. [2004] Y. Zhang, S. Singh, S. Sen, N. Duffield, and C. Lund. Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications. In Internet Measurement Conference (IMC), pages 101–114. ACM, 2004.
Comments
There are no comments yet.