Real world datasets are always noisy; imprecise references to same real-world entities are ubiquitous in the business and scientific databases. For example, YouTube contains many videos of almost the same content; they appear to be slightly different due to cuts, compression and change of resolutions. A large number of webpages on the Internet are near-duplicates of each other. Numerous tweets and WhatsApp/WeChat messages are re-sent with small edits. This phenomenon makes data analytics more difficult. It is clear that direct statistical analysis on such noisy datasets will be erroneous. For instance, if we perform standard distinct sampling, then the sampling will be biased towards those elements that have a large number of near-duplicates.
On the other hand, due to the sheer size of the data it becomes infeasible to perform a comprehensive data cleaning step before the actual analytic phase. In this paper we study how to process datasets containing near-duplicates in the data stream model (FM85, ; AMS99, ), where we can only make a sequential scan of data items using a small memory space before the query-answering phase. When answering queries we need to treat all the near-duplicates as the same universe element.
This general problem has been recently proposed in (CZ16, )
, where the authors studied the estimation of the number of distinct elements of the data stream (also called). In this paper we extend this line of research by studying another fundamental problem in the data stream literature: the distinct sampling (a.k.a. -sampling), where at the time of query we need to output a random sample among all the distinct elements of the dataset. -sampling has many applications that we shall mention shortly.
We remark, as also pointed out in (CZ16, ), that we cannot place our hope on a magic hash function that can map all the near-duplicates into the same element and otherwise into different elements, simply because such a magic hash function, if exists, needs a lot of bits to describe.
The Noisy Data Model and Problems
Let us formally define the noisy data model and the problems we shall study. In this paper we will focus on points in the Euclidean space. More complicated data objects such as documents and images can be mapped to points in their feature spaces.
We first introduce a few concepts (first introduced in (CZ16, )) to facilitate our discussion. Let be the distance function of the Euclidean space, and let be a parameter (distance threshold) representing the maximum distance between any two points in the same group.
Definition 1.1 (data sparsity).
We say a dataset -sparse in the Euclidean space for some if for any we have either or . We call the separation ratio.
Definition 1.2 (well-separated dataset).
We say a dataset well-separated if the separation ratio of is larger than .
Definition 1.3 (natural partition; of well-separated dataset).
We can naturally partition a well-separated dataset to a set of groups such that the intra-group distance is at most , and the inter-group distance is more than . We call this the unique natural partition of . Define the number of distinct elements of a well-separated dataset w.r.t. , denoted as , to be the number of groups in the natural partition.
We will assume that is given as a user-chosen input to our algorithms. In practice, can be obtained for example by sampling a small number of items of the dataset and then comparing their labels.
For a general dataset, we need to define the number of distinct elements as an optimization problem as follows.
Definition 1.4 ( of general dataset).
The number of distinct elements of given a distance threshold , denoted by , is defined to be the size of the minimum cardinality partition of such that for any , and for any pair of points , we have .
Note that the definition for general datasets is consistent with the one for well-separated datasets.
We next define -sampling for noisy datasets. To differentiate with the standard -sampling we will call it robust -sampling; but we may omit the word “robust” in the rest of the paper when it is clear from the context. We start with well-separated datasets.
Definition 1.5 (robust -sampling on well-separated dataset).
Let be a well-separated dataset with natural partition . The robust -sampling on outputs a point such that
That is, we output a point from each group with equal probability; we call the outputted point therobust -sample.
It is a little more subtle to define robust -sampling on general datasets, since there could be multiple minimum cardinality partitions, and without fixing a particular partition we cannot define -sampling. We will circumvent this issue by targeting a slightly weaker sampling goal.
Definition 1.6 (robust -sampling on general dataset).
Let be a dataset and let . The robust -sampling on outputs a point such that,
where is the ball centered at with radius .
and thus we can rewrite (1) as
Comparing (2) and (3), one can see that the definition of robust -sampling on general dataset is consistent with that on well-separated dataset, except that we have relaxed the sample probability by a constant factor.
We study robust -sampling in the standard streaming model, where the points comes one by one in order, and we need maintain a sketch of (denoted by ) such that at any time we can output an -sample of using . The goal is to minimize the size of sketch (or, the memory space usage) and the processing time per point under certain accuracy/approximation guarantees.
We also study the sliding window models. Let be the window size. In the sequence-based sliding window model, at any time step we should be able to output an -sample of where is the latest point that we receive by the time . In the time-based sliding window model, we should be able to output an -sample of where are points received in the last time steps . The sliding window models are generalizations of the standard streaming model (which we call the infinite window model), and are very useful in the case that we are only interested in the most recent items. Our algorithms for sliding windows will work for both sequence-based and time-based cases. The only difference is that the definitions of the expiration of a point are different in the two cases.
This paper makes the following theoretical contributions.
We propose a robust -sampling algorithm for well-separated datasets in the streaming model in constant dimensional Euclidean spaces; the algorithm uses words of space ( is the length of the stream) and processing time per point, and successes with probability during the whole streaming process. This result matches the one in the corresponding noiseless data setting. See Section 2.1
We next design an algorithm for sliding windows under the same setting. The algorithm works for both sequence-based and time-based sliding windows, using words of space and processing time per point with success probability during the whole streaming process. We comment that the sliding window algorithm is much more complicated than the one for the infinite window, and is our main technical contribution. See Section 2.2.
We further show that our algorithms can also handle datasets in high dimensional Euclidean spaces given sufficiently large separation ratios. See Section 4.
Finally, we show that our -sampling algorithms can be used to efficiently estimate in both the standard streaming model and the sliding window models. See Section 5.
We have also implemented and tested our -sampling algorithm for the infinite window case, and verified its effectiveness on various datasets. See Section 6.
We now briefly survey related works on distinct sampling, and previous work dealing with datasets with near-duplicates.
The problem of -sampling is among the most well studied problems in the data stream literature. It was first investigated in (GT01, ; CMR05, ; FIS08, ), and the current best result is due to Jowhari et al. (JST11, ). We refer readers to (CF14, ) for an overview of a number of -samplers under a unified framework. Besides being used in various statistical estimations (CMR05, ), -sampling finds applications in dynamic geometric problems (e.g., -approximation, minimum spanning tree (FIS08, )), and dynamic graph streaming algorithms (e.g., connectivity (AGM12a, ), graph sparsifiers (AGM12b, ; AGM13, ), vertex cover (CCHM15, ; CCEHMMV16, ) maximum matching (AGM12a, ; Konrad15, ; AKLY16, ; CCEHMMV16, ), etc; see (McGregor14, ) for a survey). However, all the algorithms for -sampling proposed in the literature only work for noiseless streaming datasets.
-sampling in the sliding windows on noiseless datasets can be done by running the algorithm in (BDM02, ) with the rank of each item being generated by a random hash function. As before, this approach cannot work for datasets with near-duplicates simply because the hash values assigned to near-duplicates will be different.
-sampling has also been studied in the distributed streaming setting (CT15, ) where there are multiple streams and we want to maintain a distinct sample over the union of the streams. The sampling algorithm in (CT15, ) is essentially an extension of the random sampling algorithms in (CMYZ12, ; TW11, ) by using a hash function to generate random ranks for items, and is thus again unsuitable for datasets with near-duplicates.
The list of works for estimation is even longer (e.g., (FM85, ; BJKST02, ; DF03, ; Ganguly07, ; FFG+08, ; KNW10, ); just mention a few). Estimating in the sliding window model was studied in (ZZCL10, ). Again, all these works target noiseless data.
The general problem of processing noisy data streams without a comprehensive data cleaning step was only studied fairly recently (CZ16, ) for the problem. A number of statistical problems (, -sampling, heavy hitters, etc.) were studied in the distributed model under the same noisy data model (Zhang15, ). Unfortunately the multi-round algorithms designed in the distributed model cannot be used in the data stream model because on data streams we can only scan the whole dataset once without looking back.
This line of research is closely related to entity resolution (also called data deduplication, record linkage, etc.); see, e.g., (KSS06, ; EIV07, ; HSW07, ; DN09, ). However, all these works target finding and merging all the near-duplicates, and thus cannot be applied to the data stream model where we only have a small memory space and cannot store all the items.
The high level idea of our algorithm for the infinite window is very simple. Suppose we can modify the stream by only keeping one representative point (e.g., the first point according to the order of the data stream) of each group, then we can just perform a uniform random sampling on the representative points, which can be done for example by the following folklore algorithm: We assign each point with a random rank in , and maintain the point with the minimum rank as the sample during the streaming process. Now the question becomes:
Can we identify (not necessarily store) the first point of each group space-efficiently?
Unfortunately, we will need to use space ( is the number of groups) to identify the first point of each group for a noisy streaming dataset, since we have to store at least bit to “record” the first point of each group to avoid selecting other later-coming points of the same group. One way to deal with this challenge is to subsample a set of groups in advance, and then only focus on the first points of this set of groups. Two issues remain to be dealt with:
How to sample a set of groups in advance?
How to determine the sample rate?
Note that before we have seen all points in the group, the group itself is not well-defined, and thus it is difficult to assign an ID to a group at the beginning and perform the subsampling. Moreover, the number of groups will keep increasing as we see more points, we therefore have to decrease the sample rate along the way to keep the small space usage.
For the first question, the idea is to post a random grid of side length ( is the group distance threshold) upon the point set, and then sample cells of the grid instead of groups using a hash function. We then say a group
is sampled if and only if ’s first point falls into a sampled cell,
is rejected if has a point in a sampled cell, however the ’s first point is not in a sampled cell.
is ignored if has no point in a sampled cell.
We note that the second item is critical since we want to judge a group only by its first point; even there is another point in the group that is sampled, if it is not the first point of the group, then we will still consider the group as rejected. On the other hand, we do not need to worry about those ignored groups since they are not considered at the very beginning.
To guarantee that our decision is consistent on each group we have to keep some neighborhood information on each rejected group as well to avoid “double-counting”, which seems to be space-expensive at the first glance. Fortunately, for constant dimensional Euclidean space, we can show that if grid cells are randomly sampled, then the number of non-sampled groups is within a constant factor of that of sampled groups. We thus can control the space usage of the algorithm by dynamically decreasing the sample rate for grid cells. More precisely, we try to maintain a sample rate as low as possible while guarantee that there is at least one group that is sampled. This answers the second question.
The situation in the sliding window case becomes complicated because points will expire, and consequently we cannot keep decreasing the grid cell sample rate. In fact, we have to increase the cell sample rate when there are not enough groups being sampled. However, if we increase the cell sample rate in the middle of the process, then the neighborhood information of those previously ignored groups has already got lost. To handle this dilemma we choose to maintain a hierarchical sampling structure. We choose to describe the high level ideas as well as the actual algorithm in Section 2.2.2 after the some basic algorithms and concepts have been introduced.
For general datasets, we show that our algorithms for well-separated datasets can still return an almost uniform random distinct sample. We first relate our robust -sampling algorithm to a greedy partition process, and show that our algorithm will return a random group among the groups generated by that greedy partition. We then compare that particular greedy partition with the minimum cardinality partition, and show that the number of groups produced by the two partitions are within a constant factor of each other.
Comparison with (CZ16, ). We note that although this work follows the noisy data model of that in (CZ16, ) and the roadmap of this paper is similar to that of (CZ16, ) (which we think is the natural way for the presentation), the contents of this paper, namely, the ideas, proposed algorithms, and analysis techniques, are all very different from that in (CZ16, ). After all, the -sampling problem studied in this paper is different from the estimation studied in (CZ16, ). We note, however, that there are natural connections between distinct elements and distinct sampling, and thus would like to mention a few points.
In the infinite window case, we can easily use our robust -sampling algorithm to get an algorithm for -approximating robust using the same amount of space as that in (CZ16, ) (see Section 5). We note that in the noiseless data setting, the problem of -sampling and estimation can be reduced to each other by easy reductions. However, it is not clear how to straightforwardly use estimation to perform -sampling in the noisy data setting using the same amount of space as we have achieved. We believe that since there is no magic hash function, similar procedure like finding the representative point of each group is necessary in any -sampling algorithm in the noisy data setting.
In order to deal with general datasets, in (CZ16, ) the authors introduced a concept called -ambiguity and used it as a parameter in the analysis. Intuitively, -ambiguity measures the least fraction of points that we need to remove in order to make the dataset to be well-separated. This definition works for problems whose answer is a single number, which does not depend on the actual group partition. However, different group partitions do affect the result of -sampling, even that all those partitions have the minimum cardinality. In Section 3 we show that by introducing a relaxed version of random sampling we can bypass the issue of data ambiguity.
In Table 1 we summarize the main notations used in this paper. We use to denote .
We say is -approximation of if .
|stream of points|
|length of the stream|
|length of the sliding window|
|number of groups|
|set of groups / a group|
|group containing point|
|threshold of group diameter|
|grid / a grid cell|
|cell containing point|
|set of cells adjacent to cell|
|approximation ratio for|
We need the following versions of the Chernoff bound.
Lemma 1.7 (Standard Chernoff Bound).
Let be independent Bernoulli random variables such that
be independent Bernoulli random variables such that. Let . Let . It holds that and for any .
Lemma 1.8 (Variant of Chernoff bound).
Let be independent random variables such that for some . Let . Then for any , we have
2. Well-Separated Datasets in Constant Dimensions
We start with the discussion of -sampling on well-separated datasets in constant dimensional Euclidean space.
2.1. Infinite Window
We first consider the infinite window case. We present our algorithm for -dimensional Euclidean space, but it can be trivially extended to -dimensions by appropriately changing the constant parameters.
Let be the natural group partition of the well-separated stream of points . We post a random grid with side length on , and call each grid square a cell. For a point , define to be the cell containing . Let
where is defined to be the minimum distance between and a point in . We say a group intersects a cell if .
Assuming that all points have and coordinates in the range for a large enough value . Let . We assign the cell on the -th row and the -th column of the grid a numerical identification (ID) . For convenience we will use “cell” and its ID interchangeably throughout the paper when there is no confusion.
For ease of presentation, we will assume that we can use fully random hash functions for free. In fact, by Chernoff-Hoeffding bounds for limited independence (SSS95, ; DP09, ), all our analysis still holds when we adopt -wise independent hash functions, using which will not affect the asymptotic space and time costs of our algorithms.
Let be a fully random hash function, and define for a given parameter to be . We will use to perform sampling. In particular, given a set of IDs , we call the set of sampled IDs of by . We also call the sample rate of .
As discussed in the techniques overview in the introduction, our main idea is to sample cells instead of groups in advance using a hash function.
Definition 2.1 (sampled cell).
A cell is sampled by if and only if .
By our choices of the grid cell size and the hash function we have:
Fact 1 ().
(a) Each cell will intersect at most one group, and each group will intersect at most cells.
(b) For any set of points ,
In the infinite window case (this section) we choose the representative point of each group to be the first point of the group. We note that the representative points are fully determined by the input stream, and are independent of the hash function. We will define the representative point slightly differently in the sliding window case (next section).
We define a few sets which we will maintain in our algorithms.
Definition 2.2 ().
Let be the set of representative points of all groups. Define the accept set to be
and the reject set to be
For convenience we also introduce the following concepts.
Definition 2.3 (sampled, rejected, candidate group).
We say a group a sampled group if , a rejected group if , and a candidate group if .
Figure 1 illustrates some of the concepts we have introduced so far.
Obviously, the set of sampled groups and the set of rejected groups are disjoint, and their union is the set of candidate groups. Also note that is the set of representative points of the sampled groups, and is the set of representative points of rejected groups.
We comment that it is important to keep the set , even that at the end we will only sample a point from . This is because otherwise we will have no information regarding the first points of those groups that may have points other than the first ones falling into a sampled cell, and consequently points in may also be included into , which will make the final sampling to be non-uniform among the groups. One may wonder whether this additional storage will cost too much space. Fortunately, since each group has diameter at most , we only need to monitor groups that are at a distance of at most away from sampled cells, whose cardinality can be shown to be small. More precisely, for a group , letting be its representative point, we monitor if and only if there exists a sampled cell such that . The set of representative points of such groups is precisely .
Our algorithm for -sampling in the infinite window case is presented in Algorithm 1. We control the sample rate by doubling the range of the hash function when the number of points of exceeds a threshold (Line 1 of Algorithm 1). We will also update and accordingly to maintain Definition 2.2.
When a new point comes, if is sampled and is the first point in (Line 1), we add to ; that is, we make as the representative point of the sampled group . Otherwise if is not sampled but there is at least one sampled cell in , and is the first point in (Line 1), then we add to ; that is, we make as the representative point of the rejected group .
On the other hand, if there is at least one sampled cell in (i.e., is a candidate group) and is not the first point (Line 1), then we simply discard . Note that we can test this since we have already stored the representation points of all candidate groups. In the remaining case in which is not a candidate group, we also discard .
At the time of query, we return a random point in .
Correctness and Complexity
We show the following theorem.
Theorem 2.4 ().
In constant dimensional Euclidean space for a well-separated dataset, there exists a streaming algorithm (Algorithm 1) that with probability , at any time step, it outputs a robust -sample. The algorithm uses words of space and processing time per point.
The correctness of the algorithm is straightforward. First, is a random subset of since each point is included in if and only if . Second, the outputted point is a random point in . The only thing left to be shown is that we have at any time step.
Lemma 2.5 ().
With probability , we have throughout the execution of the algorithm.
At the first time step of the streaming process, is added into with certainty since is initialized to be . Then
keeps growing. At the moment when, is doubled so that each point in is resampled with probability . After the resampling,
By a union bound over at most resample steps, we conclude that with probability , throughout the execution of the algorithm. ∎
We next analyze the space and time complexities of Algorithm 1.
Lemma 2.6 ().
With probability we have throughout the execution of the algorithm.
Consider a fixed time step. Let . For a fixed , since (we are in the -dimensional Euclidean space), and each cell is sampled randomly, we have
We only need to prove the lemma for the case ; the case follows directly since is less likely to be included in .
For each , define to be a random variable such that if , and otherwise. Let . We have and . By a Chernoff bound (Lemma 1.7), we have
If then we immediately have . Otherwise by (6) we have
We thus have
According to Algorithm 1 it always holds that . Therefore with probability at least . The lemma follows by a union bound over time steps of the streaming process. ∎
By Lemma 2.6 the space used by the algorithm can be bounded by words. The processing time per point is also bounded by .
2.2. Sliding Windows
We now consider the sliding window case. Let be the window size. We first present an algorithm that maintains a set of sampled points in with a fixed sample rate ; it will be used as a subroutine in our final sliding window algorithm (Section 2.2.2).
2.2.1. A Sliding Window Algorithm with Fixed Sample Rate
We describe the algorithm in Algorithm 2, and explain it in words below.
Besides maintaining the accept set and the reject set as that in the infinite window case, Algorithm 2 also maintains a set consisting of key-value pairs , where is the representative point of a candidate group ( can be a point outside the sliding window as long as the group has at least one point inside the sliding window), and is the latest point of the same group (thus must be in the sliding window). Define .
For each sliding window, we guarantee that each candidate group has exactly one representative point. This is achieved by the following process: for each candidate group , if there is no maintained representative point, then we take the first point as the representative point (Line 2 and Line 2). When the last point of the group expires, we delete the maintained representative point from , and delete from (Line 2).
For a new arriving point , if there already exists a point in the same group , then we simply update the last point in the pair we maintained for (Line 2). Otherwise is the first point of in the sliding window. If is a sampled group, then we add to and to (Line 2); else if is a rejected group, then we add to and to (Line 2).
Observation 1 ().
In Algorithm 2, at any time for the current sliding window, we have
Each group has exactly one representative point, which is fully determined by the stream and is independent of the hash function. More precisely, a point becomes the representative point of group in the current window if is the latest point in such that the window right before (inclusive) has no point in . See Figure 2 for an illustration.
The representative point of each group in the current window is included in with probability .
2.2.2. A Space-Efficient Algorithm for Sliding Windows
We now present our space-efficient sliding window algorithm. Note that the algorithm presented in Section 2.2.1, though being able to produce a random sample in the sliding window setting, does not have a good space usage guarantee; it may use space up to where is the window size.
The sliding window algorithm presented in this section works simultaneously for both sequence-based sliding window and time-based sliding window.
High Level Ideas
As mentioned, the main challenge of the sliding window algorithm design is that points will expire, and thus we cannot keep decreasing the sample rate. On the contrary, if at some point there are too few non-expired sampled points, then we have to increase the sample rate to guarantee that there is at least one point in the sliding window belonging to . However, if we increase the sample rate in the middle of the streaming process, then the neighborhood information of a newly sampled group may already get lost. In other words, we cannot properly maintain when the sample rate increases.
To resolve this issue we have the “prepare” such a decrease of in advance. To this end, we maintain a hierarchical set of instances of Algorithm 2, with sample rates being respectively. We thus can guarantee that in the lowest level (the one with sample rate ) we must have at least one sampled point.
Of course, to achieve a good space usage we cannot endlessly insert points to all the Algorithm 2 instances. We instead make sure that each level stores at most points, where and are the accept set and reject set respectively in the run of an Algorithm 2 instance at level . We achieve this by maintaining a dynamic partition of the sliding window. In the -th subwindow we run an instance of Algorithm 2 with sample rate . For each incoming point, we “accept” it at the highest level in which the point falls into , and then delete all points in the accept and reject sets in all the lower levels. Whenever the number of points in at some level exceeds the threshold for some constant , we “promote” most of its points to level . The process may cascade to the top level.
At the time of query we properly resample the points maintained at each to unify their sample probabilities, and then merge them to . In order to guarantee that if the sliding window is not empty then we always have at least one sampled point in , during the algorithm (in particular, the promotion procedure) we make sure that the last point of each level is always in the accept set .
Remark 1 ().
The hierarchical set of windows reminisces the exponential histogram technique by Datar et al. (DGIM02, ) for basic counting in the sliding window model. However, by a careful look one will notice that our algorithm is very different from exponential histogram, and is (naturally) more complicated since we need to deal with both distinct elements and near-duplicates. For example, the exponential histogram algorithm in (DGIM02, ) partitions the sliding window deterministically to subwindows of size . Suppose we are only interesting in the representative point of each group, we basically need to delete all the other points in each group in the sliding window, which will change the sizes of the subwindows. Handling near duplicates adds another layer of difficulty to the algorithm design; we handle this by employing Algorithm 2 (which is a variant of the algorithm for the infinite window in Section 2.1) at each of the subwindows with different sample rates. The interplay between these components make the overall algorithm involved.
We use to represent an instance of Algorithm 2. For convenience, we also use to represent all the data structures that are maintained during the run of Algorithm 2, and write , where are the accept and reject sets respectively, is the key-value pair store, and is the reciprocal of the sample rate.
If after including , the size of becomes more than , we have to do a series of updates to restore the invariant that the accept set of each level contains at most points at any time step (Line 3 to Line 3). To do this, we first split the instance of into two instances (Algorithm 4). Let point be the last point in which is sampled by hash function . We promote all the points in arriving before (and include) to level by resampling them using , which gives a new level instance .
We next try to merge with who have the same sample rate by merging their accept/reject sets and the sets of key-value pairs (Algorithm 5). The merge may result , in which case we have to perform the split and merge again. These operations may propagate to the upper levels until we research a level in which we have after the merge.
Correctness and Complexity
In this section we prove the following theorem.
Theorem 2.7 ().
In constant dimensional Euclidean space for a well-separated dataset, there exists a sliding window algorithm (Algorithm 3) that with probability , at any time step, it outputs a robust -sample. The algorithm uses words of space and amortized processing time per point.
First, it is easy to show the probability that Algorithm 3 outputs “error” is negligible.
Lemma 2.8 ().
Fix a sliding window . Let be the groups in . The sample rate at level is . Let be a random variable such that if the -th group is sampled, and otherwise. Let . Thus . By a Chernoff bound (Lemma 1.8) we have that with probability , we have for a large enough constant . The lemma then follows by a union bound over at most sampling steps. ∎
The following definition is useful for the analysis.
Definition 2.9 (subwindows).
For a fixed sliding window , we define a subwindow for each instance as follows. starts from the first point in the sliding window to the last point (denoted by ) in . For , starts from to the last point (denoted by ) in . starts from to the last point in the window .
See Figure 3 for an illustration of subwindows.
We note that a subwindow can be empty. We also note the following immediate facts by the definitions of subwindows.
Fact 2 ().
covers the whole window .
Fact 3 ().
Each subwindow ends up with a point in .
For , let be the set of groups whose last points lie in , and let be the set of their representative points. From Algorithm 3, 4 and 5 it is easy to see that the following is maintained during the whole streaming process.
Fact 4 ().
During the run of Algorithm 3, at any time step, is formed by sampling each point in with probability .
The following lemma guarantees that at the time of query we can always output a sample.
Lemma 2.10 ().
During the run of Algorithm 3, at any time step, if the sliding window contains at least one point, then when querying we can always return a sample, i.e., .
The lemma follows from Fact 3, and the fact that includes every point in (). ∎
Now we are ready to prove the theorem.
(for Theorem 2.7).
We have the following facts:
are set of representatives of disjoint sets of groups . And is the set of all groups who have the last points inside the sliding window.
By Fact 4 each is formed by sampling each point in with probability .
By Lemma 2.10, if the sliding window is not empty.
The final sample returned is a random sample of .
By Lemma 2.8, with probability the algorithm will not output “error”.
We now analyze the space and time complexities. The space usage at each level can be bounded by words. This is due to the fact that is always no more , and consequently has key-value pairs. Using Lemma 2.6 we have that with probability , .111We can reduce the failure probability to by appropriately changing the constants in the proof. Thus by a union bound, with probability , the total space is bounded by words since we have levels.
For the time cost, simply note that the time spent on each point at each level during the whole streaming process can be bounded by , and thus the amortized processing time per item can be bounded by . ∎
We conclude the section with some discussions and easy extensions.
Sampling Points with/without Replacement
Sampling groups with replacement can be trivially achieved by running instances of the algorithm for sampling one group (Algorithm 1 or Algorithm 3) in parallel. For sampling groups without replacement, we can increase the threshold at Line 1 of Algorithm 1 to , by which we can show using exactly the same analysis in Section 2.1 that with probability we have . Similarly, for the sliding window case we can increase the threshold at Line 3 of Algorithm 3 to .
Random Point As Group Representative
We can easily augment our algorithms such that instead of always returning the (fixed) representative point of a randomly sampled group, we can return a random point of the group. In other words, we want to return each point with equal probability .
For the infinite window case we can simply plug-in the classical Reservoir sampling (Vitter85, ) in Algorithm 1. We can implement this as follows: For each group that has a point stored in , we maintain an pair where is a counter counting the number of points of this group, and is the random representative point. At the beginning (when the first point of group comes) we set . When a new point is inserted, if there exists such that (i.e., and are in the same group), we increment the counter for group , and reset with probability . For the sliding window case, we can just replace Reservoir sampling with a random sampling algorithm for sliding windows (e.g., the one in (BOZ09, )).
3. General Datasets
In this section we consider general datasets which may not be well-separated, and consequently there is no natural partition of groups. However, we show that Algorithm 1 still gives the following guarantee.
Theorem 3.1 ().
Before proving the theorem, we first study group partitions generated by a greedy process.
Definition 3.2 (Greedy Partition).
Given a dataset , a greedy partition is generated by the following process: pick an arbitrary point , create a new group and update ; repeat this process until .
Lemma 3.3 ().
Given a dataset , let be the number of groups in the minimum cardinality partition of , and be the number of groups in an arbitrary greedy partition. We always have .
We first show . Let be the groups in the greedy partition according to the orders they were created, and let be the minimum partition.
We prove by induction. First it is easy to see that must cover the group containing in the minimum part