1. Introduction
Estimating the underlying distribution of data is crucial for many applications. It is common to approximate an entire Cumulative Distribution Function (CDF) or specific quantiles. The median ( quantile) and th and th percentiles are widely used in financial metrics, statistical tests, and system monitoring. Quantiles summary found applications in databases (Selinger et al., 1979; Poosala et al., 1996), sensor networks (Li et al., 2011), logging systems (Pike et al., 2005), distributed systems (DeWitt et al., 1991)
, and decision trees
(Chen and Guestrin, 2016). While computing quantiles is conceptually very simple, doing so naively becomes infeasible for very large data.Formally the quantiles problem can be defined as follows. Let be a multiset of items . The items in exhibit a fullordering and the corresponding smallerthan comparator is known. The rank of a query (w.r.t. ) is the number of items in which are smaller than . An algorithm should process such that it can compute the rank of any query item. Answering rank queries exactly for every query is trivially possible by storing the multiset . Storing in its entirety is also necessary for this task.
An approximate version of the problem relaxes this requirement. It is allowed to output an approximate rank which is off by at most
from the exact rank. In a randomized setting, the algorithm is allowed to fail with probability at most
. Note that, for the randomized version to provide a correct answer to all possible queries, it suffices to amplify the success probability by running the algorithm with a failure probability of , and applying the union bound over quantiles. Uniform random sampling of solves this problem.In network monitoring (Liu et al., 2016) and other applications it is critical to maintain statistics while making only a single pass over the data and minimizing the communication and update time. As a result, the problem of approximating quantiles was considered in several models including distributed settings (DeWitt et al., 1991; Greenwald and Khanna, 2004; Shrivastava et al., 2004), continuous monitoring (Cormode et al., 2005; Yi and Zhang, 2013), streaming (Agarwal et al., 2013; Karnin et al., 2016; Manku et al., 1998, 1999; Wang et al., 2013; Greenwald and Khanna, 2001, 2016), and sliding windows (Arasu and Manku, 2004; Lin et al., 2004). In the present paper, the quantiles problem is considered in a standard streaming setting. The algorithm receives the items in one by one in an iterative manner. The algorithm’s approximation guarantees should not depend on the order or the content of the updates , and its space complexity should depend on at most polylogarithmically.^{1}^{1}1Throughout this manuscript we assume that each item in the stream requires space to store.
In their pioneering paper (Munro and Paterson, 1980), Munro and Paterson showed that one would need space and passes over the dataset to find a median. They also suggested an optimal iterative algorithm to find it. Later Manku et al. (Manku et al., 1998) showed that the first iteration of the algorithm in (Munro and Paterson, 1980) can be used to solve the approximate quantile problem in one pass using only space. Note that, for a small enough , this is a significant improvement over the naive algorithm, which samples items of the stream using reservoir sampling. The algorithm in (Manku et al., 1998) is deterministic, however, compared with reservoir sampling it assumes the length of the stream is known in advance. In many applications such an assumption is unrealistic. In their followup paper (Manku et al., 1999) the authors suggested a randomized algorithm without that assumption. Further improvement by Agarwal et al. (Agarwal et al., 2013) via randomizing the core subroutine pushed the space requirements down to . Also, the new data structure was proven to be fully mergeable. Greenwald and Khanna in (Greenwald and Khanna, 2001) presented an algorithm that maintains upper and lower bounds for each quantile individually, rather than one bound for all quantiles. It is deterministic and requires only space. It is not known to be fully mergeable. Later Felber and Ostrovsky (Felber and Ostrovsky, 2015) suggested nontrivial techniques of feeding sampled items into sketches from (Greenwald and Khanna, 2001) and improved the space complexity to . Recently Karnin et al. in (Karnin et al., 2016) presented an asymptotically optimal but nonmergeable data structure with space usage of and a matching lower bound. They also presented a fully mergeable algorithm whose space complexity is .
In the current paper, we suggest several further improvements to the algorithms introduced in (Karnin et al., 2016). These improvements do not affect the asymptotic guarantees of (Karnin et al., 2016) but reduce the upper bounds by constant terms, both in theory and practice. The suggested techniques also improve the worstcase update time. Additionally, we suggest two algorithms for the extended version of the problem where updates have weights. All the algorithms presented operate in the comparison model. They can only store (and discard) items from the stream and compare between them. For more background on quantile algorithms in the streaming model see (Greenwald and Khanna, 2016; Wang et al., 2013).
2. A unified view of previous randomized solutions
To introduce further improvements to the streaming quantiles algorithms we will first reexplain the previous work using simplified concepts of one pair compression and a compactor. Consider a simple problem in which your data set contains only two items and , while your data structure can only store one item. We focus on the comparison based framework where we can only compare items and cannot compute new items via operations such as averaging. In this framework, the only option for the data structure is to pick one of them and store it explicitly. The stored item is assigned weight . Given a rank query the data structure will report for , and for . For the output of the data structure will be correct, however, for the correct rank is and the data structure will output with or . It, therefore, introduces a error depending on which item was retained. From this point on, is an inner query with respect to the pair if and an outer query otherwise. This lets us distinguish those queries for which an error is introduced from those that were not influenced by a compression. Figure 1 depicts the above example of one pair compression.
The example gives rise to a highlevel method for the original problem with a dataset of size and memory capacity of items. Namely 1) keep adding items to the data structure until it is full; 2) choose any pair of items with the same weight and compress them. Notice that if we choose those pairs without care, in the worst case, we might end up representing the full dataset by its top elements, introducing an error of almost which is much larger than . Intuitively, pairs being compacted (compressed) should have their ranks as close as possible, thereby affecting as few queries as possible.
This intuition is implemented via a compactor. First introduced by Manku et al. in (Manku et al., 1998), it defines an array of items with weight each, and a compaction procedure which compress all items into items with weight
. A compaction procedure first sorts all items, then deletes either even or odd positions and doubles the weight of the rest. Figure
2 depicts the error introduced for different rank queries , by a compaction procedure applied to an example array of items . Notice that the compactor utilizes the same idea as the one pair compression, but on the pairs of neighbors in the sorted array; thus by performing nonintersecting compressions it introduces an overall error of as opposed to .The algorithm introduced in (Manku et al., 1998), defines a stack of compactors, each of size . Each compactor obtains as an input a stream and outputs a stream with half the size by performing a compact operation each time its buffer is full. The output of the final compactor is a stream of length that can simply be stored in memory. The bottom compactor that observes items has a weight of ; the next one observes items of weight and the top one . The output of a compactor on th level is an input of the compactor on th level. Note that the error introduced on th level is equal to the number of compactions times the error introduced by one compaction . The total error can be computed as: Setting will lead to an approximation error of . The space used by compactors of size each is . Note that the algorithm is deterministic.
Later, Agarwal et al. (Agarwal et al., 2013) suggested the compactor to choose the odd or even positions randomly and equiprobably, pushing the introduced error to zero in expectation. Additionally, the authors suggested a new way of feeding a subsampled streams into the data structure, recalling that samples preserve quantiles with approximation error. The proposed algorithm requires space and succeeds with high constant probability.
To prove the result the authors introduced a random variable
denoting the error introduced on the th compaction at th level. Then the overall error is where is bounded, has mean zero and is independent of the other variables. Thus, due to the Hoeffding’s inequality:Setting and will keep the error probability bounded by for quantiles.
The following improvements were made by Karnin et al. (Karnin et al., 2016).

Use exponentially decreasing size of the compactor. Higher weighted items receive higher capacity compactors.

Replace compactors of capacity 2 with a sampler. This retains only the top top compactors.

Keep the size of the top compactors fixed.

Replace the top compactors with a GK sketch (Greenwald and Khanna, 2001).
(1) and (2) reduced the space complexity to , (3) pushed it further to , and (4) led to an optimal The authors also provided a matching lower bound. Note, the last solution is not mergeable due to the use of GK (Greenwald and Khanna, 2001) as a subroutine.
While (3) and (4) lead to the asymptotically better algorithm, its implementation is complicated for application purposes and mostly are of a theoretical interest. In this paper we build upon the KLL algorithm of (Karnin et al., 2016) using only (1) and (2).
In (Karnin et al., 2016), the authors suggest the size of the compactor to decrease as , for , then and where . ^{2}^{2}2In fact (Karnin et al., 2016) has a fixable mistake in their derivation. For the sake of completeness in Appendix A we clarify that the original results holds although with a slightly different constant terms. Setting leads to the desired approximation guarantee for all quantiles with constant probability. Note that the smallest meaningful compactor has size , thus the algorithm will require compactors, where the last term is due to the stack of compactors of size . The authors suggested replacing that stack with a basic sampler, which picks one item out of every updates at random and logically is identical but consumes only space. The resulting space complexity is . We provide the pseudocode for the core routine in Algorithm 1.
3. Our Contribution
Although the asymptotic optimum is already achieved for the quantile problem, there remains room for improvement from a practical perspective. In what follows we provide novel modifications to the existing algorithms that improve both their memory consumption and runtime. In addition to the performance, we ensure the algorithm is easy to use by (1) having the algorithm require only a memory limit, as opposed to versions that must know the values of in advance, and (2) by extending the functionality of the sketching algorithm to handle weighted examples. We demonstrate the value of our algorithm in Section 5 with empirical experiments.
3.1. Lazy compactions
Consider a simplified model, when the length of the stream is known in advance. One can easily identify the weight on the top layer of KLL data structure, as well as the sampling rate and the size of each compactor. Additionally, these parameters do not change while processing the stream. Then note that while we are processing the first half of the stream, the top layer of KLL will be at most half full, i.e. half of the top compactor memory will not be in use during processing first items. Let be the total amount of allocated memory and be the compactor size decrease rate. The top layer is of size , meaning that a fraction of is not used throughout that time period. The suggested value for is which means that this quantity is . This is of course a lower estimate as the other layers in the algorithm are not utilized in various stages of the processing. A similar problem arise when we do not know the final
and keep updating it online: When the top layer is full the algorithm compacts it into a new layer; at this moment the algorithm basically doubles its guess of the final
. Although after this compaction items immediately appear on the top layer, we still have of the top layer not in use until the next update of . This unused fraction accounts for of the overall allocated memory.We suggest all the compactors share the pool of allocated memory and perform a compaction only when the pool is fully saturated. This way each compaction is applied to a potentially larger set of items compared to the fixed budget setting, leading to less compactions. Each compaction introduces a fixed amount of error thus the total error introduced is lower. Algorithm 2 gives the formal lazycompacting algorithm, and Figure 4 visualizes its advantage: in vanilla KLL all compactors having less items than their individual capacities, in lazy KLL this is not enforced due to sharing the pool of memory. In Figure 3 you can see that the memory is indeed unsaturated even when we compact the top level.
3.2. Reduced randomness via AntiCorrelations
Consider the process involving a single compactor layer. A convenient way of analyzing its performance is viewing it as a stream processing unit. It receives a stream of size and outputs a stream of size . When collecting items it sorts them, and outputs (to the output stream) either those with even or odd locations. A deterministic compactor may admit an error of up to . A compactor that decides whether to output the even or odds uniformly at random at every step admits an error of in expectation as the directions of the errors are completely uncorrelated. Here we suggest a way to force a negative correlation that reduce the mean error by a factor of . The idea is to group the compaction operations into pairs. At the th compaction, choose uniformly at random whether to output the even or odd items, as described above. In the th compaction, perform the opposite decision compared to the th compaction.
This way, each coin flip defines consecutive compactions: with probability it is even odd (), and with probability it is odd even ().
Let’s analyze the error under this strategy. Recall from Section 2 that for a rank query and a compaction operation, is either an inner or outer query. If it is an outer query, it suffers no error. If it is an inner query, it suffers and error of if we output the odds and if we output evens. Consider the error associated with a single query after two consecutive and anticorrelated compactions. We represent the four possibilities of as (innerouter), , , .
w.p.  
w.p. 
Clearly, in expectation every two compactions introduce error. Additionally, we conclude that instead of suffering an error of up to
for every single compaction operation, we suffer that error for every two compaction operations. It follows that the variance of the error is twice smaller, hence the mean error is cut by a factor of
.3.3. Error spreading
Recall that in the analysis of all compactor based solutions (Manku et al., 1998; Agarwal et al., 2013; Karnin et al., 2016; Wang et al., 2013). During a single compaction we can distinguish two types of rank queries: inner queries, for which some error is introduced, and outer queries, for which no error is introduced. Though the algorithms use this distinction in their analysis, they do not take an action to reduce the number of inner queries. It follows that for an arbitrary stream and an arbitrary query, the query may be an inner query the majority of the time, as it is treated in the analysis. In this section we provide a method that makes sure that a query has an equal chance of being inner or outer, thereby cutting in half the variance of the error associated with any query, for any stream. Consider a single compactor with a buffer of slots, and suppose is odd. On each compaction we flip a coin and then either compact the items with indices to (prefix compaction) or to (suffix compaction) equiprobably. This way each query is either inner or outer equiprobably. Formally, for a fixed rank query : with probability at least it is an outer query and then no error is introduced, with probability at most it is an inner query with error ; and with probability at most it is an inner with error
. We thus still have an unbiased estimator for the query’s rank but the variance is cut in half. We note that the same analysis applies for two consecutive compactions using the reduced randomness improvement discussed in Section
3.2: The configuration (,,,) of a query in two consecutive compactions described in Table 1 will now happen with equal probability, hence we have the same distribution for the error: with probability at least , and with probability at most each, meaning that the variance is cut in half compared to its worse case analysis without the errorspreading improvement. Figure 5 visualizes the analysis of the error for a fixed query during a single compaction operation.3.4. Sweepcompactor
The error bound for all compactor based algorithms follows from the property that every batch of pair compressions is disjoint. In other words, the compactor makes sure that all of the compacted pairs can be partitioned into sets of size exactly , the intervals corresponding to each set are disjoint, and the error bound is a result of this property. In this section we provide a modified compactor that compacts pairs one at time while maintaining the guarantee that pairs can be split into sets of size at least such that the intervals corresponding to the pairs of each set are disjoint. Compacting a single pair takes constant time; hence we reduce the worstcase update time from to . Additionally, for some data streams the disjoint batch size is strictily larger than resulting in a reduction in the overall error.
The modified compactor operates in phases we call sweeps. It maintains the same buffer as before and an additional threshold initialized as special null value. The items in the buffer are stored in nondecreasing sorted order. When we reach capacity we compact a single pair. If is null we set it to ^{3}^{3}3Notice that is still defined in the comparison model. or to the value of the smallest item uniformly at random. This mimics the behavior of the prefix/suffix compressions of Section 3.3 ^{4}^{4}4If we wish to ignore prefix/suffix compactions should always be initialized to .. The pair we compact is a pair of consecutive items where the smaller item is the smallest item in the buffer that is larger than .^{5}^{5}5We ignore the case of items with equal value. Note that if that happens, these two items should be compacted together as this is guaranteed not to incur a loss. If no such pair exist due to being too large, we start a new sweep, meaning we set to null and act as detailed above. We note that a sweep is the equivalent to a compaction of a standard compactor. Due to this reason, we consistently keep either the smaller or larger item when compacting a single pair throughout a sweep. To keep true to the technique of Section 3.2 we have sweep number draw a coin to determine if the small or large items are kept, and sweep number does the opposite. The pseudocode for the sweepcompactor is given in Algorithm 3 and Figure 6 visualizes the inner state of the sweepcompactor during a single sweep.
Notice that for an already sorted stream the modified compactor performs only a single sweep, hence in this scenario the resulting error would not be a sum of i.i.d. error terms, each of magnitude but rather a single error term of magnitude . Though this extreme situation may not happen very often, it is likely that the data admits some sorted subsequences and the average sweep would contain more than pairs. We demonstrate this empirically in our experiments.
4. Weighted stream extension
In the current section, we extend the existing quantiles algorithms to handle weighted inputs. Consider the stream of updates , where each item comes with a weight . After feeding the stream into a data structure, it is queried with a quantile , and should report an item such that:
where is the total weight of the entire stream.
Such a scenario may come up if the observed samples have different importance, e.g., in the case of boosted decision trees where examples are reweighted according to the current loss corresponding to them (Chen and Guestrin, 2016), or load balancing of tasks with a different associated cost. It can also occur when we are observing samples from a set that were not chosen uniformly at random, or when we suffer a distribution drift and wish to give preference to more recent items.
One can approach the problem naively: break down each update into unitary updates and feed it into lazy sweeping KLL. In the worst case, the time to process one unitary update is , and the time to process one weighted update is .
However, in a common scenario, weights do not increase exponentially with . In this case for long enough streams, vast majority of updates would satisfy , and in particular , where is the sampling rate of the KLL sampler object.
Recall, that in KLL the sampler maintains a reservoir sample of a single item until it observes items and outputs the sample to the stream observed by the first compactor. Then compactor processes their input stream and provides an output stream to the next compactor and so on.
In (Karnin et al., 2016), in order to obtain mergeable sketches the sampler object is in fact defined in a way that it can accept weighted inputs. It feeds the inputs into a weighted reservoir sample until that weight is larger than . At that point, the sampler has the reservoir sample of weight and a new item of weight . One of these items is being outputted into the bottom compactor input stream with a weight of exactly , with probabilities that ensure an unbiased error. Weighted reservoir sampler can process updates with the weights less than the sampling rate in time. Therefore, if does not grow exponentially, then in the worstcase the update time for the majority of updates becomes .
Further, we provide two approaches for handling the weighted input scenario in the general case, where we do not assume the slow growth of . The first is achieved via a near blackbox approach, wrapping the KLL algorithm and manipulating the input data, which introduces extra overhead to the worstcase update time and overhead to the amortized update time. In the second algorithm, we modify the core component, the compactor. It obtains a compactor that can handle items of different weight and uses the KLL paradigm with these new compactors to handle weighted inputs. The second approach does not suffer from the overhead of manipulating the incoming data and offers the same asymptotic runtime as the unweighted version.
4.1. Splitting the Input
The first algorithm we provide is obtained via a black box approach on the top of KLL for unweighted streams. Let be the minimal integer such that the total observed weight , including that of a newly observed item has the property , where is the size of the top compactor, and is the sampling rate. When a new items of weight is fed to the stream we view it in a (partially) binary representation:
Here, ; for , are either 0 or 1, and for , can take any integer value between 0 and . We feed the item with weight to the sampler, then for all for which we feed the item to the appropriate compactor of level . Finally, we feed copies of the item to the compactor at the top level . The process is depicted on the Figure 7.
The compute has one additional step in case increased due to the new item. In this case, all items from the compactors outputting items of weight are fed to the sampler. The entire process is given in Algorithm 4. It implements the update function adding an item with a weight to the sketch. It invokes in a black box manner the subprocedure we call KLL.pushItems that implements the item update of the unweighted KLL algorithm. This subprocedure can either add an arbitrarily weight value for an item that is added to the sampler, or a weight equal to a power of two, determining the compactor that will receive the item.
Theorem 4.1 ().
Algorithm 4 processes a stream of weighted updates and outputs all approximate quantiles with high probability using memory . In the worstcase scenario, a single update invokes update calls to compactors of the KLL sketch, and calls to the sampler, resulting in a worstcase runtime. The amortized runtime is .
Due to space restrictions we defer the proof to Appendix B.
4.2. Weightaware Compactor
Here we suggest a solution which does not require any stream transformation. Instead, we modify the main building block, the compactor, to handle different weights for its inputs. We define a weight aware compactor as an object that receives a stream of items of weights in for some scalar and outputs items of weight .
Suppose you are given two pairs and , such that , however, the data structure can store only one item. Due to the limitations of the comparison model, as described in section 2, the only option is to pick either or . The weightaware compactor chooses with probability and with probability , assigns weight for the chosen item and drop the other one.
To carefully control the variance we define the weightaware compactor as an array of pairs such that for some predefined scalar , and the compaction procedure is similar to the unweighted case:

sort the array using as an index

break the array into pairs of neighbors

compress each pair, using procedure described above
The intuition behind the weighted pair compression is depicted in the Figure 8 and the rest of process is given in Algorithm 5, that due to space restrictions is available in the Appendix.
Lemma 4.2 ().
Given a stream of items of weights in , a weightaware compactor outputs a stream of items of weight . If the memory budget of the weightaware compactor is we have that for any query , the error in its rank in the output stream compared to the input stream is equal to . Here, the ’s are independent random variables. For every we have and w.p. 1.
Due to space restrictions we defer the proof to Appendix B.
The new algorithm will operate as the unweighted version of KLL did. It maintains a hierarchy of compactors, and a sampler at the bottom hierarchy. A compactor at level accepts inputs of weights in , instead of exactly as in the unweighted case. As before, the sampler outputs items of weight and accepts items of weight in range from to .
Theorem 4.3 ().
Algorithm 5 processes a stream of weighted updates and outputs all approximate quantiles with high probability using space and has both worstcase runtime and amortized runtime equal .
Due to space restrictions we defer the proof to Appendix B.
5. Experimental Results
5.1. Data Sets
To study the algorithms properties we tested it on both synthetic and real datasets, with various sizes, underlying distributions and orders. Note that all the approximation guarantees of the investigated algorithms do not depend on the order in the data, however in practice the order might significantly influence the precision of the output within the theoretical guarantees. Surprisingly the worstcase is achieved when the dataset is randomly shuffled. Therefore, we will pay more attention to randomly ordered data sets in this section. We also experiment with the semirandom orders that resemble more to real life applications. Due to the space limitations we could not possibly present all the experiments in the paper and present here only the most interesting findings.
Our experiments were carried on following synthetic datasets.
Sorted is a stream with all unique items in ascending order.
Shuffled is a randomly shuffled stream with all unique items.
Trending is meanzero random variable. Trending stream mimics a statistical drift over time
(widely used in ML).
Brownian simulates a Brownian motion or a random walk which generates time series data not unlike CPU usage, stock market, traffic congestion, etc.
The length of the stream varies from to for all the datasets.
In addition to synthetic data we use two publicly available datasets. The first contains text information and the second contains IP addresses. Both objects types have a natural order and can be fed as input to a quantile sketching algorithm.
(1) Anonymized Internet Traces 2015 (CAIDA) (cai, 2015) The dataset contains anonymized passive traffic traces from the internet data collection monitor which belongs to CAIDA (Center for Applied Internet Data Analysis) and located at an Equinix data center in Chicago, IL. For simplicity we work with the stream of pairs . The comparison model is lexicographic. We evaluate the performance on the prefixes of the dataset of different sizes: from to . Note that evaluation of the CDF of the underlying distribution for traffic flows helps optimize packet managing. CAIDA’s datasets are used widely for verifying different sketching techniques to maintain different statistics over the flow, and finding quantiles and heavy hitters specifically.
(2) Page view statistics for Wikimedia projects (Wiki) (wik, 2016) The dataset contains counts for the number of requests for each page of the Wikipedia project during 8 months of 2016. The data is aggregated by day, i.e. within each day data is sorted and each item is assigned with a count of requests during that day. Every update in this dataset is the title of a Wikipedia page. We will experiment with both the original dataset and with its shuffled version. Similarly to CAIDA we will consider for the Wiki dataset prefixes of size from to . In our experiments, each update is a string containing the name of the page in Wikipedia. The comparison model is lexicographic.


5.2. Implementation and Evaluation Details
All the algorithms and experimental settings are implemented in Python 3.6.3. The advantage of using a scripting language is fast prototyping and readable code for distribution inside the community. Time performance of the algorithm is not the subject of the research in the current paper, and we leave its investigation for future work. This in particular applies to the sweep compactor KLL and the algorithms for weighted quantiles, which theoretically improve the worstcase update time exponentially in . All the algorithms in the current comparison are randomized, thus for each experiment the results presented are averaged over 50 independent runs. KLL and all suggested modifications are compared with each other and LWYC (the algorithm Random from (Huang et al., 2011)). In (Wang et al., 2013) the authors carried on the experimental study of the algorithms from (Manku et al., 1998, 1999; Agarwal et al., 2013; Greenwald and Khanna, 2001) and concluded that their own algorithm (LWYC) is preferable to the others: better in accuracy than (Greenwald and Khanna, 2001) and similar in accuracy compared with (Manku et al., 1999) while LWYC has a simpler logic and easier to implement.
As mentioned earlier we compared our algorithms under a fixed space restrictions. In other words, in all experiments we fixed the space allocated to the sketch and evaluated the algorithm based on the best accuracy it can achieve under that space limit. We measured the accuracy as the maximum deviation among all quantile queries, otherwise known as the KolmogorovSmirnov divergence, widely used to measure the distance between CDFs of two distributions. Additionally, we measure the introduced variance caused separately by the compaction steps and sampling. Its value can help the user to evaluate the accuracy of the output. Note that for KLL this value depends on the size of the stream, and is independent of the arrival order of the items. In other words, the guarantees of KLL are the same for all types of streams, adversarial and structured. Some of our improvements change this property; recall that the sweep compactor KLL, when applied to sorted input, requires only a single sweep per layer. For this reason, in our experiments we found variance to be dependent not only on the internal randomness of the algorithm but also the arrival order of the stream items.
5.3. Results
Note that the majority modifications presented in the current paper can be combined for better performance, due to the space limitations we present only some of them. For the sake of simplicity we will fix the order of suggested modification as: lazy from Section 3.1, reduced randomness from Section 3.2, error spreading from Section 3.3 and sweeping from Section 3.4, and denote all possible combinations as four digits, i.e. would imply the vanilla KLL without any modifications, while would imply that we use KLL with error spreading trick and sweeping.
In Figures 8(b) and 8(a) we compare the size/precision tradeoff for LWYC, vanilla KLL, and KLL with modifications. First, we can see that all KLLbased algorithms provide the approximation ratio significantly better than LWYC as the space allocation is growing, which confirms theoretical guarantees. Second, from the experiments it becomes clear that all algorithms behave worse on the data without any order, i.e. shuffled stream. Although the laziness give the most significant push to the performance of the Vanilla KLL, all other modifications improve the precision even further if combined. One can easily see it in the table 8(g) for shuffled dataset and table 8(h) for the sorted stream. Same experiments were carried on for the CAIDA dataset (Fig. 8(d)), and shuffled Wikipedia page statistics (Fig. 8(e)).
Although, theoretically none of the algorithms should depend on the length of the dataset, we verified this property in practice, the results can be seen on Figure 8(f).
In Figure 8(c) we verified that although all the theoretical bounds hold, KLL and LWYC performance indeed depend on the ammount of randomness in the stream, more randomness leads to less precision. Our experiment were held on the trending dataset, i.e. the stream containing two components: (meanzero random variable) and (trend ). Figure 8(c) shows how precision drops as start to grow (Xaxis). Note that modified algorithm does not drop in precision as fast as vanilla KLL or LWYC.
6. HighPerformance Implementation
For simplicity of analysis, experimentation, and exposition, the pseudocode so far assumes the use of listbased data structures. In reality those would include link fields that would double the space usage for data types whose physical size is similar to that of pointers. Moreover, they are not very efficient in terms of update operations. In practice, a factor of two in space and update time is very significant.
Fortunately, lazy KLL can be implemented in a way that optimizes the time and space constant factors. The key idea is to store the levels in a shared and fully packed array, with the invariant that all levels except for level 0 are already sorted. We note that because there are no gaps between the subarrays occupied by the various levels, the data motion that must occur during a compaction is somewhat tricky. It is diagrammed in Figure 10. First, the algorithm searches the levels from left to right to find the first one that is at or above capacity. After a coin flip to decide between retaining the odd or even positions, it then halves the items to the left and creates free space to the right of the level. Then an inplace mergereduce occurs with the level above (physically to the right). Finally, the data is shifted from all lower levels to close the gap created. We end up with free space on the extreme left of the array, that can subsequently be used to grow the unsorted contents of level zero. Due to space restrictions we do not analyze this version of the algorithm here, but mention that the same asymptotic bound apply. An efficient implementation is available from the DataSketches (Rhodes et al., 2013) open source streaming algorithms library, the code can be found datasketches.github.io (in process of moving to datasketches.apache.org)^{6}^{6}6The core code can be found at github.com/apache/incubatordatasketchesjava.
For completeness we provide a few runtime measurements of the highperformance implementation. Figure 11 (in the appendix) shows the results of streamprocessing timings that were performed on a 3.1 GHz MacBook Pro that had 16 GB of memory, and was running the Mojave operating system. The initial behavior on each stream is somewhat complicated, but as the stream gets longer, the update time stabilizes at about 50 nanoseconds per item. We plotted the time without the sorting time for level 0 as well, to demonstrate that that part is significant in terms of speed.
7. Conclusion
We verified experimentally that the KLL algorithm proposed by Karnin et al. (Karnin et al., 2016) has predicted asymptotic improvement over LWYC(Wang et al., 2013).We proposed four modifications to KLL with provably better constants in the approximation bounds. Experiments verified that the approximation is roughly twice as good in practice compared to KLL and more than four times better compared to LWYC (and growing with the space allocated to the sketch). Moreover, the worstcase update time for the presented sweepcompactor based KLL is which improves over the rest of the compactor based algorithms. Two algorithms proposed for the weighted streams improve over the naive extension from to while maintaining the same space complexity. Finally, we provide an very efficient data structure for maintaining compactor based structures such as the algorithms above.
References
 (1)
 cai (2015) 2015. The CAIDA UCSD Anonymized Internet Traces, 20150219. (2015). http://www.caida.org/data/passive/passive_dataset.xml
 wik (2016) 2016. Page view statistics for Wikimedia projects. (2016). https://dumps.wikimedia.org/other/pagecountsraw/
 Agarwal et al. (2013) Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, and Ke Yi. 2013. Mergeable summaries. ACM Transactions on Database Systems (TODS) 38, 4 (2013), 26.
 Arasu and Manku (2004) Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the twentythird ACM SIGMODSIGACTSIGART symposium on Principles of database systems. ACM, 286–296.
 Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794.
 Cormode et al. (2005) Graham Cormode, Minos Garofalakis, S Muthukrishnan, and Rajeev Rastogi. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 25–36.
 DeWitt et al. (1991) David J DeWitt, Jeffrey F Naughton, and Donovan A Schneider. 1991. Parallel sorting on a sharednothing architecture using probabilistic splitting. In Parallel and distributed information systems, 1991., proceedings of the first international conference on. IEEE, 280–291.
 Felber and Ostrovsky (2015) David Felber and Rafail Ostrovsky. 2015. A randomized online quantile summary in O (1/epsilon* log (1/epsilon)) words. In LIPIcsLeibniz International Proceedings in Informatics, Vol. 40. Schloss DagstuhlLeibnizZentrum fuer Informatik.
 Greenwald and Khanna (2001) Michael Greenwald and Sanjeev Khanna. 2001. Spaceefficient online computation of quantile summaries. In ACM SIGMOD Record, Vol. 30. ACM, 58–66.
 Greenwald and Khanna (2004) Michael B Greenwald and Sanjeev Khanna. 2004. Powerconserving computation of orderstatistics over sensor networks. In Proceedings of the twentythird ACM SIGMODSIGACTSIGART symposium on Principles of database systems. ACM, 275–285.
 Greenwald and Khanna (2016) Michael B Greenwald and Sanjeev Khanna. 2016. Quantiles and equidepth histograms over streams. In Data Stream Management. Springer, 45–86.
 Huang et al. (2011) Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu. 2011. Sampling based algorithms for quantile computation in sensor networks. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 745–756.
 Karnin et al. (2016) Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal quantile approximation in streams. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on. IEEE, 71–78.
 Li et al. (2011) Zhenjiang Li, Mo Li, Jiliang Wang, and Zhichao Cao. 2011. Ubiquitous data collection for mobile users in wireless sensor networks. In INFOCOM, 2011 Proceedings IEEE. IEEE, 2246–2254.
 Lin et al. (2004) Xuemin Lin, Hongjun Lu, Jian Xu, and Jeffrey Xu Yu. 2004. Continuously maintaining quantile summaries of the most recent n elements over a data stream. In Data Engineering, 2004. Proceedings. 20th International Conference on. IEEE, 362–373.
 Liu et al. (2016) Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 101–114.
 Manku et al. (1998) Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. 1998. Approximate medians and other quantiles in one pass and with limited memory. In ACM SIGMOD Record, Vol. 27. ACM, 426–435.
 Manku et al. (1999) Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. 1999. Random sampling techniques for space efficient online computation of order statistics of large datasets. In ACM SIGMOD Record, Vol. 28. ACM, 251–262.
 Munro and Paterson (1980) J Ian Munro and Mike S Paterson. 1980. Selection and sorting with limited storage. Theoretical computer science 12, 3 (1980), 315–323.
 Pike et al. (2005) Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. 2005. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 13, 4 (2005), 277–298.
 Poosala et al. (1996) Viswanath Poosala, Peter J Haas, Yannis E Ioannidis, and Eugene J Shekita. 1996. Improved histograms for selectivity estimation of range predicates. In ACM Sigmod Record, Vol. 25. ACM, 294–305.
 Rhodes et al. (2013) Lee Rhodes, Kevin Lang, Alexander Saydakov, Edo Liberty, and Justin Thaler. 2013. DataSketches: A library of stochastic streaming algorithms. Open source software: https://datasketches.github.io/, (in process of moving to datasketches.apache.org). (2013).
 Selinger et al. (1979) P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. In Proceedings of the 1979 ACM SIGMOD international conference on Management of data. ACM, 23–34.
 Shrivastava et al. (2004) Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri. 2004. Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of the 2nd international conference on Embedded networked sensor systems. ACM, 239–249.
 Wang et al. (2013) Lu Wang, Ge Luo, Ke Yi, and Graham Cormode. 2013. Quantiles over data streams: an experimental study. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 737–748.
 Yi and Zhang (2013) Ke Yi and Qin Zhang. 2013. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65, 1 (2013), 206–223.
Appendix A Fixing the Original KLL Proof
The original paper by Karnin et al. (Karnin et al., 2016) contains a mistake regarding the number of compactions performed at a single level. Correcting the mistake is trivial and does not change the authors claim. Nevertheless, we provide a correction of their argument. The authors use compactors of exponentially decreasing size. Higher weight items receive higher capacity compactors. The error appeared in the last inequality of the bound on — the number of compaction made at level (page in (Karnin et al., 2016)):
(1) 
where is the height of the top compactor, is the size of the compactor at height . Note, that the last inequality implies , while from the defition of it follows that at least one compaction happened on level . Therefore . Fixing this slightly increases the constant in the final upper bound.
Recall that and . We reuse the notation and refer to the height of the top compactor as . Additionally, we introduce which denotes the height of the top compactor of size . Due to the choice of and we can conclude that .
Every compactor of size contains at most one item, otherwise it would be compacted. Therefore, the bottom compactors have total weight . Similarly, every compactor of size contains at most items. Then the total weight of compactors from level to is:
Putting together the total weight of the bottom and top compactors we get the upper bound on the number of items processed:
Plugging into the last inequality of Equation 1 leads to which is times worse than the initial derivation. Repeating the argument as in (Karnin et al., 2016) and in the Section 2 of the current paper, we get . As in (Karnin et al., 2016) applying Hoeffding’s inequality gives
. However, the constant has changed from to . Note that all asymptotic guarantees stay the same as in (Karnin et al., 2016).
Appendix B Missing Proofs
Theorem 4.1 Algorithm 4 processes a stream of weighted updates. It outputs all approximate quantiles with high probability using memory . In the worstcase scenario, a single update invokes update calls to compactors of the KLL sketch and calls to the sampler. This results in a worstcase update runtime. The amortized runtime is .
Proof.
The analysis of the error of this algorithm is straightforward, as an item of weight is broken into several weights summing to . For the runtime analysis, we decompose it into three parts:

push to the top compactor (line 13)
For the first part, the worstcase happens when all compactors are full and all should be deleted due to increase of . Therefore, line 6 of Algorithm 4 will push items into the data structure at the total cost . Since each of these items must have been inserted earlier, the amortized runtime for the first part is .
The second part is associated with the different components of except the largest one, . In the worstcase, for all . Recall, that the worstcase runtime of lazy sweeping KLL is . Therefore the worstcase runtime is . The amortized running time is as well.
Finally, the third part is taking into account adding the same element to the top layer times. In the worstcase, in total time. For the amortized case, although could be equal to , we do not really need to add copies of the item but rather remember the number of times the item is inserted. It follows that the amortized case is the same as that of inserting an item to a compactor which is
Summing the three components, we get a worstcase runtime of and amortized time of . ∎
Lemma 4.2 Given a stream of items of weights in , a weightaware compactor outputs a stream of items of weight . For a size weightaware compactor the added error in rank between input and output stream (for any query ) is equal to . Here, ’s are independent random variable such that and .
Proof.
The claim regarding the stream length is trivial as every two items become a single item in the compact operation. Also, since the weight of an output is , with it follows that the output weights are in . For the error, consider an arbitrary query . In a single compact operation, is an inner query if for some even and an outer query otherwise. If is an outer query, the error associated to it is 0. Otherwise, the error is with probability and with probability . Denoting the error for at compaction as we get that and as claimed. Finally, since the size of the compactor is , a compact operation will occur for every items and indeed the number of error variables is . ∎
Theorem 4.3 Algorithm 5 processes a stream of weighted updates and outputs all approximate quantiles with high probability using space . It’s worstcase update time is .
Proof.
The error in the algorithm is introduced via compaction procedures in line 8 and via dropping bottom compactors in line 4. The analysis for this compactor extension is similar to the unitary weighted case. Let be a random variable which indicates the sign of the error introduced during the th compaction on the th level, and let it be equal to zero if no error is introduced. Note, that in Lemma 4.2 we showed that and , therefore the total error introduced is
Repeating the argument as in Appendix A we conclude that:
To reach the same approximation guarantees with the same probability of failure, one need to set , i.e. this algorithm will use twice as much space as the naive implementations. Additionally it stores a weight for each item explicitly which doubles the space complexity (this depends on the memory footprint of a stream item).
Note, that the error introduced in line 4 is not in expectation and might accumulate over time. However, line 4 is only executed when an item of weight more than is processed. We can bound the overall weight of the items in the bottom compactors that were discarded as a small fraction of the overall number of items processed. The cumulative weight of items dropped will turn out to be a geometric sequence dominated by its last element, which in turn is a small fraction of the overall weight. In Appendix A we show that the total weight of items in compactor of height is at most For weighted compactors it is . The number of bottom compactors that are to be dropped is and the level of the highest dropped compactor is . Therefore, the total weight dropped is less than . Our goal is to bound the portion of total weight we dropped. Therefore, we will estimate the ratio of dropped weight to the weight of added items.
The last equation holds since and . It follows that with each compactor drop we discard at most an portion of the stream. At the same time before every such drop the total weight increase by at least : . We conclude that if we have a large number of compactor drops, the final error introduced is . Adjusting the input memory allowance by a constant factor leads to the desired approximation.
To process any weighted update, Algorithm 5 applies lines 8, 10 and 12. If we use lazy compactions with sweeping, lines 8 and 10 in the worst case require running time. As for line 12, we store a single item and its multiplicity , instead of pushing up to items into the top compactor. Hence, in the worst case line 12 accounts for runtime. ∎
Appendix C Base2update update procedure
Algorithm 5 contains the pseudocode for the Base2update update procedure. In it, rb is a uniform random number in , and append,delete,sort are the standard operations of a list.
Appendix D Speed measurements of efficient implementation
Figure 11 plots the runtime measurements of the high performance implementation discussed in Section 6. For each of several values of we measured the average (over multiple trials) time required to process a stream that consisted of a random permutation of the integers between 1 and . The axis of the plot is the total processing time (including object creation and initialization, and garbage collection), measured in nanoseconds, and then divided by . This can be interpreted as the average peritem update time for the data structure. The blue line excludes the sorting of level 0 and the red line includes the entire procedure. Experiments were conducted using a single thread and an Intel Core I72670qm 3.1 Ghz processor.