1. Introduction
Modern datacenters provide a distributed computing environment with hundreds of thousands of machines. Stream analytic systems are key components in such largescale systems as they are employed by datacenter operators for monitoring them and responding to events in real time (Qian et al., 2013; Zaharia et al., 2013; Chandramouli et al., 2014; Kulkarni et al., 2015; Lin et al., 2016). For instance, datacenter network latency (Guo et al., 2015; Tan et al., 2019) and web search engine (Dean and Barroso, 2013; Jeon et al., 2016) monitoring systems collect response latencies of servers to assess the health of the systems and/or to guide load balancing decisions. These monitoring systems continuously receive massive amounts of data from thousands of machines, perform computations over recent data as scoped by a temporal window, and periodically report results, typically in seconds or minutes. Supporting complex computation over such data volumes in real time requires hundreds of machines (Qian et al., 2013; Zaharia et al., 2013; Lin et al., 2016), calling for improvements in stream processing throughput (Miao et al., 2017).
The quantile operator lies in the heart of such realtime monitoring systems as they can determine the typical (0.5quantile) or abnormal behavior (0.99quantile) of the monitored system. Formally, quantiles are the th largest value (i.e., rank ) within values for different values of . For instance, in network troubleshooting and network health dashboards, a static set of quantiles are continuously computed on the roundtrip times (RTTs) between datacenter servers so as to measure the quality of network reachability. In web search engines, a predefined set of quantiles are computed on query response times across clusters and are employed by load balancers so as to meet strict servicelevel agreements on query latency (Dean and Barroso, 2013). In such scenarios, highly accurate quantiles are required to reduce any falsepositive discoveries or ineffective scheduling decisions.
Satisfying the requirements for high throughput, realtime, and accurate computation of quantiles is a challenging task as exact and lowlatency computation of quantiles is resourceintensive and often infeasible. Unlike aggregation operators (e.g., average) that can be computed incrementally with a small memory footprint, exact quantile computation requires storing and processing the entire value distribution over the temporal window. In realworld scenarios, such as the network latency monitoring system (Guo et al., 2015), where million events arrive every second and the temporal window can be in the order of minutes, the memory and compute requirements for exact and lowlatency quantile computation are massive.
In these scenarios, exact solutions are often not needed to satisfy the accuracy requirements, and approximate quantiles with low value error
are acceptable if they are computable within a fraction of resources needed for the exact solution. For instance, quantiles are often utilized to guide a highlevel decision, such as in network health dashboards and web search engines, where quantiles are compared to predefined thresholds to discover outliers
(Guo et al., 2015) or to guide load balancing (Dean and Barroso, 2013). In such cases, approximate quantiles should be computed within a small value error from the corresponding exact quantiles so as to avoid making different highlevel decisions.In this work, we uncover opportunities in approximate quantiles by characterizing realworld workloads. Our study shows that practical workloads have many recurring values and can have a substantial skew. For instance, in datacenter networking latency datasets (NetMon in Figure
1), while most latencies are small and concentrated, with more than 90% taking below 1,247 us, a few latencies are very large and heavytailed, taking up to 74,265 us. When studying the distributions across different time scales, we also find that the distribution of small values is selfsimilar (Leland et al., 1994). This is not surprising because datacenter networks work well and function consistently most of the time, resulting in similar latencies and distributions.Implications of value distribution on approximate quantiles. We find that while existing work on approximate quantiles (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016) results in low value errors for nonhigh quantiles (e.g., 0.5 to 0.9quantile, they fail to achieve low value error for higher quantiles (e.g., 0.99quantile). These works seek to minimize the rank error, a widely used quantile approximation metric, rather than value error. Rank error is the distance between the exact rank and the approximate rank. When distributions are skewed, a small amount of rank error translates to a high value error for higher quantiles.
We demonstrate these implications by considering a realworld example. Consider a data stream of size with its elements sorted as in increasing order. The quantile () is the th element with rank . For a given rank and data size , prior work focuses on delivering an approximate quantile within a deterministic rankerror bound called approximation, i.e., the rank of approximate quantile is within the range . Assume and a window size of 100K elements, the rank error is bounded by , thereby resulting in the rank interval , where and are the lower and upper bounds respectively. Hence, the same rank interval will deliver different value distances according to underlying data distributions. For the datacenter networking latency scenario, the latency at the 0.5quantile (median, ) is 798 us while its rank distance +2K (i.e., ) sits in 814 us, resulting in a value error of only 2%. On the contrary, the latency at the 0.99quantile () is 1,874 us, and its rank distance +2K (i.e., ) sits in 74,265 us, which is 40x larger. These results demonstrate that designing an efficient and accurate quantile approximation algorithm requires taking into account the underlying data distribution and its influence on value errors.
Our proposal. We present approximate Quantiles with LOw Value Error, or QLOVE. QLOVE leverages the observation in monitoring scenarios that the quantiles to be computed are fixed throughout the temporal window. QLOVE takes into account the underlying data distribution for these quantiles and proposes different approaches for computing nonhigh quantiles and high quantiles. Each approach capitalizes on the insights from our workload characterization, delivering efficient and accurate computation of quantiles.
Nonhigh quantiles. Based on the observation that value distribution of small latencies (i.e., the ones used in computation of nonhigh quantiles) is selfsimilar, QLOVE first performs quantile computation at the granularity of subwindows (i.e., smaller window than the temporal window defined by the user). The quantile for a given temporal window is computed by averaging the quantiles of all subwindows falling within the temporal window. This optimization enables QLOVE to store only the value associated with the inflight subwindow and the quantiles of all previous subwindows, resulting in higher throughput due to smaller memory footprint.
Similar to prior work on approximate quantiles (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016), memory consumption is reduced as fewer values are retained during a temporal window. QLOVE further reduces the memory consumption by capitalizing on an additional workload insight that values are mostly concentrated in a small collection of numerical values. For instance, in our datacenter network latency monitoring example, only % of the elements in a onehour temporal window are unique. This high data redundancy allows considerable reduction in space, by maintaining the frequency distribution of inflight data (i.e., {value, count}) instead of the value distribution. In doing so, due to the high concentration of values, our quantile estimate can achieve low value error for nonhigh quantiles (e.g., median) while minimizing the memory footprint.
High quantiles. QLOVE explicitly records tail values to better approximate the high quantiles (e.g., 0.99quantile). Typically, these values are infrequent and can be stored efficiently. Our technique, called fewk merging, carefully chooses which and how many tail values to store based on the query parameters (i.e., window size and period) as well as observed data distributions. We highlight two scenarios where fewk merging is needed: (i) Statistical inefficiency. When the subwindow contains too few data points, the high quantiles in each subwindow are not statistically robust as they are impacted by too few values. For instance, if a subwindow has elements, the 0.999quantile is decided only by the two largest elements; and (ii) Bursty traffic. If all of the tail values are concentrated in one or few subwindows, their impact on the overall quantiles is not reflected well by the quantiles of their corresponding subwindows. We discuss how to choose the fewk size, how to merge fewk values to produce an answer for the window, and how to manage space budget. Similar to nonhigh quantiles, we compress individual values and store frequency distribution to reduce the memory needed to store the fewk values.
We have implemented QLOVE in Trill opensource streaming analytics engine (Chandramouli et al., 2014) and have deployed it in production in a streaming network monitoring system at a large enterprise. We evaluate QLOVE using both realworld and synthetic workloads. Our experiments show that, relative to computing the exact quantiles, QLOVE offers a  higher throughput when subwindows have 10K elements each; the throughput is up to higher when subwindows have 1M elements each. Value compression lowers the space usage by . Moreover, the average relative value error for different quantiles falls below . In comparison, algorithms built atop the rankerror approximation metric (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016) either incur a high relative value error (% higher than QLOVE) or result in lower throughput (% on a sliding window of 100K elements that includes 10 subwindows).
In summary, this paper has the following contributions:

We define the problem of approximate computing of quantiles with low value error (rather than rank error) and present a solution, QLOVE. To the best of our knowledge, this is the first attempt to tackle this problem.

We design and implement mechanisms to reduce memory consumption through spaceefficient summaries and frequency distribution, enabling high throughput in the presence of huge data streams and large temporal windows.

We implement QLOVE in a streaming engine, and show its practicality using realworld use cases.
2. Quantile Processing Model
We introduce our streaming query execution model (i.e., incremental evaluation), and define the problem of our quantile approximation.
Streaming model. A data stream is a continuous sequence of data elements that arrive timely. Each element has its value associated with a timestamp that captures the order of ’s occurrence. Under this data stream model, a query defines a window to specify the elements to consider in the query evaluation. For example, a query uses a window to process the latest elements seen so far, where is the window size. Due to the continuous arrival of data, a window requires a period to determine how frequently the query must execute (i.e., the frequency of query evaluating). For example, a query can process the recent elements periodically upon every insertion of new elements, where is the window period. Our work can be applied to windows defined by time parameters, e.g., evaluate the query every one minute (window period) for the elements seen last one hour (window size).
This paper mainly considers two types of windowing models (Akidau et al., 2015): (1) Tumbling Window, where window size is equal to window period, and (2) Sliding Window, where window size is larger than window period. As window size and period are the same in tumbling window, there is no overlap between data elements considered by two successive query evaluations. Therefore, an element processed in one window is never reused by any subsequent windows. In contrast, a slidingwindow query overlaps between successive evaluations of data, thus allowing each element to be valid across continuous windows. We do not consider the windowing model where elements never expire (Wang et al., 2013).
Incremental evaluation. Incremental evaluation (Ghanem et al., 2007; Chandramouli et al., 2014) supports a unified design logic to stream processing while allowing windowbased queries to be integrated with various performance optimizations. The basic idea is keeping state for query to evaluate, so that state is updated while new elements are inserted or old elements are deleted. State is typically smaller in size than the data covered by a window, thus making use of resource efficiently. Further, when computing the query result, using
directly is typically faster compared to a naive, stateless way that accumulates all elements at the moment of evaluating the query. To implement an incremental operator, developers should define the following functions
(Chandramouli et al., 2014):
InitialState:() => S: Returns an initial state S.

Accumulate:(S, E) => S: Updates an old state S with newly arrived event E.

Deaccumulate:(S, E) => S: Updates an old state S upon the expiration of event E.

ComputeResult:S => R: Computes the result R from the current state S.
For example, the following illustrates how to write an average operator using the incremental evaluation functions:
Incremental evaluation on sliding windows tends to be slower to execute than that on tumbling windows. The primary reason is that the tumblingwindow query is implemented with a smaller set of functions without Deaccumulate. In this case, the query accumulates all data of a period on an initialized state, computes a result, and simply discards the state. In contrast, a slidingwindow query must invoke Deaccumulate for every deletion of an element from the current window. For a sliding window, the performance largely varies with how complicated the Deaccumulate logic is. Simple operators such as average can deaccumulate an element from the state efficiently, whereas it is a challenging task to complicated operators such as quantiles.
3. Algorithm for Nonhigh Quantiles
First, we present an algorithm that effectively deals with nonhigh quantiles with high underlying distribution density, and its cost and error analysis. We also illustrate scenarios, where the algorithm alone is insufficient, to motivate techniques introduced in Section 4.
3.1. Algorithm Overview
The key idea is to partition a window into subwindows, and leverage the results from subwindow computations to give an approximate answer of the quantile for the entire window. Subwindows are created following windowing semantics in use, by which the size of each subwindow is aligned with window period. These subwindows follow the timestamp order of data elements, i.e., subwindows containing older data elements are generated earlier than those containing newer ones. For each subwindow, we maintain a smallsize summary instead of keeping all data elements. At the window level, there is an aggregation function that merges subwindow summaries to approximate the exact quantile answer.
More formally, assume a sliding window divided into subwindows, where each subwindow includes data points. If the th subwindow has a sequence of data , we observe the whole data in the sliding window. Here, the sample quantile of the sliding window is denoted by , which is the exact result to approximate. QLOVE estimates through twolevel hierarchical processing, as presented in Figure 2. Level 1 computes the exact quantile of each subwindow. The exact quantile of the th subwindow is denoted as , which becomes the summary of the corresponding subwindow. Level 2 aggregates the summaries of all subwindows to estimate the exact quantile
. QLOVE uses mean as an aggregation function guided by the Central Limit Theorem
(Stuart and Ord, 1994; Tierney, 1983) and obtains the aggregated quantile of the sliding window denoted by .When the window slides after one time step, as the figure shows QLOVE discards the oldest summary , and adds for the new subwindow, thereby forming a new bag of summaries .
In principle, QLOVE is a hybrid approach that combines tumbling windows into the original slidingwindow quantile computation, delivering improved performance. Specifically, while creating a summary Level 1 runs a tumbling window, which avoids deaccumulation. Once a subwindow completes, all values are discarded after they are used to compute the summary. Level 2 operates a sliding window on summaries, requiring deaccumulation. However, since a summary contains only a few entries associated with the specified quantiles, not with other factors such as subwindow size, it can perform fast deaccumulation as well as fast accumulation.
(Level 1) Creating a new subwindow summary. During Level 1 process, we exploit opportunities for volume reduction by data redundancy. During the subwindow processing, inflight data are maintained in a compressed state of , where is the frequency of element . A critical property here is that each is unknown until we observe it from new incoming data. To efficiently insert a new element in the state and continuously keep it ordered, we use the redblack tree (Guibas and Sedgewick, 1978) where the element attribute acts as the key to sort nodes in the tree. This also avoids the sorting cost when computing the exact quantiles at the end of Level 1 processing for the inflight subwindow.
The logic to manage the redblack tree is sketched in Algorithm 1. InitialState and Accumulate are selfexplanatory, and we explain ComputeResult in details. At the result computation in ComputeResult, the tree already has sorted the subwindow’s elements. Thus, the computation does an inorder traversal of the tree while using the frequency attribute to count the position of each unique element. As the total number of elements is maintained in , it is straightforward to know the percentage of elements below the current element during the inorder traversal. A query may ask for multiple quantiles at a time. In this case, ComputeResult evaluates the quantiles in a single pass with the smallest one searched first during the same inorder traversal.
Lastly, to increase data duplicates, some insignificant loworder digits of streamed values may be zeroed out. Often, we consider only the three most significant digits of the original value, which ensures the quantized value within less than 1% relative error.
(Level 2) Aggregating subwindow summaries. The logic for aggregating all subwindow summaries is almost identical to the incremental evaluation for the average introduced in Section 2. The only distinction is that to answer specified quantiles, there are instances of the average’s state (i.e., sum and count). The accumulation and deaccumulation handlers update these states to compute the average of each quantile separately.
3.2. Algorithm Analysis
Space complexity. The space complexity for our approximate algorithm is . For summaries of independent quantiles , we need space, where and are the window size and the subwindow size, respectively. There is at most one subwindow under construction, for which we maintain a sorted tree of inflight data. Its space usage is which can range from 1 to depending on the degree of duplicates in the workload. In one extreme case, all elements have the same value, so , and in another extreme case, there is no duplicate at all, so . This spectrum allows the data arrival handler to reduce space usage significantly if there is high data redundancy in the workload.
Time complexity. The degree of duplicates is a factor that also reduces the theoretical time cost in Level 1 stage. For the subwindow of size with unique elements, the Accumulate cost is , which falls along a continuum between and , again depending on the degree of duplicates. Likewise, the complexity of ComputeResult is irrespective of the number of quantiles to search. Level 2 stage in QLOVE runs extremely fast with a static cost: each of specified quantiles needs two add operations for Accumulate or Deaccumulate, and one division operation for ComputeResult. Section 5.4 shows an experiment on how higher degree of duplicates leads on higher throughput.
Error bound. Summarydriven aggregation is designed based on selfsimilarity of data distribution for nonhigh quantiles with high underlying distribution density, as described in Section 1. We present the theory behind our approach for error bound analysis, and our theorem assumes several conditions. (1) We consider the
quantile of each subwindow as a random variable before we actually get the data. Similarly, the
quantile of the sliding window is also a random variable as long as the data is not given. (2) The target quantiles across subwindows are independent and identically distributed (i.i.d.). (3) Data in the window have continuous distribution.Now, we derive a statistical guarantee to have the aggregated estimate close to the exact quantile using the Central Limit Theorem (see Theorem 1 for details). To illustrate how to interpret the results in Theorem 1, suppose we obtain the aggregated estimate and the error bound over the data stream of a window. Since the exact quantile is , we use to approximate , and evaluate how close they are using . Essentially, we claim that with high confidence (e.g.
, probability 95%).
Our probabilistic error bound depends on the density of underlying data distribution for the specified quantile. Recall Figure 1 where the density only decays in the tail, making the density at the 0.5quantile (median) much larger than that at the 0.999quantile. In this case, the error bound is expected to be much tighter for nonhigh quantiles than for high quantiles. Narrower error bounds imply lower estimation errors, otherwise the error bound is not informative. For a number of tests we performed, including a query for Table 1, we see that the observed value error is much lower than the error bound .
Finally, due to condition (2) assumed in our theorem, the dependence between data is not supposed to collapse our method. In Section 5.4, we show that our aggregated estimator can be effectively applied to noni.i.d. data in a diverse spectrum of data dependence, with competitive accuracy compared to i.i.d. data.
3.3. Achieving High Accuracy
There are two complementary cases where achieving high accuracy needs special consideration.
Statistical inefficiency. The inaccuracy of high quantiles becomes more significant when there is lack of data to accurately estimate the quantiles in a subwindow. For example, in Table 2, the estimated error increases noticeably at the 0.999quantile if sliding windows use 1K elements in a period (i.e., 1K period). In this case, the two largest elements are used when computing the 0.999quantile in each subwindow. This makes statistical estimation not robust under the data distribution, and thus misleads the approximation.
There are several remedies for this curse of high quantiles. Users can change the parameters of query window to operate larger subwindows with more data points. This provides better chance to observe data points in the tail, allowing more precise estimates of high quantiles. Another approach is to cache a small proportion of raw data without changing windowing semantics, and use it to directly compute accurate high quantiles of the sliding window. We generalize this idea in Section 4.
Bursty traffic. When bursty traffic happens, extremely large values are highly skewed in one or a few subwindows. In effect, they dominate the values to observe across the window for computing high quantiles. Note that statistical inefficiency for high quantiles may occur when the data distribution is selfsimilar across the subwindows, while bursty traffic indicates that the data distribution at the tail is highly variant across the subwindows.
4. Fewk merging
Fewk merging in QLOVE leverages raw data points to handle large value errors that could appear in high quantiles as a result of statistical inefficiency or bursty traffic. During fewk merging, each subwindow collects data points among the largest values in its portion of streaming data and uses the values to compute the target high quantile for the window. We begin with discussing the issues that make collecting right values challenging.
4.1. Challenges
Through analyzing our monitoring workloads, we find that values must be collected differently based on the observed streaming traffic. To illustrate, Figure 3 exemplifies four patterns (), where the largest 10 values (colored in dark) of the window are distributed differently among subwindows (). Assume the target high quantile can be obtained precisely by exploiting these 10 values. Then, indicates the case of extremely bursty traffic, where a single subwindow includes all the largest values, whereas indicates the case that they are completely evenly distributed across subwindows. Fewk merging must enforce for to produce the exact answer. However, for , any caters to precise quantile. Using will sacrifice the accuracy for other cases, with performing the worst, followed by , and then . Driven by this observation, our first challenge is providing a solution to handle both statistical inefficiency and bursty traffic under diverse patterns corresponding to the largest values over subwindows.
Our workloads have a tendency that large values are spread over concurrent subwindows in most of the time, similar to or in Figure 3. Bursty traffic occurs timetotime as a result of infrequent yet unpredictable abnormal behavior of the monitored system. Therefore, we cannot assume that the distribution of the inflight largest values is known ahead or repeats over time. The distribution can only be estimated once we observe the data. Thus our second challenge is building a robust mechanism to dynamically recognize the current traffic pattern. Section 4.2 addresses these challenges.
4.2. QLOVE’s Approach
One possible approach is claiming enough data points to compute the exact quantile regardless of statistical inefficiency and bursty traffic. Formally, to guarantee the exact answer of quantile () on the window size of , each subwindow must return largest elements. Then, the entire window will need to have space for elements in total, where is the subwindow size. This approach could be costly if the window size of a given sliding window query is significantly larger than the window period. We thus consider fewk merging when the space is limited.
Let be the space budget, where is smaller than the space for the exact quantile, i.e., . Given such , we assign each subwindow the same space budget , where , and . Within each subwindow, will be further partitioned into two parts, to address statistical inefficiency by topk merging and to address bursty traffic by samplek merging, such that . We now explain how to use the given subwindow budgets, and , to handle statistical inefficiency and bursty traffic.
Topk merging for statistical inefficiency. When handling statistical inefficiency, each subwindow caches its largest values, based on the observation that the global largest values tend to be widely spread, making higherrank values in each subwindow more important. These all largest values are merged across the entire window to answer quantile. For window size , we draw the th largest value in the merged data to approximate the quantile. We show how we can opportunistically tradeoff small space consumption for high accuracy (i.e., low value error) for some classes of applications in Section 5.3.
Samplek for bursty traffic. We determine that traffic is bursty when the largest values in some subwindows are relatively worse than those in others. Such subwindows happen to appear dynamically over time, and thus we do not differentiate the largest values in each subwindow when coping with bursty traffic.
In QLOVE, each subwindow takes samples from its largest values so as to capture the distribution of the largest values using a smaller fraction. It takes an interval sampling which picks every th element on the ranked values (Luo et al., 2016); e.g., for , we select all even ranked values. The sampling interval is inversely proportional to the allocated fraction = . After merging all samples, the resulting quantile is obtained by referring to the th largest value to factor in data reduction by sampling.
Deciding . When deciding the size of , QLOVE estimates the number of data points that a window needs to compute an answer for nonbursty traffic such as in Figure 3 (that requires ). We use it to decide per subwindow data points, i.e., for , which we use as . QLOVE assigns all the remaining budget for . Note that a more conservative approach may assume in Figure 3, then becomes twice. is typically larger than because is based on a very small portion of the largest data. Samplek merging takes an advantage of it considerably since the value error in the estimation of high quantiles through sampling is sensitive to sampling rate due to low density of underlying data.
4.3. Traffic Handling at Runtime
The two proposed value merging techniques can be active at the same time during the streaming query processing, triggered by different conditions. The use of topk merging is decided by query semantics (e.g., windowing model and target quantiles). If there is any quantile that suffers from statistical inefficiency, topk merging for the quantile will be switched on. In contrast, samplek merging is a standing pipeline and exploited anytime by ongoing bursty traffic.
Currently, several decisions made for traffic handling are guided by empirical study or parameters measured offline. Future work includes integrating these processes entirely online.
When to enable Topk. For each quantile, QLOVE initiates the topk data merging process if , where is the threshold deciding the statistical inefficiency when using a subwindow of size . Otherwise, QLOVE directly exploits results estimated from our approximation algorithm presented in Section 3. We set as 10 based on evaluating monitoring workloads we run in our system.
Selecting outcomes. QLOVE needs to decide when to take the outcomes between the two pipelines at runtime. Results from samplek merging are prioritized if bursty traffic is detected. Otherwise, QLOVE uses the results from topk merging for those high quantiles that face statistical inefficiency. To detect bursty traffic, we identify if the sampled largest values in the current subwindow are distributionally different and stochastically larger than those in the adjacent former subwindow. We use an existing methodology for it (Mann and Whitney, 1947).
5. Experimental Evaluation
We implement QLOVE along with competing policies using Trill opensource streaming analytics engine (Chandramouli et al., 2014), and conduct evaluation using realworld workloads. Trill extends temporalrelation models on languageintegrated queries (LINQ), supporting diverse scenarios in streaming analytics. It also provides a framework to allow integrating custom incremental evaluation operators, which we use to compare all the policies experimentally.
5.1. Experimental Setup
Machine setup and workload. The machine for all experiments has two 2.40 GHz 12core Intel 64bit Xeon processors with hyperthreading and 128GB of main memory, and runs Windows. Our evaluation is based on the two realworld datasets. (1) NetMon dataset includes network latency in microseconds (us), with each entry measuring roundtrip time (RTT) between any two servers in a largescale datacenter. (2) Search dataset includes query response time of index serving node (ISN) at search engine. The response time is in microseconds (us), and measures from the time the ISN receives the query to the time it responds to the user. Each dataset contains 10 million entries.
Query. We run the query written in LINQ on the datasets to estimate 0.5, 0.9, 0.99, and 0.999quantile (henceforth denoted by Q0.5, Q0.9, Q0.99, and Q0.999):
[fontsize=\small] Qmonitor = Stream .Window(windowSize, period) .Where(e => e.errorCode != 0) .Aggregate(c => c.Quantile(0.5,0.9,0.99,0.999))
Policies in comparison. We compare QLOVE to the four strategies that support the sliding window model. (1) Exact is the baseline policy that computes exact quantiles. This extends Algorithm 1 with a deaccumulation logic; the node representing the expired element’s value decrements its frequency by one, and is deleted from the redblack tree if the frequency becomes zero. This outperformed other methods for the exact quantiles. (2) CMQS, Continuously Maintaining Quantile Summaries (Lin et al., 2004), bounds rank errors of quantile approximation deterministically; for a given rank and elements, its approximation returns a value within the rank interval . (3) AM is another deterministic algorithm with rank error bound designed by Arasu and Manku (Arasu and Manku, 2004). (4) Random is a state of the art using sampling to bound rank error with constant probabilities (Luo et al., 2016). (5) Moment Sketch is an algorithm using mergeable momentbased quantile sketches to predict the original data distribution from moment statistics summary.
Metrics. We use average relative error as the accuracy metric, number of variables as the memory usage metric, and million elements per second (M ev/s) processed for a single thread as the throughput metric. Average relative error (in %) is measured by , where is estimated value from approximation and is the exact value. As stream processing continuously organizes a window and evaluates the query on it, the error is averaged over query evaluations.
5.2. Comparison to Competing Algorithms
This section compares QLOVE with the competing algorithms. We disable fewk merging in QLOVE until Section 5.3 to show how our algorithm in Section 3 alone works.
Approximation error. The accuracy column in Table 1 shows average value error and rank error for a set of quantiles when using 16K window period and 128K window size on NetMon dataset. For CMQS, AM and Random, we configure the errorbound parameter as 0.02 guided by its sensitivity to value error. For Moment, we set its parameter as 12 to be similar in error bounds. For a given quantile , in addition to average value error, we present its average rank error measured by , where is the exact rank of , is the rank of the returned value for th query evaluation, is the total number of query evaluations, and is the window size. guarantees that none of is larger than 0.02.
The results in Table 1 show that CMQS, AM and Random can all successfully bound errors by rank. The average rank errors stay low within , with the largest error observed in individual query evaluations across all the policies below 0.0105, which confirms the effectiveness of the proposed methods. It is also noteworthy that the rank error of values returned by QLOVE is comparable, with even slightly lower error results across the quantiles.
Comparing across the policies, QLOVE outperforms others in value error, especially for very high quantile (Q0.999). Moreover, the results indicate that a given rank error has very different influence on the value error across different quantiles. For example, comparing Q0.5 and Q0.999, the rank error in Q.999 is lower while its corresponding value error is adversely higher. This is primarily because the NetMon workload exhibits high variability, where the value for the high quantile is orders magnitude larger than the medium quantile, as indicated in Figure 1. As a result, small change in Q.999 rank error ends up leading to more than 3x difference in value error.
Space usage. The space usage column in Table 1 presents the number of variables to store in memory for each algorithm. The space usage is calculated from the theoretical bound (Analytical) found in (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016), as well as measured at runtime (Observed) while running the algorithm. QLOVE benefits from high data redundancy present in the NetMon, reducing memory usage at runtime from its analytical cost substantially. Recall that our theoretical cost is (see Section 3.2) for the window size and the period size . For , the actual cost approaches to 1 from as we see more data duplicates in the workload. This is how QLOVE reduces memory consumption in practice.
Additionally, we test a larger for CMQS, AM, and Random, and for Moment in order to reduce space usage. It goes considerably down to around 6,000, but value errors become extremely significant.
Throughput. For throughput, we compare QLOVE with CMQS, which is observed as the most highperformance among rankbound algorithms. In CMQS, each subwindow creates a data structure, namely a sketch, and all active sketches are combined to compute approximate quantiles over a sliding window. The capacity of each subwindow is to ensure the rank error bound by approximation (Lin et al., 2004). In other word, if the sizes relevant to a sliding window are given, we can get to be deterministically ensured. In this experiment, we consider a query with 1K period and 100K window. Here the is calculated as 0.02, and to cover a wider error spectrum, we enforce the to range from 0.02 (1x) to 0.2 (10x).
Figure 4 presents the throughput of QLOVE compared with CMQS for varying values and also with Exact. Overall, QLOVE achieves higher throughput than CMQS across values and Exact.
CMQS has a clear tradeoff between accuracy and throughput. If is set too small (e.g., 1x), then the strategy will be too aggressive and will largely lower performance (lower than Exact). If is set too large (e.g., 10x), then the throughput is largely recovered. However, in this case, the strategy becomes too conservative and will be too loose in bounding the error. In theory, this will allow a rank distance for approximate quantile up to , which is unacceptable.
Scalability.
For scalability tests, we create two largesize synthetic datasets, each including 1 billion entries: Normal dataset generated from a normal distribution, with a mean of 1 million and a standard deviation of 50 thousand, and Uniform dataset generated from a uniform distribution ranging from 90 to 110. Using these datasets, we increase the window size from 1K up to 100M elements to further stress the query which uses 1K period.
Figure 5 shows that QLOVE scales to the window size. QLOVE shows the consistent throughput for all window sizes for both datasets. In comparison, Exact has throughput degradation as it begins to use sliding window. For example, when the size is increased to 10K, Exact shows throughput degradation by 79%. This is a consequence of paying deaccumulation cost to search and eliminate the oldest 1K elements from the tree state for every windowing period. QLOVE achieves high scalability by mitigating such deaccumulation cost and using smallsize state as a summary of each subwindow.
We so far have presented how QLOVE achieves low value error, low space consumption, and scalable performance. Next, we present benefit of using fewk merging.
5.3. Fewk Merging
Addressing statistical inefficiency. As explained in Section 3.3, a larger period enables us to use more data points to estimate high quantiles, and deliver more accurate results. However, small periods are where topk merging that caches the largest values can be effective. To quantify this, we fix window size to include 128K elements, and we vary the period size over a wide range from 64K to 1K; the trends were similar for larger window sizes.
Table 2 summarizes average relative errors prior to applying fewk merging for different period sizes. We observe that varying period sizes is insignificant to Q0.5 and Q0.9 with relative errors less than 1%, whereas it matters to Q0.999 with the error going up to 18.93%. The accuracy target is domainspecific. For example, in realworld NetMon query, 5% of the relative error is considered adequate. Therefore, if we set this as the optimization target, we need to exploit the topk values for the period smaller than 16K. Interestingly, this circumstance does not exist in Search dataset with all relative value errors falling below 1%^{1}^{1}1Search ISN limits query execution to take up to the predefined response time SLA, e.g., 200 ms. The queries terminated by the SLA are concentrated on Q0.9 and above, incurring high density in the tail of data distribution..
Having focused on Q0.999 in NetMon, we measure the accuracy by varying the fraction of the caching size that guarantees the exact answer in fewk merging, and show results in Table 3 along with the observed space usage. Considering the 128K window size, each subwindow needs to maintain largest entries for the exact answer. As the table shows, exploiting a smaller fraction of it for each subwindow’s data can reduce Q0.999 value errors significantly. In particular, using a half the space (i.e., fraction of 0.5) results in accuracy as close as the optimal solution that needs the entire top132 values. Using top13 values (i.e., fraction of 0.1) makes the error fall around/below our desired valueerror target (5%). This excellent tradeoff also indicates that the largest entries in NetMon are fairly well distributed across subwindows.
In NetMon, missing the largest values in each subwindow hurts the accuracy. For example, when sampling is instead applied using a half fraction (i.e., 0.5), the value errors in 8K, 4K, 2K, and 1K explode with 2.23%, 4.60%, 8.33%, and 13.36%, respectively.
Addressing bursty traffic. Next, we discuss the effect of samplingk merging on bursty traffic. The nature of bursty traffic is that the largest values from one or few subwindows decide the target high quantile in the window. Sampling here aims at picturing those values using a smaller amount of space. For evaluate, we inject a burst traffic to NetMon such that it affects Q0.999 and above and appears just once in every evaluation of the sliding window. That is, in the window size and the quantile , we increase the values of the top elements in every th subwindow of size by 10x.
Table 4 presents average relative errors for two queries with 16K and 4K period sizes, both using the 128K window. The fraction is defined similarly: the amount of data assigned for holding sampled data relative to the amount needed to give the guaranteed exact answer. The zero fraction indicates the case that QLOVE handles bursty traffic without samples. Looking at the zero fraction in the table, the bursty traffic is damaging Q0.999 in both queries, with Q0.99 in 4Kperiod query also compromised. This is because burst traffic blows up more when using smaller periods. That is, the bursty traffic exhibits top132 values which will sweep in the 40th largest value the query refers for Q0.99.
Using the sampled values, both queries can improve their accuracy. In general, Q0.999 needs a higher sampling rate (e.g., fraction of 0.5) since the neighboring values are heavytailed. Q0.99 works well even with conservative sampling (e.g., fraction of 0.1) since the neighboring entries are gathered in a smaller value range.
Throughput. Throughput in fewk merging is tightly coupled with the number of entries to process per window. This is because the merged values must be kept in a sorted form, and utilized directly. With more merged data, the state grows bigger, consumes more processing cycles, and affects throughput. In one extreme case of using the 1K period, which is the most resourcedemanding case among our evaluated queries, the throughput improves with a smaller fraction due to lower cost for managing the smaller state. For example, for NetMon, with all entries cached (i.e., fraction of 1), we see 21.2% throughput penalty compared to QLOVE without fewk merging. At a smaller fraction of 0.2, where the average relative error is only 0.6%, throughput penalty is recovered to 9.0%.
5.4. Sensitivity Study
Data redundancy We come up with derived (lowprecision) datasets of NetMon and Search to test the impact higher redundancy in data streams has on throughput in QLOVE. We discard two loworder digits from the original datasets for lowprecision datasets, thus resulting in the data precision of 100 us, not 1 us. With window period fixed with 1K elements, we vary the window size from 1K to 1M elements, and measure throughput improvements over processing on the original datasets. We see a clear increase in the throughput. For example, at 1K window size, i.e., tumbling window, the throughput increases by 2.7x and 1.8x in NetMon and Search, respectively. The amount of relative increase is bigger in sliding windows with 3.7x – 4.6x. We can achieve such gain by maintaining the tree state smaller during incremental evaluation operations, which benefits both Exact and QLOVE; this accelerates the processing of the operations.
Data skewness. We create Pareto dataset to include integers from a skewed, heavytailed Pareto distribution (Newman, 2011), with Q0.5 of 20, Q0.999 of 10,000, and the max of 1.1 billion. It is more challenging for the processing of Pareto workload to provide low value errors in high quantiles, because value distances among data in the tail are much wider. We compare QLOVE with the competing algorithms using 16K window period and 128K window size, as done in Table 1. QLOVE is significantly better in reducing value errors in high quantiles. For example, at Q0.999, QLOVE achieves 4.00% relative value error, while AM and Random achieve 29.22% and 35.17%, respectively.
Noni.i.d. data. We test if our aggregated estimator can be applied to some noni.i.d. data with competitive accuracy compared to i.i.d data. To model a diverse spectrum of data dependence, we generate a noni.i.d. dataset from an AR(1) model, i.e.
(Box et al., 2015) of order 1, with coefficient , where (1) represents the correlation between a data point and its next data point, and (2) a larger indicates a stronger correlation among neighboring data points. Data points in the dataset are identically and normally distributed, with a mean of 1 million and a standard deviation of 50 thousand. For the purpose of comparison, we generate another i.i.d. dataset from a normal distribution with the same mean and standard deviation, which is equivalent to the AR(1) model with .We evaluate the accuracy using the average relative errors between the estimated and exact values for different quantiles. Table 5 shows the results for some selected quantiles using three datasets that range from low correlation to high correlation. We find that the errors slightly increase when (i.e., noni.i.d. data with low correlation), and mildly increase when (i.e., noni.i.d. data with high correlation), compared to those when (i.e., i.i.d. data). Also, empirical probabilities that the absolute errors are within the corresponding error bounds are always 1 for different and quantiles . Therefore, we achieve (1) competitive results of noni.i.d. data with respect to high accuracy of estimated quantiles, and (2) high probabilities of absolute errors within error bounds. Hence, our approach is robust to the underlying dependence in some sense.
6. Related Work
Stream processing engines have been developed for both singlemachine (Chandramouli et al., 2014; Miao et al., 2017) and distributed computing (Qian et al., 2013; Zaharia et al., 2013; Lin et al., 2016) to provide runtime for parallelism, scalability, and fault tolerance. QLOVE can be applied to all these frameworks. In the related work, we focus on quantile approximation algorithms in the literature in details.
The work related to quantiles approximation over data streams can be organized into two main categories. The first one is a theoretical category that focuses on enhancing space and time complexities. For the space complexity, the space bound is a function of an errortolerance parameter that the user specifies (Blum et al., 1973; Lin et al., 2004; Arasu and Manku, 2004). The work in the second category addresses challenges in quantiles approximation under certain assumptions, e.g., approximating quantiles when the raw data is distributed (Agarwal et al., 2012), leveraging GPUs for quantile computation (Govindaraju et al., 2005). The error bounds of the approximating techniques in the two aforementioned categories are defined in terms of the rank. In contrast, QLOVE approximates quantile computations with valuebased error, which is more suitable than rankbased error in many practical applications as illustrated in Section 1. Approximating quantiles by value is a unique distinction of QLOVE.
In (Blum et al., 1973), linear time algorithms were presented to compute a given quantile exactly using a fixed set of elements, while (Lin et al., 2004; Arasu and Manku, 2004) presented algorithms for quantile approximation over sliding windows. The work in (Lin et al., 2004; Arasu and Manku, 2004) is the most related work to QLOVE. In (Arasu and Manku, 2004), the space complexity of (Lin et al., 2004) is improved: i.e., less memory space is used to achieve the same accuracy. Similarly, the randomized algorithms in (Luo et al., 2016) use the memory size as a parameter to provide a desired error bound. However, the work in (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016) cannot scale for very large sliding windows when low latency is a requirement. This is because the cost for deaccumulating the expired elements is not scaling to slidingwindow size. In contrast, QLOVE can scale for large sliding windows due to its ability to deaccumulate an entire expiring subwindow at a time with low cost. Hence, QLOVE fits more for realtime applications, where low latency is a key requirement.
Many research efforts (Greenwald and Khanna, 2004; Huang et al., 2011; Agarwal et al., 2012) assume that the input data used to compute the quantiles is distributed. Similarly, the work done in (Cormode et al., 2005; Yi and Zhang, 2009) computes quantiles over distributed data and takes a further step by continuously monitoring any updates to maintain the latest quantiles. In QLOVE, we target applications where a large stream of data may originate from different sources to be processed by a streaming engine. QLOVE performs a single pass over the data to scale for large volume of input streams.
Deterministic algorithms for approximating biased quantiles are first presented in (Cormode et al., 2006). Biased quantiles are those with extreme values, e.g., 0.99quantile. In contrast, QLOVE is designed to compute both biased and unbiased quantiles. Moreover, (Cormode et al., 2006) is sensitive to the maximum value of the streaming elements, while QLOVE is not. In particular, the memory consumed by (Cormode et al., 2006) includes a parameter that represents the maximum value a streaming element can have. Biased quantiles can have very large values in many applications. QLOVE is able to estimate them without any cost associated with the actual values of the biased quantiles.
7. Conclusion
Quantiles is a challenging operator in realtime streaming analytics systems as it requires high throughput, low latency, and high accuracy. We present QLOVE that satisfies these requirements through workloaddriven and valuebased quantile approximation. We evaluated QLOVE using synthetic and realworld workloads on a stateoftheart streaming engine demonstrating high throughput over a wide range of window sizes while delivering small relative value error. Although the evaluation is based on single machine, our quantile design can deliver better aggregate throughput while using a fewer number of machines in distributed computing.
References
 (1)
 Agarwal et al. (2012) Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable Summaries. In PODS.
 Akidau et al. (2015) Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. FernándezMoctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massivescale, Unbounded, Outoforder Data Processing. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1792–1803.
 Arasu and Manku (2004) Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate Counts and Quantiles over Sliding Windows. In PODS.
 Blum et al. (1973) Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. 1973. Time Bounds for Selection. J. Comput. Syst. Sci. 7, 4 (Aug. 1973), 448–461. https://doi.org/10.1016/S00220000(73)800339
 Box et al. (2015) George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. Time series analysis: forecasting and control. John Wiley & Sons.
 Chandramouli et al. (2014) Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A Highperformance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (Dec. 2014), 401–412.
 Cormode et al. (2005) Graham Cormode, Minos Garofalakis, S. Muthukrishnan, and Rajeev Rastogi. 2005. Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles. In SIGMOD.
 Cormode et al. (2006) Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. 2006. Space and Timeefficient Deterministic Algorithms for Biased Quantiles over Data Streams. In PODS.
 Dean and Barroso (2013) Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56, 2 (Feb. 2013), 74–80.
 Ghanem et al. (2007) Thanaa M. Ghanem, Moustafa A. Hammad, Mohamed F. Mokbel, Walid G. Aref, and Ahmed K. Elmagarmid. 2007. Incremental Evaluation of SlidingWindow Queries over Data Streams. IEEE Trans. on Knowl. and Data Eng. 19, 1 (Jan. 2007), 57–72.
 Govindaraju et al. (2005) Naga K. Govindaraju, Nikunj Raghuvanshi, and Dinesh Manocha. 2005. Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors. In SIGMOD.
 Greenwald and Khanna (2004) Michael B. Greenwald and Sanjeev Khanna. 2004. Powerconserving Computation of Orderstatistics over Sensor Networks. In PODS.
 Guibas and Sedgewick (1978) Leo J. Guibas and Robert Sedgewick. 1978. A Dichromatic Framework for Balanced Trees. In SFCS.
 Guo et al. (2015) Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, ZhiWei Lin, and Varugis Kurien. 2015. Pingmesh: A LargeScale System for Data Center Network Latency Measurement and Analysis. In SIGCOMM.
 Huang et al. (2011) Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu. 2011. Sampling Based Algorithms for Quantile Computation in Sensor Networks. In SIGMOD.
 Jeon et al. (2016) Myeongjae Jeon, Yuxiong He, Hwanju Kim, Sameh Elnikety, Scott Rixner, and Alan L. Cox. 2016. TPC: TargetDriven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services. In ASPLOS.
 Kulkarni et al. (2015) Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In SIGMOD.
 Leland et al. (1994) Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson. 1994. On the Selfsimilar Nature of Ethernet Traffic (Extended Version). IEEE/ACM Trans. Netw. 2, 1 (1994), 1–15.
 Lin et al. (2016) Wei Lin, Haochuan Fan, Zhengping Qian, Junwei Xu, Sen Yang, Jingren Zhou, and Lidong Zhou. 2016. STREAMSCOPE: Continuous Reliable Distributed Processing of Big Data Streams. In NSDI.
 Lin et al. (2004) Xuemin Lin, Hongjun Lu, Jian Xu, and Jeffrey Xu Yu. 2004. Continuously maintaining quantile summaries of the most recent N elements over a data stream. In ICDE.
 Luo et al. (2016) Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. 2016. Quantiles over Data Streams: Experimental Comparisons, New Analyses, and Further Improvements. The VLDB Journal 25, 4 (Aug. 2016), 449–472.
 Mann and Whitney (1947) Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
 Miao et al. (2017) Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin. 2017. StreamBox: Modern Stream Processing on a Multicore Machine. In USENIX ATC.
 Newman (2011) M. E. J. Newman. 2011. Power laws, Pareto distributions and Zipf’s law. In arXiv.
 Qian et al. (2013) Zhengping Qian, Yong He, Chunzhi Su, Zhuojie Wu, Hongyu Zhu, Taizhi Zhang, Lidong Zhou, Yuan Yu, and Zheng Zhang. 2013. TimeStream: Reliable Stream Computation in the Cloud. In EuroSys.
 Stuart and Ord (1994) Alan Stuart and J Keith Ord. 1994. Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory. London, UK, (1994).
 Tan et al. (2019) Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In NSDI.
 Tierney (1983) Luke Tierney. 1983. A spaceefficient recursive procedure for estimating a quantile of an unknown distribution. SIAM J. Sci. Statist. Comput. 4, 4 (1983), 706–711.
 Wang et al. (2013) Lu Wang, Ge Luo, Ke Yi, and Graham Cormode. 2013. Quantiles over Data Streams: An Experimental Study. In SIGMOD.
 Yi and Zhang (2009) Ke Yi and Qin Zhang. 2009. Optimal Tracking of Distributed Heavy Hitters and Quantiles. In PODS.
 Zaharia et al. (2013) Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Faulttolerant Streaming Computation at Scale. In SOSP.
Appendix A Error Bound Analysis
Theorem 1 ().
Suppose the data in the sliding window are independent and identically distributed (i.i.d.). When , with probability at least , the following holds asymptotically
, where is the upper quantile of standard normal distribution and its inverse
is the cumulative distribution function of standard normal distribution,
is the probability density of the data distribution at its quantile .In particular, we take . When , with probability at least , the following holds asymptotically
Proof.
By the Central Limit Theorem for the sample quantile of i.i.d. data (Stuart and Ord, 1994; Tierney, 1983) in each subwindow, we have
(1) 
when . Since data are , are i.i.d. as well. Then for the aggregated estimate, we have
(2) 
when . Therefore, with probability , the following holds asymptotically
(3) 
when . On the other hand, for the sample quantile of the sliding window, we have
(4) 
when . Therefore, with probability , the following holds asymptotically
(5) 
when . Combining (2) and (3), with probability at least , the following holds asymptotically when .
(6) 
∎
Comments
There are no comments yet.