Approximate Quantiles for Datacenter Telemetry Monitoring

06/01/2019 ∙ by Gangmuk Lim, et al. ∙ 0

Datacenter systems require efficient troubleshooting and effective resource scheduling so as to minimize downtimes and to efficiently utilize limited resources. In doing so, datacenter operators employ streaming analytics for collecting and processing datacenter telemetry over a temporal window. The quantile operator is key in these systems as it can summarize the typical and abnormal behavior of the monitored system. Computing quantiles in real-time is resource-intensive as it requires processing hundreds of millions of events in seconds while providing high quantile accuracy. We overcome these challenges through workload-driven approximation. Our study uncovers three insights: (i) values are dominated by a set of recurring small values, (ii) distribution of small values is consistent across different time scales, and (iii) tail values are dominated by a small set of large values.We propose QLOVE, an efficient and accurate quantile approximation algorithm that capitalizes on these insights. QLOVE minimizes memory footprint of the quantile operator via compression and frequency-based summarization of small values. While these summaries are stored and processed at sub-window granularity for memory efficiency, they can extend to compute quantiles on user-defined temporal windows. Low value error for tail quantiles is achieved by retaining a few tail values per sub-window. QLOVE estimates quantiles with high throughput and less than 5 state-of-the-art algorithms either have a high relative value error (13-35 deliver lower throughput (15-92

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Modern datacenters provide a distributed computing environment with hundreds of thousands of machines. Stream analytic systems are key components in such large-scale systems as they are employed by datacenter operators for monitoring them and responding to events in real time (Qian et al., 2013; Zaharia et al., 2013; Chandramouli et al., 2014; Kulkarni et al., 2015; Lin et al., 2016). For instance, datacenter network latency (Guo et al., 2015; Tan et al., 2019) and web search engine (Dean and Barroso, 2013; Jeon et al., 2016) monitoring systems collect response latencies of servers to assess the health of the systems and/or to guide load balancing decisions. These monitoring systems continuously receive massive amounts of data from thousands of machines, perform computations over recent data as scoped by a temporal window, and periodically report results, typically in seconds or minutes. Supporting complex computation over such data volumes in real time requires hundreds of machines (Qian et al., 2013; Zaharia et al., 2013; Lin et al., 2016), calling for improvements in stream processing throughput (Miao et al., 2017).

The quantile operator lies in the heart of such real-time monitoring systems as they can determine the typical (0.5-quantile) or abnormal behavior (0.99-quantile) of the monitored system. Formally, quantiles are the -th largest value (i.e., rank ) within values for different values of . For instance, in network troubleshooting and network health dashboards, a static set of quantiles are continuously computed on the round-trip times (RTTs) between datacenter servers so as to measure the quality of network reachability. In web search engines, a predefined set of quantiles are computed on query response times across clusters and are employed by load balancers so as to meet strict service-level agreements on query latency (Dean and Barroso, 2013). In such scenarios, highly accurate quantiles are required to reduce any false-positive discoveries or ineffective scheduling decisions.

Satisfying the requirements for high throughput, real-time, and accurate computation of quantiles is a challenging task as exact and low-latency computation of quantiles is resource-intensive and often infeasible. Unlike aggregation operators (e.g., average) that can be computed incrementally with a small memory footprint, exact quantile computation requires storing and processing the entire value distribution over the temporal window. In real-world scenarios, such as the network latency monitoring system (Guo et al., 2015), where million events arrive every second and the temporal window can be in the order of minutes, the memory and compute requirements for exact and low-latency quantile computation are massive.

In these scenarios, exact solutions are often not needed to satisfy the accuracy requirements, and approximate quantiles with low value error

are acceptable if they are computable within a fraction of resources needed for the exact solution. For instance, quantiles are often utilized to guide a high-level decision, such as in network health dashboards and web search engines, where quantiles are compared to predefined thresholds to discover outliers 

(Guo et al., 2015) or to guide load balancing (Dean and Barroso, 2013). In such cases, approximate quantiles should be computed within a small value error from the corresponding exact quantiles so as to avoid making different high-level decisions.

In this work, we uncover opportunities in approximate quantiles by characterizing real-world workloads. Our study shows that practical workloads have many recurring values and can have a substantial skew. For instance, in datacenter networking latency datasets (NetMon in Figure 

1), while most latencies are small and concentrated, with more than 90% taking below 1,247 us, a few latencies are very large and heavy-tailed, taking up to 74,265 us. When studying the distributions across different time scales, we also find that the distribution of small values is self-similar (Leland et al., 1994). This is not surprising because datacenter networks work well and function consistently most of the time, resulting in similar latencies and distributions.

Implications of value distribution on approximate quantiles. We find that while existing work on approximate quantiles (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016) results in low value errors for non-high quantiles (e.g., 0.5- to 0.9-quantile, they fail to achieve low value error for higher quantiles (e.g., 0.99-quantile). These works seek to minimize the rank error, a widely used quantile approximation metric, rather than value error. Rank error is the distance between the exact rank and the approximate rank. When distributions are skewed, a small amount of rank error translates to a high value error for higher quantiles.

We demonstrate these implications by considering a real-world example. Consider a data stream of size with its elements sorted as in increasing order. The -quantile () is the -th element with rank . For a given rank and data size , prior work focuses on delivering an approximate quantile within a deterministic rank-error bound called -approximation, i.e., the rank of approximate quantile is within the range . Assume and a window size of 100K elements, the rank error is bounded by , thereby resulting in the rank interval , where and are the lower and upper bounds respectively. Hence, the same rank interval will deliver different value distances according to underlying data distributions. For the datacenter networking latency scenario, the latency at the 0.5-quantile (median, ) is 798 us while its rank distance +2K (i.e., ) sits in 814 us, resulting in a value error of only 2%. On the contrary, the latency at the 0.99-quantile () is 1,874 us, and its rank distance +2K (i.e., ) sits in 74,265 us, which is  40x larger. These results demonstrate that designing an efficient and accurate quantile approximation algorithm requires taking into account the underlying data distribution and its influence on value errors.

Our proposal. We present approximate Quantiles with LOw Value Error, or QLOVE. QLOVE leverages the observation in monitoring scenarios that the quantiles to be computed are fixed throughout the temporal window. QLOVE takes into account the underlying data distribution for these quantiles and proposes different approaches for computing non-high quantiles and high quantiles. Each approach capitalizes on the insights from our workload characterization, delivering efficient and accurate computation of quantiles.

Non-high quantiles. Based on the observation that value distribution of small latencies (i.e., the ones used in computation of non-high quantiles) is self-similar, QLOVE first performs quantile computation at the granularity of sub-windows (i.e., smaller window than the temporal window defined by the user). The quantile for a given temporal window is computed by averaging the quantiles of all sub-windows falling within the temporal window. This optimization enables QLOVE to store only the value associated with the in-flight sub-window and the quantiles of all previous sub-windows, resulting in higher throughput due to smaller memory footprint.

Similar to prior work on approximate quantiles (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016), memory consumption is reduced as fewer values are retained during a temporal window. QLOVE further reduces the memory consumption by capitalizing on an additional workload insight that values are mostly concentrated in a small collection of numerical values. For instance, in our datacenter network latency monitoring example, only % of the elements in a one-hour temporal window are unique. This high data redundancy allows considerable reduction in space, by maintaining the frequency distribution of in-flight data (i.e., {value, count}) instead of the value distribution. In doing so, due to the high concentration of values, our quantile estimate can achieve low value error for non-high quantiles (e.g., median) while minimizing the memory footprint.

Figure 1. Histogram of 100K latency values (in us) in NetMon. The x-axis is cut at 10,000 due to a very long tail.

High quantiles. QLOVE explicitly records tail values to better approximate the high quantiles (e.g., 0.99-quantile). Typically, these values are infrequent and can be stored efficiently. Our technique, called few-k merging, carefully chooses which and how many tail values to store based on the query parameters (i.e., window size and period) as well as observed data distributions. We highlight two scenarios where few-k merging is needed: (i) Statistical inefficiency. When the sub-window contains too few data points, the high quantiles in each sub-window are not statistically robust as they are impacted by too few values. For instance, if a sub-window has elements, the 0.999-quantile is decided only by the two largest elements; and (ii) Bursty traffic. If all of the tail values are concentrated in one or few sub-windows, their impact on the overall quantiles is not reflected well by the quantiles of their corresponding sub-windows. We discuss how to choose the few-k size, how to merge few-k values to produce an answer for the window, and how to manage space budget. Similar to non-high quantiles, we compress individual values and store frequency distribution to reduce the memory needed to store the few-k values.

We have implemented QLOVE in Trill open-source streaming analytics engine (Chandramouli et al., 2014) and have deployed it in production in a streaming network monitoring system at a large enterprise. We evaluate QLOVE using both real-world and synthetic workloads. Our experiments show that, relative to computing the exact quantiles, QLOVE offers a - higher throughput when sub-windows have 10K elements each; the throughput is up to higher when sub-windows have 1M elements each. Value compression lowers the space usage by . Moreover, the average relative value error for different quantiles falls below . In comparison, algorithms built atop the rank-error approximation metric (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016) either incur a high relative value error (% higher than QLOVE) or result in lower throughput (% on a sliding window of 100K elements that includes 10 sub-windows).

In summary, this paper has the following contributions:

  • We define the problem of approximate computing of quantiles with low value error (rather than rank error) and present a solution, QLOVE. To the best of our knowledge, this is the first attempt to tackle this problem.

  • We design and implement mechanisms to reduce memory consumption through space-efficient summaries and frequency distribution, enabling high throughput in the presence of huge data streams and large temporal windows.

  • We implement QLOVE in a streaming engine, and show its practicality using real-world use cases.

2. Quantile Processing Model

We introduce our streaming query execution model (i.e., incremental evaluation), and define the problem of our quantile approximation.

Streaming model. A data stream is a continuous sequence of data elements that arrive timely. Each element has its value associated with a timestamp that captures the order of ’s occurrence. Under this data stream model, a query defines a window to specify the elements to consider in the query evaluation. For example, a query uses a window to process the latest elements seen so far, where is the window size. Due to the continuous arrival of data, a window requires a period to determine how frequently the query must execute (i.e., the frequency of query evaluating). For example, a query can process the recent elements periodically upon every insertion of new elements, where is the window period. Our work can be applied to windows defined by time parameters, e.g., evaluate the query every one minute (window period) for the elements seen last one hour (window size).

This paper mainly considers two types of windowing models (Akidau et al., 2015): (1) Tumbling Window, where window size is equal to window period, and (2) Sliding Window, where window size is larger than window period. As window size and period are the same in tumbling window, there is no overlap between data elements considered by two successive query evaluations. Therefore, an element processed in one window is never reused by any subsequent windows. In contrast, a sliding-window query overlaps between successive evaluations of data, thus allowing each element to be valid across continuous windows. We do not consider the windowing model where elements never expire (Wang et al., 2013).

Incremental evaluation. Incremental evaluation (Ghanem et al., 2007; Chandramouli et al., 2014) supports a unified design logic to stream processing while allowing window-based queries to be integrated with various performance optimizations. The basic idea is keeping state for query to evaluate, so that state is updated while new elements are inserted or old elements are deleted. State is typically smaller in size than the data covered by a window, thus making use of resource efficiently. Further, when computing the query result, using

directly is typically faster compared to a naive, stateless way that accumulates all elements at the moment of evaluating the query. To implement an incremental operator, developers should define the following functions 

(Chandramouli et al., 2014):

  • InitialState:() => S: Returns an initial state S.

  • Accumulate:(S, E) => S: Updates an old state S with newly arrived event E.

  • Deaccumulate:(S, E) => S: Updates an old state S upon the expiration of event E.

  • ComputeResult:S => R: Computes the result R from the current state S.

For example, the following illustrates how to write an average operator using the incremental evaluation functions:

InitialState: () => S = {Count : 0, Sum : 0}
Accumulate: (S, E) => {S.Count + 1, S.Sum + E.Value}
Deaccumulate: (S, E) => {S.Count - 1, S.Sum - E.Value}
ComputeResult: S => S.Sum / S.Count

Incremental evaluation on sliding windows tends to be slower to execute than that on tumbling windows. The primary reason is that the tumbling-window query is implemented with a smaller set of functions without Deaccumulate. In this case, the query accumulates all data of a period on an initialized state, computes a result, and simply discards the state. In contrast, a sliding-window query must invoke Deaccumulate for every deletion of an element from the current window. For a sliding window, the performance largely varies with how complicated the Deaccumulate logic is. Simple operators such as average can deaccumulate an element from the state efficiently, whereas it is a challenging task to complicated operators such as quantiles.

3. Algorithm for Non-high Quantiles

First, we present an algorithm that effectively deals with non-high quantiles with high underlying distribution density, and its cost and error analysis. We also illustrate scenarios, where the algorithm alone is insufficient, to motivate techniques introduced in Section 4.

3.1. Algorithm Overview

The key idea is to partition a window into sub-windows, and leverage the results from sub-window computations to give an approximate answer of the -quantile for the entire window. Sub-windows are created following windowing semantics in use, by which the size of each sub-window is aligned with window period. These sub-windows follow the timestamp order of data elements, i.e., sub-windows containing older data elements are generated earlier than those containing newer ones. For each sub-window, we maintain a small-size summary instead of keeping all data elements. At the window level, there is an aggregation function that merges sub-window summaries to approximate the exact -quantile answer.

Figure 2. Sliding window processing in QLOVE.

More formally, assume a sliding window divided into sub-windows, where each sub-window includes data points. If the -th sub-window has a sequence of data , we observe the whole data in the sliding window. Here, the sample -quantile of the sliding window is denoted by -, which is the exact result to approximate. QLOVE estimates through two-level hierarchical processing, as presented in Figure 2. Level 1 computes the exact -quantile of each sub-window. The exact -quantile of the -th sub-window is denoted as -, which becomes the summary of the corresponding sub-window. Level 2 aggregates the summaries of all sub-windows to estimate the exact -quantile

. QLOVE uses mean as an aggregation function guided by the Central Limit Theorem 

(Stuart and Ord, 1994; Tierney, 1983) and obtains the aggregated -quantile of the sliding window denoted by .

When the window slides after one time step, as the figure shows QLOVE discards the oldest summary , and adds for the new sub-window, thereby forming a new bag of summaries .

In principle, QLOVE is a hybrid approach that combines tumbling windows into the original sliding-window quantile computation, delivering improved performance. Specifically, while creating a summary Level 1 runs a tumbling window, which avoids deaccumulation. Once a sub-window completes, all values are discarded after they are used to compute the summary. Level 2 operates a sliding window on summaries, requiring deaccumulation. However, since a summary contains only a few entries associated with the specified quantiles, not with other factors such as sub-window size, it can perform fast deaccumulation as well as fast accumulation.

(Level 1) Creating a new sub-window summary. During Level 1 process, we exploit opportunities for volume reduction by data redundancy. During the sub-window processing, in-flight data are maintained in a compressed state of , where is the frequency of element . A critical property here is that each is unknown until we observe it from new incoming data. To efficiently insert a new element in the state and continuously keep it ordered, we use the red-black tree (Guibas and Sedgewick, 1978) where the element attribute acts as the key to sort nodes in the tree. This also avoids the sorting cost when computing the exact quantiles at the end of Level 1 processing for the in-flight sub-window.

The logic to manage the red-black tree is sketched in Algorithm 1. InitialState and Accumulate are self-explanatory, and we explain ComputeResult in details. At the result computation in ComputeResult, the tree already has sorted the sub-window’s elements. Thus, the computation does an in-order traversal of the tree while using the frequency attribute to count the position of each unique element. As the total number of elements is maintained in , it is straightforward to know the percentage of elements below the current element during the in-order traversal. A query may ask for multiple quantiles at a time. In this case, ComputeResult evaluates the quantiles in a single pass with the smallest one searched first during the same in-order traversal.

Lastly, to increase data duplicates, some insignificant low-order digits of streamed values may be zeroed out. Often, we consider only the three most significant digits of the original value, which ensures the quantized value within less than 1% relative error.

(Level 2) Aggregating sub-window summaries. The logic for aggregating all sub-window summaries is almost identical to the incremental evaluation for the average introduced in Section 2. The only distinction is that to answer specified quantiles, there are instances of the average’s state (i.e., sum and count). The accumulation and deaccumulation handlers update these states to compute the average of each quantile separately.

3.2. Algorithm Analysis

Space complexity. The space complexity for our approximate algorithm is . For summaries of independent quantiles , we need space, where and are the window size and the sub-window size, respectively. There is at most one sub-window under construction, for which we maintain a sorted tree of in-flight data. Its space usage is which can range from 1 to depending on the degree of duplicates in the workload. In one extreme case, all elements have the same value, so , and in another extreme case, there is no duplicate at all, so . This spectrum allows the data arrival handler to reduce space usage significantly if there is high data redundancy in the workload.

Time complexity. The degree of duplicates is a factor that also reduces the theoretical time cost in Level 1 stage. For the sub-window of size with unique elements, the Accumulate cost is , which falls along a continuum between and , again depending on the degree of duplicates. Likewise, the complexity of ComputeResult is irrespective of the number of quantiles to search. Level 2 stage in QLOVE runs extremely fast with a static cost: each of specified quantiles needs two add operations for Accumulate or Deaccumulate, and one division operation for ComputeResult. Section 5.4 shows an experiment on how higher degree of duplicates leads on higher throughput.

1:procedure InitialState:
2:      return new red-black tree
3:end procedure
4:
5:procedure Accumulate: (state, input)
6:      if  then
7:            
8:      end if
9:      
10:      
11:end procedure
12:
13:procedure ComputeResult: (state)
14:      
15:       quantiles to answer
16:       quantiles in non-decreasing order
17:      
18:       rank of the first quantile
19:      for  do
20:            
21:            while  do
22:                 
23:                 if  then
24:                       return
25:                 end if
26:                 
27:                  next quantile
28:            end while
29:      end for
30:end procedure
Algorithm 1 Incremental computation for Level 1

Error bound. Summary-driven aggregation is designed based on self-similarity of data distribution for non-high quantiles with high underlying distribution density, as described in Section 1. We present the theory behind our approach for error bound analysis, and our theorem assumes several conditions. (1) We consider the

-quantile of each sub-window as a random variable before we actually get the data. Similarly, the

-quantile of the sliding window is also a random variable as long as the data is not given. (2) The target -quantiles across sub-windows are independent and identically distributed (i.i.d.). (3) Data in the window have continuous distribution.

Now, we derive a statistical guarantee to have the aggregated estimate close to the exact -quantile using the Central Limit Theorem (see Theorem 1 for details). To illustrate how to interpret the results in Theorem 1, suppose we obtain the aggregated estimate and the error bound over the data stream of a window. Since the exact quantile is , we use to approximate , and evaluate how close they are using . Essentially, we claim that with high confidence (e.g.

, probability 95%).

Our probabilistic error bound depends on the density of underlying data distribution for the specified -quantile. Recall Figure 1 where the density only decays in the tail, making the density at the 0.5-quantile (median) much larger than that at the 0.999-quantile. In this case, the error bound is expected to be much tighter for non-high quantiles than for high quantiles. Narrower error bounds imply lower estimation errors, otherwise the error bound is not informative. For a number of tests we performed, including a query for Table 1, we see that the observed value error is much lower than the error bound .

Finally, due to condition (2) assumed in our theorem, the dependence between data is not supposed to collapse our method. In Section 5.4, we show that our aggregated estimator can be effectively applied to non-i.i.d. data in a diverse spectrum of data dependence, with competitive accuracy compared to i.i.d. data.

3.3. Achieving High Accuracy

There are two complementary cases where achieving high accuracy needs special consideration.

Statistical inefficiency. The inaccuracy of high quantiles becomes more significant when there is lack of data to accurately estimate the quantiles in a sub-window. For example, in Table 2, the estimated error increases noticeably at the 0.999-quantile if sliding windows use 1K elements in a period (i.e., 1K period). In this case, the two largest elements are used when computing the 0.999-quantile in each sub-window. This makes statistical estimation not robust under the data distribution, and thus misleads the approximation.

There are several remedies for this curse of high quantiles. Users can change the parameters of query window to operate larger sub-windows with more data points. This provides better chance to observe data points in the tail, allowing more precise estimates of high quantiles. Another approach is to cache a small proportion of raw data without changing windowing semantics, and use it to directly compute accurate high quantiles of the sliding window. We generalize this idea in Section 4.

Bursty traffic. When bursty traffic happens, extremely large values are highly skewed in one or a few sub-windows. In effect, they dominate the values to observe across the window for computing high quantiles. Note that statistical inefficiency for high quantiles may occur when the data distribution is self-similar across the sub-windows, while bursty traffic indicates that the data distribution at the tail is highly variant across the sub-windows.

4. Few-k merging

Few-k merging in QLOVE leverages raw data points to handle large value errors that could appear in high quantiles as a result of statistical inefficiency or bursty traffic. During few-k merging, each sub-window collects data points among the largest values in its portion of streaming data and uses the values to compute the target high quantile for the window. We begin with discussing the issues that make collecting right values challenging.

4.1. Challenges

Figure 3. Examples () where the largest 10 values (colored in dark) appear differently at sub-windows ().

Through analyzing our monitoring workloads, we find that values must be collected differently based on the observed streaming traffic. To illustrate, Figure 3 exemplifies four patterns (), where the largest 10 values (colored in dark) of the window are distributed differently among sub-windows (). Assume the target high quantile can be obtained precisely by exploiting these 10 values. Then, indicates the case of extremely bursty traffic, where a single sub-window includes all the largest values, whereas indicates the case that they are completely evenly distributed across sub-windows. Few-k merging must enforce for to produce the exact answer. However, for , any caters to precise quantile. Using will sacrifice the accuracy for other cases, with performing the worst, followed by , and then . Driven by this observation, our first challenge is providing a solution to handle both statistical inefficiency and bursty traffic under diverse patterns corresponding to the largest values over sub-windows.

Our workloads have a tendency that large values are spread over concurrent sub-windows in most of the time, similar to or in Figure 3. Bursty traffic occurs time-to-time as a result of infrequent yet unpredictable abnormal behavior of the monitored system. Therefore, we cannot assume that the distribution of the in-flight largest values is known ahead or repeats over time. The distribution can only be estimated once we observe the data. Thus our second challenge is building a robust mechanism to dynamically recognize the current traffic pattern. Section 4.2 addresses these challenges.

4.2. QLOVE’s Approach

One possible approach is claiming enough data points to compute the exact quantile regardless of statistical inefficiency and bursty traffic. Formally, to guarantee the exact answer of -quantile () on the window size of , each sub-window must return largest elements. Then, the entire window will need to have space for elements in total, where is the sub-window size. This approach could be costly if the window size of a given sliding window query is significantly larger than the window period. We thus consider few-k merging when the space is limited.

Let be the space budget, where is smaller than the space for the exact -quantile, i.e., . Given such , we assign each sub-window the same space budget , where , and . Within each sub-window, will be further partitioned into two parts, to address statistical inefficiency by top-k merging and to address bursty traffic by sample-k merging, such that . We now explain how to use the given sub-window budgets, and , to handle statistical inefficiency and bursty traffic.

Top-k merging for statistical inefficiency. When handling statistical inefficiency, each sub-window caches its largest values, based on the observation that the global largest values tend to be widely spread, making higher-rank values in each sub-window more important. These all largest values are merged across the entire window to answer -quantile. For window size , we draw the th largest value in the merged data to approximate the -quantile. We show how we can opportunistically trade-off small space consumption for high accuracy (i.e., low value error) for some classes of applications in Section 5.3.

Sample-k for bursty traffic. We determine that traffic is bursty when the largest values in some sub-windows are relatively worse than those in others. Such sub-windows happen to appear dynamically over time, and thus we do not differentiate the largest values in each sub-window when coping with bursty traffic.

In QLOVE, each sub-window takes samples from its largest values so as to capture the distribution of the largest values using a smaller fraction. It takes an interval sampling which picks every -th element on the ranked values (Luo et al., 2016); e.g., for , we select all even ranked values. The sampling interval is inversely proportional to the allocated fraction = . After merging all samples, the resulting -quantile is obtained by referring to the th largest value to factor in data reduction by sampling.

Deciding . When deciding the size of , QLOVE estimates the number of data points that a window needs to compute an answer for non-bursty traffic such as in Figure 3 (that requires ). We use it to decide per sub-window data points, i.e., for , which we use as . QLOVE assigns all the remaining budget for . Note that a more conservative approach may assume in Figure 3, then becomes twice. is typically larger than because is based on a very small portion of the largest data. Sample-k merging takes an advantage of it considerably since the value error in the estimation of high quantiles through sampling is sensitive to sampling rate due to low density of underlying data.

4.3. Traffic Handling at Runtime

The two proposed value merging techniques can be active at the same time during the streaming query processing, triggered by different conditions. The use of top-k merging is decided by query semantics (e.g., windowing model and target quantiles). If there is any quantile that suffers from statistical inefficiency, top-k merging for the quantile will be switched on. In contrast, sample-k merging is a standing pipeline and exploited anytime by ongoing bursty traffic.

Currently, several decisions made for traffic handling are guided by empirical study or parameters measured offline. Future work includes integrating these processes entirely online.

When to enable Top-k. For each -quantile, QLOVE initiates the top-k data merging process if , where is the threshold deciding the statistical inefficiency when using a sub-window of size . Otherwise, QLOVE directly exploits results estimated from our approximation algorithm presented in Section 3. We set as 10 based on evaluating monitoring workloads we run in our system.

Selecting outcomes. QLOVE needs to decide when to take the outcomes between the two pipelines at runtime. Results from sample-k merging are prioritized if bursty traffic is detected. Otherwise, QLOVE uses the results from top-k merging for those high quantiles that face statistical inefficiency. To detect bursty traffic, we identify if the sampled largest values in the current sub-window are distributionally different and stochastically larger than those in the adjacent former sub-window. We use an existing methodology for it (Mann and Whitney, 1947).

5. Experimental Evaluation

We implement QLOVE along with competing policies using Trill open-source streaming analytics engine (Chandramouli et al., 2014), and conduct evaluation using real-world workloads. Trill extends temporal-relation models on language-integrated queries (LINQ), supporting diverse scenarios in streaming analytics. It also provides a framework to allow integrating custom incremental evaluation operators, which we use to compare all the policies experimentally.

-1in-1in Policy Accuracy Space usage Rank error () Value error (%) Analytical Observed Q0.5 Q0.9 Q0.99 Q0.999 Q0.5 Q0.9 Q0.99 Q0.999 QLOVE 0.0016 0.0005 0.0002 0.0001 0.10 0.06 0.78 4.40 16,416 3,340 CMQS 0.0034 0.0018 0.0009 0.0007 0.31 0.26 1.78 28.47 33,504 31,194 AM 0.0020 0.0011 0.0004 0.0004 0.24 0.20 0.94 13.25 45,309 36,253 Random 0.0021 0.0012 0.0005 0.0005 0.20 0.20 1.00 16.69 45,611 68,001 Moment 0.018 0.0017 0.0004 0.0002 0.98 0.28 0.76 9.30 NA 16,596

Table 1. Accuracy and space usage of five approximation algorithms.

5.1. Experimental Setup

Machine setup and workload. The machine for all experiments has two 2.40 GHz 12-core Intel 64-bit Xeon processors with hyperthreading and 128GB of main memory, and runs Windows. Our evaluation is based on the two real-world datasets. (1) NetMon dataset includes network latency in microseconds (us), with each entry measuring round-trip time (RTT) between any two servers in a large-scale datacenter. (2) Search dataset includes query response time of index serving node (ISN) at search engine. The response time is in microseconds (us), and measures from the time the ISN receives the query to the time it responds to the user. Each dataset contains 10 million entries.

Query. We run the query written in LINQ on the datasets to estimate 0.5, 0.9, 0.99, and 0.999-quantile (henceforth denoted by Q0.5, Q0.9, Q0.99, and Q0.999):

[fontsize=\small]
Qmonitor = Stream
 .Window(windowSize, period)
 .Where(e => e.errorCode != 0)
 .Aggregate(c => c.Quantile(0.5,0.9,0.99,0.999))

Policies in comparison. We compare QLOVE to the four strategies that support the sliding window model. (1) Exact is the baseline policy that computes exact quantiles. This extends Algorithm 1 with a deaccumulation logic; the node representing the expired element’s value decrements its frequency by one, and is deleted from the red-black tree if the frequency becomes zero. This outperformed other methods for the exact quantiles. (2) CMQS, Continuously Maintaining Quantile Summaries (Lin et al., 2004), bounds rank errors of quantile approximation deterministically; for a given rank and elements, its -approximation returns a value within the rank interval . (3) AM is another deterministic algorithm with rank error bound designed by Arasu and Manku (Arasu and Manku, 2004). (4) Random is a state of the art using sampling to bound rank error with constant probabilities (Luo et al., 2016). (5) Moment Sketch is an algorithm using mergeable moment-based quantile sketches to predict the original data distribution from moment statistics summary.

Metrics. We use average relative error as the accuracy metric, number of variables as the memory usage metric, and million elements per second (M ev/s) processed for a single thread as the throughput metric. Average relative error (in %) is measured by , where is estimated value from approximation and is the exact value. As stream processing continuously organizes a window and evaluates the query on it, the error is averaged over query evaluations.

5.2. Comparison to Competing Algorithms

This section compares QLOVE with the competing algorithms. We disable few-k merging in QLOVE until Section 5.3 to show how our algorithm in Section 3 alone works.

Approximation error. The accuracy column in Table 1 shows average value error and rank error for a set of quantiles when using 16K window period and 128K window size on NetMon dataset. For CMQS, AM and Random, we configure the error-bound parameter as 0.02 guided by its sensitivity to value error. For Moment, we set its parameter as 12 to be similar in error bounds. For a given quantile , in addition to average value error, we present its average rank error measured by , where is the exact rank of , is the rank of the returned value for -th query evaluation, is the total number of query evaluations, and is the window size. guarantees that none of is larger than 0.02.

The results in Table 1 show that CMQS, AM and Random can all successfully bound errors by rank. The average rank errors stay low within , with the largest error observed in individual query evaluations across all the policies below 0.0105, which confirms the effectiveness of the proposed methods. It is also noteworthy that the rank error of values returned by QLOVE is comparable, with even slightly lower error results across the quantiles.

Comparing across the policies, QLOVE outperforms others in value error, especially for very high quantile (Q0.999). Moreover, the results indicate that a given rank error has very different influence on the value error across different quantiles. For example, comparing Q0.5 and Q0.999, the rank error in Q.999 is lower while its corresponding value error is adversely higher. This is primarily because the NetMon workload exhibits high variability, where the value for the high quantile is orders magnitude larger than the medium quantile, as indicated in Figure 1. As a result, small change in Q.999 rank error ends up leading to more than 3x difference in value error.

Figure 4. Throughput comparison.

Space usage. The space usage column in Table 1 presents the number of variables to store in memory for each algorithm. The space usage is calculated from the theoretical bound (Analytical) found in (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016), as well as measured at runtime (Observed) while running the algorithm. QLOVE benefits from high data redundancy present in the NetMon, reducing memory usage at runtime from its analytical cost substantially. Recall that our theoretical cost is (see Section 3.2) for the window size and the period size . For , the actual cost approaches to 1 from as we see more data duplicates in the workload. This is how QLOVE reduces memory consumption in practice.

Additionally, we test a larger for CMQS, AM, and Random, and for Moment in order to reduce space usage. It goes considerably down to around 6,000, but value errors become extremely significant.

Throughput. For throughput, we compare QLOVE with CMQS, which is observed as the most high-performance among rank-bound algorithms. In CMQS, each sub-window creates a data structure, namely a sketch, and all active sketches are combined to compute approximate quantiles over a sliding window. The capacity of each sub-window is to ensure the rank error bound by -approximation (Lin et al., 2004). In other word, if the sizes relevant to a sliding window are given, we can get to be deterministically ensured. In this experiment, we consider a query with 1K period and 100K window. Here the is calculated as 0.02, and to cover a wider error spectrum, we enforce the to range from 0.02 (1x) to 0.2 (10x).

Figure 4 presents the throughput of QLOVE compared with CMQS for varying values and also with Exact. Overall, QLOVE achieves higher throughput than CMQS across values and Exact.

CMQS has a clear trade-off between accuracy and throughput. If is set too small (e.g., 1x), then the strategy will be too aggressive and will largely lower performance (lower than Exact). If is set too large (e.g., 10x), then the throughput is largely recovered. However, in this case, the strategy becomes too conservative and will be too loose in bounding the error. In theory, this will allow a rank distance for approximate quantile up to , which is unacceptable.

(a) Normal
(b) Uniform
Figure 5. Scalability tests using synthetic datasets.

Scalability.

For scalability tests, we create two large-size synthetic datasets, each including 1 billion entries: Normal dataset generated from a normal distribution, with a mean of 1 million and a standard deviation of 50 thousand, and Uniform dataset generated from a uniform distribution ranging from 90 to 110. Using these datasets, we increase the window size from 1K up to 100M elements to further stress the query which uses 1K period.

Figure 5 shows that QLOVE scales to the window size. QLOVE shows the consistent throughput for all window sizes for both datasets. In comparison, Exact has throughput degradation as it begins to use sliding window. For example, when the size is increased to 10K, Exact shows throughput degradation by 79%. This is a consequence of paying deaccumulation cost to search and eliminate the oldest 1K elements from the tree state for every windowing period. QLOVE achieves high scalability by mitigating such deaccumulation cost and using small-size state as a summary of each sub-window.

We so far have presented how QLOVE achieves low value error, low space consumption, and scalable performance. Next, we present benefit of using few-k merging.

5.3. Few-k Merging

-1in-1in Quantile 64K 32K 16K 8K 4K 2K 1K 0.5 0.04 0.06 0.10 0.15 0.22 0.28 0.35 0.9 0.03 0.04 0.06 0.08 0.10 0.14 0.27 0.99 0.13 0.27 0.78 1.27 1.73 2.27 3.39 0.999 1.82 3.31 4.40 7.04 10.46 10.55 18.93

Table 2. Average relative errors without few-k merging for period sizes ranging from 64K to 1K in 128K window size.

Addressing statistical inefficiency. As explained in Section 3.3, a larger period enables us to use more data points to estimate high quantiles, and deliver more accurate results. However, small periods are where top-k merging that caches the largest values can be effective. To quantify this, we fix window size to include 128K elements, and we vary the period size over a wide range from 64K to 1K; the trends were similar for larger window sizes.

Table 2 summarizes average relative errors prior to applying few-k merging for different period sizes. We observe that varying period sizes is insignificant to Q0.5 and Q0.9 with relative errors less than 1%, whereas it matters to Q0.999 with the error going up to 18.93%. The accuracy target is domain-specific. For example, in real-world NetMon query, 5% of the relative error is considered adequate. Therefore, if we set this as the optimization target, we need to exploit the top-k values for the period smaller than 16K. Interestingly, this circumstance does not exist in Search dataset with all relative value errors falling below 1%111Search ISN limits query execution to take up to the pre-defined response time SLA, e.g., 200 ms. The queries terminated by the SLA are concentrated on Q0.9 and above, incurring high density in the tail of data distribution..

-1in-1in Fraction 8K 4K 2K 1K 0.1 5.54 (209) 2.43 (419) 1.67 (838) 1.30 (1,677) 0.5 0.68 (1,049) 0.40 (2,097) 0.36 (4,194) 0.35 (8,389)

Table 3. Average relative errors (and observed space usage) for using fraction in top-k merging w.r.t. the exact Q0.999.

Having focused on Q0.999 in NetMon, we measure the accuracy by varying the fraction of the caching size that guarantees the exact answer in few-k merging, and show results in Table 3 along with the observed space usage. Considering the 128K window size, each sub-window needs to maintain largest entries for the exact answer. As the table shows, exploiting a smaller fraction of it for each sub-window’s data can reduce Q0.999 value errors significantly. In particular, using a half the space (i.e., fraction of 0.5) results in accuracy as close as the optimal solution that needs the entire top-132 values. Using top-13 values (i.e., fraction of 0.1) makes the error fall around/below our desired value-error target (5%). This excellent trade-off also indicates that the largest entries in NetMon are fairly well distributed across sub-windows.

In NetMon, missing the largest values in each sub-window hurts the accuracy. For example, when sampling is instead applied using a half fraction (i.e., 0.5), the value errors in 8K, 4K, 2K, and 1K explode with 2.23%, 4.60%, 8.33%, and 13.36%, respectively.

Addressing bursty traffic. Next, we discuss the effect of sampling-k merging on bursty traffic. The nature of bursty traffic is that the largest values from one or few sub-windows decide the target high quantile in the window. Sampling here aims at picturing those values using a smaller amount of space. For evaluate, we inject a burst traffic to NetMon such that it affects Q0.999 and above and appears just once in every evaluation of the sliding window. That is, in the window size and the quantile , we increase the values of the top elements in every th sub-window of size by 10x.

Table 4 presents average relative errors for two queries with 16K and 4K period sizes, both using the 128K window. The fraction is defined similarly: the amount of data assigned for holding sampled data relative to the amount needed to give the guaranteed exact answer. The zero fraction indicates the case that QLOVE handles bursty traffic without samples. Looking at the zero fraction in the table, the bursty traffic is damaging Q0.999 in both queries, with Q0.99 in 4K-period query also compromised. This is because burst traffic blows up more when using smaller periods. That is, the bursty traffic exhibits top-132 values which will sweep in the 40th largest value the query refers for Q0.99.

Using the sampled values, both queries can improve their accuracy. In general, Q0.999 needs a higher sampling rate (e.g., fraction of 0.5) since the neighboring values are heavy-tailed. Q0.99 works well even with conservative sampling (e.g., fraction of 0.1) since the neighboring entries are gathered in a smaller value range.

-1in-1in Fraction 16K 4K Q0.99 Q0.999 Q0.99 Q0.999 0.0 0.08 (0) 44.10 (0) 28.15 (0) 55.36 (0) 0.1 0.14 (1,048) 25.97 (104) 0.43 (4,194) 17.38 (419) 0.5 0.05 (5,242) 1.75 (524) 0.30 (20,971) 1.52 (2,097)

Table 4. Average relative errors (and observed space usage) for using fraction in sample-k merging w.r.t. the exact Q0.999.

Throughput. Throughput in few-k merging is tightly coupled with the number of entries to process per window. This is because the merged values must be kept in a sorted form, and utilized directly. With more merged data, the state grows bigger, consumes more processing cycles, and affects throughput. In one extreme case of using the 1K period, which is the most resource-demanding case among our evaluated queries, the throughput improves with a smaller fraction due to lower cost for managing the smaller state. For example, for NetMon, with all entries cached (i.e., fraction of 1), we see 21.2% throughput penalty compared to QLOVE without few-k merging. At a smaller fraction of 0.2, where the average relative error is only 0.6%, throughput penalty is recovered to 9.0%.

5.4. Sensitivity Study

Data redundancy We come up with derived (low-precision) datasets of NetMon and Search to test the impact higher redundancy in data streams has on throughput in QLOVE. We discard two low-order digits from the original datasets for low-precision datasets, thus resulting in the data precision of 100 us, not 1 us. With window period fixed with 1K elements, we vary the window size from 1K to 1M elements, and measure throughput improvements over processing on the original datasets. We see a clear increase in the throughput. For example, at 1K window size, i.e., tumbling window, the throughput increases by 2.7x and 1.8x in NetMon and Search, respectively. The amount of relative increase is bigger in sliding windows with 3.7x – 4.6x. We can achieve such gain by maintaining the tree state smaller during incremental evaluation operations, which benefits both Exact and QLOVE; this accelerates the processing of the operations.

Data skewness. We create Pareto dataset to include integers from a skewed, heavy-tailed Pareto distribution (Newman, 2011), with Q0.5 of 20, Q0.999 of 10,000, and the max of 1.1 billion. It is more challenging for the processing of Pareto workload to provide low value errors in high quantiles, because value distances among data in the tail are much wider. We compare QLOVE with the competing algorithms using 16K window period and 128K window size, as done in Table 1. QLOVE is significantly better in reducing value errors in high quantiles. For example, at Q0.999, QLOVE achieves 4.00% relative value error, while AM and Random achieve 29.22% and 35.17%, respectively.

Non-i.i.d. data. We test if our aggregated estimator can be applied to some non-i.i.d. data with competitive accuracy compared to i.i.d data. To model a diverse spectrum of data dependence, we generate a non-i.i.d. dataset from an AR(1) model, i.e.

, autoregressive model 

(Box et al., 2015) of order 1, with coefficient , where (1) represents the correlation between a data point and its next data point, and (2) a larger indicates a stronger correlation among neighboring data points. Data points in the dataset are identically and normally distributed, with a mean of 1 million and a standard deviation of 50 thousand. For the purpose of comparison, we generate another i.i.d. dataset from a normal distribution with the same mean and standard deviation, which is equivalent to the AR(1) model with .

We evaluate the accuracy using the average relative errors between the estimated and exact values for different quantiles. Table 5 shows the results for some selected quantiles using three datasets that range from low correlation to high correlation. We find that the errors slightly increase when (i.e., non-i.i.d. data with low correlation), and mildly increase when (i.e., non-i.i.d. data with high correlation), compared to those when (i.e., i.i.d. data). Also, empirical probabilities that the absolute errors are within the corresponding error bounds are always 1 for different and quantiles . Therefore, we achieve (1) competitive results of non-i.i.d. data with respect to high accuracy of estimated quantiles, and (2) high probabilities of absolute errors within error bounds. Hence, our approach is robust to the underlying dependence in some sense.

-1in-1in Quantiles 0.5 0.9 0.99 0.0 0.2 0.8

Table 5. Average relative errors for three datasets from autoregressive model with different correlation factors.

6. Related Work

Stream processing engines have been developed for both single-machine (Chandramouli et al., 2014; Miao et al., 2017) and distributed computing (Qian et al., 2013; Zaharia et al., 2013; Lin et al., 2016) to provide runtime for parallelism, scalability, and fault tolerance. QLOVE can be applied to all these frameworks. In the related work, we focus on quantile approximation algorithms in the literature in details.

The work related to quantiles approximation over data streams can be organized into two main categories. The first one is a theoretical category that focuses on enhancing space and time complexities. For the space complexity, the space bound is a function of an error-tolerance parameter that the user specifies (Blum et al., 1973; Lin et al., 2004; Arasu and Manku, 2004). The work in the second category addresses challenges in quantiles approximation under certain assumptions, e.g., approximating quantiles when the raw data is distributed (Agarwal et al., 2012), leveraging GPUs for quantile computation (Govindaraju et al., 2005). The error bounds of the approximating techniques in the two aforementioned categories are defined in terms of the rank. In contrast, QLOVE approximates quantile computations with value-based error, which is more suitable than rank-based error in many practical applications as illustrated in Section 1. Approximating quantiles by value is a unique distinction of QLOVE.

In (Blum et al., 1973), linear time algorithms were presented to compute a given quantile exactly using a fixed set of elements, while (Lin et al., 2004; Arasu and Manku, 2004) presented algorithms for quantile approximation over sliding windows. The work in (Lin et al., 2004; Arasu and Manku, 2004) is the most related work to QLOVE. In  (Arasu and Manku, 2004), the space complexity of (Lin et al., 2004) is improved: i.e., less memory space is used to achieve the same accuracy. Similarly, the randomized algorithms in (Luo et al., 2016) use the memory size as a parameter to provide a desired error bound. However, the work in  (Lin et al., 2004; Arasu and Manku, 2004; Luo et al., 2016) cannot scale for very large sliding windows when low latency is a requirement. This is because the cost for deaccumulating the expired elements is not scaling to sliding-window size. In contrast, QLOVE can scale for large sliding windows due to its ability to deaccumulate an entire expiring sub-window at a time with low cost. Hence, QLOVE fits more for real-time applications, where low latency is a key requirement.

Many research efforts (Greenwald and Khanna, 2004; Huang et al., 2011; Agarwal et al., 2012) assume that the input data used to compute the quantiles is distributed. Similarly, the work done in  (Cormode et al., 2005; Yi and Zhang, 2009) computes quantiles over distributed data and takes a further step by continuously monitoring any updates to maintain the latest quantiles. In QLOVE, we target applications where a large stream of data may originate from different sources to be processed by a streaming engine. QLOVE performs a single pass over the data to scale for large volume of input streams.

Deterministic algorithms for approximating biased quantiles are first presented in (Cormode et al., 2006). Biased quantiles are those with extreme values, e.g., 0.99-quantile. In contrast, QLOVE is designed to compute both biased and unbiased quantiles. Moreover, (Cormode et al., 2006) is sensitive to the maximum value of the streaming elements, while QLOVE is not. In particular, the memory consumed by (Cormode et al., 2006) includes a parameter that represents the maximum value a streaming element can have. Biased quantiles can have very large values in many applications. QLOVE is able to estimate them without any cost associated with the actual values of the biased quantiles.

7. Conclusion

Quantiles is a challenging operator in real-time streaming analytics systems as it requires high throughput, low latency, and high accuracy. We present QLOVE that satisfies these requirements through workload-driven and value-based quantile approximation. We evaluated QLOVE using synthetic and real-world workloads on a state-of-the-art streaming engine demonstrating high throughput over a wide range of window sizes while delivering small relative value error. Although the evaluation is based on single machine, our quantile design can deliver better aggregate throughput while using a fewer number of machines in distributed computing.

References

  • (1)
  • Agarwal et al. (2012) Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable Summaries. In PODS.
  • Akidau et al. (2015) Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded, Out-of-order Data Processing. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1792–1803.
  • Arasu and Manku (2004) Arvind Arasu and Gurmeet Singh Manku. 2004. Approximate Counts and Quantiles over Sliding Windows. In PODS.
  • Blum et al. (1973) Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. 1973. Time Bounds for Selection. J. Comput. Syst. Sci. 7, 4 (Aug. 1973), 448–461. https://doi.org/10.1016/S0022-0000(73)80033-9
  • Box et al. (2015) George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. Time series analysis: forecasting and control. John Wiley & Sons.
  • Chandramouli et al. (2014) Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A High-performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (Dec. 2014), 401–412.
  • Cormode et al. (2005) Graham Cormode, Minos Garofalakis, S. Muthukrishnan, and Rajeev Rastogi. 2005. Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles. In SIGMOD.
  • Cormode et al. (2006) Graham Cormode, Flip Korn, S. Muthukrishnan, and Divesh Srivastava. 2006. Space- and Time-efficient Deterministic Algorithms for Biased Quantiles over Data Streams. In PODS.
  • Dean and Barroso (2013) Jeffrey Dean and Luiz André Barroso. 2013. The Tail at Scale. Commun. ACM 56, 2 (Feb. 2013), 74–80.
  • Ghanem et al. (2007) Thanaa M. Ghanem, Moustafa A. Hammad, Mohamed F. Mokbel, Walid G. Aref, and Ahmed K. Elmagarmid. 2007. Incremental Evaluation of Sliding-Window Queries over Data Streams. IEEE Trans. on Knowl. and Data Eng. 19, 1 (Jan. 2007), 57–72.
  • Govindaraju et al. (2005) Naga K. Govindaraju, Nikunj Raghuvanshi, and Dinesh Manocha. 2005. Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors. In SIGMOD.
  • Greenwald and Khanna (2004) Michael B. Greenwald and Sanjeev Khanna. 2004. Power-conserving Computation of Order-statistics over Sensor Networks. In PODS.
  • Guibas and Sedgewick (1978) Leo J. Guibas and Robert Sedgewick. 1978. A Dichromatic Framework for Balanced Trees. In SFCS.
  • Guo et al. (2015) Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In SIGCOMM.
  • Huang et al. (2011) Zengfeng Huang, Lu Wang, Ke Yi, and Yunhao Liu. 2011. Sampling Based Algorithms for Quantile Computation in Sensor Networks. In SIGMOD.
  • Jeon et al. (2016) Myeongjae Jeon, Yuxiong He, Hwanju Kim, Sameh Elnikety, Scott Rixner, and Alan L. Cox. 2016. TPC: Target-Driven Parallelism Combining Prediction and Correction to Reduce Tail Latency in Interactive Services. In ASPLOS.
  • Kulkarni et al. (2015) Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In SIGMOD.
  • Leland et al. (1994) Will E. Leland, Murad S. Taqqu, Walter Willinger, and Daniel V. Wilson. 1994. On the Self-similar Nature of Ethernet Traffic (Extended Version). IEEE/ACM Trans. Netw. 2, 1 (1994), 1–15.
  • Lin et al. (2016) Wei Lin, Haochuan Fan, Zhengping Qian, Junwei Xu, Sen Yang, Jingren Zhou, and Lidong Zhou. 2016. STREAMSCOPE: Continuous Reliable Distributed Processing of Big Data Streams. In NSDI.
  • Lin et al. (2004) Xuemin Lin, Hongjun Lu, Jian Xu, and Jeffrey Xu Yu. 2004. Continuously maintaining quantile summaries of the most recent N elements over a data stream. In ICDE.
  • Luo et al. (2016) Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. 2016. Quantiles over Data Streams: Experimental Comparisons, New Analyses, and Further Improvements. The VLDB Journal 25, 4 (Aug. 2016), 449–472.
  • Mann and Whitney (1947) Henry B Mann and Donald R Whitney. 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics (1947), 50–60.
  • Miao et al. (2017) Hongyu Miao, Heejin Park, Myeongjae Jeon, Gennady Pekhimenko, Kathryn S. McKinley, and Felix Xiaozhu Lin. 2017. StreamBox: Modern Stream Processing on a Multicore Machine. In USENIX ATC.
  • Newman (2011) M. E. J. Newman. 2011. Power laws, Pareto distributions and Zipf’s law. In arXiv.
  • Qian et al. (2013) Zhengping Qian, Yong He, Chunzhi Su, Zhuojie Wu, Hongyu Zhu, Taizhi Zhang, Lidong Zhou, Yuan Yu, and Zheng Zhang. 2013. TimeStream: Reliable Stream Computation in the Cloud. In EuroSys.
  • Stuart and Ord (1994) Alan Stuart and J Keith Ord. 1994. Kendall’s Advanced Theory of Statistics, Volume 1: Distribution Theory. London, UK, (1994).
  • Tan et al. (2019) Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. 2019. NetBouncer: Active Device and Link Failure Localization in Data Center Networks. In NSDI.
  • Tierney (1983) Luke Tierney. 1983. A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM J. Sci. Statist. Comput. 4, 4 (1983), 706–711.
  • Wang et al. (2013) Lu Wang, Ge Luo, Ke Yi, and Graham Cormode. 2013. Quantiles over Data Streams: An Experimental Study. In SIGMOD.
  • Yi and Zhang (2009) Ke Yi and Qin Zhang. 2009. Optimal Tracking of Distributed Heavy Hitters and Quantiles. In PODS.
  • Zaharia et al. (2013) Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In SOSP.

Appendix A Error Bound Analysis

Theorem 1 ().

Suppose the data in the sliding window are independent and identically distributed (i.i.d.). When , with probability at least , the following holds asymptotically

, where is the upper -quantile of standard normal distribution and its inverse

is the cumulative distribution function of standard normal distribution,

is the probability density of the data distribution at its -quantile .

In particular, we take . When , with probability at least , the following holds asymptotically

Proof.

By the Central Limit Theorem for the sample -quantile of i.i.d. data (Stuart and Ord, 1994; Tierney, 1983) in each sub-window, we have

(1)

when . Since data are , are i.i.d. as well. Then for the aggregated estimate, we have

(2)

when . Therefore, with probability , the following holds asymptotically

(3)

when . On the other hand, for the sample -quantile of the sliding window, we have

(4)

when . Therefore, with probability , the following holds asymptotically

(5)

when . Combining (2) and (3), with probability at least , the following holds asymptotically when .

(6)